# **Markov Decision Process**

Markov decision process (MDP) is an imprtant concept regarding reinforcement learning (RL), almost all RL probelms can be modeled as MDP. Not like other ML methods, RL does not need labels or rules, the agent of RL interact with the environment and get reward from the environment, which is the target to optimize. 

<img src="img/reinforcement_learning.png" width="600">

1. environment tells current state
2. agent makes an action based on current state
3. enviroment return the reward and next state according to the action

We can see that here are 3 infromation delivered between agent and environment: state, action and reward. With respect to the action $(a)$, current state $(S_t)$ will switch to next state $(S_{t+1})$. If the environment is fully observable, which means that all $P(S_{t+1}|S_t=s,A=a)$ are known, we saild that the state fulfill Markov property.

### **Markov Property**

The future is only depends on current state. Current state can represent all past.

$P(S_{t+1}|S_1,...S_t)=P(S_{t+1}|S_t)$


<img src="img/markov_process.png" width="600">

This is an example of Markov process, if current state $S_t$ is Standing, than   
$P(S_{t+1}=\text{Sitting}|S_{t}=\text{Standing})=0.3$  
$P(S_{t+1}=\text{Hands Raised}|S_{t}=\text{Standing})=0.1$  
$P(S_{t+1}=\text{Foot Forward}|S_{t}=\text{Standing})=0.5$  
$P(S_{t+1}=\text{Shut Down}|S_{t}=\text{Standing})=0.1$  

### **Markov Reward Process**

Markov reward process is a tuple $<S, P, R, \gamma>$

$S$: finite set of states  
$P$: state transition probability matrix $P_{ss\prime}=P(S_{t+1}=s\prime|S_t=s)$  
$R$: reward function $R_s=E(R_{t+1}|S_t=s)$  
$\gamma$: discount factor, $\gamma \in [0,1]$

<img src="img/markov_reward_process.png" width="600">

Here is an example of Makov Reward Process based on previous Makov process example. Similarly, if current state $S_t$ is Standing, than

$R_{S_{t}=\text{Standing}}=E(R_{t+1}|S_{t}=\text{Standing})$  
$=0.6\times(-1)+0.1\times0.1+0.5\times(-1)+0.1\times0$  
$=-0.6+0.01=0.5=-0.99$

#### **Discount Factor**

We usually prefer near reward due to uncertainty in the future, so the more future reward will get less value than it's actual reward.

$G_t=R_{t+1}+\gamma R_{t+2}+...=\sum_{=0}^{\infty}\gamma^kR_{t+k+1}$

#### **Bellman Equation for MRP**

Bellman equation is to calculate the state value, which not only contain reward, but discounted value of current state.

$v(s)=E(G_t|S_t=s)$
    $=E(R_{t+1}+\gamma R_{t+2}+...|S_t=s)$  
    $=E(R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}...)|S_t=s)$  
    $=E(R_{t+1}+\gamma G_{t+1}|S_t=s)$  
    $=E(R_{t+1}+\gamma v{S_{t+1}}|S_t=s)$

<img src="img/mrp_state_value.png" width="400">

So if $\gamma=1$, which mean there is no discount,  

$v(\text{Standing})=-1+R_{S_{t}}=-1+(-0.99)=-1.99$

### **Markov Decision Process**

If add **Action** into Markov Reward Process, then will become Markov Decision Process, which is a tuple $<S,A,P,R,\gamma>$

$S$: finite set of states  
$A$: finite set of actions 
$P$: state transition probability matrix $P_{ss\prime}^a=P(S_{t+1}=s\prime|S_t=s, A_t=a)$  
$R$: reward function $R_s^a=E(R_{t+1}|S_t=s,A_t=a)$  
$\gamma$: discount factor, $\gamma \in [0,1]$

#### **Policy**

A policy is a distribution over actions with a given state.

$\pi(a|s)=P(A_t=a|S_t=s)$

#### **Bellman Equation for MDP**

State value function: $v_{\pi}(s)=E_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S+t=s]$

Action value function: $q_{\pi}(s,a)=E_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1},A_{t+1})|S+t=s,A_t=a]$

<img src="img/mdp_state_value.png" width="800">

The relation between state and action function:  

$v_{\pi}(s)=\sum_{a\in A}\pi(a|s)q_{\pi}(s,a)$

$\implies$ state value is the expected value over action value.

$q_{\pi}(s,a)=R_s^a+\gamma\sum_{s\prime\in S}P_{ss\prime}^a v_{\pi}(s\prime)$

#### **Optimal Value Function**

The optimal policy is the policy give max state value and action value.  
$v_{*}(s)=max_{\pi}v_{\pi}(s)$  
$q_{*}(s,a)=max_{\pi}q_{\pi}(s,a)$  

$\pi_{*}(a|s)=\begin{cases} 1\quad \text{if}\ a=\arg\max_{a\in A}q_{*}(s,a) \\ 0\quad \text{otherwise}\end{cases}$