## Chapter 2: Markov Decision Processes


**Markov Decision Process (MDP)** is an environment that can be defined as 5-tuple:
</br>
</br>
<font size="4">
$$\begin{align}
(\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)
\end{align}$$
</font>
where

- $\mathcal{S}$: state space (set of states)

- $\mathcal{A}$: action space (set of actions)

- $\mathcal{P}$: state transition probability matrix $\mathcal{P}_{ss'}^a = \mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a]$

- $\mathcal{R}$: reward function $\mathcal{R}_s^a = \mathbb{E}[R_{t+1}|S_t=s, A_t=a]$

- $\gamma$: discount factor $\gamma \in [0, 1]$


state space $\mathcal{S}$ and action space $\mathcal{A}$ can be infinite, but they are usually assumed to be finite in most RL theories. 


### Markov Property

In State space $\mathcal{S}$, state $S_{t+1} \in \mathcal{S}$ is **independent of past given only the current state** $S_t$. This is called **Markov property**. It can be written as 
</br>
</br>
<font size="4">
$$\begin{align}
\mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_1, S_2, ... , S_t]
\end{align}$$
</font>

Thus, only knowing current state $S_{t}$ is enough for RL, if the environment satisfies the markov property.


### Policy

In Markov Decision Process, the agent need to decide what action to take in a given state. Such distribution over action for states is called **Policy**.
</br>
</br>
<font size="4">
$$\begin{align}
\pi(a|s) = \mathbb{P}[A_{t} = a|S_t = s]
\end{align}$$
</font>

As the state is Markov, the policy considers only current state $S_t$.


### Return

In RL, reward function $r(s, a)$ is defined as mean value of rewards in state $s$ and action $a$. However, only considering the reward of next timestep may not be wise. To make agent consider reward of the far future too, **Return** is widely used.
</br>
</br>
<font size="4">
$$\begin{align}
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} \cdots  = \sum_{k=0}^\infty \gamma^k R_{t+k+1}
\end{align}$$
</font>

**Return** is the total discounted reward from timestep $t$.</br>
The discount $\gamma \in [0, 1]$ affects the present value of the future rewards.</br>
Reward $R$ after timestep $t+1$ is discounted to $\gamma^k R$ by $\gamma \in [0, 1]$ in current timestep. 
</br></br>
$\gamma$ can be set between 0 and 1. $\gamma$ close to 0 makes agent more near-sighted to rewards, and $\gamma$ close to 1 makes it vice-versa.   
</br>
$\gamma$ is usually set as 0.9 or 0.99 but not 1 as it means it can result infinite returns in cylic MDP.


### Value function & Action-value function (Q-value function)

Almost all RL algorithms estimate value of states in environment - how good is a state for the agent.  
In RL, 'how good' is estimated by **return**, and expected return in state $s$ under policy $\pi$ is called **value function** $v_\pi (s)$.
</br>
</br>
<font size="4">
$$\begin{align}
v_\pi(s) = \mathbb{E}_\pi[G_t|S_t = s]
\end{align}$$
</font>

We can also think of the value of taking action $a$ in state $s$ under policy $\pi$. This is called **action-value function**, or Q-value $q_\pi(s, a)$.
</br>
</br>
<font size="4">
$$\begin{align}
q_\pi(s, a) = \mathbb{E}_\pi[G_t|S_t = s, A_t = a]
\end{align}$$
</font>

### Bellman equation

**Bellman equation**, which is one of the central elements of RL, expresses value and action-value can be decomposed into two parts:

1. **immediate reward**
2. **discounted future values**

For example, value function $v_\pi(s)$ is
</br>
</br>
<font size="4">
$$\begin{align}
v_\pi(s) = \mathbb{E}_\pi[G_t|S_t = s] = \mathbb{E}_\pi[\sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s]&
 \\= \mathbb{E}_\pi[R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s] \\= \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(s_{t+1}) | S_t = s]
\end{align}$$
</font>

and similarly, action-value function $q_\pi(s, a)$ is 
</br>
</br>
<font size="4">
$$\begin{align}
q_\pi(s, a) = \mathbb{E}_\pi[R_{t+1} + \gamma q_\pi (S_{t+1}, A_{t+1})|S_t = s, A_t = a]
\end{align}$$
</font>

By Bellman equation, We can decompose estimation of value and action-value as more simpler, recursive subproblems.  

### Optimal value functions

The final goal of RL is to find the policy that can achieve the highest cumulative reward in the long run. Therefore, in MDP the better policy can be defined as following:
</br>
</br>
<font size="4">
$$\begin{align}
\pi \geq \pi' \text{  iff  } v_\pi (s) \geq v_{\pi'} (s) \text{  for  } \forall s \in \mathcal{S}
\end{align}$$
</font>

There exists at least one **optimal policy** $\pi_{*}$ that is better than all other policies.

**Optimal policy** $\pi_{*}$ achieves the **optimal value function** $v_*$ and **optiaml action-value function** $q_*$, which is 

</br>
</br>
<font size="4">
$$\begin{align}
v_*(s) = \max_{\pi} v_\pi (s)
\end{align}$$
</font>
</br>
<font size="4">
$$\begin{align}
q_*(s, a) = \max_{\pi} q_\pi (s, a)
\end{align}$$
</font>
for all $s \in \mathcal{S}$ and $a \in \mathcal{A}$.

Optimal value function and optimal action-value function can be connected as follows:
</br>
</br>
<font size="4">
$$\begin{align}
v_*(s) = \max_{a} q_* (s, a)
\end{align}$$
</font>

and bellman equation can be also applied to it. Bellman equation applied to $v_*$ and $q_*$ is called **Bellman optimality equation**.
</br>
</br>
<font size="4">
$$\begin{align}
v_*(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_* (S_{t+1})|S_t = s]
\end{align}$$
</font>
</br>
<font size="4">
$$\begin{align}
q_*(s, a) = \mathbb{E}_\pi[R_{t+1} + \gamma \max_{a'} q_* (S_{t+1}, a')|S_t = s, A_t = a]
\end{align}$$
</font>

If we know optimal value function (or optimal action-value function) the MDP, it means 'MDP is solved' as we know the optimal policy $\pi_*$ that can achieve maximum return. However, 

### Difficulty of solving bellman optimality equation