# Assignment Week 2

## 3. MDP (Markov Decision Process)

A Markov Decision Process (MDP) is a Markov reward process with decisions. 
It is an environment in which all states are Markov.In this case, we can make active decisions to arrive different states, rather than being manipulated passively.

As an expansion of MRP, the MDP has 5 components: $S,P,R,A,\gamma $:

- $S$ : A finite set of states which satisfy Markov property.
- $A$ : A finite set of actions.
- $P$ : A corresponding state transition probability matrix: (now it's a 3-dimension matrix) 
$$ P_{ss^{'}}^a=P(S_{t+1}=s^{'} \mid S_t=s, A_t=a)$$
- $R$ : A reward function:
$$ R_{s}^a=E[R_{t+1} \mid S_t=s, A_t=a]$$
- $\gamma$: A discount factor.


### 3.1 Policy


A $policy \ \pi$ is a distribution over actions given states: $$\pi(a \mid s)=P[A_t=a \mid S_t=s]$$

- A policy fully defines the behaviour of an agent.
- MDP policies depend on the current state (not the history).
- Given an MDP $\{S, A ,P ,R, \gamma \}$ and a policy $\pi$.
- The state sequence, $S_1, S_2, ...$ is a MP $ \{S, P\}$.
- The state and reward sequance, $S_1, R_2, S_2, ...$ is an MRP $\{s, P^\pi ,R^\pi, \gamma\}$
    where, if given a specific policy, $$\mathcal P_{s,s^{'}}^\pi=\sum_{a \in \mathcal A} \pi(a \mid s) \mathcal P_{ss^{'}}^a$$ $$\mathcal R_s^\pi=\sum_{a \in \mathcal A} \pi(a \mid s) \mathcal R_s^a$$


### 3.2.1 Value Function for MDP (Prediction)

There are 2 definitions for the value function of MDP based on state or action.
The state-value function $v_\pi(s)$ of an MDP is the expected return starting from state $s$ and then following policy $\pi$: $$v_\pi(s)=E_\pi[G_t \mid S_t=s]$$ 

The action-value function $q_\pi(s,a)$ of an MDP is the expected return starting from state $s$, taking action $a$ and then following policy $\pi$: $$q_\pi(s,a)=E_\pi[G_t \mid S_t=s, A_t=a]$$


### 3.2.2 Bellman Equation for MDP (Prediction)

Similar analysis can be taken in bellman equation. Given a fixed policy, weturn MDP into MRP, and then just use the definition before. I prefer calling this prediction as stated in the caption because we are already given the policy in contrast to the optimization part we will discuss later.



- The general expressions are: $$v_\pi(s)=E_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s]$$ 

$$q_\pi(s,a)=E_\pi[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t=s, A_t=a]$$

- Consider one-step expectation for $v_\pi(s)$ and $q_\pi(s,a)$: 

A state node extending two action nodes gives:
$$v_\pi(s)=\sum_{a \in \mathcal A} \pi(a \mid s) q_\pi(s,a)$$ While an action node extending to two state nodes gives
$$q_\pi(s,a)=\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a v_\pi(s^{'})$$

- We can come up with iterative expressions for $v_\pi(s)$ and $q_\pi(s,a)$ by joining the 2 equations above: $$v_\pi(s)=\sum_{a \in \mathcal A} \pi(a \mid s) \Big (\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a v_\pi(s^{'}) \Big)$$ $$q_\pi(s,a)=\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a \sum_{a^{'} \in \mathcal A} \pi(a^{'} \mid s^{'}) q_\pi(s^{'},a^{'})$$


### 3.3.1 Optimality Value Function for MDP (Optimization)

The optimal state-value function  $v_*(s)$ is the maximum state-value function over all policies: $$v_*(s) = \max _\pi v_\pi(s)$$ The optimal action-value function $ q_*(s,a)$ is the maximum action-value function over all policies: $$q_*(s,a) = \max _\pi q_\pi(s,a)$$

- The optimal value function specifies the best possible performance in the MDP.
- An MDP is "solved" when we know the optimal value function.


### 3.3.2 Optimal Policy for MDP (Optimization)

Define the inequality relation over policies as: $\pi \geq \pi^{'} $ if $ v_\pi(s) \geq v_{\pi^{'}}(s), \forall s$.
Theorem: For any MDP,

- There exists an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_* \geq  \pi, \forall \pi$.
- All optimal policies achieve the optimal state-value function, $v_{\pi_*(s)}=v_*(s)$.
- All optimal policies achieve the optimal action-value function, $q_{\pi_*(s,a)}=q_*(s,a)$.

An optimal policy can be found by maximising over $q_*(s,a)$, $$\pi_*(a \mid s)= \begin{cases} 1& \text{if $a=\mathop{\arg\max}_{a \in \mathcal A}q_*(s,a)$}\\ 0& \text{otherwise} \end{cases} $$

- There is always a deterministic optimal policy for any MDP.
- If we know $q_*(s,a)$, we immediately have the optimal policy.


### 3.3.2 Bellman Optimality Equation for MDP (Optimization)


Consider the optimal policy case in the one-step Bellman equation, we have: $$v_*(s)=\max_a q_*(s,a)$$ $$q_*(s,a)=\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a v_*(s^{'})$$

Again, combine these 2 equations, we have iterative expressions for $v_*(s)$ and $q_*(s,a)$ : $$v_*(s)=\max_a \Big (\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a v_*(s^{'}) \Big)$$ $$q_*(s,a)=\mathcal R_s^a + \gamma \sum_{s^{'} \in \mathcal S} \mathcal P_{ss^{'}}^a \max_{a^{'}} q_*(s^{'},a^{'})$$

Bellman optimality equation is non-linear due to the $\texttt{max}$ function in the equation. So there is no closed form solution in general. We can use iterative methods to solve it, which will be explored in the next assignment.


### 3.4 Relation between  MP, MRP, and MDP Explained

These concepts are built step by step. We build MRP by introducing rewards based on MP, and build MDP by introducing decisions based on MRP.

So, to summarize:

- Given an MDP $ M = \{S, A ,P ,R, \gamma\}$ and a policy $\pi$.
- The state sequence, $S_1, S_2, ...$ is an MP $ \{S, P^\pi\}$, which is the foundation of all MP,MRP and MDP.
- The state and reward sequance, $S_1, R_2, S_2, ...$ is an MRP $\mathcal \{S, P^\pi ,R^\pi,\gamma\}$. Which means given a policy, we treat the poicy as "environment', so now we are back to MRP that following the environment "passively"
- where, $$\mathcal P_{s,s^{'}}^\pi=\sum_{a \in \mathcal A} \pi(a \mid s) \mathcal P_{ss^{'}}^a$$ $$\mathcal R_s^\pi=\sum_{a \in \mathcal A} \pi(a \mid s) \mathcal R_s^a$$
