# Write-up and code for Assignment 3 - Week 2

### To do:
- ~~Write the Bellman equation for MRP Value Function and code to calculate MRP Value Function (based on Matrix inversion method you learnt in this lecture)~~
- ~~Write out the MDP definition, Policy definition and MDP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)~~
- ~~Think about the data structure/class design (in Python 3) to represent MDP, Policy, Value Function, and implement them with clear type definitions~~
- ~~The data struucture/code design of MDP should be incremental (and not independent) to that of MRP~~
- ~~Separately implement the r(s,s',a) and R(s,a) = \sum_{s'} p(s,s',a) * r(s,s',a) definitions of MDP~~
- ~~Write code to convert/cast the r(s,s',a) definition of MDP to the R(s,a) definition of MDP (put some thought into code design here)~~
- ~~Write code to create a MRP given a MDP and a Policy~~
- Write out all 8 MDP Bellman Equations and also the transformation from Optimal Action-Value function to Optimal Policy (in LaTeX)

#### Bellman Equation for MRP
The value f being in the current state $s$ in an MRP can be decomposed into two parts: the immediate reward $R_{t+1}$, and the discounted value of being in the successor state $\gamma v(S_{t+1})$. Hence, we can define the value function for state $s$ as:
<br>
<br>
$$v(s) = E[R_{t+1} + \gamma v(S_{t+1}) ~|~ S_t = s]$$
<br>
<br>
Which then for a discrete state space translates to:
<br>
<br>
$$v(s) = \mathcal R_s + \gamma \sum_{s'\in\mathcal S} \mathcal P_{ss'} v(s')$$
<br>
<br>
Using matrix notation this can be written as:
<br>
<br>
$$v = \mathcal R + \gamma \mathcal P v$$
<br>
<br>
We can then find $v$ by using matrix operations as follows:
<br>
<br>
$$v = (I - \gamma \mathcal P)^{-1} \mathcal R$$

In [12]:
def get_value_function(mrp: MRP) -> Dict[any, float]:
    R, P = get_matrices(mrp)
    I = np.identity(r.shape[0])
    
    v = np.matmul(np.inv(np.subtract(I, mrp.gamma*P)), R)
    return v


def get_matrices(mrp: MRP) -> Tuple[np.ndarray, np.ndarray]:
    # function to convert reward and transition dicts to matrices
    # this is necessary to solve the value funciton using matrix operations
    # assumes rewards are defined as a funciton of the current state
    sz = len(mrp.S)
    R = np.zeros((sz,1))
    P = np.zeros((sz,sz))
    
    for i, s in enumerate(mrp.S):
        R[i] = mrp.R[s]
        for j, sp in enumerate(mrp.S):
            if sp in mrp.P[s].keys():
                P[i, j] = mrp.P[s][sp]
                
    return R, P

#### Markov Decision Process
A Markov Decision Process (MDP) consists of a finite set of states $\mathcal S$, a transition probability matrix $\mathcal P$ (that now also depends on the action taken), a reward function $\mathcal R$, a discount factor $\gamma$, and a finite set of actions $\mathcal A$. An MDP is then defined by the parameters $\mathcal S$, $\mathcal A$, $\mathcal P$, $\mathcal R$ and $\gamma$. An MDP is thus an extension of an MRP.


#### Policy
A policy $\pi$ is a probabilistic distribution over actions $a \in \mathcal A$ given a state $s$.
<br>
<br>
$$\pi(a~|~s) = \mathbb P[A_t = a ~|~ S_t = s]$$
<br>
- A policy fully defines the behaviour of an agent.
- An MDP combined with a policy results in an MRP.


#### MDP Value Function
We can define the value function for an MDP in two ways, firstly as solely a function of the current state:
<br>
- The state-value function $v_\pi(s)$ is the expected aggregated and discounted reward starting in state $s$ and then following policy $\pi$.
<br>
<br>
$$v_\pi(s) = \mathbb E_\pi[ G_t ~|~ S_t = s]$$
<br>
and secondly as a function of both the current state and the action (action-value function):
<br>
- The action-value function $q_\pi(s,a)$ is the expected aggregated and discounted reward starting in state $s$, taking action $a$ and then following policy $\pi$.
<br>
<br>
$$q_\pi(s,a) = \mathbb E_\pi[ G_t ~|~ S_t = s, A_t = a]$$

In [23]:
class MDP(NamedTuple):
    States: Set[S]
    # the transitions depend on s, a, and s'
    # mapping from a state to a mapping of an action to a mapping of a state to a float (probability)
    P: Dict[S, Dict[A, Dict[S, float]]]
    Actions: A
    # reward is a function of the current state,  and the action
    R: Dict[S, Dict[A, float]]
    gamma: float
        
        
class MDP(NamedTuple):
    States: Set[S]
    # the transitions depend on s, a, and s'
    # mapping from a state to a mapping of an action to a mapping of a state to a float (probability)
    P: Dict[S, Dict[A, Dict[S, float]]]
    Actions: A
    # reward is a function of the current state s, the action a, and the next state sp
    R: Dict[S, Dict[A, Dict[S, float]]]
    gamma: float
        
        
class Policy(NamedTuple):
    # state to action to a probability
    pi: Dict[S, Dict[A, float]]
        

class value_function(NamedTuple):
    vf: Dict[S, Dict[A, float]]

As we see in the code above we can define the reward in multiple ways. The reward can be a function of the current state and the action taken: $R(s,a)$.
The reward can also be a function of the current state, the action taken, and the next state: $r(s,s',a)$. The code below shows how we can convert the latter definition to the former using the laws of expectation:
$$R(s,a) = \sum_{s'\in \mathcal S}\mathcal P_{s,s',a}r(s,s',a)$$

In [22]:
def convert_reward_mdp(mdp: MDP) -> Dict[S, Dict[A, float]]:
    # function to convert an mdp using r(s, s', a) to R(s,a)
    new_r = {}
    for s in mdp.States:
        new_r[s] = {}
        for a in mdp.R[s]:
            new_r[s][a] = 0
            for sp in mdp.R[s][a]:
                new_r[s][a] += mdp.R[s][a][sp]*mdp.P[s][a][sp]
    
    return new_r

Given a policy and an MDP we can create an MRP.

In [24]:
def mdp_to_mrp(mdp: MDP, policy: Policy) -> MRP:
    # function to convert an MDP and a policy into an MRP
    States = mdp.States
    gamma = mdp.gamma
    reward = {}
    probs = {}
    for s in States:
        # assume the reward is just a function of the current state R(s)
        reward[s] = 0
        probs[s] = {}
        for a in policy[s]:
            reward[s] += policy[s][a]*mdp.R[s][a]
            for sp in mdp.P[s][a]:
                if sp not in probs[s]:
                    probs[s][sp] = 0
                    reward[s][sp]
                probs[s][sp] += policy[s][a]*mdp.P[s][a][sp]
    
    mp = MP(States, probs)
    mrp = MRP(mp, reward, gamma)
    
    return mrp

#### Bellman Equations
As we saw earlier the state-value function $v_\pi(s)$ can be expressed as the immediate and future discounted rewards starting in state $s$. We can therefore decompose the state-value function as follows:
$$v_\pi(s) = \mathbb E[R_{t+1} + \gamma v_\pi(S_{t+1})~|~S_t = s]$$
<br>
Similarly, we can decompose the action-value function into the same two parts:
<br> <br>
$$q_\pi(s,a) = \mathbb E[R_{t+1} + \gamma q_\pi(S_{t+1},A_{t+1})~|~S_t = s, A_{t} = a]$$
<br> <br>
We can see how the state-value function is connected to the action-value function, and how we can transform the latter to the former using the policy $\pi$:
<br> <br>
$$v_\pi(s) = \sum_{a \in A}\pi(a~|~s) q_\pi(s,a)$$
<br> <br>
The action-value function also depends on the state-value function. If we take action $a$ in state $s$ the value is the immediate reward, plus the discounted value of the state we end up in when taking action $a$. This can be illustrated by the following equation:
<br> <br>
$$q_\pi(s,a) = \mathcal R_s^a + \gamma\sum_{s' \in S}\mathcal P_{s,s'}^a v_\pi(s')$$
<br> <br>
Combining the last two equations gives us:
<br> <br>
$$v_\pi(s) = \sum_{a \in A}\pi(a~|~s)\Bigg( \mathcal R_s^a + \gamma\sum_{s' \in S}\mathcal P_{s,s'}^a v_\pi(s')\Bigg)$$
<br> <br>
We can also combine them into:
<br> <br>
$$q_\pi(s,a) = \mathcal R_s^a + \gamma\sum_{s' \in S}\mathcal P_{s,s'}^a \sum_{a' \in A}\pi(a'~|~s') q_\pi(s',a')$$
<br> <br>
Finally, using matrix notation we can write the state-value function as:
<br> <br>
$$v_\pi(s) = \mathcal R_\pi + \gamma\mathcal P_\pi v_\pi$$
<br> <br>
which can then be transformed into:
<br> <br>
$$v_\pi(s) = (I - \gamma\mathcal P_\pi)^{-1}\mathcal R_\pi$$
<br> <br>

#### Optimal Value and Optimal Policy
The optimal action-value function is defined as:
<br><br>
$$q_*(s,a) = max_\pi q_\pi(s,a)$$
<br><br>
We say that $\pi \geq \pi'$ if $v_\pi(s) \geq v_{\pi'}(s)$ for all states $s$. Then,
- For any MDP there exist an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_* \geq \pi, \forall \pi$.
- All optimal policies achieve the optimal state-value function, $v_{\pi_*}(s) = v_*(s), \forall s$.
- All optimal policies achieve the optimal action-value function, $q_{\pi_*}(s,a) = q_*(s,a), \forall s, a$.


An optimal policy can be found by maximizing over $q_*(s,a)$.

## Appendix
Here is code from previous assignments that are necessary to run the code above.

In [16]:
from typing import NamedTuple, Any, Dict, Tuple, Set
import numpy as np

class MP(NamedTuple):
    States: Set[S]
    P: Dict[S, Dict[S, float]]
        
        
class MRP(NamedTuple):
    # assumes that the reward is just a function of the current state
    mp: MP
    R: Dict[S, float]
    gamma: float
        
        
class MRP(NamedTuple):
    # assumes that the reward is a function of the transition
    mp: MP
    R: Dict[S, Dict[S, float]]
    gamma: float

In [17]:
from typing import TypeVar

S = TypeVar('S')
A = TypeVar('A')