# Write-up and code for Assignment 3 - Week 2

### To do:
- ~~Write the Bellman equation for MRP Value Function and code to calculate MRP Value Function (based on Matrix inversion method you learnt in this lecture)~~
- ~~Write out the MDP definition, Policy definition and MDP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)~~
- Think about the data structure/class design (in Python 3) to represent MDP, Policy, Value Function, and implement them with clear type definitions
- The data struucture/code design of MDP should be incremental (and not independent) to that of MRP
- Separately implement the r(s,s',a) and R(s,a) = \sum_{s'} p(s,s',a) * r(s,s',a) definitions of MDP
- Write code to convert/cast the r(s,s',a) definition of MDP to the R(s,a) definition of MDP (put some thought into code design here)
- Write code to create a MRP given a MDP and a Policy
- Write out all 8 MDP Bellman Equations and also the transformation from Optimal Action-Value function to Optimal Policy (in LaTeX)

#### Bellman Equation for MRP
The value f being in the current state $s$ in an MRP can be decomposed into two parts: the immediate reward $R_{t+1}$, and the discounted value of being in the successor state $\gamma v(S_{t+1})$. Hence, we can define the value function for state $s$ as:
<br>
<br>
$$v(s) = E[R_{t+1} + \gamma v(S_{t+1}) ~|~ S_t = s]$$
<br>
<br>
Which then for a discrete state space translates to:
<br>
<br>
$$v(s) = \mathcal R_s + \gamma \sum_{s'\in\mathcal S} \mathcal P_{ss'} v(s')$$
<br>
<br>
Using matrix notation this can be written as:
<br>
<br>
$$v = \mathcal R + \gamma \mathcal P v$$
<br>
<br>
We can then find $v$ by using matrix operations as follows:
<br>
<br>
$$v = (I - \gamma \mathcal P)^{-1} \mathcal R$$

In [12]:
def get_value_function(mrp: MRP) -> Dict[any, float]:
    R, P = get_matrices(mrp)
    I = np.identity(r.shape[0])
    
    v = np.matmul(np.inv(np.subtract(I, mrp.gamma*P)), R)
    return v


def get_matrices(mrp: MRP) -> Tuple[np.ndarray, np.ndarray]:
    # function to convert reward and transition dicts to matrices
    # this is necessary to solve the value funciton using matrix operations
    # assumes rewards are defined as a funciton of the current state
    sz = len(mrp.S)
    R = np.zeros((sz,1))
    P = np.zeros((sz,sz))
    
    for i, s in enumerate(mrp.S):
        R[i] = mrp.R[s]
        for j, sp in enumerate(mrp.S):
            if sp in mrp.P[s].keys():
                P[i, j] = mrp.P[s][sp]
                
    return R, P

#### Markov Decision Process
A Markov Decision Process (MDP) consists of a finite set of states $\mathcal S$, a transition probability matrix $\mathcal P$ (that now also depends on the action taken), a reward function $\mathcal R$, a discount factor $\gamma$, and a finite set of actions $\mathcal A$. An MDP is then defined by the parameters $\mathcal S$, $\mathcal A$, $\mathcal P$, $\mathcal R$ and $\gamma$. An MDP is thus an extension of an MRP.


#### Policy
A policy $\pi$ is a probabilistic distribution over actions $a \in \mathcal A$ given a state $s$.
<br>
<br>
$$\pi(a~|~s) = \mathbb P[A_t = a ~|~ S_t = s]$$
<br>
- A policy fully defines the behaviour of an agent.
- An MDP combined with a policy results in an MRP.


#### MDP Value Function
We can define the value function for an MDP in two ways, firstly as solely a function of the current state:
<br>
- The state-value function $v_\pi(s)$ is the expected aggregated and discounted reward starting in state $s$ and then following policy $\pi$.
<br>
<br>
$$v_\pi(s) = \mathbb E_\pi[ G_t ~|~ S_t = s]$$
<br>
and secondly as a function of both the current state and the action (action-value function):
<br>
- The action-value function $q_\pi(s,a)$ is the expected aggregated and discounted reward starting in state $s$, taking action $a$ and then following policy $\pi$.
<br>
<br>
$$q_\pi(s,a) = \mathbb E_\pi[ G_t ~|~ S_t = s, A_t = a]$$

In [1]:
from typing import NamedTuple, Any, Dict, Tuple, Set
import numpy as np

class MDP(NamedTuple):
    S: Set[any]
    # the transitions depend on s, a, and s'
    # mapping from a state to a mapping of an action to a mapping of a state to a float (probability)
    P: Dict[any, Dict[any, Dict[any, float]]]
    A: any
    R: Dict[any, float]
    gamma: float
        
        
class Policy(NamedTuple):
    pi: Dict[any, any]
        

class value_function(NamedTuple):
    pi: Dict[any, any]

## Appendix
Here is code from previous assignments that are necessary to run the code above.

In [8]:
class MP(NamedTuple):
    S: Set[any]
    P: Dict[any, Dict[any, float]]
        
        
class MRP(NamedTuple):
    # assumes that the reward is just a function of the current state
    mp: MP
    R: Dict[any, float]
    gamma: float
        
        
class MRP(NamedTuple):
    # assumes that the reward is a function of the transition
    mp: MP
    R: Dict[any, Dict[any, float]]
    gamma: float