# Write-up and code for Assignment 3 - Week 2

#### Bellman Equation for MRP
The value f being in the current state $s$ in an MRP can be decomposed into two parts: the immediate reward $R_{t+1}$, and the discounted value of being in the successor state $\gamma v(S_{t+1})$. Hence, we can define the value function for state $s$ as:
<br>
<br>
$$v(s) = E[R_{t+1} + \gamma v(S_{t+1}) ~|~ S_t = s]$$
<br>
<br>
Which then for a discrete state space translates to:
<br>
<br>
$$v(s) = \mathcal R_s + \gamma \sum_{s'\in\mathcal S} \mathcal P_{ss'} v(s')$$
<br>
<br>
Using matrix notation this can be written as:
<br>
<br>
$$v = \mathcal R + \gamma \mathcal P v$$
<br>
<br>
We can then find $v$ by using matrix operations as follows:
<br>
<br>
$$v = (I - \gamma \mathcal P)^{-1} \mathcal R$$

In [65]:
def get_value_function(mrp: MRP) -> Dict[S, float]:
    R, P, index_to_state = get_matrices(mrp)
    I = np.identity(R.shape[0])
    
    v = np.matmul(np.linalg.inv(np.subtract(I, mrp.gamma*P)), R)
    vf = {}
    for i in range(v.shape[0]):
        s = index_to_state[i]
        vf[s] = float(v[i])
    return vf


def get_matrices(mrp: MRP) -> Tuple[np.ndarray, np.ndarray, Dict[int, S]]:
    # function to convert reward and transition dicts to matrices
    # this is necessary to solve the value funciton using matrix operations
    # assumes rewards are defined as a funciton of the current state
    sz = len(mrp.mp.States)
    R = np.zeros((sz,1))
    P = np.zeros((sz,sz))
    # keep an index where we map each index in the matrix to the state
    index_to_state = {}
    
    for i, s in enumerate(mrp.mp.States):
        index_to_state[i] = s
        R[i] = mrp.R[s]
        for j, sp in enumerate(mrp.mp.States):
            if sp in mrp.mp.P[s].keys():
                P[i, j] = mrp.mp.P[s][sp]
                
    return R, P, index_to_state

#### Markov Decision Process
A Markov Decision Process (MDP) consists of a finite set of states $\mathcal S$, a transition probability matrix $\mathcal P$ (that now also depends on the action taken), a reward function $\mathcal R$, a discount factor $\gamma$, and a finite set of actions $\mathcal A$. An MDP is then defined by the parameters $\mathcal S$, $\mathcal A$, $\mathcal P$, $\mathcal R$ and $\gamma$. An MDP is thus an extension of an MRP.


#### Policy
A policy $\pi$ is a probabilistic distribution over actions $a \in \mathcal A$ given a state $s$.
<br>
<br>
$$\pi(a~|~s) = \mathbb P[A_t = a ~|~ S_t = s]$$
<br>
- A policy fully defines the behaviour of an agent.
- An MDP combined with a policy results in an MRP.


#### MDP Value Function
We can define the value function for an MDP in two ways, firstly as solely a function of the current state:
<br>
- The state-value function $v_\pi(s)$ is the expected aggregated and discounted reward starting in state $s$ and then following policy $\pi$.
<br>
<br>
$$v_\pi(s) = \mathbb E_\pi[ G_t ~|~ S_t = s]$$
<br>
and secondly as a function of both the current state and the action (action-value function):
<br>
- The action-value function $q_\pi(s,a)$ is the expected aggregated and discounted reward starting in state $s$, taking action $a$ and then following policy $\pi$.
<br>
<br>
$$q_\pi(s,a) = \mathbb E_\pi[ G_t ~|~ S_t = s, A_t = a]$$

In [25]:
class MDP(NamedTuple):
    States: Set[S]
    # the transitions depend on s, a, and s'
    # mapping from a state to a mapping of an action to a mapping of a state to a float (probability)
    P: Dict[S, Dict[A, Dict[S, float]]]
    Actions: A
    # reward is a function of the current state,  and the action
    R: Dict[S, Dict[A, float]]
    gamma: float
        
        
class MDP(NamedTuple):
    States: Set[S]
    # the transitions depend on s, a, and s'
    # mapping from a state to a mapping of an action to a mapping of a state to a float (probability)
    P: Dict[S, Dict[A, Dict[S, float]]]
    Actions: A
    # reward is a function of the current state s, the action a, and the next state sp
    R: Dict[S, Dict[A, Dict[S, float]]]
    gamma: float
        
        
class Policy(NamedTuple):
    # state to action to a probability
    pi: Dict[S, Dict[A, float]]
        
        
class state_value_function(NamedTuple):
    vf: Dict[S,  float]
        

class action_value_function(NamedTuple):
    vf: Dict[S, Dict[A, float]]

As we see in the code above we can define the reward in multiple ways. The reward can be a function of the current state and the action taken: $R(s,a)$.
The reward can also be a function of the current state, the action taken, and the next state: $r(s,s',a)$. The code below shows how we can convert the latter definition to the former using the laws of expectation:
$$R(s,a) = \sum_{s'\in \mathcal S}\mathcal P_{s,s',a}r(s,s',a)$$

In [22]:
def convert_reward_mdp(mdp: MDP) -> Dict[S, Dict[A, float]]:
    # function to convert an mdp using r(s, s', a) to R(s,a)
    new_r = {}
    for s in mdp.States:
        new_r[s] = {}
        for a in mdp.R[s]:
            new_r[s][a] = 0
            for sp in mdp.R[s][a]:
                new_r[s][a] += mdp.R[s][a][sp]*mdp.P[s][a][sp]
    
    return new_r

Given a policy and an MDP we can create an MRP.

In [79]:
def mdp_to_mrp(mdp: MDP, policy: Policy) -> MRP:
    # function to convert an MDP and a policy into an MRP
    States = mdp.States
    gamma = mdp.gamma
    reward = {}
    probs = {}
    for s in States:
        # assume the reward is just a function of the current state R(s)
        reward[s] = 0
        probs[s] = {}
        for a in policy[s]:
            reward[s] += policy[s][a]*mdp.R[s][a]
            for sp in mdp.P[s][a]:
                if sp not in probs[s]:
                    probs[s][sp] = 0
                probs[s][sp] += policy[s][a]*mdp.P[s][a][sp]
    
    mp = MP(States, probs)
    mrp = MRP(mp, reward, gamma)
    
    return mrp

#### Bellman Equations
As we saw earlier the state-value function $v_\pi(s)$ can be expressed as the immediate and future discounted rewards starting in state $s$. We can therefore decompose the state-value function as follows:
$$v_\pi(s) = \mathbb E[R_{t+1} + \gamma v_\pi(S_{t+1})~|~S_t = s]$$
<br>
Similarly, we can decompose the action-value function into the same two parts:
<br> <br>
$$q_\pi(s,a) = \mathbb E[R_{t+1} + \gamma q_\pi(S_{t+1},A_{t+1})~|~S_t = s, A_{t} = a]$$
<br> <br>
We can see how the state-value function is connected to the action-value function, and how we can transform the latter to the former using the policy $\pi$:
<br> <br>
$$v_\pi(s) = \sum_{a \in A}\pi(a~|~s) q_\pi(s,a)$$
<br> <br>
The action-value function also depends on the state-value function. If we take action $a$ in state $s$ the value is the immediate reward, plus the discounted value of the state we end up in when taking action $a$. This can be illustrated by the following equation:
<br> <br>
$$q_\pi(s,a) = \mathcal R_s^a + \gamma\sum_{s' \in S}\mathcal P_{s,s'}^a v_\pi(s')$$
<br> <br>
Combining the last two equations gives us:
<br> <br>
$$v_\pi(s) = \sum_{a \in A}\pi(a~|~s)\Bigg( \mathcal R_s^a + \gamma\sum_{s' \in S}\mathcal P_{s,s'}^a v_\pi(s')\Bigg)$$
<br> <br>
We can also combine them into:
<br> <br>
$$q_\pi(s,a) = \mathcal R_s^a + \gamma\sum_{s' \in S}\mathcal P_{s,s'}^a \sum_{a' \in A}\pi(a'~|~s') q_\pi(s',a')$$
<br> <br>
Finally, using matrix notation we can write the state-value function as:
<br> <br>
$$v_\pi(s) = \mathcal R_\pi + \gamma\mathcal P_\pi v_\pi$$
<br> <br>
which can then be transformed into:
<br> <br>
$$v_\pi(s) = (I - \gamma\mathcal P_\pi)^{-1}\mathcal R_\pi$$
<br> <br>

#### Optimal Value and Optimal Policy
The optimal action-value function is defined as:
<br><br>
$$q_*(s,a) = max_\pi q_\pi(s,a)$$
<br><br>
We say that $\pi \geq \pi'$ if $v_\pi(s) \geq v_{\pi'}(s)$ for all states $s$. Then,
- For any MDP there exist an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_* \geq \pi, \forall \pi$.
- All optimal policies achieve the optimal state-value function, $v_{\pi_*}(s) = v_*(s), \forall s$.
- All optimal policies achieve the optimal action-value function, $q_{\pi_*}(s,a) = q_*(s,a), \forall s, a$.


An optimal policy can be found by maximizing over $q_*(s,a)$.

### Example 1
We will continue on the example used in the second assignment which was a $4\times4$ gridworld problem.

In [100]:
def is_in_grid(state: Tuple[int, int], size: int) -> bool:
    # helper function to check whether a state is in the grid
    return  state[0] >= 0 and state[0] < size and state[1] >= 0 and state[1] < size


def get_neighbor_states(state: Tuple[int, int], size: int) -> Set[Tuple[int, int]]:
    # function to return a set of neighboring states in the grid
    nbr_states = set()
    
    up_state = s[0]-1, s[1]
    if is_in_grid(up_state, size):
        nbr_states.add(up_state)
        
    down_state = s[0]+1, s[1]
    if is_in_grid(down_state, size):
        nbr_states.add(down_state)
        
    left_state = s[0], s[1]-1
    if is_in_grid(left_state, size):
        nbr_states.add(left_state)
        
    right_state = s[0], s[1]+1
    if is_in_grid(right_state, size):
        nbr_states.add(right_state)
    
    return nbr_states

In [101]:
# define the gridworld parameters
States = set()
P = {}
for i in range(4):
    for j in range(4):
        state = (i, j)
        States.add(state)

for s in States:
    P[s] = {}
    nbrs = get_neighbor_states(s, 4)
    if s == (0,0) or s == (3,3):
        P[s][s] = 1.0
    else:
        for sp in nbrs:
            P[s][sp] = 0.25
        if len(nbrs) < 4:
            P[s][s] = 0.25*(4 - len(nbrs))
    
        
mp = MP(States, P)

In [102]:
# here the reward is just a funciton of the current state
R = {}
for s in mp.States:
    if s == (0,0) or s == (3,3):
        R[s] = 3.
    elif s == (1,2):
        R[s] = -2.
    else:
        R[s] = 0.
gamma = 0.9

In [103]:
mrp = MRP(mp, R, gamma)

Now we will pull out the value function from the MRP:

In [104]:
v = get_value_function(mrp)
for s in sorted(v):
    print(s, v[s])

(0, 0) 30.000000000000007
(0, 1) 13.386087847135503
(0, 2) 7.09545337311291
(0, 3) 5.805370941637835
(1, 0) 13.693968875824414
(1, 1) 9.012182544798252
(1, 2) 5.248436163060024
(1, 3) 7.095453373112911
(2, 0) 8.15593247193028
(2, 1) 7.725651757527839
(2, 2) 9.01218254479825
(2, 3) 13.3860878471355
(3, 0) 6.673035658852047
(3, 1) 8.15593247193028
(3, 2) 13.693968875824414
(3, 3) 30.000000000000004


The top left and bottom right corner states are absorbing but not terminating. These states are also the only states with positive rewards, hence it is reasonable that these states have the highest value.

### Example 2
We will modify the previous example and turn it into an MDP.

In [89]:
def get_neighbor_direction(s: S, sp: S) -> int:
    # function to figure out in which direction the state sp is from state s
    # assume that both states are adjacent
    if s[1] > sp[1]:
        # sp is to the left of s
        return 1
    elif s[1] < sp[1]:
        # sp is to the right of s
        return 2
    elif s[0] > sp[0]:
        # sp is above s
        return 3
    elif s[0] < sp[0]:
        # sp is below s
        return 4
    else:
        # sp is equal to s
        return 0

In [90]:
# define the gridworld parameters
States = set()
P_2 = {}
A = set()
for i in range(4):
    # 1 is move left, 2 is move right, 3 is up, 4 is down
    A.add(i+1)
    for j in range(4):
        state = (i, j)
        States.add(state)

for s in States:
    P_2[s] = {}
    nbrs = get_neighbor_states(s, 4)
    for a in A:
        P_2[s][a] = {}
        if s == (0,0) or s == (3,3):
            P_2[s][a][s] = 1.0
        else:
            agg_p = 0
            for sp in nbrs:
                if get_neighbor_direction(s, sp) == a:
                    P_2[s][a][sp] = 0.7
                    agg_p += 0.7
                else:
                    P_2[s][a][sp] = 0.1
                    agg_p += 0.1
            if len(nbrs) < 4:
                P_2[s][a][s] = 1. - agg_p

In [91]:
# here the reward is just a function of the current state
R_2 = {}
for s in States:
    R_2[s] = {}
    for a in A:
        if s == (0,0) or s == (3,3):
            R_2[s][a] = 3.
        elif s == (1,2):
            R_2[s][a] = -2.
        else:
            R_2[s][a] = 0.
gamma_2 = 0.9

In [92]:
mdp = MDP(States, P_2, A, R_2, gamma_2)

Create a policy that always aims to move right.

In [93]:
policy = {}
for s in mdp.States:
    policy[s] = {}
    for a in A:
        if a == 2:
            policy[s][a] = 1.0
        else:
            policy[s][a] = 0.

Combine the above MDP and policy to create an MRP.

In [94]:
mrp2 = mdp_to_mrp(mdp, policy)

In [99]:
v2 = get_value_function(mrp2)
for s in sorted(v2):
    print(s, v2[s])

(0, 0) 30.000000000000007
(0, 1) 4.3392109618152865
(0, 2) 1.6319135097620148
(0, 3) 1.5581165139564894
(1, 0) 5.458593761423049
(1, 1) 2.4508496011315613
(1, 2) 1.2544322614163288
(1, 3) 3.215560089213728
(2, 0) 8.036500824245438
(2, 1) 8.652831681642478
(2, 2) 9.568674724791766
(2, 3) 10.406976035839175
(3, 0) 15.229537245561264
(3, 1) 18.674500741552613
(3, 2) 23.56251185930879
(3, 3) 30.000000000000007


## Appendix
Here is code from previous assignments that are necessary to run the code above.

In [16]:
from typing import NamedTuple, Any, Dict, Tuple, Set
import numpy as np

class MP(NamedTuple):
    States: Set[S]
    P: Dict[S, Dict[S, float]]
        
        
class MRP(NamedTuple):
    # assumes that the reward is just a function of the current state
    mp: MP
    R: Dict[S, float]
    gamma: float
        
        
class MRP(NamedTuple):
    # assumes that the reward is a function of the transition
    mp: MP
    R: Dict[S, Dict[S, float]]
    gamma: float

In [17]:
from typing import TypeVar

S = TypeVar('S')
A = TypeVar('A')