# Write up and code for Assignment 2 Week 1

### Definitions

#### Markov Process
A Markov Process (MP) is a stochastic process where each state follow the Markov property, i.e. $\mathbb P[S_{t+1}=s'~|~S_1, S_2, ..., S_t] = \mathbb P[S_{t+1} = s' ~|~ S_t]$. This means that history beyond the current state offers no more information about the transition to the next state. The transition probabilities are governed by a transition probability matrix $\mathcal P$. The states in an MP consists of a finite set $\mathcal S$. An MP is then defined by its parameters $\mathcal S$ and $\mathcal P$.

#### Markov Reward Process
In addition to a finite set of states $\mathcal S$, and a transition probability matrix $\mathcal P$, a Markov Reward Process (MRP) also has a reward function $\mathcal R$, ($\mathcal R_s = \mathbb E[R_{t+1}~|~S_t = s]$), and a discount factor $\gamma \in [0,1]$. An MRP is then defined by the parameters $\mathcal S$, $\mathcal P$, $\mathcal R$ and $\gamma$. An MRP is thus an extension of an MP.

#### Markov Decision Process
A Markov Decision Process (MDP) consists of a finite set of states $\mathcal S$, a transition probability matrix $\mathcal P$ (that now also depends on the action taken), a reward function $\mathcal R$, a discount factor $\gamma$, and a finite set of actions $\mathcal A$. An MDP is then defined by the parameters $\mathcal S$, $\mathcal A$, $\mathcal P$, $\mathcal R$ and $\gamma$. An MDP is thus an extension of an MRP.

#### Value function
First we define the return $G_t$ as the total discounted rewards recieved starting at time $t$. $$G_t = R_{t} + \gamma R_{t+1} + \gamma^2 R_{t+2} + ...$$

The state value function $v(s)$ is the value of an MDP when starting in state $s$. It is given by:
$$v(s) = \mathbb E[G_t~|~S_t = s]$$
Using the definition of the return we can write the value function as $$v(s) = \mathbb E[R_t + \gamma R_{t+1} + ... ~|~S_t = s] = \mathbb E[R_t ~|~ S_t = s] + \gamma \mathbb E[G_{t+1}~|~S_t = s]$$


### Data Structures
Below is the code for how to represent the above processes in code as structs.

In [12]:
from typing import NamedTuple, Any, Dict, Tuple, Set
import numpy as np

class MP(NamedTuple):
    S: Set[any]
    P: Dict[any, Dict[any, float]]
        
        
class MRP(NamedTuple):
    # assumes that the reward is just a function of the current state
    mp: MP
    R: Dict[any, float]
    gamma: float
    
    
class MDP(NamedTuple):
    S: Set[any]
    # the transitions depend on s, a, and s'
    # mapping from a state to a mapping of an action to a mapping of a state to a float (probability)
    P: Dict[any, Dict[any, Dict[any, float]]]
    A: any
    R: Dict[any, float]
    gamma: float
        
        
class Policy(NamedTuple):
    pi: Dict[any, any]

### Reward Functions
The reward function for a particular movement can be defined in mulitple ways, it can either be a function of the state the Agent leaves and the new state the Agent enters, i.e. $R(s,s')$. Or the reward function can be a function of simply the current state, i.e. $R(s)$. The latter reward function is then simply an expected value of the reward for the next transition: $$R(s) = \mathbb E[R(s,s')~|~ S = s] = \sum_{s'\in \mathcal S}P(S_{t+1} = s'~|~S_t=s)\times R(s, s')$$

In [13]:
def convert_reward(R: Dict[Tuple[any, any], float], P: Dict[any, Dict[any, float]]) -> Dict[any, float]:
    # this function converts reward as a function of the transition to reward as a function of the current state
    new_R: Dict[any, float]
    for s, d in P:
        # assume that the first element in the key tuple is the current state
        if s not in new_R:
            new_R[s] = 0
        for sp, p in d:
            new_R[s] += p * R[(s, sp)]
        
    return new_R

### Stationary Distribution
Below is code to find the stationary distribution of a Markov Process. The stationary distirbution can be found analytically using eigenvalues and eigenvectors of the transition matrix. In this problem I've chosen to find the stationary distribution of a Markov Process using simulation. By simulating the process a large number of times, and to a sufficient depth and then count how many times we end up in every state we can estimate the stationary distribution. For a large set of states we need a large number of simulations to get a reasonable estimate.

In [16]:
import random

def get_stationary_dist(mp: MP, n_iter: int, depth: int) -> Dict[any, float]:
    # this function finds a stationary distribution for a Markov Process through simulation
    stat_dist = Dict[any, float]
    for i in range(n_iter):
        state = random.choice(mp.S)
        for j in range(depth):
            new_state = get_random_state(mp.P[state])
            state = new_state
        
        if state not in stat_dist:
            stat_dist[state] = 0
        stat_dist[state] += 1/n_iter
    return stat_dist


def get_random_state(d: Dict[any, float]) -> any:
    # helper function to return a random state based on a mapping from a state to a float
    p = np.random.rand()
    agg_prob = 0
    for sp, prob in d:
        agg_prob += prob
        if p <= agg_prob:
            return sp
    # degenerate distribution if it does not return before reaching this point
    return None

Below I've used the eigenvalues approach to finding the stationary distribution.

In [17]:
from scipy.linalg import eig

def get_stat_dist_eig(mp: MP) -> Dict[any, float]:
    # create a matrix to store the transition probabilities
    mat = np.zeros((len(mp.S), len(mp.S)))
#     for i, s, d in enumerate(mp.P):
#         if s not in state_to_index:
#             state_to_index[s] = i
#         for j, sp, p in enumerate(d):
#             if sp not in state_to_index:
#                 state_to_index[sp] = j
    for i, s in enumerate(mp.S):
        for j, sp in enumerate(mp.S):
            if sp in mp.P[s]:
                # if we do not have a mapping from s to sp to a probability the transition has zero probability
                mat[i,j] = mp.P[s][sp]
    v = get_valid_eig(mat)
    dist = Dic
    return mat


def get_valid_eig(mat: np.array) -> np.array:
    # helper function to return a normalized valid eigenvector (i.e. no negative numbers)
    eig_vals, eig_vecs = eig(mat.T)
    v = np.zeros((eig_vals.shape))
    # we use the eigenvector associated with an eigenvalue = 1
    for i, val in enumerate(eig_vals):
        if np.abs(val - 1.) < 10 ** (-5):
            v = eig_vecs[:, i]
            break
    return v / sum(v)