# Write-up and code for Feb 6

### To Do
- ~~Write code for the interface for tabular RL algorithms. The core of this interface should be a mapping from a (state, action) pair to a sampling of the (next state, reward) pair. It is important that this interface doesn't present the state-transition probability model or the reward model.~~
- ~~Implement any tabular Monte-Carlo algorithm for Value Function prediction~~
- ~~Implement tabular 1-step TD algorithm for Value Function prediction~~
- ~~Test the above implementation of Monte-Carlo and TD Value Function prediction algorithms versus DP Policy Evaluation algorithm on an example MDP~~
- Prove that fixed learning rate (step size alpha) for MC is equivalent to an exponentially decaying average of episode returns

## Interface for tabular RL algorithms

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
from modules.MDP import MDP, Policy, V, Q
from modules.state_action_vars import S, A
import numpy as np
import random

In [3]:
from typing import Tuple

def RL_interface(mdp: MDP, s: S, a: A) -> Tuple[S, float]:
    # interface that takes in a state 's' and an action 'a', and returns a new state 'sp' and an observed reward 'r'
    sp = sp_sampler(mdp, s, a)
    # check how the reward is defined
    if type(mdp.R[s][a]) == float:
        r = mdp.R[s][a]
    else:
        r = mdp.R[s][a][sp]
        
    return sp, r


def sp_sampler(mdp: MDP, s: S, a: A) -> S:
    # function that takes in an MDP, a state and an action and samples a new state sp from that distribution
    p_cum = 0
    prob = np.random.rand()
    for sp in mdp.P[s][a].keys():
        p_cum += mdp.P[s][a][sp]
        if prob <= p_cum:
            return sp
        
    return sp

## Monte-Carlo Algorithm for Value Function Prediction

In [15]:
def first_visit_mc(policy: Policy, mdp: MDP, num_epi: int, num_steps: int) -> V:
    # follows the first visit MC algorithm outlined in Sutton's RL book
    v = {}
    returns = {}
    gamma = mdp.gamma
    for s in mdp.States:
        v[s] = 0
        returns[s] = []
        
    for i in range(num_epi):
        # generate an episode
        s_list, a_list, r_list = generate_path(policy, mdp, num_steps)
        # initialize the episode return
        G = 0
        for j in range(num_steps-1, 0, -1):
            G = gamma*G + r_list[j]
            if s_list[j] not in s_list[:j]:
                returns[s_list[j]].append(G)
                
    for s in mdp.States:
        v[s] = np.mean(returns[s])  
        
    return v


def generate_path(policy: Policy, mdp: MDP, num_steps: int) -> Tuple[list, list, list]:
    # generate a sample path that follows the provided policy
    # the function returns: S_0, A_0, R_1, ... , S_(T-1), A_(T-1), R_T
    s_list = []
    a_list = []
    r_list = []
    
    s0 = random.sample(mdp.States,1).pop()
    s_list.append(s0)
    a0 = action_sampler(policy, s0)
    a_list.append(a0)
    for i in range(num_steps - 1):
        sp, r = RL_interface(mdp, s_list[-1], a_list[-1])
        a = action_sampler(policy, sp)
        s_list.append(sp)
        a_list.append(a)
        r_list.append(r)
        
    # sample the last reward
    _, r = RL_interface(mdp, s_list[-1], a_list[-1])
    r_list.append(r)
    
    return s_list, a_list, r_list


def action_sampler(policy: Policy, s: S) -> A:
    # function that takes in a policy and a state and samples an action according to this policy 
    p_cum = 0
    prob = np.random.rand()
    for a in policy[s].keys():
        p_cum += policy[s][a]
        if prob <= p_cum:
            return a
        
    return a

## 1-step TD Algorithm for Value Function Prediction

In [71]:
def TD_0(policy: Policy, mdp: MDP, alpha: float, fixed_alpha: bool, num_epi: int, num_steps: int) -> V:
    # follows the TD(0) algorithm as it is outlined in the Sutton RL book
    v = {}

    for s in mdp.States:
        v[s] = 0
        
    if not fixed_alpha:
        alpha = 1.0
    
    for i in range(num_epi):
        s = random.sample(mdp.States,1).pop()
        
        for j in range(num_steps):
            a = action_sampler(policy, s)
            sp, r = RL_interface(mdp, s, a)
            v[s] += alpha*(r + mdp.gamma*v[sp] - v[s])
            s = sp
        
        if not fixed_alpha:
            alpha = 1.0/(i + 2)
        
    return v

## Example for Comparing MC, TD(0) and DP Algorithms
We will reuse the example from Assignment 4, which is a simple gridworld problem.

In [6]:
from modules.gridworld import gridworld
gw = gridworld(0.9)

Create a policy that is always moving up

In [7]:
policy = {}
for s in gw.States:
    policy[s] = {}
    for a in gw.Actions:
        if a == 2:
            policy[s][a] = 1.0
        else:
            policy[s][a] = 0.

Find the optimal policy

In [66]:
from modules.DP import policy_eval, policy_iter
policy = policy_iter(gw, policy, 100)

First, evaluate the value function using policy evaluation:

In [67]:
vf_pe = policy_eval(gw, policy, 1000)

Now evaluate the policy using First Visit Monte-Carlo

In [68]:
vf_mc = first_visit_mc(policy, gw, 10000, 100)

And finally, predict the value function using One-Step Temporal Difference

In [72]:
vf_td = TD_0(policy, gw, 1/100, True, 10000, 100)

Compare the three methods

In [70]:
for key in sorted(vf_td.keys()):
    print(key, 'Policy Evaluation: {:0.3f} \tMonte-Carlo: {:0.3f} \tTD(0): {:0.3f}'.format(vf_pe[key], vf_mc[key], vf_td[key]))

(0, 0) Policy Evaluation: 30.000 	Monte-Carlo: 29.999 	TD(0): 30.000
(0, 1) Policy Evaluation: 24.916 	Monte-Carlo: 24.953 	TD(0): 19.973
(0, 2) Policy Evaluation: 20.730 	Monte-Carlo: 20.933 	TD(0): 10.652
(0, 3) Policy Evaluation: 18.202 	Monte-Carlo: 18.205 	TD(0): 3.336
(1, 0) Policy Evaluation: 24.940 	Monte-Carlo: 24.979 	TD(0): 18.687
(1, 1) Policy Evaluation: 21.197 	Monte-Carlo: 21.111 	TD(0): 6.456
(1, 2) Policy Evaluation: 16.993 	Monte-Carlo: 17.158 	TD(0): 2.573
(1, 3) Policy Evaluation: 20.730 	Monte-Carlo: 21.129 	TD(0): 4.196
(2, 0) Policy Evaluation: 20.970 	Monte-Carlo: 21.517 	TD(0): 5.591
(2, 1) Policy Evaluation: 19.036 	Monte-Carlo: 19.045 	TD(0): 4.331
(2, 2) Policy Evaluation: 21.197 	Monte-Carlo: 21.226 	TD(0): 13.171
(2, 3) Policy Evaluation: 24.916 	Monte-Carlo: 25.040 	TD(0): 16.626
(3, 0) Policy Evaluation: 18.412 	Monte-Carlo: 18.406 	TD(0): 2.928
(3, 1) Policy Evaluation: 20.970 	Monte-Carlo: 20.934 	TD(0): 11.443
(3, 2) Policy Evaluation: 24.940 	Monte-C

## Proof that a Fixed Learning Rate For MC is Equivalent to Exponentially Decaying Average of Episode Returns