# Assigment 8

## Model-free (RL) Prediction With Monte Carlo and Temporal Difference

### 9.1 Monte Carlo

- MC methods learn directly from episodes of experience
- MC is model-free:  no knowledge of MDP transitions / rewards
- MC learns fromcompleteepisodes:  no bootstrapping
- MC uses the simplest possible idea:  value = mean return
- Caveat: can only apply MC toepisodicMDPs

### 9.1.1 First Visit Monte Carlo Policy Evaluation

The $\textbf{first}$ time-step $t$ that state $s$ is visited in an episode,
- Increment counter $N(s)\leftarrow N(s) + 1$
- Increment total return $S(s)\leftarrow S(s) + G_t$ 
- Value is estimated by mean return $V(s) =S(s)/N(s)$
- By law of large numbers,$V(s) \rightarrow v_\pi(s)$ as $N(s) \rightarrow \infty$

### 9.1.2 Every Visit Monte Carlo Policy Evaluation

$\textbf{Every}$ time-step $t$ that state $s$ is visited in an episode,
- Increment counter $N(s)\leftarrow N(s) + 1$
- Increment total return $S(s)\leftarrow S(s) + G_t$ 
- Value is estimated by mean return $V(s) =S(s)/N(s)$
- Again,$V(s) \rightarrow v_\pi(s)$ as $N(s) \rightarrow \infty$

### 9.2 Temporal Difference

- TD methods learn directly from episodes of experience
- TD is model-free:  no knowledge of MDP transitions / rewards
- TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess

### 9.3 Applications 

### 9.3.1 Write code for the interface for tabular RL algorithms. The core of this interface should be a mapping from a (state, action) pair to a sampling of the (next state, reward) pair. It is important that this interface doesn't present the state-transition probability model or the reward model.

In [1]:
import numpy as np
import random as random

In [2]:
def simulator(state, action):
    next_state=np.random.uniform(0, 20, size=None)
    action=state + action
    return next_state, action     

def get_mdp_rep_for_rl_tabular(data):
    output=[]
    for i in data:
        next_state, action=simulator(i[0],i[1])
        output.append([next_state,action])
    return output

In [3]:
data=[[0,5],[2,3],[12,2],[6,0],[4,1]]
get_mdp_rep_for_rl_tabular(data)

[[3.822124337799695, 5],
 [14.940305864168437, 5],
 [3.6953362781060184, 14],
 [7.149493978261985, 6],
 [4.543915349097594, 5]]

### 9.3.2 Implement any tabular Monte-Carlo algorithm for Value Function prediction

In [11]:
def MC(sample_size, mc_path, states, visit="first"):
    V, count, _= build_dict_MC(sample_size, states)
    for episode in mc_path:
        _, _,first_visit = build_dict_MC(sample_size, states)
        for i in episode:
            s=i[0]
            G=i[1]
            if  visit=="first":
                if first_visit[s]==0:
                    count[s] += 1
                    c = count[s]
                    V[s] = (V[s] * (c - 1) + G) / c
                    first_visit[s]=1
            if  visit=="every":
                count[s] += 1
                c = count[s]
                V[s] = (V[s] * (c - 1) + G) / c
    return V

In [12]:
def build_dict_MC(size, states):
    value_dict={}
    count_dict={}
    first_visit_dict = {}
    for i in range(size):
        value_dict[i]=0
        count_dict[i]=0
    for s in states:
        first_visit_dict[s] = 0
    return value_dict, count_dict, first_visit_dict

In [13]:
mc_path=[
    [[0,5],[1,3],[3,2],[0,2],[5,2]],
    [[0,3],[2,3],[4,3]],
    [[1,6],[2,4],[4,3],[1,2],[5,2]],
    [[1,6],[3,4],[4,3],[3,2],[0,2]],
    [[1,6],[4,4],[4,3],[2,2],[1,2]],
]

states = [0, 1, 2, 3, 4, 5]

In [14]:
MC(6, mc_path, states, "first")

{0: 3.3333333333333335, 1: 5.25, 2: 3.0, 3: 3.0, 4: 3.25, 5: 2.0}

In [15]:
MC(6, mc_path, states, "every")

{0: 3.0, 1: 4.166666666666667, 2: 3.0, 3: 2.6666666666666665, 4: 3.2, 5: 2.0}

### 9.3.3 Implement tabular 1-step TD algorithm for Value Function prediction

In [22]:
def TD(sample_size,TD_path,gamma,alpha=0):
    V, count=build_dict_TD(sample_size)
    
    for episode in TD_path:
        for j,i in enumerate(episode):
            s=i[0]
            r=i[1]
            if j<len(episode)-1:
                next_s=episode[j+1][0]
            else:
                next_s=i[0]
            count[s] += 1
            if alpha==0:
                alpha = 1/count[s]
            V[s] = V[s] + alpha*(r+gamma*V[next_s]-V[s])
    return V

In [23]:
def build_dict_TD(size):
    value_dict={}
    count_dict={}
    for i in range(size):
        value_dict[i]=0
        count_dict[i]=0
    return value_dict, count_dict

In [24]:
#TD_path contains the data of episodes. Each episode there are (state, r) pairs.
TD_path=[
    [[0,5],[1,3],[3,2],[0,2],[5,2]],
    [[0,3],[2,3],[4,3]],
    [[1,6],[2,4],[4,3],[1,2],[5,2]],
    [[1,6],[3,4],[4,3],[3,2],[0,2]],
    [[1,6],[4,4],[4,3],[2,2],[1,2]],
    [[1,6],[3,4],[4,3],[3,2],[0,2]],
]

In [25]:
TD(6,TD_path,1)

{0: 7.0, 1: 11.0, 2: 27.0, 3: 7.0, 4: 17.0, 5: 4.0}

In [26]:
TD(6,TD_path,1,0.8)

{0: 6.08,
 1: 13.432576000000001,
 2: 16.391424,
 3: 7.6874752,
 4: 14.4373248,
 5: 3.2}