## Monte Carlo Methods

The key idea of Monte Carlo methods is to replace the explicit transition structure used by other by approximating it from average returns run over a number of simulated episodes. Here we assume that the experience gathered by the agent is dividen into episodes that eventually finish. Monte Carlo Learning is model-free because we don't need to know the rewards or transitions of the underlying MDP. In particular, the estimates for one state do not "build upon" estimates of the other. This is useful if you only require the value at certain states. The main drawback is that it might take too much time, even for small problems. 


The goal is to learn $Q_\pi$ from episodes of experience under $\pi$:
$$S_1, A_1, R_1, S_2, A_2, R_2, \ldots S_T \sim \pi$$
Recall that the return $G_t$ is
$$G_t = R_{t+1}+\gamma R_{t+2}+\ldots + \gamma^{T-1}R_T$$
and that the value function is the expected return
$$Q_\pi(s,a) = \mathbb E_\pi(G_t  \ | \ S_t = s, A_t = a)$$
For Monte Carlo simulation we replace the expectation above by empirical mean.

To ensure that sampled average returns would converge to the value function, we need to verify that:
	* All episodes must start in a state-action pair. 
	* All state-action pairs have positive probability of being selected at the start.
This guarantees that in the limit of an infinite number of episodes, all pairs would be selected infinitely many times.


It turns out that MC with exploring starts has not been proven to converge! However, convergence is proven for a variation of this idea, called **First-Visit Monte Carlo Policy Evaluation**. The algorithm goes as follows:


<div class="alert alert-block alert-info">
<h4>First-Visit Monte Carlo Policy Evaluation </h4>
<p>We want to evaluate a state $s$ under a fixed policy $\pi$: 
    <ul>
     <li> Increment a counter the first that the pair $s,a$ is visited in an episode
        $$N(s,a) \leftarrow N(s,a)+1.$$ </li>
    <li> Increment total return $R(s,a)\leftarrow R(s,a)+G_t$    </li>
    <li> Let $Q(s,a) \sim R(s,a)/N(s,a)$ </li>
    <li> $Q(s,a) \rightarrow Q_\pi(s,a)$ as $N(s,a)\rightarrow +\infty$ </li>    
    <li> $\pi \rightarrow \epsilon-\text{greedy}(\pi)$ </li>
    </ul>
</div>    

A variation of this idea is the **Every-Visit Monte Carlo Policy Evaluation**:


<div class="alert alert-block alert-info">
<h4>First-Visit Monte Carlo Policy Evaluation </h4>
<p>We want to evaluate a state $s$ under a fixed policy $\pi$: 
    <ul>
     <li> Increment a counter every time that the pair $s,a$ is visited in an episode
        $$N(s,a) \leftarrow N(s,a)+1.$$ </li>
    <li> Increment total return $R(s,a)\leftarrow R(s,a)+G_t$    </li>
    <li> Let $Q(s,a) \sim R(s,a)/N(s,a)$ </li>
    <li> $Q(s,a) \rightarrow Q_\pi(s,a)$ as $N(s,a)\rightarrow +\infty$ </li>    
    <li> $\pi \rightarrow \epsilon-\text{greedy}(\pi)$ </li>
    </ul>
</div>    

What's the difference between these two? Almost none in practice, although the theoretical analysis is different. Both methods are **on-policy**, meaning that they sample and evaluate from the same policy.


### GLIE Monte Carlo control

Before discussing algorithms for solving the control following, let us make a remark:

Observe that the mean of a sequence $x_1, x_2, \ldots$ can be computed incrementally:

$$\begin{aligned}
\mu_k & = & \frac{1}{k}\sum_{j=1}^k x_j \\
 & = & \frac{1}{k}\left( x_k + \sum_{j=1}^{k-1}x_j\right) \\
 & = & \frac{1}{k}\left( x_k + (k-1)\mu_{k-1}\right) \\
 & = & \mu_{k-1} +\frac{1}{k}(x_k-\mu_{k-1})
\end{aligned}$$



This means we can do incremental Monte Carlo Updates for the $Q$ function:

- Update $Q(s,a)$ incrementally after each episode.
- For each state $S_t$ with return $G_t$
 $$\begin{aligned}
 N(S_t,A_t) &\leftarrow& N(S_t, A_t) +1\\
 Q(S_t,A_t) &\leftarrow& Q(S_t,A_t)+\frac{1}{N(S_t, A_t)}(G_t-Q(S_t,A_t))
 \end{aligned}
 $$
-  We can "forget the past" by compute an exponential moving mean. We don't move to correct the value all the way to the mean, we just correct it a bit.
$$Q(S_t,A_t) \leftarrow Q(S_t,A_t)+\alpha(G_t-Q(S_t,A_t))$$
 

Now we are ready to introduce the GLIE algorithm. GLIE stands for *Greedy in the Limite with Infinite Exploration* and the algorithm goes as follows:

- Sample an episode using policy $\pi$
- For each state $S_t$ and action $A_t$ in the episode, 

$$ \begin{aligned}
N(S_t,A_t) & \leftarrow & N(S_t, A_t) + 1 \\
Q(S_t, A_t) & \leftarrow & Q(S_t, A_t) + \frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t))
\end{aligned}$$

- Improve the policy based on the new state-action-value function
	- $\epsilon \leftarrow 1/k$
	- $\pi \leftarrow \epsilon-\text{greedy}(Q)$



You can check some implementation examples here:
- https://gym.openai.com/evaluations/eval_TtcFIoaZQu6fGDIQICFKtw

- https://gist.github.com/jpmaldonado/fbc572b3bb517ac0848687b6e987f9a0


GLIE is an on-policy algorithm.

## Off-policy MC control

The goal of off-policy methos is to learn for a **different** policy to the one we are using to generate the episode. Off-policy methods consider both a 
	- **target policy $t$** policy, from which we want to learn.
	- **behavior policy $b$** policy, from which we generate the episode.

Off-policy methods can be thought of learning by example, whereas on-policy methods are about learning by doing.
We need to guarantee **coverage** that is, $t(a|s) > 0$ implies $b(a|s>0)$. For this we use importance sampling, which means estimation of expected values from a distribution given samples of another.


### Weighted importance sampling
![Off-policy MC control, from David Silver's slides.](images/mcwis.png)

In [27]:
## The code below shows how to implement GLIE.
import numpy as np
import gym

def epsilon_greedy_policy(Q, epsilon, actions):
    """ Q is a numpy array, epsilon between 0,1 
    and a list of actions"""
    
    def policy_fn(state):
        if np.random.rand()>epsilon:
            action = np.argmax(Q[state,:])
        else:
            action = np.random.choice(actions)
        return action
    return policy_fn




In [28]:
env = gym.make("FrozenLake-v0")

Q = np.zeros([env.observation_space.n, env.action_space.n])
R = np.zeros([env.observation_space.n, env.action_space.n])
N = np.zeros([env.observation_space.n, env.action_space.n])
actions = range(env.action_space.n)
gamma = 1


def run_episode(env, policy): 
    done = False
    state = env.reset()
    episode = []
    while not done:
        action = policy(state)
        new_state, reward, done, _ = env.step(action)
        episode.append((state,action,reward))
        state = new_state    
    return episode

In [32]:
from tqdm import tqdm
score = 0
n_iter = 50000
for j in tqdm(range(n_iter)):
    policy = epsilon_greedy_policy(Q,epsilon=1000/(j+1), actions = actions )
    episode = run_episode(env,policy)
    ep_reward = sum(x[2]*(gamma**i) for i, x in enumerate(episode))
    score += ep_reward # counter for the 100 episode reward
    
    sa_in_episode = set([(x[0],x[1]) for x in episode])
    
    # Find first visit of each s,a in the episode
    for s,a in sa_in_episode:
        first_visit = next(i for i,x in enumerate(episode) if 
                           x[0]==s and x[1]==a)
        
        G = sum(x[2]*(gamma**i) for i, x in enumerate(episode[first_visit:]))
        R[s,a] += G
        N[s,a] += 1
        Q[s,a] += 1/N[s,a]*(G-Q[s,a])

    
    
    if (j+1)%10000 == 0:
        print("Score: ", score/100)
    
    if j%100 == 0:
        score = 0
    
env.close()

 20%|███████████▎                                             | 9922/50000 [00:06<00:30, 1331.79it/s]

Score:  0.41


 40%|██████████████████████▎                                 | 19881/50000 [00:14<00:27, 1076.95it/s]

Score:  0.49


 60%|█████████████████████████████████▌                      | 29970/50000 [00:25<00:17, 1166.63it/s]

Score:  0.56


 80%|█████████████████████████████████████████████▌           | 39996/50000 [00:35<00:12, 783.41it/s]

Score:  0.61


100%|███████████████████████████████████████████████████████▉| 49894/50000 [00:47<00:00, 1162.41it/s]

Score:  0.65


100%|████████████████████████████████████████████████████████| 50000/50000 [00:47<00:00, 1043.93it/s]
