# Value-Based Methods

#### Your task:
- Implement and compare the following algorithms.

You can use some of the environments in [https://github.com/openai/gym#environments](https://github.com/openai/gym#environments). 

Some of the easy environments:

* `FrozenLake`
* `CartPole`
* `MountainCar`

Note that for `CartPole` or `MountainCar` you need to discretize the state space somehow.   

<div class="alert alert-block alert-info">
<h4> GLIE Monte Carlo Control </h4>
<p>Sample the $k$-th episode according to the policy $\pi$ being ($\epsilon-$greedy): 
    <ul>
      <p><b> (Policy Evaluation) </b></p>
     <li> Increment a counter every time that the pair $s,a$ is visited in an episode
        $$N(s,a) \leftarrow N(s,a)+1.$$ </li>
    <li> Increment total return $R(s,a)\leftarrow R(s,a)+G_t$    </li>
    <li> Let $Q(s,a) \sim R(s,a)/N(s,a)$ </li>
    <li> $Q(s,a) \leftarrow Q(s,a)$ as $N(s,a)\rightarrow +\infty$ </li>    
    <p><b> (Reducing the exploration rate)</b> </p>
    <li> $\epsilon \leftarrow 1/k$ </li>
    </ul>
</div>  

<div class="alert alert-block alert-info">
<h4> Sarsa Algorithm </h4>
Initialize $Q(s,a)$ arbitrarily. $\forall s \in \mathcal{S}, a \in \mathcal{A}, \ Q(\text{terminal state},\cdot) = 0$. 
    <ul>
        <li> Repeat for each episode: </li>
        Initialize the initial state $S$.
        <p>Repeat, for each step of the episode:
        <ul>
            
            <li> Choose $A$ from $S$ using the policy derived from $Q$, for instance, using $\epsilon-$greedy</li>
            <li> Take action $A$, observe $R$, $S'$ </li>
            <li> Choose action $A'$ from $S'$ </li>
            <li> $Q(S,A) <= Q(S,A)+\alpha \left [ R+\gamma\cdot Q(S',A')-Q(S,A) \right ] $</li>
            <li> $S <=S'$ </li>
        </ul>
        
    </ul>
</div>

<div class="alert alert-block alert-info">
<h4> Q-Learning Algorithm </h4>
Initialize $Q(s,a)$ arbitrarily. $\forall s \in \mathcal{S}, a \in \mathcal{A}, Q(\text{terminal state},\cdot) = 0$. 
    <ul>
        <li> Repeat for each episode: </li>
        Initialize the initial state $S$.
        <p>Repeat, for each step of the episode:
        <ul>
            
            <li> Choose $A$ from $S$ using the policy derived from $Q$, for instance, using $\epsilon-$greedy.</li>
            <li> Take action $A$, observe $R$, $S'$. </li>
            <li> $Q(S,A) <= Q(S,A)+\alpha \left [ R+\gamma\cdot \max_{a \in A} Q(S',a)-Q(S,A) \right ] $</li>
            <li> $S <= S'$ </li>
        </ul>
        
    </ul>
</div>


In [8]:
import numpy as np
import gym

def epsilon_greedy_policy(Q, epsilon, actions):
    """ Q is a numpy array, epsilon between 0,1 
    and a list of actions"""
    
    def policy_fn(state):
        if np.random.rand()>epsilon:
            action = np.argmax(Q[state,:])
        else:
            action = np.random.choice(actions)
        return action
    return policy_fn


env = gym.make("FrozenLake-v0")
n_episodes = 100

# Initialization
Q = np.zeros([env.observation_space.n, env.action_space.n])

actions = range(env.action_space.n)

score = []    
for j in range(n_episodes):
    done = False
    state = env.reset()
    policy = epsilon_greedy_policy(Q, epsilon=1./(j+1), actions = actions )
    
    
    ### Generate sample episode
    t=0
    while not done:
        t+=1
        action = policy(state)    
        new_state, reward, done, _ =  env.step(action)
        new_action = policy(new_state)
        
        '''
        YOUR ALGORITHM GOES HERE:
        
        Your policy is inherited from the value function, 
        so you need to update it here
        ''' 
            
        if done:
            if len(score) < 100:
                score.append(reward)
            else:
                score[j % 100] = reward
                
                
            if (j+1)%10 == 0:
                print("INFO: Episode {} finished after \
                {} timesteps with r={}.\
                 Running score: {}".format(j+1, t, reward, np.mean(score)))
            

env.close()

INFO: Episode 10 finished after                 6 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 20 finished after                 6 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 30 finished after                 21 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 40 finished after                 31 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 50 finished after                 7 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 60 finished after                 8 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 70 finished after                 16 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 80 finished after                 24 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 90 finished after                 5 timesteps with r=0.0.                 Running score: 0.0
INFO: Episode 100 finished after 