# Q* Learning 
<br> 
In this notebook, we'll implement an agent <b>that plays FrozenLake.</b>
<img src="frozenlake.png" alt="Frozen Lake"/>

The goal of this game is <b>to go from the starting state (S) to the goal state (G)</b> by walking only on frozen tiles (F) and avoid holes (H).However, the ice is slippery, <b>so you won't always move in the direction you intend (stochastic environment)</b>

## Step 0: Import the dependencies
We use 3 libraries:
- `Numpy` for our Qtable
- `OpenAI Gym` for our FrozenLake Environment
- `Random` to generate random numbers

In [None]:
import numpy as np
import gym
import random

## Step 1: Create the environment
- Here we'll create the FrozenLake environment. 
- OpenAI Gym is a library <b> composed of many environments that we can use to train our agents.</b>
- In our case we choose to use Frozen Lake.
- Note that S is the subject, F is frozen, H is the hole, and G the goal.

In [None]:
env = gym.make("FrozenLake-v1")
env.render()

## Step 2: Create the Q-table and initialize it
- Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size
- OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`

In [None]:
action_size = env.action_space.n
state_size = env.observation_space.n

In [None]:
# Complete the code to create the qtable from the action_size and state_size
qtable = ####################
print(qtable)

## Step 3: Create the hyperparameters
- Here, we'll specify the hyperparameters
- Comment about the impact of each parameter in the learning process. You can run again the notebook for different configurations.

In [None]:
total_episodes = 15000        
learning_rate = 0.8           
max_steps = 99                
gamma = 0.95                  

epsilon = 1.0                 
max_epsilon = 1.0             
min_epsilon = 0.01            
decay_rate = 0.005            

In [None]:
# Write your comments in this cell:
'''



'''

## Step 4: The Q learning algorithm
- Now we implement the Q learning algorithm:
<img src="qtable_algo.png" alt="Q algo"/>

In [None]:
# Step 1 --> Q-values are already initialized.

# List of rewards
rewards = []

# Step 2 --> For life or until learning is stopped ...
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # Step 3 --> Choose an action (a) in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = ####################
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = # you can use the function argmax from numpy

        # Else doing a random choice --> exploration
        else:
            action = # do something

        # Step 4 --> Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = # use the method step from env. Check gym library documentation.

        # Step 5 --> Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = # update following Bellman's equation
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    # Reduce epsilon 
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)
    # Why do we need to reduce the epsilon? Comment below:
    '''
    Write your answer here:
    

    '''

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

## Step 5: Use our Q-table to play FrozenLake! 
- After 10 000 episodes, our Q-table can be used as a "cheatsheet" to play FrozenLake"
- By running this cell you can see our agent playing FrozenLake.
- Evaluate how the reward evolves in each step and comment about it.

In [None]:
env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = # do something
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

In [None]:
# Write here your code to evaluate the evolution of the reward in each step.




In [None]:
# Write your comments in this cell
'''


'''