# Q* Learning with FrozenLake
In this notebook, we'll implement an agent that plays **FrozenLake**
<br>The goal of this game is to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H). However, the ice is slippery, so you won't alwasy move in the direction you intend (stochastic environment)

In [1]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/q2ZOEFAaaI0?showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

## Step 0: Import the dependencies

In [3]:
import numpy as np # for our table
import gym # for Frozenlake Environment
import random # generate random number

## Step 1: Create the environment

In [4]:
env = gym.make("FrozenLake-v0")

## Step 2: Create the Q-table and initialize it

In [5]:
action_size = env.action_space.n
state_size = env.observation_space.n
qtable = np.zeros(shape=(action_size, state_size)).T
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## Step 3: Create the hyperparameters

In [6]:
total_episodes = 150000
learning_rate = 0.8
max_steps = 99 # max steps per episode
gamma = 0.95 #discouting rate

# Exploration params
epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.005

## The Q-learning algorithm
![](https://cdn-images-1.medium.com/max/1116/1*QeoQEqWYYPs1P8yUwyaJVQ.png?raw=true)

In [7]:
# List of rewards
rewards = []

# 2. For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # 3. Choose an action (a) in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0,1)
        
        ## if this number grater tha epsilon --> exploitation
        ## meaning taking the biggest Q value for this state
        
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state, :])
        
        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()
        
        # Take the action (a) and observe the outcome state (s') 
        # and reward (r)
        
        new_state, reward, done, info = env.step(action)
        
        # Update Q(s,a) := 
        # Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        
        # qtable[new_state,:] : all the actions 
        # we can take from new state
        
        qtable[state, action] = \
            qtable[state, action] +\
            learning_rate * (
                reward +
                gamma * np.max(qtable[new_state, :])-
                qtable[state, action]
            )
        total_rewards += reward
        
        # Our new state is new_state
        state = new_state
        
        # if done (if we're deed): finish episode
        if done:
            break
    # reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*\
              np.exp(-decay_rate*episode)
    rewards.append(total_rewards)
    
    if episode % 1000 == 0:
        print("Episode:", episode, 
              "--Reward:", total_rewards, 
              "--Epsilon:", epsilon)
    
print ("Score over time: " +  str(sum(rewards)/total_episodes))
print("="*50, "\n" , qtable)
print("="*50)

Episode: 0 --Reward: 0.0 --Epsilon: 1.0
Episode: 1000 --Reward: 0.0 --Epsilon: 0.016670567529094613
Episode: 2000 --Reward: 1.0 --Epsilon: 0.010044945930464861
Episode: 3000 --Reward: 1.0 --Epsilon: 0.010000302843297297
Episode: 4000 --Reward: 1.0 --Epsilon: 0.010000002040542086
Episode: 5000 --Reward: 0.0 --Epsilon: 0.010000000013749065
Episode: 6000 --Reward: 1.0 --Epsilon: 0.010000000000092641
Episode: 7000 --Reward: 1.0 --Epsilon: 0.010000000000000625
Episode: 8000 --Reward: 0.0 --Epsilon: 0.010000000000000004
Episode: 9000 --Reward: 0.0 --Epsilon: 0.01
Episode: 10000 --Reward: 1.0 --Epsilon: 0.01
Episode: 11000 --Reward: 1.0 --Epsilon: 0.01
Episode: 12000 --Reward: 0.0 --Epsilon: 0.01
Episode: 13000 --Reward: 0.0 --Epsilon: 0.01
Episode: 14000 --Reward: 1.0 --Epsilon: 0.01
Episode: 15000 --Reward: 0.0 --Epsilon: 0.01
Episode: 16000 --Reward: 1.0 --Epsilon: 0.01
Episode: 17000 --Reward: 0.0 --Epsilon: 0.01
Episode: 18000 --Reward: 0.0 --Epsilon: 0.01
Episode: 19000 --Reward: 0.0 --

## Step 5: Use our Q-table to play FrozenLake !

In [31]:
env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG
Number of steps 22
****************************************************
EPISODE  1
  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG
Number of steps 27
****************************************************
EPISODE  2
  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG
Number of steps 3
****************************************************
EPISODE  3
  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG
Number of steps 16
****************************************************
EPISODE  4
  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG
Number of steps 20
