# Reinforcement learning approach to the Frowen lake game#
## A simple illustration of the q-learning algorithm ##
We are going to teach a computer to play the Game "FROZEN LAKE" which, converniently enough, is provided by the gym environement:

In [1]:
import numpy as np
import gym
import random
env = gym.make("FrozenLake-v0")

The goal is the game is easy: one start at the top left of the screen, and needs to go down to the bottom right, avoiding the holes:
![](Frozen-Lake.png	)
The player must deceide at each steps what he does, but beware, the lake is slipery, so even if you move left, from time to time you will find yourself moving in a different direction.

As can be checked easyly, there are $4$ possible actions, and $16$ possible states:

In [2]:
action_size = env.action_space.n
state_size = env.observation_space.n
print("Number of possible actions: %d, number of possible states :%d." % (action_size,state_size))

Number of possible actions: 4, number of possible states :16.


Our goal will be to construct the Q-table that will give us, for each action, the total sum of reward in the future. In other words, the ideal table should be
$$
Q^*(s,a)=R^0(a)+\sum_{t=1}^{\infty} \gamma^t R^t
$$
where $R^0$ is the immediate reward if action $a$ is taken, and $R^t$ is the best possible reward for all next possible times.

Since we do not know this table *a priori* we start with a random guess Q:

In [3]:
qtable = np.random.uniform(0,1e-4,(state_size, action_size))
print(qtable)

[[8.20126535e-05 9.51917993e-05 7.58549581e-06 4.20522717e-05]
 [5.93560679e-05 2.05266150e-05 7.73609902e-05 3.32627626e-05]
 [7.15391994e-05 7.81981546e-05 5.41331483e-06 1.38914220e-05]
 [8.88068036e-05 1.02153289e-05 4.83677338e-05 5.71423813e-05]
 [1.04484668e-06 6.92089489e-05 4.78812433e-05 6.40257544e-05]
 [9.59348388e-05 1.57587732e-06 5.01156269e-05 6.72649751e-05]
 [7.07425060e-05 1.06172168e-05 2.96746011e-05 4.19711066e-05]
 [9.82190629e-05 2.72426338e-05 7.39274089e-05 6.66954648e-05]
 [9.95079939e-05 2.24644631e-05 7.77881135e-05 8.00153735e-05]
 [3.46003198e-05 3.06226874e-05 7.38690638e-05 8.63406604e-05]
 [7.28911027e-05 7.16976933e-05 5.95028190e-05 4.41792308e-05]
 [3.26390314e-05 4.93716965e-05 1.73102804e-06 1.68723502e-05]
 [2.54424347e-05 5.87725510e-05 7.56597495e-05 9.60635491e-05]
 [3.13446334e-05 6.32912262e-05 7.90374495e-05 6.88290966e-05]
 [1.90775409e-05 8.01124821e-05 2.29613871e-05 9.05698301e-05]
 [8.67596055e-05 5.94842029e-05 9.12279735e-05 5.229162

To learn this table (the Q-learning part) we need to derive the Bellman equation. It follows from the remark that for the ideal table one has:
$$Q^*(s,a)=R^0(a)+\sum_{t=1}^{\infty} \gamma^t R^t= R^0(a) + \gamma \sum_{t=1}^{\infty} \gamma^{t-1} R^t = R^0(a) + \gamma \left[R^1  + \sum_{\tau=1}^{\infty} \gamma^{\tau} R^{1+\tau}\right]$$

Since $Q^*(s,a)$ is time-translation invariant, we can thus write:
$$Q^*(s,a)= R^0(a) + \gamma \left[R_{\rm best}^1  + \sum_{\tau=0}^{\infty} \gamma^{\tau} R^{1+\tau}\right]$$
and this leads to the **Bellman equation**:
$$Q^*(s,a)=  R^0(a) + \gamma \max_{a'} Q^*(s'|a,a')$$

Given this identity, we will use the update rule:
$$
Q^{t+1}(s,a)=(1-\delta) * Q^{t}(s,a)+ \delta(R(a) * \gamma \max_{a'} Q^*(s'|a,a')$$
where $\delta$ is the learning rate.

The important point, before updating this table, is to set up an equilibrium between exploration and explotation when we play the game: of course, we want ultimatly to play the game according to the q-table $Q^*$ (*exploitation*) but since, at the begiging, our table $Q$ is essentially random, we should also try from to time to time to allow random moves (exploration). 

We shall do this less and less over time, of course, and every time we play a new game (a new episode) we should start to beleive more and more our table and so we shall set the eploration rate as
$$
\epsilon^t = \epsilon_{\min} + (\epsilon_{\max} - \epsilon_{\min})e^{-n_{\rm episode} \lambda} 
$$
where $\lambda$ is a decay rate.

Let us set up all these parameters

In [10]:
total_episodes = 20000      # Total episodes (number of games played)
learning_rate = 0.5         # Learning rate in Bellman equation (delta)
max_steps = 99              # Max steps per episode
gamma = 0.99                # Discounting rate in the Q-table

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.001             # Exponential decay rate for exploration prob

We are now ready to write the learning algorithm:

In [11]:
# List of rewards
rewards = []

# For each episode/game, we play:
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    #Now, we play until dead or until it became toooooooo long
    for step in range(max_steps):
        # First we deceide if we play in or out of policy:
        exp_exp_tradeoff = random.uniform(0, 1)
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()

        # Now we take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Finally we perform the Bellman update...
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        #... and update the reward for this game.
        # Note that here, we only get a reward 1 if we eventually reach the goal!
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        #other we continue to play
        
    # Reduce epsilon after each game/episode
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    #update rewards
    rewards.append(total_rewards)
    
    if(episode%100==0):
        av_rewards=sum(rewards)/100;
        print("Game numer %d: total reward:%f" %(episode,av_rewards))
        rewards = []

print(qtable)

Game numer 0: total reward:0.000000
Game numer 100: total reward:0.000000
Game numer 200: total reward:0.020000
Game numer 300: total reward:0.020000
Game numer 400: total reward:0.000000
Game numer 500: total reward:0.020000
Game numer 600: total reward:0.040000
Game numer 700: total reward:0.030000
Game numer 800: total reward:0.100000
Game numer 900: total reward:0.090000
Game numer 1000: total reward:0.070000
Game numer 1100: total reward:0.080000
Game numer 1200: total reward:0.080000
Game numer 1300: total reward:0.060000
Game numer 1400: total reward:0.120000
Game numer 1500: total reward:0.150000
Game numer 1600: total reward:0.170000
Game numer 1700: total reward:0.240000
Game numer 1800: total reward:0.290000
Game numer 1900: total reward:0.120000
Game numer 2000: total reward:0.250000
Game numer 2100: total reward:0.240000
Game numer 2200: total reward:0.230000
Game numer 2300: total reward:0.210000
Game numer 2400: total reward:0.320000
Game numer 2500: total reward:0.25000

We can now watch our little q-table playing the game, this time using *in policty* moves only:

In [13]:
env.reset()
env.render()
for step in range(max_steps):
    print ("t=%d " % (step))
    # Take the action (index) that have the maximum expected future reward given that state
    action = np.argmax(qtable[state,:])
    new_state, reward, done, info = env.step(action)
    if done:
        env.render()            
        # We print the number of step it took.
        print("Number of steps", step)
        break
    #else we move    
    state = new_state
    env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG
t=0 
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=1 
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=2 
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=3 
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=4 
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=5 
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=6 
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=7 
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=8 
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=9 
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=10 
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
t=11 
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=12 
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=13 
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=14 
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=15 
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
t=16 
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=17 
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=18 
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=19 
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=20 
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
t=21 
  (Down)
SFFF
FHFH
FF[41mF[0mH
HFFG
t=22 
  (Left)
SFFF
FH[

Playing the parameters, we should be able to make an algortihm quite capable to play the game