# <center>Frozen Lake
### <center>Using Q-learning to not die

<img src='frozenlake.png'>

First, let's import all the necessary libraries. 

In [1]:
import gym
import numpy as np
import time
from IPython.display import clear_output
import pandas as pd

Next, let's watch some random playing of the game.

In [5]:
env = gym.make("FrozenLake-v0", is_slippery=False)
for game in range(10):
    state = env.reset()
    for t in range(100):
        clear_output(wait=True)
        env.render()
        time.sleep(0.2)
        action = env.action_space.sample()
        next_state, reward, done, info = env.step(action)
        if done:
            break

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG


### <center>Discuss
#### <center> What do you observe from watching the random agent? 
#### <center> Do you think this will be an easy environment to solve?

Now, let's try using Q-learning to train an agent to play the game.

We will start with a version of FrozenLake where the surface is not slippery. This is a deterministic environment -  every action we choose to take will always move us in that direction.

You will need the following elements:
- <b>Q-table</b>:
    - Instantiate a table of zeros of size (number of states)x(number of actions)
    - You can use np.zeros() to create the array.
    
- <b>Gamma</b>:
    - The discount factor for future rewards.
    - Typically a number close to 1 (0.95-0.99) but feel free to try different values.
- <b>Scores array</b>:
    - An empty list to append scores from each game to.
    - Also, set a running score variable to zero at the start of each game.
- <b>Action selection</b>:
    - At each step, choose the best action.
    - Since at the beginning, our table will be all zeros, we need to be able to choose an action at random if more than one has the highest Q-value.
    - You can use np.max() to get the highest Q-value.
    - Then get the action for each of those highest Q-values with np.where(). (Note: np.where() returns a tuple where the first element is the array of actions.)
    - Then use np.random.choice() to select one of the best actions.
- <b>Q-value calculation</b>:
    - Using the Q-formula:
    <img src='q_formula.png'>
    - You will have to find the maximum Q-value at the next state.
    - You will want to set the appropriate value in your Q-table to this.
- <b>Reward tracking</b>:
    - Add the current reward to the running score total
    - After each game, store the game score in the scores list
- <b>State transition</b>
    - Set the next state as the current state.
- <b>Termination</b>
    - If the agent managed to not fall in a hole and thus made it to the goal for 10 episodes in a row, then the environment is considered solved.

In [4]:
env = gym.make("FrozenLake-v0", is_slippery=False)

Q_table = np.zeros([env.observation_space.n, env.action_space.n])
gamma = 0.95
scores = []

for episode in range(10000):
    state = env.reset()
    score = 0
    done = False
    while not done:
        action = np.random.choice(np.where(Q_table[state,:]==np.max(Q_table[state,:]))[0])
        next_state, reward, done, info = env.step(action)
        Q_value = reward + gamma * np.max(Q_table[next_state, :])
        Q_table[state, action] = Q_value
        score += reward
        state = next_state     
    scores.append(score)
    if episode>9:
        if np.mean(scores[-10:])==1:
            print("Solved!")
            print("Average per past 10 games: " + str(np.mean(scores[-10:])))
            print("Number of games played: " + str(episode))
            break
            
if np.mean(scores[-10:])!=1:
    print('Failed')
env.close()

Solved!
Average per past 100 games: 0.18478260869565216
Number of games played: 91


Take a look at your Q-table now.

In [6]:
q_table_not_slippery = np.round(Q_table,3)
q_table_not_slippery

array([[0.   , 0.   , 0.774, 0.   ],
       [0.   , 0.   , 0.815, 0.   ],
       [0.   , 0.857, 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.902, 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.902, 0.   ],
       [0.   , 0.95 , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.95 , 0.   ],
       [0.   , 0.   , 1.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ]])

### <center>Discuss
#### <center> How did your agent do? 
#### <center> What does your Q-table look like?

Now we are going to try a stochastic version of our environment by setting is_slippery to true. There is a now a chance that we will slide on the ice and move a different direction than what we chose. 

### <center>Discuss
#### <center> How do you think the challenge will change in the new, slippery version of the environment?


Copy your code from above and paste below with the following changes:
- <b>is_slippery</b>:
    - Set to True

In [7]:
env = gym.make("FrozenLake-v0", is_slippery=True)

Q_table = np.zeros([env.observation_space.n, env.action_space.n])
gamma = 0.95
scores = []

for episode in range(10000):
    state = env.reset()
    score = 0
    done = False
    while not done:
        action = np.random.choice(np.where(Q_table[state,:]==np.max(Q_table[state,:]))[0])
        next_state, reward, done, info = env.step(action)
        Q_value = reward + gamma * np.max(Q_table[next_state, :])
        Q_table[state, action] = Q_value
        score += reward
        state = next_state     
    scores.append(score)
    if episode>9:
        if np.mean(scores[-10:])==1:
            print("Solved!")
            print("Average per past 10 games: " + str(np.mean(scores[-10:])))
            print("Number of games played: " + str(episode))
            break
     
    if episode%1000==0:
        print('Episode: ',episode, 'Average success rate: ', np.mean(scores))
if np.mean(scores[-10:])!=1:
    print('Failed')
env.close()

Episode:  0 Average success rate:  0.0
Episode:  1000 Average success rate:  0.017982017982017984
Episode:  2000 Average success rate:  0.02048975512243878
Episode:  3000 Average success rate:  0.01866044651782739
Episode:  4000 Average success rate:  0.018245438640339916
Episode:  5000 Average success rate:  0.01919616076784643
Episode:  6000 Average success rate:  0.019330111648058656
Episode:  7000 Average success rate:  0.01814026567633195
Episode:  8000 Average success rate:  0.017872765904261966
Episode:  9000 Average success rate:  0.016887012554160648
Failed


Take a look at your Q-table now.

In [8]:
q_table_is_slippery = np.round(Q_table,3)
q_table_is_slippery

array([[0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.857],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.95 , 0.   ],
       [0.   , 0.   , 1.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ]])

### <center>Discuss
#### <center> How did your agent perform in the slippery environment?
#### <center> Why is performance different in this environment?


### <center>Solution

#### <center>Greedy epsilon policy
<center>Makes the agent explore the environment more at first and then later on relying only on its knowledge.

#### <center>Learning rate
<center>Allows the agent to accumulate knowledge, not just overriding previous Q-values with each new experience.

<center>Q[s, a] = Q[s, a] + alpha*(R + gamma*Max[Q(s’, A)] - Q[s, a])

In [11]:
env = gym.make("FrozenLake-v0", is_slippery=True)

Q_table = np.zeros([env.observation_space.n, env.action_space.n])
gamma = 0.95
scores = []

epsilon = 1
epsilon_decay = 0.995
epsilon_min = 0.01
alpha = 0.85

for episode in range(10000):
    state = env.reset()
    score = 0
    done = False
    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.random.choice(np.where(Q_table[state,:]==np.max(Q_table[state,:]))[0])
        next_state, reward, done, info = env.step(action)
        Q_value = reward + gamma * np.max(Q_table[next_state, :])
        Q_table[state, action] = Q_table[state, action] + alpha*(Q_value-Q_table[state,action])
        score += reward
        state = next_state     
    scores.append(score)
    epsilon = epsilon*epsilon_decay
    if episode>99:
        if np.mean(scores[-10:])==1:
            print("Solved!")
            print("Average per past 10 games: " + str(np.mean(scores[-10:])))
            print("Number of games played: " + str(episode))
            break
     
    if episode%100==0:
        print('Average success rate: ', np.mean(scores), ' Epsilon: ', epsilon)
if np.mean(scores[-10:])!=1:
    print('Failed')
env.close()

Average success rate:  0.0  Epsilon:  0.995
Average success rate:  0.0594059405940594  Epsilon:  0.6027415843082742
Average success rate:  0.05472636815920398  Epsilon:  0.36512303261753626
Average success rate:  0.05647840531561462  Epsilon:  0.2211807388415433
Average success rate:  0.07481296758104738  Epsilon:  0.13398475271138335
Average success rate:  0.09780439121756487  Epsilon:  0.0811640021330769
Average success rate:  0.11314475873544093  Epsilon:  0.04916675299948831
Average success rate:  0.12838801711840228  Epsilon:  0.029783765425331846
Average success rate:  0.15355805243445692  Epsilon:  0.018042124582040707
Average success rate:  0.19089900110987792  Epsilon:  0.010929385683282892
Average success rate:  0.22977022977022976  Epsilon:  0.0066206987359377885
Solved!
Average per past 10 games: 1.0
Number of games played: 1019


In [12]:
q_table_is_slippery_solved = np.round(Q_table,3)
q_table_is_slippery_solved

array([[0.297, 0.07 , 0.028, 0.062],
       [0.003, 0.004, 0.001, 0.28 ],
       [0.004, 0.006, 0.012, 0.178],
       [0.   , 0.01 , 0.011, 0.074],
       [0.38 , 0.008, 0.016, 0.001],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.031, 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.002, 0.011, 0.013, 0.755],
       [0.003, 0.866, 0.   , 0.003],
       [0.746, 0.004, 0.002, 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.008, 0.   , 0.747, 0.014],
       [0.118, 0.958, 0.107, 0.167],
       [0.   , 0.   , 0.   , 0.   ]])