# <center>Frozen Lake
### <center>Using Q-learning to not die

<center>
S F F F <br>
F H F H <br>
F F F H <br>
H F F G <br>
</center>

S = starting position<br>
F = frozen, or safe<br>
H = hole, or death<br>
G = goal<br>

<img src='frozenlake.png'>

First, let's import all the necessary libraries. 

In [1]:
import gym
import numpy as np
import time
from IPython.display import clear_output
import pandas as pd

Next, let's watch some random playing of the game.

In [2]:
env = gym.make("FrozenLake-v0", is_slippery=False)
for game in range(10):
    state = env.reset()
    for t in range(100):
        clear_output(wait=True)
        env.render()
        time.sleep(0.2)
        action = env.action_space.sample()
        next_state, reward, done, info = env.step(action)
        if done:
            break

  (Down)
SFFF
FHFH
FF[41mF[0mH
HFFG


### <center>Discuss
#### <center> What do you observe from watching the random agent? 
#### <center> Do you think this will be an easy environment to solve?

Now, let's try using Q-learning to train an agent to play the game.

We will start with a version of FrozenLake where the surface is not slippery. This is a deterministic environment -  every action we choose to take will always move us in that direction.

You will need the following elements:
- <b>Q-table</b>:
    - Instantiate a table of zeros of size (number of states)x(number of actions)
    - You can use np.zeros() to create the array.
    
- <b>Gamma</b>:
    - The discount factor for future rewards.
    - Typically a number close to 1 (0.95-0.99) but feel free to try different values.
- <b>Scores array</b>:
    - An empty list to append scores from each game to.
    - Also, set a running score variable to zero at the start of each game.
- <b>Action selection</b>:
    - At each step, choose the best action.
    - Since at the beginning, our table will be all zeros, we need to be able to choose an action at random if more than one has the highest Q-value.
    - You can use np.max() to get the highest Q-value.
    - Then get the action for each of those highest Q-values with np.where(). (Note: np.where() returns a tuple where the first element is the array of actions.)
    - Then use np.random.choice() to select one of the best actions.
- <b>Q-value calculation</b>:
    - Using the Q-formula:
    <img src='q_formula.png'>
    - You will have to find the maximum Q-value at the next state.
    - You will want to set the appropriate value in your Q-table to this.
- <b>Reward tracking</b>:
    - Add the current reward to the running score total
    - After each game, store the game score in the scores list
- <b>State transition</b>
    - Set the next state as the current state.
- <b>Termination</b>
    - If the agent managed to not fall in a hole and thus made it to the goal for 10 episodes in a row, then the environment is considered solved.

In [3]:
env = gym.make("FrozenLake-v0", is_slippery=False)

Q_table = None
gamma = None
scores = []

## We will play for a maximum of 10000 games.
for episode in range(10000):
    state = env.reset()
    score = 0
    while not done:
        ## Choose action
        
        next_state, reward, done, info = env.step(action)
        
        ## Calculate Q-value
        
        ## Set appropriate value in Q-table to Q-value
        
        ## Add current reward to running score total
        
        ## Set the current state to the next_state
        
    ## Add the score for this game to the scores list
    
    
    ## If we have played at least 10 games, check if the last 10 games resulted in success
    if episode>9:
        if np.mean(scores[-10:])==1:
            print("Solved!")
            print("Average per past 100 games: " + str(np.mean(scores[-100:])))
            print("Number of games played: " + str(episode))
            break
            
if np.mean(scores[-10:])!=1:
    print('Failed')
env.close()

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Failed


Take a look at your Q-table now.

In [4]:
q_table_not_slippery = np.round(Q_table,3)
q_table_not_slippery

TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

### <center>Discuss
#### <center> How did your agent do? 
#### <center> What does your Q-table look like?

Now we are going to try a stochastic version of our environment by setting is_slippery to true. There is a now a chance that we will slide on the ice and move a different direction than what we chose. 

### <center>Discuss
#### <center> How do you think the challenge will change in the new, slippery version of the environment?


Copy your code from above and paste below with the following changes:
- <b>is_slippery</b>:
    - Set to True

In [None]:
## Code for slippery environment agent

Take a look at your Q-table now.

In [None]:
q_table_is_slippery = np.round(Q_table,3)
q_table_is_slippery

### <center>Discuss
#### <center> How did your agent perform in the slippery environment?
#### <center> Why is performance different in this environment?


## <center>Solution

#### <center>Greedy epsilon policy
<center>Makes the agent explore the environment more at first and then later on relying only on its knowledge.

#### <center>Learning rate
<center>Allows the agent to accumulate knowledge, not just overriding previous Q-values with each new experience.

<center>Q[s, a] = Q[s, a] + alpha*(R + gamma*Max[Q(s’, A)] - Q[s, a])

In [None]:
env = gym.make("FrozenLake-v0", is_slippery=True)

Q_table = np.zeros([env.observation_space.n, env.action_space.n])
gamma = 0.95
scores = []

epsilon = 1
epsilon_decay = 0.995
epsilon_min = 0.01
alpha = 0.85

for episode in range(10000):
    state = env.reset()
    score = 0
    done = False
    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.random.choice(np.where(Q_table[state,:]==np.max(Q_table[state,:]))[0])
        next_state, reward, done, info = env.step(action)
        Q_value = reward + gamma * np.max(Q_table[next_state, :])
        Q_table[state, action] = Q_table[state, action] + alpha*(Q_value-Q_table[state,action])
        score += reward
        state = next_state     
    scores.append(score)
    epsilon = epsilon*epsilon_decay
    if episode>99:
        if np.mean(scores[-10:])==1:
            print("Solved!")
            print("Average per past 10 games: " + str(np.mean(scores[-10:])))
            print("Number of games played: " + str(episode))
            break
     
    if episode%100==0:
        print('Average success rate: ', np.mean(scores), ' Epsilon: ', epsilon)
if np.mean(scores[-10:])!=1:
    print('Failed')
env.close()

In [None]:
q_table_is_slippery_solved = np.round(Q_table,3)
q_table_is_slippery_solved