# Value Iteration Q-Learning Agent to play the Frozen Lake Game

*Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:*

<center> SFFF </center> 
<center>FHFH </center>
<center>FFFH </center>
<center>HFFG </center>

This grid is our environment where S is the agent's starting point, and it's safe. F represents the frozen surface and is also safe. H represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that's not good. Finally, G represents the goal, which is the space on the grid where the prized frisbee is located.

The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise.

- S: Starting point, Safe. 0 Reward \\
- F: Frozen Lake, Safe. 0 Reward \\
- H: Hole, Unsafe. Episode Ends. 0 Reward \\
- G: Goal. Episode ends. 1 Reward \\


In [None]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output
import matplotlib.pyplot as plt

## Creating the environment
Using open AI gym, we can load the environment of the Frozen Lake and render it using the following code:

In [None]:
env = gym.make("FrozenLake-v0")
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


## Constructing the Q Table
We're now going to construct our Q-table, and initialize all the Q-values to zero for each state-action pair.

Remember, the number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space. We can get this information using using env.observation_space.n and env.action_space.n, as shown below.

In [None]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size)) #Initialize q_table to a np array of zeros of shape state_space_size x action_space_size
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## Initialize Q-Learning Parameters

Initialize the parameters for the q learning agent. Challenge: Change these values to get better results than the ones with these original values.

In [None]:
num_episodes = 10000 
max_steps_per_episode = 100 #Maximum number of moves per episode

learning_rate = 0.05 #Learning rate for updating the q values.
discount_rate = 0.99

exploration_rate = 1 #Initial exploration rate
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001 

## Train the Q-Learning agent

In [None]:
rewards_all_episodes = [] #This will store the rewards in each episode
for episode in range(num_episodes): #Iterate over each episode
    state = env.reset() #Reset the environment before starting each episode
    done = False #The done variable tells if the episode is over yet
    rewards_current_episode = 0
    for step in range(max_steps_per_episode): #Iterate over the steps

    ##### Exploration-exploitation trade-off #####
        #Generate a random sample between 0 and 1, explore if this value is less than the exploration rate, exploit otherwise
        exploration_rate_threshold = random.uniform(0,1) 
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:]) 
        else:
            action = env.action_space.sample() #Hint: You want your action to be randomly chosen from {0,1,2,3} in this case
            
        new_state, reward, done, info = env.step(action) #Get the updated environment
        
        #Update the q table using the value iteration formula
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        state = new_state
        rewards_current_episode += reward 
        if done == True: 
            break
          
    #Update the exploration rate        
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode) 
    rewards_all_episodes.append(rewards_current_episode) 

Look at the values of the reward per thousand episodes. It should be increasing if your agent is learning properly.

In [None]:
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.04300000000000003
2000 :  0.19800000000000015
3000 :  0.4070000000000003
4000 :  0.5740000000000004
5000 :  0.6180000000000004
6000 :  0.6690000000000005
7000 :  0.6980000000000005
8000 :  0.7040000000000005
9000 :  0.6750000000000005
10000 :  0.7160000000000005


In [None]:
#Final Q Table
print(q_table)

[[0.57167596 0.54216396 0.53900705 0.5338552 ]
 [0.27621282 0.32140462 0.30082611 0.52869067]
 [0.44235584 0.40940537 0.40430461 0.49758144]
 [0.3211564  0.36309904 0.25882046 0.47944005]
 [0.58817814 0.39342404 0.3525503  0.36208995]
 [0.         0.         0.         0.        ]
 [0.21092903 0.2160294  0.39917689 0.16138619]
 [0.         0.         0.         0.        ]
 [0.41976919 0.39802194 0.41382696 0.62645387]
 [0.35390398 0.67461136 0.43783381 0.33740017]
 [0.60303274 0.42049938 0.43218607 0.30058745]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.39649383 0.56792192 0.78893972 0.5812407 ]
 [0.72221288 0.90018348 0.79199654 0.77800173]
 [0.         0.         0.         0.        ]]


## Watch the trained agent play Frozen Lake

In [None]:

for episode in range(3):
    state = env.reset()
    done = False
    print("*****EPISODE ", episode+1, "*****\n\n\n\n")
    time.sleep(1)
    for step in range(max_steps_per_episode):        
        clear_output(wait=True)
        env.render()
        time.sleep(0.3)
        action = np.argmax(q_table[state,:])   #Since this is not part of training, we want the agent to exploit all the time.     
        new_state, reward, done, info = env.step(action)
        if done:
            clear_output(wait=True)
            env.render()
            if reward == 1:
                print("****You reached the goal!****")
                time.sleep(3)
            else:
                print("****You fell through a hole!****")
                time.sleep(3)
                clear_output(wait=True)
            break
        state = new_state
env.close()

  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
****You reached the goal!****
