##### Description of the game "Frozen Lake"

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

The surface is described using a grid like the following:



https://gym.openai.com/envs/FrozenLake-v0/

In [1]:
import numpy as np
import random
from IPython.display import clear_output
import time
import gym

In [2]:
env = gym.make('FrozenLake-v0')

In [3]:
type(env)

gym.wrappers.time_limit.TimeLimit

Now we will construct the Q-Table and initialize all the Q-values to zero

In [4]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


Let us create and initialize all the parameters needed to implement the Q-learning algorithm

In [20]:
num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.1


Implementing the Algorithm

In [21]:
rewards_all_episodes = list()

# Q-Learning Algorithm
for episode in range(num_episodes):
    state = env.reset()
    
    done = False
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        
        # Exploration-Exploitation trade-off
        exploration_rate_threshold = random.uniform(0,1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:])
        else:
            action = env.action_space.sample()
            
        new_state, reward, done, info = env.step(action)
        
        # Update Q-Table for Q(s,a)
        q_table[state,action] = q_table[state,action] * (1 - learning_rate) + \
            learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
                
        state = new_state
        rewards_current_episode += reward
        
        if done == True:
            break
            
    # Exploration Rate decay 
    exploration_rate = min_exploration_rate + \
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
    
    rewards_all_episodes.append(rewards_current_episode)
    
# Calculate and print the average reward per 1000 episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000
                                 
print("********* Average Reward per thousand episodes *********\n")
for r in rewards_per_thousand_episodes:
    print(count, ":", str(sum(r/1000)))
    count += 1000
                                
# Print updated Q-table
print("\n\n********** Q-table ***********\n\n")
print(q_table)

********* Average Reward per thousand episodes *********

1000 : 0.6750000000000005
2000 : 0.6740000000000005
3000 : 0.6670000000000005
4000 : 0.6950000000000005
5000 : 0.7010000000000005
6000 : 0.6520000000000005
7000 : 0.6750000000000005
8000 : 0.6670000000000005
9000 : 0.7040000000000005
10000 : 0.6870000000000005


********** Q-table ***********


[[0.55616189 0.51399505 0.51051543 0.52904098]
 [0.37103296 0.33522746 0.30874892 0.4967635 ]
 [0.42479016 0.42368968 0.40863647 0.46898985]
 [0.23433073 0.32958115 0.27589545 0.45005899]
 [0.57052725 0.43469314 0.36495402 0.42802347]
 [0.         0.         0.         0.        ]
 [0.18340733 0.1678329  0.40207604 0.16206023]
 [0.         0.         0.         0.        ]
 [0.37940093 0.31980009 0.43297366 0.60365794]
 [0.477043   0.65009997 0.46648155 0.38779166]
 [0.61797683 0.35795662 0.3787674  0.40144973]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.51737766 0.55179143 0.7365094  

So we have trained our model. Now it is a time to see our model play the game.

We will watch 3 episodes

In [22]:
for episode in range(3):
    state = env.reset()
    done = False
    print('***** Episode ', episode+1 ,'*****\n\n\n\n')
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        clear_output(wait=True)
        env.render()
        time.sleep(0.3)
        
        action = np.argmax(q_table[state,:])
        new_state, reward, done, info = env.step(action)
        
        if done:
            clear_output(wait=True)
            env.render()
            if reward == 1:
                print("*****You reached the goal******")
                time.sleep(3)
            else:
                print('*****You fell in the hole******')
                time.sleep(3)
            clear_output(wait=True)
            break
            
        state = new_state
        
env.close()
        
        
        
          

  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
*****You fell in the hole******
