Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that 
left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted.
If you step into one of those holes, you'll fall into the freezing water. 
At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake 
and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. 
The surface is described using a grid like the following:

SFFF
FHFH
FFFH
HFFG

This grid is our environment where S is the agent’s starting point, and it’s safe. F represents the frozen surface and is also safe. H represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that’s not good. Finally, G represents the goal, which is the space on the grid where the prized frisbee is located.

The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise.

State	Description	Reward
S	Agent’s starting point - safe	0
F	Frozen surface - safe	0
H	Hole - game over	0
G	Goal - game over	1

In [50]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

In [51]:
env = gym.make("FrozenLake-v0")

In [52]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n
#Initialize the q_table
q_table = np.zeros((state_space_size,action_space_size))



In [53]:
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


### Initializing Q-Learning Parameters

In [64]:
num_episodes = 10000
max_steps_per_episode =100
learning_rate = 0.1
discount_rate = 0.99
exploration_rate = 1
max_exploration_rate =  1
min_exploration_rate = 0.01
exploration_decay_rate = 0.0001


In [65]:
rewards_all_episodes = []

In [66]:
# Loop through all the episodes 
for episode in range(num_episodes):
    # Resetting the environment for each episode and the episode has not ended by using done variable and rewards_current_episode as 0
    state = env.reset()
    done = False
    rewards_current_episode = 0
    # Loop through each step per episode.
    for step in range(max_steps_per_episode): 
        #For each time-step within an episode, we set our exploration_rate_threshold to a random number between 0 and 1. This will be used to determine whether our agent will explore or exploit the environment in this time-step
        exploration_rate_threshold = random.uniform(0,1)
        #If the threshold is greater than the exploration_rate, which remember, is initially set to 1, then our agent will exploit the environment and choose the action that has the highest Q-value in the Q-table for the current state.  
        if(exploration_rate_threshold > exploration_rate):
            action = np.argmax(q_table[state,:])
        else:
            #the agent will explore the environment, and sample an action randomly.
            action =  env.action_space.sample()
            #Take that action by calling step() on our env object and passing our action to it
        new_state, reward, done, info = env.step(action)
        #update the Q value
        q_table[state, action] = q_table[state,action] * (1-learning_rate) + learning_rate*(reward + discount_rate*np.max(q_table[new_state,:]))
        state = new_state
        rewards_current_episode += reward
        if done == True: 
            break
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    rewards_all_episodes.append(rewards_current_episode)

In [67]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000
print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.01900000000000001
2000 :  0.01900000000000001
3000 :  0.028000000000000018
4000 :  0.03300000000000002
5000 :  0.03800000000000003
6000 :  0.04300000000000003
7000 :  0.060000000000000046
8000 :  0.08900000000000007
9000 :  0.10200000000000008
10000 :  0.10200000000000008


In [63]:
print(q_table)

[[0.47857402 0.45673019 0.44683036 0.45397963]
 [0.28675429 0.24693639 0.22332269 0.36779719]
 [0.31511233 0.27572929 0.28724263 0.27804774]
 [0.09591841 0.1920452  0.06280701 0.10178862]
 [0.48755724 0.39613361 0.22822045 0.41665569]
 [0.         0.         0.         0.        ]
 [0.11386927 0.18522519 0.29295839 0.08119003]
 [0.         0.         0.         0.        ]
 [0.41541809 0.31879924 0.27609227 0.5120276 ]
 [0.38343411 0.54712532 0.36512931 0.31852695]
 [0.54317864 0.36385362 0.33689954 0.37188045]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.37765347 0.49803681 0.69345935 0.43302213]
 [0.66871024 0.85350789 0.74855357 0.71962128]
 [0.         0.         0.         0.        ]]
