# Intro to gym api

The tutorials are taken from https://deeplizard.com/learn/video/QK_PP_2KgGE


In [1]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output


## Creating The Environment

Next, to create our environment, we just call `gym.make()` and pass a string of the name of the environment we want to set up. We'll be using the environment FrozenLake-v0. All the environments with their corresponding names you can use here are available on [Gym's website](https://gym.openai.com/envs/#classic_control).


In [2]:
env = gym.make("FrozenLake-v1")


  deprecation(
  deprecation(


## Creating The Q-Table

We're now going to construct our Q-table, and initialize all the Q-values to zero for each state-action pair.

Remember, the number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space. We can get this information using using env.observation_space.n and env.action_space.n, as shown below. We can then use this information to build the Q-table and fill it with zeros.


In [3]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))
q_table


array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

## Initializing Q-Learning Parameters

Now, we're going to create and initialize all the parameters needed to implement the Q-learning algorithm.


In [4]:
num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99       # gamma

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001


## Coding The Q-Learning Algorithm Training Loop

Let's start from the top.

First, we create this list to hold all of the rewards we'll get from each episode. This will be so we can see how our game score changes over time. We'll discuss this more in a bit.


In [5]:
rewards_all_episodes = []


In the following block of code, we'll implement the entire Q-learning algorithm

```python
# Q-learning algorithm
for episode in range(num_episodes):
    # initialize new episode params

    for step in range(max_steps_per_episode):
        # Exploration-exploitation trade-off
        # Take new action
        # Update Q-table
        # Set new state
        # Add new reward

    # Exploration rate decay
    # Add current episode reward to total rewards list
```


## For Each Episode

Let's get inside of our first loop. For each episode, we're going to first reset the state of the environment back to the starting state.


The done variable just keeps track of whether or not our episode is finished, so we initialize it to False when we first start the episode, and we'll see later where it will get updated to notify us when the episode is over.

Then, we need to keep track of the rewards within the current episode as well, so we set rewards_current_episode to 0 since we start out with no rewards at the beginning of each episode.


## For Each Time-Step

Now we're entering into the nested loop, which runs for each time-step within an episode. The remaining steps, until we say otherwise, will occur for each time-step.

### Exploration Vs. Exploitation

For each time-step within an episode, we set our exploration_rate_threshold to a random number between 0 and 1. This will be used to determine whether our agent will explore or exploit the environment in this time-step, and we discussed the detail of this exploration-exploitation trade-off in a previous post of this series.

If the threshold is greater than the exploration_rate, which remember, is initially set to 1, then our agent will exploit the environment and choose the action that has the highest Q-value in the Q-table for the current state. If, on the other hand, the threshold is less than or equal to the exploration_rate, then the agent will explore the environment, and sample an action randomly.


## Taking Action

After our action is chosen, we then take that action by calling step() on our env object and passing our action to it. The function step() returns a tuple containing the new state, the reward for the action we took, whether or not the action ended our episode, and diagnostic information regarding our environment, which may be helpful for us if we end up needing to do any debugging.

## Update the Q-Value

After we observe the reward we obtained from taking the action from the previous state, we can then update the Q-value for that state-action pair in the Q-table. This is done using the formula we introduced in an earlier post, and remember, there we walked through a concrete example showing how to implement the Q-table update.

Here is the formula:

<img src="__ref/formulae-1.png" alt="" width="50%">

## Transition To The Next State
Next, we set our current state to the new_state that was returned to us once we took our last action, and we then update the rewards from our current episode by adding the reward we received for our previous action.

We then check to see if our last action ended the episode for us, meaning, did our agent step in a hole or reach the goal? If the action did end the episode, then we jump out of this loop and move on to the next episode. Otherwise, we transition to the next time-step.

## Exploration Rate Decay
Once an episode is finished, we need to update our exploration_rate using exponential decay, which just means that the exploration rate decreases or decays at a rate proportional to its current value. We can decay the exploration_rate using the formula above, which makes use of all the exploration rate parameters that we defined last time.

In [6]:
# for each episode
for episode in range(num_episodes):
    state = env.reset()
    done = False
    rewards_current_episode = 0

    # For each time step
    for step in range(max_steps_per_episode):
        # Exploitaion or exploration
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state, :])
        else:
            action = env.action_space.sample()
        # Taking Action
        new_state, reward, done, info = env.step(action)
        
        # Update Q-value
        q_table[state, action] = q_table[state, action] * \
            (1-learning_rate) + learning_rate * \
            (reward+discount_rate * np.max(q_table[new_state, :]))
        # Transition to new state
        state = new_state
        rewards_current_episode+=reward
        if done:
            break
    
    # Exploratio rate decay
    exploration_rate = min_exploration_rate + \
    (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)

    # Append the reward from current episode
    rewards_all_episodes.append(rewards_current_episode)


## After All Episodes Complete
After all episodes are finished, we now just calculate the average reward per thousand episodes from our list that contains the rewards for all episodes so that we can print it out and see how the rewards changed over time.

In [7]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.04900000000000004
2000 :  0.21300000000000016
3000 :  0.3800000000000003
4000 :  0.5310000000000004
5000 :  0.6200000000000004
6000 :  0.6450000000000005
7000 :  0.6950000000000005
8000 :  0.6720000000000005
9000 :  0.6960000000000005
10000 :  0.6540000000000005


Interpreting The Training Results
Let's take a second to understand how we can interpret these results. Our agent played 10,000 episodes. At each time step within an episode, the agent received a reward of 1 if it reached the frisbee, otherwise, it received a reward of 0. If the agent did indeed reach the frisbee, then the episode finished at that time-step.

So, that means for each episode, the total reward received by the agent for the entire episode is either 1 or 0. So, for the first thousand episodes, we can interpret this score as meaning that 4% of the time, the agent received a reward of 1 and won the episode. And by the last thousand episodes from a total of 10,000, the agent was winning 72% of the time.

By analyzing the grid of the game, we can see it's a lot more likely that the agent would fall in a hole or perhaps reach the max time steps than it is to reach the frisbee, so reaching the frisbee  of the time isn't too shabby, especially since the agent had no explicit instructions to reach the frisbee. It learned that this is the correct thing to do.

    SFFF
    FHFH
    FFFH
    HFFG
    
Lastly, we print out our updated Q-table to see how that has transitioned from its initial state of all zeros.

## Save the trained Q tablw

In [8]:
# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)
np.savetxt("L-07-Q_table.txt",q_table)



********Q-table********

[[0.50757154 0.44019976 0.45177627 0.45741024]
 [0.2855845  0.31854987 0.27639278 0.42394095]
 [0.33069364 0.26150264 0.26573058 0.27020972]
 [0.08498068 0.15198439 0.07541347 0.10260243]
 [0.52518316 0.3934814  0.34749928 0.37584832]
 [0.         0.         0.         0.        ]
 [0.16623461 0.1321614  0.31466261 0.10396448]
 [0.         0.         0.         0.        ]
 [0.37000304 0.42302033 0.41291809 0.55496024]
 [0.4862758  0.58800947 0.42354015 0.36065907]
 [0.5296818  0.32909486 0.38978523 0.24914056]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.30754057 0.55994573 0.74227529 0.49746954]
 [0.73491478 0.84302153 0.76762264 0.75333791]
 [0.         0.         0.         0.        ]]
