# Deep Reinforcement Learning

Tutorial obtained from [learndatasci.com](https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/)

## Install:

```bash
git clone https://github.com/openai/gym
cd gym
pip install -e .
```

In [1]:
!pip install gym

In [1]:
import gym

env = gym.make("Taxi-v2").env

env.render()

+---------+
|R: | : :G|
| : : : :[43m [0m|
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+



The core gym interface is `env`, which is the unified environment interface. The following are the `env` methods that would be quite helpful to us:

- `env.reset`: Resets the environment and returns a random initial state.
- `env.step`(action): Step the environment by one timestep. Returns
  - **observation**: Observations of the environment
  - **reward**: If your action was beneficial or not
  - **done**: Indicates if we have successfully picked up and dropped off a passenger, also called one episode
  - **info**: Additional info such as performance and latency for debugging purposes
- `env.render`: Renders one frame of the environment (helpful in visualizing the environment)

*Note: We are using the .env on the end of make to avoid training stopping at 200 iterations, which is the default for the new version of Gym (reference).*

The **Action Space** is:
- 0 = south
- 1 = north
- 2 = east
- 3 = west
- 4 = pickup
- 5 = dropoff

In [2]:
env.reset() # reset environment to a new, random state
env.render()

print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

+---------+
|R: | : :[34;1mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[35m[43mY[0m[0m| : |B: |
+---------+

Action Space Discrete(6)
State Space Discrete(500)


In [3]:
# (taxi row, taxi column, passenger index, destination index)
state = env.encode(3, 1, 2, 0) 
print("State:", state)

env.s = state
env.render()

State: 328
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



`P` is a Reward table. The dictionary has the following structure:

`{action: [(probability, nextstate, reward, done)]}`.

In [4]:
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

In [5]:
env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 862
Penalties incurred: 280


In [14]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

Timestep: 14
State: 475
Action: 5
Reward: 20


Not good. Our agent takes thousands of timesteps and makes lots of wrong drop offs to deliver just one passenger to the right destination.

This is because we aren't learning from past experience. We can run this over and over, and it will never optimize. The agent has no memory of which action was best for each state, which is exactly what Reinforcement Learning will do for us.

## Intro to Q-Learning

Essentially, Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state.

In our Taxi environment, we have the reward table, P, that the agent will learn from. It does thing by looking receiving a reward for taking an action in the current state, then updating a Q-value to remember if that action was beneficial.

    
    TODO: Add theory

In [6]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

In [7]:
q_table

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

## Training the agent

In [15]:
%%time

import random
import time
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 10001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        
            
        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(q_table)
        print(f"Episode: {i}")

print("Training finished.\n")

[[  0.           0.           0.           0.           0.
    0.        ]
 [ -2.27294961  -2.12204027  -2.27308804  -2.122014    -1.870144
  -11.11829469]
 [ -1.86891841  -1.45069001  -1.86989488  -1.45100413  -0.7504
  -10.44857059]
 ...
 [ -1.19727543   0.41220266  -1.05770049  -1.22960122  -1.94179156
   -3.61203677]
 [ -2.13060336  -2.11078713  -2.11701026  -2.11026458  -4.448542
   -4.35022508]
 [  2.1769634    0.05959      0.82764379  10.99999998  -2.19439328
   -1.82619831]]
Episode: 1000
Training finished.

CPU times: user 453 ms, sys: 103 ms, total: 556 ms
Wall time: 430 ms


## Evaluate agent's performance after Q-learning

In [13]:
total_epochs, total_penalties = 0, 0
episodes = 100
frames = [] # for animation

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        


for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    print_frames(frames)
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1
              
        # Put each rendered frame into dict for animation
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward
        })

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

+---------+
|[34;1mR[0m: | : :G|
| : : : : |
| : : : : |
| |[43m [0m: | : |
|Y| : |[35mB[0m: |
+---------+
  (North)

Timestep: 1
State: 323
Action: 1
Reward: -1


NameError: name 'sleep' is not defined

| Measure                                 | Random agent's performance | Q-learning agent's performance |
| --------------------------------------- | -------------------------- | ------------------------------ |
| Average rewards per move                | -3.9012092102214075        | 0.6962843295638126             |
| Average number of penalties per episode | 920.45                     | 0.0                            |
| Average number of timesteps per trip    | 2848.14                    | 12.38                          |

These metrics were computed over 100 episodes. And as the results show, our Q-learning agent nailed it!