<a href="https://colab.research.google.com/github/jbpacker/deep-rl-class/blob/main/unit5/HuggingFace_unit_5_%F0%9F%92%AA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 5: Code your first Deep Reinforcement Learning Algorithm with PyTorch: Reinforce. And test its robustness üí™

link to [original colab](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit5/unit5.ipynb)

üéÆ Environments: 
- [CartPole-v1](https://www.gymlibrary.ml/environments/classic_control/cart_pole/)
- [PixelCopter](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)
- [Pong](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pong.html)

## get everything ready

### Step 1: install libraries

In [1]:
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(500, 500))
virtual_display.start()

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-opengl is already the newest version (3.1.0+dfsg-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
xvfb is already the newest version (2:1.19.6-1ubuntu4.11).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.


<pyvirtualdisplay.display.Display at 0x7f32f01cae50>

In [2]:
!pip install gym
!pip install git+https://github.com/ntasfi/PyGame-Learning-Environment.git
!pip install git+https://github.com/qlan3/gym-games.git
!pip install huggingface_hub
!pip install wandb

!pip install pyyaml==6.0 # avoid key error metadata

!pip install pyglet # Virtual Screen

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/ntasfi/PyGame-Learning-Environment.git
  Cloning https://github.com/ntasfi/PyGame-Learning-Environment.git to /tmp/pip-req-build-gcsoo98v
  Running command git clone -q https://github.com/ntasfi/PyGame-Learning-Environment.git /tmp/pip-req-build-gcsoo98v
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/qlan3/gym-games.git
  Cloning https://github.com/qlan3/gym-games.git to /tmp/pip-req-build-9n7irwjl
  Running command git clone -q https://github.com/qlan3/gym-games.git /tmp/pip-req-build-9n7irwjl
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels

### Step 2: import packages

In [3]:
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

import wandb

import gym
import gym_pygame

from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.

import imageio

will print the device to be used

In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## Build Architecture

### Step 3: Create the CartPole environment and understand how it works
#### [The environment üéÆ](https://www.gymlibrary.ml/environments/classic_control/cart_pole/)

In [5]:
env_id = "CartPole-v1"
env = gym.make(env_id)

num_obs = env.observation_space.shape[0]
num_act = env.action_space.n

### Build Model

fully connected nn obs input and action output

In [6]:
class PolicyNetwork(nn.Module):
    def __init__(self, num_obs, num_act):
        super(PolicyNetwork, self).__init__()

        self.l1 = nn.Linear(num_obs, 64)
        self.dropout = nn.Dropout(p=0.6)
        self.l2 = nn.Linear(64, num_act)

    def forward(self, x):
        x = self.l1(x)
        x = self.dropout(x)
        x = F.relu(x)
        action_scores = self.l2(x)
        action_probs = F.softmax(action_scores, dim=1)

        return action_probs

    def act(self, state):
        """
        Given a state, take action
        """
        state = torch.from_numpy(state).float().unsqueeze(0)
        probs = self.forward(state)
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

## Build the Reinforce Training Algorithm

Start with loop that collects an episode and saves it into a replay buffer

In [7]:
def generate_episode_data(policy):
    data = [
            np.empty((0, num_obs), dtype=np.float32), # obs
            np.empty((0, 1), dtype=np.float32), # action
            np.empty((0, 1), dtype=np.float32), # reward
            np.empty((0, 1), dtype=bool),  # done
            np.empty((0, num_obs), dtype=np.float32), # next_obs
            ] 

    log_prob = []

    state = env.reset()
    done = False
    reward = 0

    while not done:
        action, lp = policy.act(state)
        log_prob.append(lp)
        data[0] = np.append(data[0], np.reshape(state, (1,-1)), axis=0)
        data[1] = np.append(data[1], np.reshape(action, (1,-1)), axis=0)
        data[2] = np.append(data[2], np.reshape(reward, (1,-1)), axis=0)
        data[3] = np.append(data[3], np.reshape(done, (1,-1)), axis=0)

        state, reward, done, info = env.step(action)

        data[4] = np.append(data[4], np.reshape(state, (1,-1)), axis=0)

    # The final replay buffer idx won't have a "next_state" or "action"
    data[0] = np.append(data[0], np.reshape(state, (1,-1)), axis=0)
    data[2] = np.append(data[2], np.reshape(reward, (1,-1)), axis=0)
    data[3] = np.append(data[3], np.reshape(done, (1,-1)), axis=0)

    return data, log_prob

## Debug printing
# data, _ = generate_episode_data(policy)
# for i in range(5):
#     print(len(data[i]))
# print(data)


Next a function that takes the replay buffer and calculates cumulative reward

In [8]:
def find_cumulative_reward(data, gamma):
    num_states = len(data[0])
    cumulative_reward = np.empty((num_states, 1), dtype=np.float32)
    cumulative_reward[num_states - 1] = data[2][num_states - 1]
    for i in reversed(range(num_states-1)):
        cumulative_reward[i] = data[2][i] + gamma * cumulative_reward[i + 1]

    return cumulative_reward

## For debugging
# gamma = 0.99
# data, _ = generate_episode_data(policy)
# R = find_cumulative_reward(data, gamma)

# for i in range(len(data[2])):
#     print("[{}] r: {} cr: {}".format(i, data[2][i], R[i]))

**Notes:**

In the huggingface class G(t) is only calculated for the entire episode. Here we calculate G(t) for each state in the episode and sum them together.

This trick is then used to increase performance found [in the pytorch reinforce implementation](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)
```
R(t) = G(t) - mean(G(t)) / std(G(t))
```


In [16]:
def train_single_episode(optimizer, policy):
    data, log_prob = generate_episode_data(policy)
    R = find_cumulative_reward(data, gamma)

    policy_losses = []
    R = torch.tensor(R)

    # This comes from the pytorch reinforce example
    # https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py
    R = (R - R.mean()) / (R.std() + eps)
    for r, l_p in zip(R, log_prob):
        # Weird for me here that 
        policy_losses.append(-l_p * r)
    
    optimizer.zero_grad()
    # Note here that div by the len seems to degrade performance
    policy_loss = torch.cat(policy_losses).sum()# / len(data[0])

    if log:
        wandb.log({"loss": policy_loss})
        wandb.log({"reward sum": np.sum(data[2])})
        wandb.log({"episode length": len(data[0])})

    policy_loss.backward()
    optimizer.step()

    tensor_obs = torch.from_numpy(data[0])


## Debug - Single Step
# env_id = "CartPole-v1"
# env = gym.make(env_id)

# num_obs = env.observation_space.shape[0]
# num_act = env.action_space.n

# policy = PolicyNetwork(num_obs, num_act)
# optimizer = optim.Adam(policy.parameters(), lr=1e-2)
# # eps = np.finfo(np.float32).eps.item()

# train_single_episode(optimizer, policy)

In [10]:
def train(env_id):
    if log: 
        wandb.init(project="reinforce")
    env = gym.make(env_id)

    num_obs = env.observation_space.shape[0]
    num_act = env.action_space.n

    policy = PolicyNetwork(num_obs, num_act)

    if log: 
        wandb.watch(policy, log_freq=1)  

    optimizer = optim.Adam(policy.parameters(), lr=1e-2)
    # eps = np.finfo(np.float32).eps.item()

    for i in range(1, steps):
        if log:
            wandb.log({"epoch": i})
        train_single_episode(optimizer, policy)

In [17]:
# Discount factor
gamma = 0.99
steps = 500

log = True
eps = np.finfo(np.float32).eps.item()

In [18]:
env_id = "CartPole-v1"
train(env_id)

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max‚Ä¶

0,1
episode length,‚ñÅ‚ñÇ‚ñÉ‚ñÇ‚ñà‚ñÖ‚ñÉ‚ñÑ‚ñà‚ñà‚ñÑ‚ñÇ‚ñá‚ñà‚ñÉ‚ñÉ‚ñá‚ñà‚ñá‚ñà‚ñà‚ñà‚ñÜ‚ñà‚ñÜ‚ñà‚ñÇ‚ñÖ‚ñà‚ñÖ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
epoch,‚ñÅ‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñá‚ñá‚ñá‚ñá‚ñà‚ñà‚ñà
loss,‚ñÜ‚ñÜ‚ñÑ‚ñà‚ñÑ‚ñÖ‚ñÜ‚ñÉ‚ñÖ‚ñÖ‚ñÖ‚ñÖ‚ñÜ‚ñÑ‚ñá‚ñá‚ñÜ‚ñÜ‚ñÑ‚ñÜ‚ñÖ‚ñÇ‚ñÑ‚ñÜ‚ñÜ‚ñÖ‚ñÖ‚ñà‚ñÖ‚ñÜ‚ñÅ‚ñÑ‚ñÉ‚ñÜ‚ñÜ‚ñÑ‚ñÖ‚ñÉ‚ñÖ‚ñÑ
reward sum,‚ñÅ‚ñÇ‚ñÉ‚ñÇ‚ñà‚ñÖ‚ñÉ‚ñÑ‚ñà‚ñà‚ñÑ‚ñÇ‚ñá‚ñà‚ñÉ‚ñÉ‚ñá‚ñà‚ñá‚ñà‚ñà‚ñà‚ñÜ‚ñà‚ñÜ‚ñà‚ñÇ‚ñÖ‚ñà‚ñÖ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

0,1
episode length,501.0
epoch,999.0
loss,-0.01161
reward sum,500.0
