<a href="https://colab.research.google.com/github/kobybibas/huggingface_deep_reinforcement-learning_course/blob/main/unit8_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 8: Proximal Policy Gradient (PPO) with PyTorch 🤖

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail.png" alt="Unit 8"/>


In this notebook, you'll learn to **code your PPO agent from scratch with PyTorch using CleanRL implementation as model**.

To test its robustness, we're going to train it in:

- [LunarLander-v2 🚀](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)


⬇️ Here is an example of what you will achieve. ⬇️

In [None]:
%%html
<video controls autoplay><source src="https://huggingface.co/sb3/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>

We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).

## Objectives of this notebook 🏆

At the end of the notebook, you will:

- Be able to **code your PPO agent from scratch using PyTorch**.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.




## This notebook is from the Deep Reinforcement Learning Course
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>

In this free course, you will:

- 📖 Study Deep Reinforcement Learning in **theory and practice**.
- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
- 🤖 Train **agents in unique environments**

Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**


The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5

## Prerequisites 🏗️
Before diving into the notebook, you need to:

🔲 📚 Study [PPO by reading Unit 8](https://huggingface.co/deep-rl-course/unit8/introduction) 🤗  

To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push one model, we don't ask for a minimal result but we **advise you to try different hyperparameters settings to get better results**.

If you don't find your model, **go to the bottom of the page and click on the refresh button**

For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process

## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">

- `Hardware Accelerator > GPU`

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">

## Create a virtual display 🔽

During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).

Hence the following cell will install the librairies and create and run a virtual screen 🖥

In [None]:
!pip install setuptools==65.5.0

In [None]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!apt install swig cmake
!pip install pyglet==1.5
!pip3 install pyvirtualdisplay

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

## Install dependencies 🔽
For this exercise, we use `gym==0.22`.

In [None]:
# !pip install gym==0.22
# !pip install gym[box2d]==0.22

!pip install imageio-ffmpeg
!pip install huggingface_hub
!pip install gymnasium
!pip install gymnasium[box2d]
!pip install -q imageio[ffmpeg]
!pip install tqdm

## Let's code PPO from scratch with Costa Huang tutorial
- For the core implementation of PPO we're going to use the excellent [Costa Huang](https://costa.sh/) tutorial.
- In addition to the tutorial, to go deeper you can read the 37 core implementation details: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

👉 The video tutorial: https://youtu.be/MEt6rrxH8W4

In [None]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/MEt6rrxH8W4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

- The best is to code first on the cell below, this way, if you kill the machine **you don't loose the implementation**.

In [None]:
import torch
import gymnasium as gym
import numpy as np
import random

random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

def make_env(num_envs):
    envs = gym.make_vec("LunarLander-v3", num_envs=num_envs, vectorization_mode="sync")
    envs.reset(seed=123)
    return envs

num_envs = 10
envs = make_env(num_envs=num_envs)

print("Action space:", envs.action_space)
print("Sample action:", envs.action_space.sample())
obs, info = envs.reset(seed=123)
print("Observation shape:", obs.shape)  # Expect (3, 8) for 3 envs
num_actions = 4
input_size = num_observations = 8 # Observation shape


In [None]:
from torch import nn

class ActorCritic(nn.Module):

    def __init__(self, input_size, num_actions):
        super().__init__()
        self.shared_layers = nn.Sequential(
          nn.Linear(input_size, 128),
          nn.ReLU(),
          nn.Linear(128, 64),
        nn.ReLU(),
        )
        self.value_head =  nn.Linear(64, 1)
        self.policy_head =  nn.Linear(64, num_actions)


    def forward(self, x):
        x = self.shared_layers(x)
        logits = self.policy_head(x)
        value = self.value_head(x)
        return logits, value


model = ActorCritic(input_size, num_actions)
model = model.to('cuda')
obs_tensor = torch.tensor(obs)
logits, value = model(obs_tensor)
print("Logits shape:", logits.shape)  # Expect (3, 4)
print("Value shape:", value.shape)    # Expect (3, 1)

In [None]:
from torch.distributions.categorical import Categorical


def collect_rollout(model, envs, n_steps):
    with torch.no_grad():
        memory = []
        obs, info = envs.reset(seed=123)
        for _ in tqdm(range(n_steps)):
            obs_tensor = torch.tensor(obs, dtype=torch.float32).to('cuda')
            logits, values = model(obs_tensor)
            dist = Categorical(logits=logits)
            actions = dist.sample()
            logprobs = dist.log_prob(actions)
            actions_np = actions.cpu().numpy().astype(np.int32)
            # print(actions_np)

            next_obs, rewards, terminations, truncations, infos = envs.step(actions_np)

            memory.append({
                'obs': obs_tensor,
                'action': actions,
                'logprob': logprobs,
                'reward': torch.tensor(rewards, dtype=torch.float32).to('cuda'),
                'done': torch.tensor(terminations, dtype=torch.float32).to('cuda'),
                'value': values
            })

            obs = next_obs
    return memory
envs = make_env(num_envs)
n_steps = 100000
memory = collect_rollout(model, envs, n_steps)
# print(memory)

In [None]:
# Compute GAE (Generalized Advantage Estimation)
def compute_gae(memory, gamma, lam):
    n_step = len(memory)
    num_envs = memory[0]['value'].shape[0]
    advantages = torch.zeros((n_step, num_envs)).to('cuda')

    adv_t_plus_1 = torch.zeros(num_envs).to('cuda')
    memory_t_plus_1_value = torch.zeros(num_envs).to('cuda')

    for t in reversed(range(n_step)):
        memory_t = memory[t]
        r_t = memory_t['reward']                      # shape: [num_envs]
        done_t = memory_t['done']                     # shape: [num_envs]
        v_t = memory_t['value'].squeeze(-1)           # shape: [num_envs]

        # print( r_t, gamma, memory_t_plus_1_value, (1 - done_t), v_t)
        delta_t = r_t + gamma * memory_t_plus_1_value * (1 - done_t) - v_t
        adv_t = delta_t + gamma * lam * (1 - done_t) * adv_t_plus_1

        advantages[t] = adv_t.detach()
        adv_t_plus_1 = adv_t
        memory_t_plus_1_value = v_t

    values = torch.stack([m['value'].squeeze(-1) for m in memory])  # shape: [n_step, num_envs]
    return advantages + values



gamma = 0.1
lam = 0.1
gae = compute_gae(memory, gamma, lam)
print(gae, gae.shape)

In [None]:
from torch.optim import Adam
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
from tqdm.notebook import tqdm

batch_size = 2048
lr = 0.001
epoch = 100

# PPO Loss and Policy Update
def compute_ppo_loss(obs, values_predicted, logprobs_new, logprobs_old, returns, advantages, entropy, epsilon=0.2, c1=0.5, c2=0.02):
    r_t = torch.exp(logprobs_new - logprobs_old)
    loss_ppo = -torch.minimum(r_t*advantages, torch.clamp(r_t, 1-epsilon, 1+epsilon) * advantages)
    loss_value = nn.functional.mse_loss(values_predicted, returns)
    entropy_bonus= entropy
    return loss_ppo.mean() + c1 * loss_value - c2 * entropy_bonus


model = ActorCritic(input_size, num_actions)
model.train()
optimizer = Adam(model.parameters(), lr=lr)

# Stack rollout data to tensorts
obs_all = torch.stack([m['obs'] for m in memory]).reshape(-1,num_observations)
logprobs_old_all = torch.stack([m['logprob'] for m in memory]).reshape(-1,1)
returns_all = torch.stack([m['reward'] for m in memory]).reshape(-1,1)
actions_all = torch.stack([m['action'] for m in memory]).reshape(-1, 1)
advantages_all = gae.reshape(-1,1)

# Move to gpu
model = model.to('cuda')
obs_all = obs_all.to('cuda')
logprobs_old_all = logprobs_old_all.to('cuda')
returns_all = returns_all.to('cuda')
actions_all = actions_all.to('cuda')
advantages_all = advantages_all.to('cuda')


print(f'{obs_all.shape, logprobs_old_all.shape, returns_all.shape, actions_all.shape,  advantages_all.shape}')
dataset = TensorDataset(obs_all, logprobs_old_all, returns_all, actions_all,  advantages_all)
dataloader = DataLoader(dataset, batch_size=batch_size)

loss_tracker = []
for epoch in tqdm(range(epoch)):

    loss_tracker_batch = []
    for obs, logprobs_old, returns, actions, advantages in dataloader:

        # Feed obs through your current policy network to get the updated logprob_new
        logits, values_predicted = model(obs)
        dist = Categorical(logits=logits)
        logprobs_new = dist.log_prob(actions.squeeze(-1)).unsqueeze(1)  # using the actions we logged but with the new probability the model assigns
        entropy = dist.entropy().mean()
        # print(f'{[obs.shape, logprobs_new.shape, logprobs_old.shape, returns.shape, obs.shape, advantages.shape]}')
        # print(f'{[obs, logprobs_new, logprobs_old, returns, obs, advantages]}')

        loss = compute_ppo_loss(obs, values_predicted, logprobs_new, logprobs_old, returns, advantages, entropy)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss_tracker_batch.append(loss.item())
    loss_tracker.append(np.mean(loss_tracker_batch))

In [None]:
import matplotlib.pyplot as plt
plt.plot(loss_tracker)
plt.show()

In [None]:
# Eval
def evaluate_agent(model, env_name="LunarLander-v3", episodes=5, render=True):
    env = gym.make(env_name)
    model.eval()

    for ep in range(episodes):
        obs, _ = env.reset()
        done = False
        total_reward = 0
        while not done:
            obs_tensor = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)  # add batch dim
            obs_tensor = obs_tensor.to('cuda')
            # print(obs_tensor.shape)
            with torch.no_grad():
                logits, _ = model(obs_tensor)
                dist = Categorical(logits=logits)
                action = dist.probs.argmax(dim=-1).item()  # Greedy action (no sampling)

            obs, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            total_reward += reward

            if render:
                env.render()  # This will open a window locally or show animation in notebook

        print(f"Episode {ep+1} finished with reward: {total_reward}")

    env.close()
evaluate_agent(model, env_name="LunarLander-v3", episodes=5, render=True)

In [None]:
import imageio
import os
from IPython.display import HTML
from base64 import b64encode

def evaluate_and_save_video(model, env_name="LunarLander-v3", filename="lander.mp4", max_frames=1000):
    import gymnasium as gym
    env = gym.make(env_name, render_mode="rgb_array")
    obs, _ = env.reset()
    model.eval()

    frames = []
    done = False
    total_reward = 0

    for _ in range(max_frames):
        obs_tensor = torch.tensor(obs, dtype=torch.float32).unsqueeze(0).to('cuda')
        with torch.no_grad():
            logits, _ = model(obs_tensor)
            dist = Categorical(logits=logits)
            action = dist.probs.argmax(dim=-1).item()

        obs, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        frame = env.render()
        frames.append(frame)
        total_reward += reward

        if done:
            break

    env.close()
    print(f"Episode reward: {total_reward}")

    # Save video
    imageio.mimsave(filename, frames, fps=30)

def show_video(filename):
    mp4 = open(filename,'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    return HTML(f'<video width=480 controls><source src="{data_url}" type="video/mp4"></video>')

# Usage
evaluate_and_save_video(model)
show_video("lander.mp4")
