# Deep RL with Gym and stablebaselines3

Trimmed version of https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb#scrollTo=zpz8kHlt_a_m

Note: Set the GPU!

## Install Dependencies

In [None]:
!apt install python-opengl
!apt install ffmpeg freeglut3-dev xvfb  # For visualization
!pip3 install pyvirtualdisplay
!pip install gym[box2d] # environments
!pip install stable-baselines3[extra] # deep RL library
!pip install huggingface_sb3 # extra code for sb3 to load/upload models to HFHub
!pip install pyglet
!pip install ale-py==0.7.4 # To overcome an issue with gym (https://github.com/DLR-RM/stable-baselines3/issues/875)

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
Suggested packages:
  libgle3
The following NEW packages will be installed:
  python-opengl
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 496 kB of archives.
After this operation, 5,416 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-opengl all 3.1.0+dfsg-1 [496 kB]
Fetched 496 kB in 1s (483 kB/s)
Selecting previously unselected package python-opengl.
(Reading database ... 155639 files and directories currently installed.)
Preparing to unpack .../python-opengl_3.1.0+dfsg-1_all.deb ...
Unpacking python-opengl (3.1.0+dfsg-1) ...
Setting up python-opengl (3.1.0+dfsg-1) ...
Reading package lists... Done
Building dependency tree       
Reading state information... Done
free

## Import Dependencies

In [None]:
import gym

from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.

from stable_baselines3 import PPO, DQN

from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

import matplotlib.pyplot as plt
from gym import wrappers
from gym.wrappers import Monitor
import io
import base64
import glob
import numpy as np
from IPython.display import HTML
from IPython import display as ipythondisplay

from pyvirtualdisplay import Display # Virtual display

### Set the virtual display

In [None]:
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7fa92a91f750>

## Create the LunarLander environment
#### [The environment 🎮](https://www.gymlibrary.ml/environments/box2d/lunar_lander/)

In [None]:
def query_environment(name):
    env = gym.make(name)
    spec = gym.spec(name)
    print(f"Action Space: {env.action_space}")
    print(f"Action Space Shape: {env.action_space.n}")
    print(f"Action Space Sample: {env.action_space.sample()}")
    print(f"Observation Space: {env.observation_space.shape}")
    print(f"Max Episode Steps: {spec.max_episode_steps}")
    print(f"Nondeterministic: {spec.nondeterministic}")
    print(f"Reward Range: {env.reward_range}")
    print(f"Reward Threshold: {spec.reward_threshold}")
    print(f"Sample observation: {env.observation_space.sample()}")

In [None]:
# Bonus: 3 new environments
env_name = "MountainCar-v0"
env = gym.make("MountainCar-v0")
env.reset()

array([-0.4921547,  0.       ], dtype=float32)

In [None]:
query_environment(env_name)

Action Space: Discrete(3)
Action Space Shape: 3
Action Space Sample: 2
Observation Space: (2,)
Max Episode Steps: 200
Nondeterministic: False
Reward Range: (-inf, inf)
Reward Threshold: -110.0
Sample observation: [ 0.5884096  -0.06373038]


#### Vectorized Environment
- We create a vectorized environment (method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**

In [None]:
# Create the environment
env = make_vec_env(env_name, n_envs=16)

## Create the Model

To solve this problem, we're going to use SB3 **PPO**. [PPO (aka Proximal Policy Optimization) is one of the of the SOTA (state of the art) Deep Reinforcement Learning algorithms that you'll study during this course](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D).

PPO is a combination of:
- *Value-based reinforcement learning method*: learning an action-value function that will tell us what's the **most valuable action to take given a state and action**.
- *Policy-based reinforcement learning method*: learning a policy that will **gives us a probability distribution over actions**.


Stable-Baselines3 is easy to set up:

```
# Create environment
env = gym.make('LunarLander-v2')

# Instantiate the agent
model = PPO('MlpPolicy', env, verbose=1)
# Train the agent
model.learn(total_timesteps=int(2e5))
```



In [None]:
# Define a PPO MlpPolicy architecture
# We use MultiLayerPerceptron (MLPPolicy) because the input is a vector, if we had frames as input we would use CnnPolicy
model = PPO( #PPO
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=1)

# model = DQN(policy='MlpPolicy', env=env, verbose=1)

Using cuda device


## Train the PPO agent

In [None]:
model.learn(total_timesteps=1_000_000)
model_name = f"{env_name}-ppo-1-million"

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 200      |
|    ep_rew_mean     | -200     |
| time/              |          |
|    fps             | 8647     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 16384    |
---------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 200           |
|    ep_rew_mean          | -200          |
| time/                   |               |
|    fps                  | 4095          |
|    iterations           | 2             |
|    time_elapsed         | 8             |
|    total_timesteps      | 32768         |
| train/                  |               |
|    approx_kl            | 0.00025686642 |
|    clip_fraction        | 0             |
|    clip_range           | 0.2           |
|    entropy_loss         | -1.1          |
|    explained_variance   | 6.1e-05       |


In [None]:
model.save(model_name)

## Visualise learned agent

In [None]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [None]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [None]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv, SubprocVecEnv

def record_video(env_name, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = SubprocVecEnv([lambda: gym.make(env_name) for i in range(4)])
  # Start the video at step=0 and record video_length steps
  eval_env = VecVideoRecorder(eval_env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()

In [None]:
record_video(env_name, model, video_length=1000, prefix=model_name)

Saving video to /content/videos/MountainCar-v0-ppo-1-million-step-0-to-step-1000.mp4


In [None]:
show_videos('videos', prefix=model_name)

## Evaluate the agent

💡 When you evaluate your agent, you should not use your training environment but create an evaluation environment.

In [None]:
# Create a new environment for evaluation
eval_env = gym.make(env_name)
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")



mean_reward=-200.00 +/- 0.0


## Publish our trained model on the Hub (optional)

In [None]:
notebook_login()
!git config --global credential.helper store

Login successful
Your token has been saved to /root/.huggingface/token


If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`

Let's fill the `package_to_hub` function:
- `model`: our trained model.
- `model_name`: the name of the trained model that we defined in `model_save`
- `model_architecture`: the model architecture we used: in our case PPO
- `env_id`: the name of the environment, in our case `LunarLander-v2`
- `eval_env`: the evaluation environment defined in eval_env
- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`

💡 **A good name is {username}/{model_architecture}-{env_id}**

- `commit_message`: message of the commit

In [None]:
import gym

from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env

from huggingface_sb3 import package_to_hub

# PLACE the variables you've just defined two cells above
# Define the name of the environment
env_id = env_name

# Define the model architecture we used
model_architecture = "PPO"

## Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
## CHANGE WITH YOUR REPO ID
repo_id = f"harryb0905/{model_name}"

## Define the commit message
commit_message = "Upload PPO Mountain Car trained agent"

# Create the evaluation env
eval_env = DummyVecEnv([lambda: gym.make(env_id)])

# PLACE the package_to_hub function you've just filled here
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model 
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)


[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue and use
push_to_hub instead.[0m


Cloning https://huggingface.co/harryb0905/dqn-MountainCar-v0-1-million into local empty directory.


Saving video to /content/-step-0-to-step-1000.mp4
[38;5;4mℹ Pushing repo dqn-MountainCar-v0-1-million to the Hugging Face Hub[0m


Upload file replay.mp4:   2%|1         | 3.34k/196k [00:00<?, ?B/s]

Upload file MountainCar-v0_dqn_1_million/policy.pth:   8%|8         | 3.34k/39.5k [00:00<?, ?B/s]

Upload file MountainCar-v0_dqn_1_million/policy.optimizer.pth:   9%|8         | 3.34k/38.8k [00:00<?, ?B/s]

Upload file MountainCar-v0_dqn_1_million/pytorch_variables.pth: 100%|##########| 431/431 [00:00<?, ?B/s]

Upload file MountainCar-v0_dqn_1_million.zip:   3%|3         | 3.34k/96.5k [00:00<?, ?B/s]

To https://huggingface.co/harryb0905/dqn-MountainCar-v0-1-million
   0402fab..3e97385  main -> main



[38;5;4mℹ Your model is pushed to the hub. You can view your model here:
https://huggingface.co/harryb0905/dqn-MountainCar-v0-1-million[0m


'https://huggingface.co/harryb0905/dqn-MountainCar-v0-1-million'