# Training my first Deep Reinforcement Learning Agent 🤖

In this notebook, I'll train my **first Deep Reinforcement Learning agent** a Lunar Lander agent that will learn to **land correctly on the Moon 🌕**. I'm using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) a Deep Reinforcement Learning library.




### The environment 🎮
- [LunarLander-v2](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)

### The library used 📚
- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)

## Install dependencies and create a virtual screen 🔽
The first step is to install the dependencies, we’ll install multiple ones.

- `gym[box2D]`: Contains the LunarLander-v2 environment 🌛 (we use `gym==0.21`)
- `stable-baselines3[extra]`: The deep reinforcement learning library.
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.


In [2]:
!apt install swig cmake

Reading package lists... Done
Building dependency tree       
Reading state information... Done
cmake is already the newest version (3.16.3-1ubuntu1.20.04.1).
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 23 not upgraded.
Need to get 1,086 kB of archives.
After this operation, 5,413 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 swig4.0 amd64 4.0.1-5build1 [1,081 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/universe amd64 swig all 4.0.1-5build1 [5,528 B]
Fetched 1,086 kB in 1s (2,124 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 128276 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.1-5build1_amd64.deb ...
Unpacking swig4.0 (4.0.1-5build1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.1-5build1_a

In [3]:
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stable-baselines3[extra]
  Downloading stable_baselines3-1.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.8/171.8 KB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting box2d-py
  Downloading box2d-py-2.3.8.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.5/374.5 KB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggingface_sb3
  Downloading huggingface_sb3-2.2.4-py3-none-any.whl (9.4 kB)
Collecting pyglet==1.5.1
  Downloading pyglet-1.5.1-py2.py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m54.5 MB/s[0m eta [36m0:00:00[0m
Collecting importlib-metadata~=4.13
  Downloading importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting gym==0.21
  Downloading gym

In [4]:
!sudo apt-get update
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

0% [Working]            Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1,581 B]
0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Waiting for header                                                                               Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
                                                                               Get:3 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease [3,622 B]
0% [Connecting to archive.ubuntu.com] [2 InRelease 14.2 kB/114 kB 12%] [Connect                                                                               0% [Waiting for headers] [Connecting to ppa.launchpad.net (185.125.190.52)]                                                                           Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
0% [Waiting for headers] [Connecting to ppa.launchpad.net (185.125.190.52)]0% [Waiting for headers] 

In [5]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7f033935ef10>

## Import the packages 📦

One additional library we import is huggingface_hub **to be able to upload and download trained models from the hub**.

You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads


In [6]:
import gym

from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

## Understand what is LunarLander Environment and How it works 🚀
###Action Space
There are four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
- Do nothing,
- Fire left orientation engine,
- Fire the main engine,
- Fire right orientation engine.

###Observation Space
The state is an 8-dimensional vector: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.
- Horizontal pad coordinate (x)
- Vertical pad coordinate (y)
- Horizontal speed (x)
- Vertical speed (y)
- Angle
- Angular speed
- If the left leg has contact point touched the land
- If the right leg has contact point touched the land

###Rewards
- Moving from the top of the screen to the landing pad and zero speed is about 100~140 points.
- Firing main engine is -0.3 each frame
- Each leg ground contact is +10 points
- Episode finishes if the lander crashes (additional - 100 points) or come to rest (+100 points)

In [9]:
# create the environment
env = gym.make("LunarLander-v2")

# reset the environment to initial state
observation = env.reset()

for _ in range(10):
  print("---------------------------")

  # take a random action
  action = env.action_space.sample()
  print("Action taken: ", action)

  observation, reward, done, info = env.step(action)
  print("Observation: ", observation)
  print("Reward: ", reward)

  # If the game is done (in our case we land, crashed or timeout)
  if done:
    # Reset the environment
    print("Environment is reset")
    observation = env.reset()


---------------------------
Action taken:  3
Observation:  [-0.01515503  1.4153616  -0.7601681   0.08560633  0.01535855  0.1304709
  0.          0.        ]
Reward:  0.6025386020033647
---------------------------
Action taken:  2
Observation:  [-0.0227375   1.417569   -0.76466954  0.09802797  0.02169633  0.12676743
  0.          0.        ]
Reward:  -1.7600620968701322
---------------------------
Action taken:  2
Observation:  [-0.03032856  1.4197516  -0.76551473  0.09689875  0.02803298  0.12674493
  0.          0.        ]
Reward:  -1.2356458812990525
---------------------------
Action taken:  2
Observation:  [-0.03810682  1.4228346  -0.78344953  0.13690223  0.03360759  0.11150227
  0.          0.        ]
Reward:  -3.554176033630415
---------------------------
Action taken:  2
Observation:  [-0.04606523  1.4261175  -0.80069077  0.1457906   0.03840673  0.09599131
  0.          0.        ]
Reward:  -2.985006956070708
---------------------------
Action taken:  0
Observation:  [-0.054023

### Vectorized Environment
- We create a vectorized environment (method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**

In [10]:
# Create the environment
env = make_vec_env('LunarLander-v2', n_envs=16)

## Create the Model 🤖
- Now that we studied our environment and we understood the problem: **being able to land correctly the Lunar Lander to the Landing Pad by controlling left, right and main orientation engine**. Let's build the algorithm we're going to use to solve this Problem 🚀.

- To do so, we're going to use our first Deep RL library, [Stable Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/).

- SB3 is a set of **reliable implementations of reinforcement learning algorithms in PyTorch**.

---

💡 A good habit when using a new library is to dive first on the documentation: https://stable-baselines3.readthedocs.io/en/master/ and then try some tutorials.

----

To solve this problem, we're going to use SB3 **PPO**. [PPO (aka Proximal Policy Optimization) is one of the of the SOTA (state of the art) Deep Reinforcement Learning algorithms.

PPO is a combination of:
- *Value-based reinforcement learning method*: learning an action-value function that will tell us what's the **most valuable action to take given a state and action**.
- *Policy-based reinforcement learning method*: learning a policy that will **gives us a probability distribution over actions**.

In [21]:
model = PPO(
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 32,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=1)

Using cpu device


In [22]:
# Train it for 1,500,000 timesteps
model.learn(total_timesteps=1500000)
# Save the model
model_name = "ppo-LunarLander-v2"
model.save(model_name)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 88.7     |
|    ep_rew_mean     | -190     |
| time/              |          |
|    fps             | 4597     |
|    iterations      | 1        |
|    time_elapsed    | 3        |
|    total_timesteps | 16384    |
---------------------------------
---------------------------------------
| rollout/                |           |
|    ep_len_mean          | 94.8      |
|    ep_rew_mean          | -158      |
| time/                   |           |
|    fps                  | 2468      |
|    iterations           | 2         |
|    time_elapsed         | 13        |
|    total_timesteps      | 32768     |
| train/                  |           |
|    approx_kl            | 0.009522  |
|    clip_fraction        | 0.072     |
|    clip_range           | 0.2       |
|    entropy_loss         | -1.38     |
|    explained_variance   | -0.000881 |
|    learning_rate        | 0.0003    |
|    loss           

## Evaluate the agent 📈
- Now that our Lunar Lander agent is trained 🚀, we need to **check its performance**.
- Stable-Baselines3 provides a method to do that: `evaluate_policy`.
- To fill that part you need to [check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading)


💡 When you evaluate your agent, you should not use your training environment but create an evaluation environment.

In [23]:
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=273.46 +/- 18.11657668450089


## Publish the trained model on the Hugging Face Hub

In [24]:
notebook_login()
!git config --global credential.helper store

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Let's fill the `package_to_hub` function:
- `model`: our trained model.
- `model_name`: the name of the trained model that we defined in `model_save`
- `model_architecture`: the model architecture we used: in our case PPO
- `env_id`: the name of the environment, in our case `LunarLander-v2`
- `eval_env`: the evaluation environment defined in eval_env
- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`

💡 **A good name is {username}/{model_architecture}-{env_id}**

- `commit_message`: message of the commit

In [25]:
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env

from huggingface_sb3 import package_to_hub

## TODO: Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
repo_id = "gokcenazakyol/ppo-LunarLander-v2"

# TODO: Define the name of the environment
env_id = "LunarLander-v2"

# Create the evaluation env
eval_env = DummyVecEnv([lambda: gym.make(env_id)])


# TODO: Define the model architecture we used
model_architecture = "PPO"

## TODO: Define the commit message
commit_message = "Push LunarLander-v2 model"

# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model 
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)

# Note: if after running the package_to_hub function and it gives an issue of rebasing, please run the following code
# cd <path_to_repo> && git add . && git commit -m "Add message" && git pull 
# And don't forget to do a "git push" at the end to push the change to the hub.

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m




Saving video to /tmp/tmpuftdq4d8/-step-0-to-step-1000.mp4
[38;5;4mℹ Pushing repo gokcenazakyol/ppo-LunarLander-v2 to the Hugging Face
Hub[0m


pytorch_variables.pth:   0%|          | 0.00/431 [00:00<?, ?B/s]

policy.optimizer.pth:   0%|          | 0.00/87.5k [00:00<?, ?B/s]

policy.pth:   0%|          | 0.00/43.3k [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

ppo-LunarLander-v2.zip:   0%|          | 0.00/147k [00:00<?, ?B/s]

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/gokcenazakyol/ppo-LunarLander-v2/tree/main/[0m


'https://huggingface.co/gokcenazakyol/ppo-LunarLander-v2/tree/main/'

⬇️ Here is an example of what **I achieved** ⬇️

In [27]:
%%html
<video controls autoplay><source src="https://huggingface.co/gokcenazakyol/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>