## Welcome to the Reinforcement learning 101 have-a-go tutorial!

Today we'll go through an example notebook which trains an agent to land on the moon. We will use a Box 2D environment from the Gym library, as well as the PPO model from Stable Baselines 3. We will need a Hugging face package as well to share our model to the hub.

### Steps:

1. Getting set up
2. Getting familiar with the environment
3. Testing the environment
4. Build the model
5. Train model
6. Test model
7. Upload it to the hub, and render it

Let's have a go!

In [None]:
!pip install stable-baselines3[extra] 
!pip install gym[box2d] 
!pip install huggingface_sb3
!pip install ale-py==0.7.4 # To overcome an issue with gym (https://github.com/DLR-RM/stable-baselines3/issues/875)
!pip install pickle5

# Virtual display dependencies
!sudo apt-get update
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

This is what our agent interacting with the environment will look like...

In [None]:
# Example video
%%html
<video controls autoplay><source src="https://huggingface.co/ThomasSimonini/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>

In [None]:
# import dependencies
import gym 
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv #wrapper that goes around our environment that allows to create a dummy vectorised environment  (argument of some algorithms)
from stable_baselines3.common.evaluation import evaluate_policy #test out our model
from stable_baselines3.common.env_util import make_vec_env

from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login

#import numpy as np
#import matplotlib.pyplot as plt
#from IPython import display as ipythondisplay

ModuleNotFoundError: ignored

We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:
- Horizontal pad coordinate (x)
- Vertical pad coordinate (y)
- Horizontal speed (x)
- Vertical speed (y)
- Angle
- Angular speed
- If the left leg has contact point touched the land
- If the right leg has contact point touched the land

In [None]:
# We create our environment with gym.make("<name_of_the_environment>")
environment_name= 'LunarLander-v2'
env = gym.make("LunarLander-v2")
env.reset()
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space Shape", env.observation_space.shape)
print("Sample observation", env.observation_space.sample()) # Get a random observation

_____OBSERVATION SPACE_____ 

Observation Space Shape (8,)
Sample observation [ 0.18261679  0.04823878 -0.61862314  0.00383941 -0.981477    2.1404572
  0.00590046  1.1117045 ]


The action space (the set of possible actions the agent can take) is discrete with 4 actions available ðŸŽ®: 

- Do nothing,
- Fire left orientation engine,
- Fire the main engine,
- Fire right orientation engine.

Reward function (the function that will gives a reward at each timestep) ðŸ’°:

- Moving from the top of the screen to the landing pad and zero speed is about 100~140 points.
- Firing main engine is -0.3 each frame
- Each leg ground contact is +10 points
- Episode finishes if the lander crashes (additional - 100 points) or come to rest (+100 points)

In [None]:
# Explore the action space
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

Action Space Shape 4
Action Space Sample 3


In [None]:
# within the environment we generated, let's take a bunch of steps to test it out
# try to land 10 times
episodes = 10
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        #env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:-349.3061014611998
Episode:2 Score:-222.80266070166454
Episode:3 Score:-130.50956246096473
Episode:4 Score:-127.72743794116559
Episode:5 Score:-86.26976025850452
Episode:6 Score:-97.81366725380178
Episode:7 Score:-426.91490562120674
Episode:8 Score:-155.82354952131868
Episode:9 Score:-227.1455477807237
Episode:10 Score:-200.41219603863527


Building the model

In [None]:
# build model
# Create a vectorised environment
env = make_vec_env('LunarLander-v2', n_envs=16) # to create more diverse training experience
model = PPO('MlpPolicy', env, verbose= 1)

Using cpu device


Training the model

In [None]:
# train model - https://stable-baselines3.readthedocs.io/en/master/common/logger.html - how to read output
model.learn(total_timesteps=500000)



---------------------------------
| rollout/           |          |
|    ep_len_mean     | 94.5     |
|    ep_rew_mean     | -182     |
| time/              |          |
|    fps             | 3704     |
|    iterations      | 1        |
|    time_elapsed    | 8        |
|    total_timesteps | 32768    |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 96.3        |
|    ep_rew_mean          | -143        |
| time/                   |             |
|    fps                  | 1855        |
|    iterations           | 2           |
|    time_elapsed         | 35          |
|    total_timesteps      | 65536       |
| train/                  |             |
|    approx_kl            | 0.010343928 |
|    clip_fraction        | 0.0842      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | 0.00391     |
|    learning_rate        | 0.

<stable_baselines3.ppo.ppo.PPO at 0x7fd2ecf2f2d0>

In [None]:
# save trained model
model_name = "ppo-LunarLander-v2"
model.save(model_name)

Evaluate model

In [None]:
# Create a new environment for evaluation
eval_env = gym.make('LunarLander-v2')

# Evaluate the model with 10 evaluation episodes and deterministic=True
mean_reward, std_reward =  evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)

# Print the results
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")




mean_reward=249.91 +/- 21.53714682555622


Post to hub

In [None]:
notebook_login()
!git config --global credential.helper store

Login successful
Your token has been saved to /root/.huggingface/token


In [None]:
# Define the name of the environment
env_id = "LunarLander-v2"


# TODO: Define the model architecture we used
model_architecture = "PPO"

## Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
## CHANGE WITH YOUR REPO ID
repo_id = "your-user-name/ppo-LunarLander-v2"

## Define the commit message
commit_message = "Update PPO LunarLander-v2 trained agent"

# Create the evaluation env
eval_env = DummyVecEnv([lambda: gym.make(env_id)])


# PLACE the package_to_hub function you've just filled here
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model 
               model_architecture=model_architecture, # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance roberta-sgariglia/ppo-LunarLander-v2
               commit_message=commit_message)


Some boiler plate code to load a pre-trained model from the Hub

In [None]:
from huggingface_sb3 import load_from_hub
repo_id = "" # The repo_id
filename = "" # The model filename.zip

# When the model was trained on Python 3.8 the pickle protocol is 5
# But Python 3.6, 3.7 use protocol 4
# In order to get compatibility we need to:
# 1. Install pickle5 (we done it at the beginning of the colab)
# 2. Create a custom empty object we pass as paramater to PPO.load()
custom_objects = {
            "learning_rate": 0.0,
            "lr_schedule": lambda _: 0.0,
            "clip_range": lambda _: 0.0,
}

checkpoint = load_from_hub(repo_id, filename)
model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)

# Evaluate this model
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

Rendering our agent in a Google Colab setting using the `colabmyrender` package 


In [None]:
# install dependencies
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!sudo pip3 install imageio==2.4.1
!pip install colabgymrender==1.0.2

In [None]:
from colabgymrender.recorder import Recorder

directory = './video'
env = Recorder(eval_env, directory)

obs = env.reset()
done = False
while not done:
  action, _state = model.predict(obs)
  obs, reward, done, info = env.step(action)
env.play()