In [1]:
%%html
<video controls autoplay><source src="https://huggingface.co/ThomasSimonini/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>

In [2]:
# Create a VirtualScreen to render the video
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libnvidia-common-460 nsight-compute-2020.2.0
Use 'apt autoremove' to remove them.
Suggested packages:
  libgle3
The following NEW packages will be installed:
  python-opengl
0 upgraded, 1 newly installed, 0 to remove and 42 not upgraded.
Need to get 496 kB of archives.
After this operation, 5,416 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-opengl all 3.1.0+dfsg-1 [496 kB]
Fetched 496 kB in 2s (294 kB/s)
Selecting previously unselected package python-opengl.
(Reading database ... 155203 files and directories currently installed.)
Preparing to unpack .../python-opengl_3.1.0+dfsg-1_all.deb ...
Unpacking python-opengl (3.1.0+dfsg-1) ...
Setting up python-opengl (3.1.0+dfsg-1) ...
Reading package lists... Done
Building dependency tree       
Reading s

<pyvirtualdisplay.display.Display at 0x7f03a4036ed0>

In [3]:
# install dependencies
!pip install gym[box2d]
!pip install stable-baselines3[extra]
!pip install huggingface_sb3
!pip install pyglet
!pip install ale-py==0.7.4 # To overcome an issue with gym (https://github.com/DLR-RM/stable-baselines3/issues/875)

Collecting box2d-py~=2.3.5
  Downloading box2d_py-2.3.8-cp37-cp37m-manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 4.1 MB/s 
Installing collected packages: box2d-py
Successfully installed box2d-py-2.3.8
Collecting stable-baselines3[extra]
  Downloading stable_baselines3-1.5.0-py3-none-any.whl (177 kB)
[K     |████████████████████████████████| 177 kB 4.1 MB/s 
Collecting gym==0.21
  Downloading gym-0.21.0.tar.gz (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 20.1 MB/s 
Collecting autorom[accept-rom-license]~=0.4.2
  Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB)
Collecting ale-py~=0.7.4
  Downloading ale_py-0.7.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 39.2 MB/s 
Collecting AutoROM.accept-rom-license
  Downloading AutoROM.accept-rom-license-0.4.2.tar.gz (9.8 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l

In [4]:
import gym

from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

In [5]:
env = gym.make('LunarLander-v2')

observation = env.reset()

for _ in range(20):
    action = env.action_space.sample()
    print(f'Action taken: {action}')

    observation, reward, done, info = env.step(action)
    if done:
        print('Environment is reset')
        observation = env.reset()

Action taken: 3
Action taken: 2
Action taken: 0
Action taken: 0
Action taken: 0
Action taken: 2
Action taken: 3
Action taken: 0
Action taken: 1
Action taken: 3
Action taken: 2
Action taken: 2
Action taken: 3
Action taken: 3
Action taken: 1
Action taken: 0
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 1


# Create the LunarLander environment
The agent needs to land on the moon. It needs to regulate its speed and position (horizontal, vertical and angular).

In [6]:
env = gym.make('LunarLander-v2')
env.reset()
print('OBSERVATION SPACE:')
print(f'Observation space shape: {env.observation_space.shape}')
print(f'Sample observation: {env.observation_space.sample()}')

OBSERVATION SPACE:
Observation space shape: (8,)
Sample observation: [-1.341686   -0.3168316  -0.420112    0.40027493  0.3433369   1.7649101
  0.00942347 -0.11004459]


Shape 8 for a sample in the observation space:
- horizontal pad coordinate (x)
- vertical pad coordinate (y)
- horizontal speed (x)
- vertical speed (y)
- angle
- angular speed
- if the left leg has contact point touched the land
if the right leg has contact point touched the land

Docs: https://www.gymlibrary.ml/environments/box2d/lunar_lander/

In [7]:
# check now the action space
print('ACTION SPACE')
print(f'Action space shape: {env.action_space.n}')
print(f'Action space sample: {env.action_space.sample()}')

ACTION SPACE
Action space shape: 4
Action space sample: 1


4 possible actions:
- do nothing
- fire left orientation engine
- fire the main engine
- fire right orientation engine
Reward function:
- Moving from the top of the screen to the landing pad and zero speed is about 100-140 points
- firing main engine is -0.3 each frame
- each leg ground contact is +10 points
- episode finishes if the lander crashes (-100 points) or come to rest (+100 points)
- the game is solved if your agent does 200 points

## Create vectorized environment
I can create a vectorized environment (method for stacking multiple independent environments into a single env) of 16 envs so that we have more diverse experiences during the training.

In [8]:
env = make_vec_env('LunarLander-v2', n_envs=16)

To solve the problem, we use Stable Baselines3 library, PyTorch.

More specifically, we use the Proximal Policy Optimization (PPO) algorithm to train the system: https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D.

PPO is a combination of:
- Value-based RL method: it tells us what is the most valuable action to take given a station and an action
- policy-based RL method: learn a policy that will give us a probability distribution over actions.

The model is instantiated and it is trained by invoking `model.learn` + number of training timesteps

In [16]:
model = PPO('MlpPolicy',
            env,
            n_steps=1024,
            batch_size=64,
            n_epochs=5,
            gamma=0.999,
            gae_lambda=0.98,
            ent_coef = 0.01,
            verbose=1)

Using cuda device


In [17]:
model.learn(total_timesteps=500_000)
model_name = 'ppo-LunarLander-v2'
model.save(model_name)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 93.5     |
|    ep_rew_mean     | -203     |
| time/              |          |
|    fps             | 3015     |
|    iterations      | 1        |
|    time_elapsed    | 5        |
|    total_timesteps | 16384    |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 92.3        |
|    ep_rew_mean          | -155        |
| time/                   |             |
|    fps                  | 1718        |
|    iterations           | 2           |
|    time_elapsed         | 19          |
|    total_timesteps      | 32768       |
| train/                  |             |
|    approx_kl            | 0.008803392 |
|    clip_fraction        | 0.0919      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | -0.002      |
|    learning_rate        | 0.

In [18]:
# evaluate the agent
eval_env = gym.make('LunarLander-v2')
mean_reward, std_reward = evaluate_policy(model.policy, env, n_eval_episodes=10, deterministic=True)
print(f'Mean reward: {mean_reward} +/- {std_reward}')

Mean reward: 183.0706624 +/- 72.22611624444878


In [19]:
notebook_login()
!git config --global credential.helper store

Login successful
Your token has been saved to /root/.huggingface/token


In [20]:
from stable_baselines3.common.vec_env import DummyVecEnv

env_id = 'LunarLander-v2'
eval_env = DummyVecEnv([lambda: gym.make(env_id)])
model_architecture = 'PPO'

repo_id = 'lc-col/FirstTest-LunarLander-v2'
commit_msg = 'PPO for LunarLander-v2 trained agent, first test'

eval_env = DummyVecEnv([lambda: gym.make(env_id)])

package_to_hub(model=model,
               model_name=model_name,
               model_architecture=model_architecture,
               env_id=env_id,
               eval_env=eval_env,
               repo_id=repo_id,
               commit_message=commit_msg)

[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: If you encounter a bug, please open an issue and use
push_to_hub instead.[0m


Cloning https://huggingface.co/lc-col/FirstTest-LunarLander-v2 into local empty directory.


Saving video to /content/-step-0-to-step-1000.mp4
[38;5;4mℹ Pushing repo FirstTest-LunarLander-v2 to the Hugging Face Hub[0m


Upload file replay.mp4:   1%|1         | 3.34k/234k [00:00<?, ?B/s]

Upload file ppo-LunarLander-v2.zip:   2%|2         | 3.34k/141k [00:00<?, ?B/s]

Upload file ppo-LunarLander-v2/policy.optimizer.pth:   4%|4         | 3.34k/82.8k [00:00<?, ?B/s]

Upload file ppo-LunarLander-v2/policy.pth:   8%|7         | 3.34k/42.2k [00:00<?, ?B/s]

Upload file ppo-LunarLander-v2/pytorch_variables.pth: 100%|##########| 431/431 [00:00<?, ?B/s]

remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/lc-col/FirstTest-LunarLander-v2
   0239bb0..6c23218  main -> main



[38;5;4mℹ Your model is pushed to the hub. You can view your model here:
https://huggingface.co/lc-col/FirstTest-LunarLander-v2[0m


'https://huggingface.co/lc-col/FirstTest-LunarLander-v2'