<a href="https://colab.research.google.com/github/migolan/RL-notebooks/blob/main/HF_RL_unit1_distilled.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is based on https://huggingface.co/learn/deep-rl-course.
* gymnasium for creating the [lunar lander](https://huggingface.co/sb3/ppo-LunarLander-v2/resolve/main/replay.mp4) environment
* stable_baselines3 for creating a PPO RL agent
* HF upload/download

# Installations

In [None]:
!apt install swig cmake
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt
!sudo apt-get update
!sudo apt-get install -y python3-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay
!pip install --upgrade ipykernel
import os
# os.kill(os.getpid(), 9)

# Imports

In [3]:
# RL environments library
import gymnasium as gym

# RL agents library https://stable-baselines3.readthedocs.io/en/master/
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv

from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
from huggingface_sb3 import load_from_hub, package_to_hub # upload and download trained models from the hub
# Deep reinforcement Learning models available are listed at https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads

# for visualization
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7c32112b7c40>

# Explore the LunarLander-v2 environment

Create environment

In [8]:
env_id = "LunarLander-v2"  # https://gymnasium.farama.org/environments/box2d/lunar_lander/
env = gym.make(env_id)
observation, info = env.reset()

State and action spaces

In [5]:
print("Observation Space Shape:", env.observation_space.shape)
print("Observation Space Sample:", env.observation_space.sample()) # Get a random observation
print("Action Space Shape:", env.action_space.n)
print("Action Space Sample:", env.action_space.sample()) # sample a random action

Observation Space Shape: (8,)
Observation Space Sample: [15.688143   35.103764   -3.477983   -3.7590947   1.2247837  -4.8451624
  0.81280357  0.29216114]
Action Space Shape: 4
Action Space Sample: 0


Environment rollout

In [9]:
observation, info = env.reset()

for _ in range(5):
  # sample a random action
  action = env.action_space.sample()
  print("Action taken:", action)

  # perform action and observe state and reward
  observation, reward, terminated, truncated, info = env.step(action)
  # https://gymnasium.farama.org/api/env/#gymnasium.Env.step
  print(f"Observation: {observation}")
  print(f"Reward: {reward}")
  if info:
    print(f"Info: {info}")

  if terminated or truncated:
      print("Environment is reset")
      observation, info = env.reset()

env.close()

Action taken: 2
Observation: [ 0.00320358  1.3872495   0.15871033 -0.5169701  -0.004113   -0.04441812
  0.          0.        ]
Reward: 2.583969659896792
Action taken: 3
Observation: [ 0.0048398   1.3750271   0.16767475 -0.5432326  -0.00812759 -0.08029964
  0.          0.        ]
Reward: -1.9834527836584346
Action taken: 3
Observation: [ 0.00655556  1.3621925   0.17766216 -0.5704695  -0.01414298 -0.12031905
  0.          0.        ]
Reward: -2.246094417031627
Action taken: 1
Observation: [ 0.00819874  1.3487482   0.1685482  -0.59757644 -0.01832837 -0.08371522
  0.          0.        ]
Reward: -1.4447394261140107
Action taken: 0
Observation: [ 0.00984211  1.3347039   0.16856004 -0.62424624 -0.02251434 -0.08372692
  0.          0.        ]
Reward: -1.5865053614883777


**Lunar lander envirnoment**

**Observation** is a vector of size 8, where each value contains different information about the lander:
- Horizontal pad coordinate (x)
- Vertical pad coordinate (y)
- Horizontal speed (x)
- Vertical speed (y)
- Angle
- Angular speed
- If the left leg contact point has touched the land (boolean)
- If the right leg contact point has touched the land (boolean)

**The action space** (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:

- Action 0: Do nothing,
- Action 1: Fire left orientation engine,
- Action 2: Fire the main engine,
- Action 3: Fire right orientation engine.

**Reward function** (the function that will give a reward at each timestep) 💰:

After every step a reward is granted. The total reward of an episode is the **sum of the rewards for all the steps within that episode**.

For each step, the reward:

- Is increased/decreased the closer/further the lander is to the landing pad.
-  Is increased/decreased the slower/faster the lander is moving.
- Is decreased the more the lander is tilted (angle not horizontal).
- Is increased by 10 points for each leg that is in contact with the ground.
- Is decreased by 0.03 points each frame a side engine is firing.
- Is decreased by 0.3 points each frame the main engine is firing.

The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively.**

An episode is **considered a solution if it scores at least 200 points.**


# Train an RL agent on the environment

In [None]:
# create a vectorized environment - stack multiple independent environments into
# a single environment, to have more diverse experiences during the training.
env = make_vec_env(env_id, n_envs=16)

# Generate an agent with PPO learning algorithm
# We use MultiLayerPerceptron (MLPPolicy) because the input is a vector,
# if we had frames as input we would use CnnPolicy
model = PPO(
    policy='MlpPolicy',
    env=env,
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    verbose=1)

This will take time, do it on a GPU:

In [None]:
# train PPO agent
model.learn(total_timesteps=int(1e6))

In [None]:
model_name = "ppo-LunarLander-v2" # A good name is {model_architecture}-{env_id}
model.save(model_name)

# Push the model to the HF hub

In [None]:
notebook_login()
!git config --global credential.helper store

In [None]:
# Create the evaluation env and set the render_mode="rgb_array"
eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])

package_to_hub(model=model,
               model_name=model_name,
               model_architecture="PPO",
               env_id=env_id,
               eval_env=eval_env,
               repo_id=f"migolan/{model_name}", # A good name is {username}/{model_architecture}-{env_id}
               commit_message="Upload PPO LunarLander-v2 trained agent")

The script above should have displayed a link to a model repository such as https://huggingface.co/osanseviero/test_sb3. When you go to this link, you can:
* See a video preview of your agent at the right.
* Click "Files and versions" to see all the files in the repository.
* Click "Use in stable-baselines3" to get a code snippet that shows how to load the model.
* A model card (`README.md` file) which gives a description of the model

Compare the results of your LunarLander-v2 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard

# Load a saved model from the Hub

In [10]:
!pip install shimmy # conversion tool that will help us run the environment correctly https://github.com/Farama-Foundation/Shimmy

Collecting shimmy
  Downloading Shimmy-2.0.0-py3-none-any.whl.metadata (3.5 kB)
Collecting gymnasium>=1.0.0a1 (from shimmy)
  Using cached gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Downloading Shimmy-2.0.0-py3-none-any.whl (30 kB)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/958.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m958.1/958.1 kB[0m [31m48.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gymnasium, shimmy
  Attempting uninstall: gymnasium
    Found existing installation: gymnasium 0.28.1
    Uninstalling gymnasium-0.28.1:
      Successfully uninstalled gymnasium-0.28.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
stable-baselines3 2.0.0a5 requires gymnasium==0.28.1, but you have gym

In [12]:
# id this doesn't work amke sure you've installed shimmy
repo_id = "Classroom-workshop/assignment2-omar" # The repo_id
filename = "ppo-LunarLander-v2.zip" # The model filename.zip
# Go to https://huggingface.co/models?library=stable-baselines3 to see the list of all the Stable-baselines3 saved models.
checkpoint = load_from_hub(repo_id, filename)

# When the model was trained on Python 3.8 the pickle protocol is 5
# But Python 3.6, 3.7 use protocol 4
# In order to get compatibility we need to:
# 1. Install pickle5 (we done it at the beginning of the colab)
# 2. Create a custom empty object we pass as parameter to PPO.load()
custom_objects = {
            "learning_rate": 0.0,
            "lr_schedule": lambda _: 0.0,
            "clip_range": lambda _: 0.0,
}
model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)

== CURRENT SYSTEM INFO ==
- OS: Linux-6.1.85+-x86_64-with-glibc2.35 # 1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
- Python: 3.10.12
- Stable-Baselines3: 2.0.0a5
- PyTorch: 2.5.1+cu121
- GPU Enabled: False
- Numpy: 1.26.4
- Cloudpickle: 3.1.0
- Gymnasium: 0.28.1
- OpenAI Gym: 0.25.2

== SAVED MODEL SYSTEM INFO ==
OS: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic #1 SMP Sun Apr 24 10:03:06 PDT 2022
Python: 3.7.13
Stable-Baselines3: 1.5.0
PyTorch: 1.11.0+cu113
GPU Enabled: True
Numpy: 1.21.6
Gym: 0.21.0



  th_object = th.load(file_content, map_location=device)


# Evaluate the agent
When you evaluate your agent, you should not use your training environment but create an evaluation environment.

In [13]:
# Create a new environment for evaluation, with a monitor
eval_env = Monitor(gym.make(env_id, render_mode='rgb_array'))

# Evaluate the model
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=271.37 +/- 64.693128743265


# Improve Agent
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?

Here are some ideas to get to the top of the leaderboard:
* Train more steps
* Try different hyperparameters for `PPO`. You can see them at https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters.
* Check the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) and try another model such as DQN.
* **Push your new trained model** on the Hub 🔥

Other possible environments:
* MountainCar-v0
* CartPole-v1
* CarRacing-v0

Check how they work at https://www.gymlibrary.dev.