<a href="https://colab.research.google.com/github/migolan/HF-DRLC/blob/main/06_A2C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro
This notebook is based on https://huggingface.co/learn/deep-rl-course/unit6/hands-on.

* PandaReachDense-v3 environment from [Panda-Gym](https://github.com/qgallouedec/panda-gym)
* A2C from [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)

# Installations

In [None]:
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

!pip install stable-baselines3[extra]
!pip install gymnasium

!pip install huggingface_sb3
!pip install huggingface_hub
!pip install panda_gym

# Imports

In [None]:
import os

import gymnasium as gym
import panda_gym

from stable_baselines3 import A2C
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.env_util import make_vec_env

from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login

from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

# The PandaReachDense-v3 environment

The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).

In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.

In `PandaReach`, the robot must place its end-effector at a target position (green ball).

We're going to use the dense version of this environment. It means we'll get a *dense reward function* that will provide a reward at each timestep (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment return a reward if and only if the task is completed.

Also, we're going to use the *End-effector displacement control*, it means the action corresponds to the displacement of the end-effector. We don't control the individual motion of each joint (joint control). This way the training will be easier.


<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg"  alt="Robotics" width=50%/>

The observation space is a dictionary with 3 different elements:
- `achieved_goal`: (x,y,z) the current position of the end-effector.
- `desired_goal`: (x,y,z) the target position for the end-effector.
- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).

Since the observation is a dictionary, we will need to use a MultiInputPolicy policy instead of MlpPolicy.

The action space is a vector with 3 values:
- Control x, y, z movement





In [None]:
env_id = "PandaReachDense-v3"
env = gym.make(env_id)

print(f"Observation space: {env.observation_space}")
print(f"Sample observation: {env.observation_space.sample()}")
print(f"Action space: {env.action_space}")
print(f"Sample action: {env.action_space.sample()}")

# Normalize observation and rewards

A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).

For that purpose, we'll use the VecNormalize wrapper that computes a running average and standard deviation of input features, and also normalizes rewards.

https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize

In [None]:
env = make_vec_env(env_id, n_envs=4)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)

# Create the A2C Model

For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes

To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3).

In [None]:
model = A2C(policy="MultiInputPolicy", env=env, verbose=1)

# Train the A2C agent

In [None]:
model.learn(1_000_000)

In [None]:
# Save the model and VecNormalize statistics when saving the agent
model.save("a2c-PandaReachDense-v3")
env.save("vec_normalize.pkl")

# Evaluate the agent

In [None]:
# Load the saved statistics
eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v3")])
eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)

# We need to override the render_mode
eval_env.render_mode = "rgb_array"

# do not update them at test time
eval_env.training = False
# reward normalization is not needed at test time
eval_env.norm_reward = False

# Load the agent
model = A2C.load("a2c-PandaReachDense-v3")

mean_reward, std_reward = evaluate_policy(model, eval_env)

print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

# Publish the model on the HF hub

In [None]:
notebook_login()

In [None]:
package_to_hub(
    model=model,
    model_name=f"a2c-{env_id}",
    model_architecture="A2C",
    env_id=env_id,
    eval_env=eval_env,
    repo_id=f"migolan/a2c-{env_id}",
    commit_message="Initial commit",
)

# Some additional challenges
The best way to learn **is to try things by your own**! Why not trying  `PandaPickAndPlace-v3`?

If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.

PandaPickAndPlace-v1 (this model uses the v1 version of the environment): https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1

And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html

We provide you the steps to train another agent (optional):

1. Define the environment called "PandaPickAndPlace-v3"
2. Make a vectorized environment
3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
4. Create the A2C Model (don't forget verbose=1 to print the training logs).
5. Train it for 1M Timesteps
6. Save the model and  VecNormalize statistics when saving the agent
7. Evaluate your agent
8. Publish your trained model on the Hub 🔥 with `package_to_hub`
