# Train your Second Agent

In this Notebook we will explore yet another environment, train the agent and then share the result using Huggingface ecosystem.


#### Running in Colab/Kaggle

If you are running this on Colab or Kaggle, please uncomment below cell and run this to install required dependencies.



In [1]:
## uncomment and execute this cell to install all the the dependencies if running in Google Colab or Kaggle

# !apt-get update 
# !apt-get install -y swig cmake ffmpeg freeglut3-dev xvfb

In [2]:
## uncomment and execute this cell to install all the the dependencies if running in Google Colab or Kaggle

# !pip install "box2d-py==2.3.8"
# !pip install "stable-baselines3[extra]==2.1.0"
# !pip install "huggingface_sb3>=3.0"
# !pip install "moviepy==1.0.3"

## Imports

In [43]:
import gymnasium as gym
import ale_py as ale

from stable_baselines3.common.logger import HumanOutputFormat, KVWriter, Logger
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3 import A2C

In [44]:
from typing import Any, Dict, Tuple, Union
import numpy as np
import mlflow
import sys


gym.register_envs(ale)

class MLflowOutputFormat(KVWriter):
    """
    Dumps key/value pairs into MLflow's numeric format.
    """

    def write(
        self,
        key_values: Dict[str, Any],
        key_excluded: Dict[str, Union[str, Tuple[str, ...]]],
        step: int = 0,
    ) -> None:

        for (key, value), (_, excluded) in zip(
            sorted(key_values.items()), sorted(key_excluded.items())
        ):

            if excluded is not None and "mlflow" in excluded:
                continue

            if isinstance(value, np.ScalarType):
                if not isinstance(value, str):
                    mlflow.log_metric(key, value, step)


loggers = Logger(
    folder=None,
    output_formats=[HumanOutputFormat(sys.stdout), MLflowOutputFormat()],
)

In [None]:


# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

# Create a new MLflow Experiment
mlflow.set_experiment("Pong")

  logger.warn("Unable to save last video! Did you call close()?")
                                                                            
[A

[A[A


[A[A[A



[A[A[A[A




[A[A[A[A[A






[A[A[A[A[A[A[A





[A[A[A[A[A[A








[A[A[A[A[A[A[A[A[A







[A[A[A[A[A[A[A[A2025/02/16 18:43:31 INFO mlflow.tracking.fluent: Experiment with name 'Pong' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/160692338975497610', creation_time=1739749411464, experiment_id='160692338975497610', last_update_time=1739749411464, lifecycle_stage='active', name='Pong', tags={}>

## Create the Gym env and instantiate the agent

For this example, we will use an environment from Atari Simulator. Atari environments are simulated via the Arcade Learning Environment (ALE) through the [Stella](https://github.com/stella-emu/stella) and the [Arcade Learning Environment](https://github.com/mgbellemare/Arcade-Learning-Environment).

![Pong](https://gymnasium.farama.org/_images/pong.gif)


We will use [Pong](https://gymnasium.farama.org/environments/atari/pong/) from the Atari game simulator. We will be using a specific variant  `PongNoFrameskip-v4`. For now we will not delve into the details except telling you that in this case the state is the image of the game and actions are the game controller actions. As we will be using images as state, we will use a different variation of policy network called `CnnPolicy` which is based on CNNs (Convolutional Neural Networks).

Last notebook we used DQN as the algorithm to train the agent. In this notebook we will be used [A2C algorithm](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html). A2C is a synchronous, deterministic variant of Asynchronous Advantage Actor Critic ([A3C](https://arxiv.org/abs/1602.01783)).

In [47]:
# There already exists an environment generator
# that will make and wrap atari environments correctly.
# Here we are also multi-worker training (n_envs=4 => 4 environments)
env_id = "ALE/Pong-v5"

vec_env = make_atari_env(env_id, n_envs=4, seed=0)
# Frame-stacking with 4 frames
vec_env = VecFrameStack(vec_env, n_stack=4)

model = A2C("CnnPolicy", vec_env, verbose=1, device="mps")


Using mps device
Wrapping the env in a VecTransposeImage.


## Train the Agent

It is advisable to run this on an envirnment with GPU as otherwise it may take long to run this

In [48]:
# If you have a GPU, you can increase the `total_timesteps` to something like 1_000_000 i.e. one million
# It would take about 45 mins to train
# The more you train the better the result will be

with mlflow.start_run():
    model.set_logger(loggers)
    model.learn(total_timesteps=20_000, log_interval=500, progress_bar=True)

Output()

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 765      |
|    ep_rew_mean        | -21      |
| time/                 |          |
|    fps                | 285      |
|    iterations         | 500      |
|    time_elapsed       | 34       |
|    total_timesteps    | 10000    |
| train/                |          |
|    entropy_loss       | -0.00241 |
|    explained_variance | 0.896    |
|    learning_rate      | 0.0007   |
|    n_updates          | 499      |
|    policy_loss        | 1.69e-06 |
|    value_loss         | 0.0074   |
------------------------------------


------------------------------------
| rollout/              |          |
|    ep_len_mean        | 769      |
|    ep_rew_mean        | -21      |
| time/                 |          |
|    fps                | 308      |
|    iterations         | 1000     |
|    time_elapsed       | 64       |
|    total_timesteps    | 20000    |
| train/                |          |
|    entropy_loss       | -0.633   |
|    explained_variance | 0.196    |
|    learning_rate      | 0.0007   |
|    n_updates          | 999      |
|    policy_loss        | 0.0905   |
|    value_loss         | 0.15     |
------------------------------------


🏃 View run placid-vole-899 at: http://127.0.0.1:8080/#/experiments/160692338975497610/runs/594bef164b7e4d418d807476222af6a9
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/160692338975497610


## Generate a Video

In [49]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv
import os

video_length = 1000
video_folder = "logs/videos/"
video_prefix = f"a2c-agent-{env_id}"
# Create the video folder if it does not exist
os.makedirs(f"{video_folder}/{video_prefix}", exist_ok=True)

# Create base environment
vec_env = make_atari_env(env_id, n_envs=1, seed=0)
# Frame-stacking with 4 frames
vec_env = VecFrameStack(vec_env, n_stack=4)

try:
    # Wrap with video recorder
    recorded_env = VecVideoRecorder(vec_env, video_folder,
                                  record_video_trigger=lambda x: x == 0, 
                                  video_length=video_length,
                                  name_prefix=f"a2c-agent-{env_id}")
    
    obs = recorded_env.reset()
    for step in range(video_length + 1):
        action, _state = model.predict(obs)
        obs, rewards, dones, info = recorded_env.step(action)
        
        if dones.any():
            obs = recorded_env.reset()
            
finally:
    # Ensure proper cleanup
    recorded_env.close()
    vec_env.close()













[A[A[A[A[A[A[A[A[A[A[A[A                                      










[A[A[A[A[A[A[A[A[A[A[A                                         










[A[A[A[A[A[A[A[A[A[A[A









[A[A[A[A[A[A[A[A[A[A











[A[A[A[A[A[A[A[A[A[A[A[A                                        










[A[A[A[A[A[A[A[A[A[A[A                                           










[A[A[A[A[A[A[A[A[A[A[A









[A[A[A[A[A[A[A[A[A[A

Saving video to /Users/mjm/projects/rhlf-book/02/logs/videos/a2c-agent-ALE/Pong-v5-step-0-to-step-1000.mp4
MoviePy - Building video /Users/mjm/projects/rhlf-book/02/logs/videos/a2c-agent-ALE/Pong-v5-step-0-to-step-1000.mp4.
MoviePy - Writing video /Users/mjm/projects/rhlf-book/02/logs/videos/a2c-agent-ALE/Pong-v5-step-0-to-step-1000.mp4



                                                                           











[A[A[A[A[A[A[A[A[A[A[A[A                                        










[A[A[A[A[A[A[A[A[A[A[A                                           










[A[A[A[A[A[A[A[A[A[A[A









[A[A[A[A[A[A[A[A[A[A











[A[A[A[A[A[A[A[A[A[A[A[A                                        










[A[A[A[A[A[A[A[A[A[A[A                                           










[A[A[A[A[A[A[A[A[A[A[A









[A[A[A[A[A[A[A[A[A[A

MoviePy - Done !
MoviePy - video ready /Users/mjm/projects/rhlf-book/02/logs/videos/a2c-agent-ALE/Pong-v5-step-0-to-step-1000.mp4


## Watch the Video

To get better results and if you have GPU, increase the `total_timesteps` to something like `1_000_000` in `model.learn` call above.

In [50]:
from IPython.display import HTML
from base64 import b64encode
mp4 = open('./logs/videos/a2c-agent-ALE/Pong-v5-step-0-to-step-1000.mp4','rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

## Share the Video on Huggingface Account

### First let us login into our huggingface account

In [8]:
from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.

notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

**IMPORTANT**
Some users have reported facing following error while running the `package_to_hub` upload function. 

```
"Token is required (write-access action) but no token found. You need to provide a token or be logged in to Hugging Face with `huggingface-cli login` or `huggingface_hub.login`. See https://huggingface.co/settings/tokens."
```

In such a case the following command will help you over come the issue

```
import huggingface_hub

huggingface_hub.login(token= <YOUR_HF_TOKEN>,
                     write_permission = True,
                    add_to_git_credential = True)
					
```

Another alternative is to use following command from command shell where the `venv` or `conda` environment for this repository has been activated and then follow the instructions to set the HuggingFace token.

```
huggingface-cli login

```

### Push to HuggingFace

You can execute the code below to push the trained agent to huggingface hub. Towards the end of output there will be a link of the hosted model that you can share with your friends and family. Share the url without the ending paths "tree/main" so that the url takes them to Model Card tab where they can see the animation. e.g. in my case it will be:<br/>
`https://huggingface.co/nsanghi/a2c-Atari-Pong/`<br/>
instead of<br/>
`https://huggingface.co/nsanghi/a2c-Atari-Pong/tree/main/`

NOTE: At the time of writing this book, there is a bug in push_to_hub which results in sample video not getting created. You can refer to this bug here - https://github.com/huggingface/huggingface_sb3/issues/33



In [10]:
eval_env = make_atari_env(env_id, n_envs=1, seed=0)
# Frame-stacking with 4 frames
eval_env = VecFrameStack(eval_env, n_stack=4)

obs = vec_env.reset()

# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub

# Please note repo_id is of the form <huggingface_id>/<name of repo>
# you will need to change this to "<your_huggingface_id>/dqn-LunarLander-v2"

package_to_hub(model=model, # Our trained model
               model_name="A2C-Atari-Pong", # The name of our trained model
               model_architecture="A2C", # The model architecture we used: in our case PPO
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id="nsanghi/a2c-Atari-Pong", # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name}
               commit_message="Push to Hub")

Moviepy - Building video /home/nsanghi/sandbox/apress/drl-2ed/chapter2/logs/videos/a2c-agent-PongNoFrameskip-v4-step-1001-to-step-2001.mp4.
Moviepy - Writing video /home/nsanghi/sandbox/apress/drl-2ed/chapter2/logs/videos/a2c-agent-PongNoFrameskip-v4-step-1001-to-step-2001.mp4



                                                                                                                        

Moviepy - Done !
Moviepy - video ready /home/nsanghi/sandbox/apress/drl-2ed/chapter2/logs/videos/a2c-agent-PongNoFrameskip-v4-step-1001-to-step-2001.mp4
[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m


  logger.warn(


Saving video to /tmp/tmp2rall_56/-step-0-to-step-1000.mp4
Moviepy - Building video /tmp/tmp2rall_56/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmp2rall_56/-step-0-to-step-1000.mp4



ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers                                     
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --en

Moviepy - Done !
Moviepy - video ready /tmp/tmp2rall_56/-step-0-to-step-1000.mp4


frame= 1002 fps=0.0 q=-1.0 Lsize=      61kB time=00:00:33.30 bitrate=  15.1kbits/s speed=79.8x    
video:50kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 23.169149%
[libx264 @ 0x558c265230c0] frame I:5     Avg QP:15.93  size:   434
[libx264 @ 0x558c265230c0] frame P:518   Avg QP:22.53  size:    58
[libx264 @ 0x558c265230c0] frame B:479   Avg QP:30.09  size:    38
[libx264 @ 0x558c265230c0] consecutive B-frames: 22.0% 37.3% 16.8% 24.0%
[libx264 @ 0x558c265230c0] mb I  I16..4: 40.9% 48.0% 11.1%
[libx264 @ 0x558c265230c0] mb P  I16..4:  0.3%  1.3%  0.3%  P16..4:  3.1%  0.5%  0.1%  0.0%  0.0%    skip:94.4%
[libx264 @ 0x558c265230c0] mb B  I16..4:  0.0%  0.2%  0.2%  B16..8:  5.6%  0.7%  0.0%  direct: 0.0%  skip:93.3%  L0:49.6% L1:50.3% BI: 0.0%
[libx264 @ 0x558c265230c0] 8x8 transform intra:60.7% inter:4.1%
[libx264 @ 0x558c265230c0] coded y,uvDC,uvAC intra: 6.1% 17.1% 14.3% inter: 0.1% 0.4% 0.3%
[libx264 @ 0x558c265230c0] i16 v,h,dc,p: 58% 41%  1%  0%
[libx

[38;5;4mℹ Pushing repo nsanghi/a2c-Atari-Pong to the Hugging Face Hub[0m


A2C-Atari-Pong.zip:   0%|          | 0.00/13.8M [00:00<?, ?B/s]

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/nsanghi/a2c-Atari-Pong/tree/main/[0m


'https://huggingface.co/nsanghi/a2c-Atari-Pong/tree/main/'