# Hyperparameter tuning with Optuna

Github repo: https://github.com/araffin/tools-for-robotic-rl-icra2022

Optuna: https://github.com/optuna/optuna

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo

QR-DQN: https://advancedoracademy.medium.com/quantile-regression-dqn-pushing-the-boundaries-of-value-distribution-approximation-in-620af75ec5f3

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.


## Introduction

In this notebook, you will learn the importance of tuning hyperparameters. You will first try to optimize the parameters manually and then we will see how to automate the search using Optuna.


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [16]:
!pip install stable-baselines3



In [17]:
# Optional: install SB3 contrib to have access to additional algorithms
!pip install sb3-contrib



In [8]:
# Optuna will be used in the last part when doing hyperparameter tuning
!pip install optuna



In [None]:
!apt install swig cmake
!pip install swig

!sudo apt-get update
!sudo apt-get install -y python3-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

!pip install moviepy==1.0.3

!pip install huggingface_sb3

!pip install gym[classic_control]

!pip install gymnasium[atari]
!pip install gymnasium[accept-rom-license]

!pip install ffmpeg --upgrade

In [34]:
# For now we install this update of RL-Baselines3 Zoo
!pip install git+https://github.com/DLR-RM/rl-baselines3-zoo@update/hf

Collecting git+https://github.com/DLR-RM/rl-baselines3-zoo@update/hf
  Cloning https://github.com/DLR-RM/rl-baselines3-zoo (to revision update/hf) to /tmp/pip-req-build-iinqxvci
  Running command git clone --filter=blob:none --quiet https://github.com/DLR-RM/rl-baselines3-zoo /tmp/pip-req-build-iinqxvci
  Running command git checkout -b update/hf --track origin/update/hf
  Switched to a new branch 'update/hf'
  Branch 'update/hf' set up to track remote branch 'update/hf' from 'origin'.
  Resolved https://github.com/DLR-RM/rl-baselines3-zoo to commit 7dcbff7e74e7a12c052452181ff353a4dbed313a
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting pytablewriter~=0.64 (from rl_zoo3==2.0.0a9)
  Downloading pytablewriter-0.64.2-py3-none-any.whl.metadata (32 kB)
Collecting DataProperty<2,>=0.55.0 (from pytablewrite

## Imports

In [19]:
import gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [21]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN

In [22]:
# Algorithms from the contrib repo
# https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
from sb3_contrib import QRDQN, TQC

ImportError: cannot import name 'PyTorchObs' from 'stable_baselines3.common.type_aliases' (/opt/conda/lib/python3.10/site-packages/stable_baselines3/common/type_aliases.py)

In [25]:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

In [None]:
import os
os.kill(os.getpid(), 9)

In [26]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x79708f392230>

# Part I: The Importance Of Tuned Hyperparameters



When compared with Supervised Learning, Deep Reinforcement Learning is far more sensitive to the choice of hyper-parameters such as learning rate, number of neurons, number of layers, optimizer ... etc.

Poor choice of hyper-parameters can lead to poor/unstable convergence. This challenge is compounded by the variability in performance across random seeds (used to initialize the network weights and the environment).

In addition to hyperparameters, selecting the appropriate algorithm is also an important choice. We will demonstrate it on the simple Pendulum task.

See [gym doc](https://gym.openai.com/envs/Pendulum-v0/): "The inverted pendulum swingup problem is a classic problem in the control literature. In this version  of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright."


Let's try first with PPO and a small budget of 4000 steps (20 episodes):

In [9]:
env_id = "Pendulum-v1"
# Env used only for evaluation
eval_envs = make_vec_env(env_id, n_envs=10)
# 4000 training timesteps
budget_pendulum = 4000

### PPO

In [10]:
ppo_model = PPO("MlpPolicy", env_id, seed=0, verbose=0).learn(budget_pendulum)

In [11]:
mean_reward, std_reward = evaluate_policy(ppo_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

PPO Mean episode reward: -1179.35 +/- 227.73


### A2C

In [12]:
# Define and train a A2C model
a2c_model = A2C("MlpPolicy", env_id, seed=8, verbose=0).learn(budget_pendulum)

In [13]:
# Evaluate the train A2C model
mean_reward, std_reward = evaluate_policy(a2c_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"A2C Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

A2C Mean episode reward: -1637.66 +/- 171.42


Both are far from solving the env (mean reward around -200).
Now, let's try with an off-policy algorithm:

### Training longer PPO ?

Maybe training longer would help?

You can try with 10x the budget, but in the case of A2C/PPO, training longer won't help much, finding better hyperparameters is needed instead.

In [14]:
# train longer
new_budget = 10 * budget_pendulum

ppo_model = PPO("MlpPolicy", env_id, seed=0, verbose=0).learn(new_budget)

In [15]:
mean_reward, std_reward = evaluate_policy(ppo_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

PPO Mean episode reward: -1076.18 +/- 221.73


### PPO - Tuned Hyperparameters

Using Optuna, we can in fact tune the hyperparameters and find a working solution (from the [RL Zoo](https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/ppo.yml)):

In [16]:
tuned_params = {
    "gamma": 0.9,
    "use_sde": True,
    "sde_sample_freq": 4,
    "learning_rate": 1e-3,
}

# budget = 10 * budget_pendulum
ppo_tuned_model = PPO("MlpPolicy", env_id, seed=1, verbose=1, **tuned_params).learn(50_000, log_interval=5)

Using cpu device

Creating environment from the given name 'Pendulum-v1'

Wrapping the env with a `Monitor` wrapper

Wrapping the env in a DummyVecEnv.

-----------------------------------------

| rollout/                |             |

|    ep_len_mean          | 200         |

|    ep_rew_mean          | -1.2e+03    |

| time/                   |             |

|    fps                  | 507         |

|    iterations           | 5           |

|    time_elapsed         | 20          |

|    total_timesteps      | 10240       |

| train/                  |             |

|    approx_kl            | 0.016115531 |

|    clip_fraction        | 0.182       |

|    clip_range           | 0.2         |

|    entropy_loss         | -2.67       |

|    explained_variance   | 0.78        |

|    learning_rate        | 0.001       |

|    loss                 | 14.5        |

|    n_updates            | 40          |

|    policy_gradient_loss | -0.0128     |

|    std                  | 0.

In [17]:
mean_reward, std_reward = evaluate_policy(ppo_tuned_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"Tuned PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

Tuned PPO Mean episode reward: -203.23 +/- 178.43


Note: if you try SAC on the simple MountainCarContinuous environment, you will encounter some issues without tuned hyperparameters: https://github.com/rail-berkeley/softlearning/issues/76

Simple environments can be challenging even for SOTA algorithms.

# Part II: Grad Student Descent


### Challenge (10 minutes): "Grad Student Descent"
The challenge is to find the best hyperparameters (max performance) for A2C on `CartPole-v1` with a limited budget of 20 000 training steps.


Maximum reward: 500 on `CartPole-v1`

The hyperparameters should work for different random seeds.

In [18]:
budget = 20_000

#### The baseline: default hyperparameters

In [19]:
eval_envs_cartpole = make_vec_env("CartPole-v1", n_envs=10)

In [20]:
model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1).learn(budget)

Using cpu device

Creating environment from the given name 'CartPole-v1'

Wrapping the env with a `Monitor` wrapper

Wrapping the env in a DummyVecEnv.

------------------------------------

| rollout/              |          |

|    ep_len_mean        | 25.4     |

|    ep_rew_mean        | 25.4     |

| time/                 |          |

|    fps                | 438      |

|    iterations         | 100      |

|    time_elapsed       | 1        |

|    total_timesteps    | 500      |

| train/                |          |

|    entropy_loss       | -0.518   |

|    explained_variance | 0.571    |

|    learning_rate      | 0.0007   |

|    n_updates          | 99       |

|    policy_loss        | 2.16     |

|    value_loss         | 7.77     |

------------------------------------

------------------------------------

| rollout/              |          |

|    ep_len_mean        | 26.4     |

|    ep_rew_mean        | 26.4     |

| time/                 |          |

|    fps   

In [21]:
mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:499.80 +/- 1.40


**Your goal is to beat that baseline and get closer to the optimal score of 500**

Time to tune!

In [27]:
import torch.nn as nn

In [23]:
policy_kwargs = dict(
    net_arch=[
      dict(vf=[64, 64], pi=[64, 64]), # network architectures for actor/critic
    ],
    activation_fn=nn.Tanh,
)

hyperparams = dict(
    n_steps=5, # number of steps to collect data before updating policy
    learning_rate=7e-4,
    gamma=0.99, # discount factor
    max_grad_norm=0.5, # The maximum value for the gradient clipping
    ent_coef=0.0, # Entropy coefficient for the loss calculation
)

model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1, **hyperparams).learn(budget)

Using cpu device

Creating environment from the given name 'CartPole-v1'

Wrapping the env with a `Monitor` wrapper

Wrapping the env in a DummyVecEnv.

------------------------------------

| rollout/              |          |

|    ep_len_mean        | 25.4     |

|    ep_rew_mean        | 25.4     |

| time/                 |          |

|    fps                | 415      |

|    iterations         | 100      |

|    time_elapsed       | 1        |

|    total_timesteps    | 500      |

| train/                |          |

|    entropy_loss       | -0.518   |

|    explained_variance | 0.571    |

|    learning_rate      | 0.0007   |

|    n_updates          | 99       |

|    policy_loss        | 2.16     |

|    value_loss         | 7.77     |

------------------------------------

------------------------------------

| rollout/              |          |

|    ep_len_mean        | 26.4     |

|    ep_rew_mean        | 26.4     |

| time/                 |          |

|    fps   

In [24]:
mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:499.50 +/- 3.50


Hint - Recommended Hyperparameter Range

```python
gamma = trial.suggest_float("gamma", 0.9, 0.99999, log=True)
max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
# from 2**3 = 8 to 2**10 = 1024
n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)
learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
ent_coef = trial.suggest_float("ent_coef", 0.00000001, 0.1, log=True)
# net_arch tiny: {"pi": [64], "vf": [64]}
# net_arch default: {"pi": [64, 64], "vf": [64, 64]}
# activation_fn = nn.Tanh / nn.ReLU
```

# Part III: Automatic Hyperparameter Tuning





In this part we will create a script that allows to search for the best hyperparameters automatically.

### Imports

In [28]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances, plot_parallel_coordinate

## CartPole-v1

### Config

In [19]:
N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 5  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
N_TIMESTEPS = int(2e4)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 15)  # 15 minutes

ENV_ID = "CartPole-v1"

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
    "env": ENV_ID,
}

### Exercise (5 minutes): Define the search space

In [18]:
from typing import Any, Dict
import torch
import torch.nn as nn

def sample_a2c_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for A2C hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    # Discount factor between 0.9 and 0.9999
    gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.1, log=True)
    max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    # 8, 16, 32, ... 1024
    n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)

    ### YOUR CODE HERE
    # TODO:
    # - define the learning rate search space [1e-5, 1] (log) -> `suggest_float`
    # - define the network architecture search space ["tiny", "small"] -> `suggest_categorical`
    # - define the activation function search space ["tanh", "relu"]
    learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
    net_arch = trial.suggest_categorical("net_arch", ["tiny", "small"])
    activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    ### END OF YOUR CODE

    # Display true values
    trial.set_user_attr("gamma_", gamma)
    trial.set_user_attr("n_steps", n_steps)

    net_arch = [
        {"pi": [64], "vf": [64]}
        if net_arch == "tiny"
        else {"pi": [64, 64], "vf": [64, 64]}
    ]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    return {
        "n_steps": n_steps,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "max_grad_norm": max_grad_norm,
        "policy_kwargs": {
            "net_arch": net_arch,
            "activation_fn": activation_fn,
        },
    }

### Define the objective function

First we define a custom callback to report the results of periodic evaluations to Optuna:

In [17]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Exercise (10 minutes): Define the objective function

Then we define the objective function that is in charge of sampling hyperparameters, creating the model and then returning the result to Optuna

In [16]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    ### YOUR CODE HERE
    # TODO:
    # 1. Sample hyperparameters and update the default keyword arguments: `kwargs.update(other_params)`
    # 2. Create the evaluation envs
    # 3. Create the `TrialEvalCallback`    

    # 1. Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_a2c_params(trial))

    # Create the RL model
    model = A2C(**kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_envs = make_vec_env(ENV_ID, n_envs=N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    # and report the performance using `N_EVAL_EPISODES` every `EVAL_FREQ`
    # TrialEvalCallback signature:
    # TrialEvalCallback(eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose)
    eval_callback = TrialEvalCallback(eval_envs, trial, n_eval_episodes=N_EVAL_EPISODES, eval_freq=EVAL_FREQ, deterministic=True, verbose=0)

    ### END OF YOUR CODE

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_envs.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

### The optimization loop

In [None]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_a2c_cartpole.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()

Complete example: https://github.com/DLR-RM/rl-baselines3-zoo

### Best Params for CartPole

In [21]:
policy_kwargs = dict(
    net_arch=[
      dict(vf=[64, 64], pi=[64, 64]), # network architectures for actor/critic
    ],
    activation_fn=nn.Tanh,
)

hyperparams = dict(
    n_steps=32, # number of steps to collect data before updating policy
    learning_rate=0.00036981751573814277,
    gamma=0.9866042885147209, # discount factor
    max_grad_norm=1.6202256587200317, # The maximum value for the gradient clipping
    ent_coef=0.0, # Entropy coefficient for the loss calculation
    policy_kwargs = policy_kwargs
)

model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=1, **hyperparams).learn(500_0000)

Using cpu device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [14]:
eval_envs_cartpole = make_vec_env("CartPole-v1", n_envs=10)

mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:500.00 +/- 0.00


In [15]:
# Save the model
model_name = "a2c-CartPole-v1"
model.save(model_name)

### Push to Huggingface

In [4]:
import gym
env = gym.make("CartPole-v1", render_mode="rgb_array")
model = A2C.load("/kaggle/working/a2c-CartPole-v1.zip", env=env)



Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [5]:
from huggingface_sb3 import load_from_hub, package_to_hub
from huggingface_hub import notebook_login, login

In [6]:
login(token=TOKEN)
!git config --global credential.helper store

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [7]:
from huggingface_sb3 import package_to_hub
from stable_baselines3.common.vec_env import DummyVecEnv

model_name = "a2c-CartPole-v1"

# PLACE the variables you've just defined two cells above
# Define the name of the environment
env_id = "CartPole-v1"

# TODO: Define the model architecture we used
model_architecture = "A2C"

## Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
## CHANGE WITH YOUR REPO ID
repo_id = "hishamcse/a2c-CartPole-v1" # Change with your repo id, you can't push with mine 😄

## Define the commit message
commit_message = "Upload A2C CartPole-v1 trained agent by hishamcse"

# Create the evaluation env and set the render_mode="rgb_array"
eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])

# PLACE the package_to_hub function you've just filled here
package_to_hub(model=model, # Our trained model
               model_name=model_name, # The name of our trained model
               model_architecture=model_architecture, # The model architecture we used: in our case A2C
               env_id=env_id, # Name of the environment
               eval_env=eval_env, # Evaluation Environment
               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
               commit_message=commit_message)



[38;5;4mℹ This function will save, evaluate, generate a video of your agent,
create a model card and push everything to the hub. It might take up to 1min.
This is a work in progress: if you encounter a bug, please open an issue.[0m


  if not isinstance(terminated, (bool, np.bool8)):


Saving video to /tmp/tmpa5b2u3k8/-step-0-to-step-1000.mp4
Moviepy - Building video /tmp/tmpa5b2u3k8/-step-0-to-step-1000.mp4.
Moviepy - Writing video /tmp/tmpa5b2u3k8/-step-0-to-step-1000.mp4



ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --e

Moviepy - Done !
Moviepy - video ready /tmp/tmpa5b2u3k8/-step-0-to-step-1000.mp4


  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/tmp/tmpa5b2u3k8/-step-0-to-step-1000.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.29.100
  Duration: 00:00:20.02, start: 0.000000, bitrate: 20 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 600x400, 15 kb/s, 50 fps, 50 tbr, 12800 tbn, 100 tbc (default)
    Metadata:
      handler_name    : VideoHandler
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))
Press [q] to stop, [?] for help
[libx264 @ 0x5d06b001d840] using cpu capabilities: MMX2 S

[38;5;4mℹ Pushing repo hishamcse/a2c-CartPole-v1 to the Hugging Face Hub[0m


a2c-CartPole-v1.zip:   0%|          | 0.00/98.7k [00:00<?, ?B/s]

[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:
https://huggingface.co/hishamcse/a2c-CartPole-v1/tree/main/[0m


CommitInfo(commit_url='https://huggingface.co/hishamcse/a2c-CartPole-v1/commit/6d79699307880c5bac7f03173e009d6251b3123a', commit_message='Upload A2C CartPole-v1 trained agent by hishamcse', commit_description='', oid='6d79699307880c5bac7f03173e009d6251b3123a', pr_url=None, pr_revision=None, pr_num=None)

## PongNoFrameskip-v4

### Config 

In [29]:
import warnings
warnings.filterwarnings('ignore')

N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 10  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
N_TIMESTEPS = int(2e4)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 8
N_EVAL_EPISODES = 5
TIMEOUT = int(60 * 50)

ENV_ID = "PongNoFrameskip-v4"

DEFAULT_HYPERPARAMS = {
    "policy": "CnnPolicy",
    "env": ENV_ID,
}

### Define the search space

https://github.com/CppMaster/SC2-AI/blob/master/optuna_utils/sample_params/ppo.py

In [30]:
from typing import Any, Dict
import torch
import torch.nn as nn

def sample_ppo_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for PPO hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    # Discount factor between 0.9 and 0.9999
    gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.1, log=True)
    max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)
    n_epochs = trial.suggest_int("n_epochs", 3, 20)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32, 64, 128, 256, 512])
    learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
    ent_coef = trial.suggest_loguniform("ent_coef", 0.00000001, 0.1)
    clip_range = trial.suggest_categorical("clip_range", [0.1, 0.2, 0.3, 0.4])
    gae_lambda = trial.suggest_float("gae_lambda", 0.8, 1, log=True)
    vf_coef = trial.suggest_uniform("vf_coef", 0, 1)
    
    net_arch = trial.suggest_categorical("net_arch", ["tiny", "small", "medium"])
    activation_fn = trial.suggest_categorical('activation_fn', ['tanh', 'relu', 'elu', 'leaky_relu'])
    
    if batch_size > n_steps:
        batch_size = n_steps

    # Display true values
    trial.set_user_attr("gamma_", gamma)
    trial.set_user_attr("n_steps", n_steps)

    net_arch = trial.suggest_categorical("net_arch", ["tiny", "small", "medium"])
    net_arch = {
        "tiny": [dict(pi=[8, 8], vf=[8, 8])],
        "small": [dict(pi=[64, 64], vf=[64, 64])],
        "medium": [dict(pi=[256, 256], vf=[256, 256])],
    }[net_arch]

    activation_fn = {
        "tanh": nn.Tanh, 
        "relu": nn.ReLU, 
        "elu": nn.ELU, 
        "leaky_relu": nn.LeakyReLU
    }[activation_fn]

    return {
        "n_steps": n_steps,
        "batch_size": batch_size,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "ent_coef": ent_coef,
        "clip_range": clip_range,
        "n_epochs": n_epochs,
        "gae_lambda": gae_lambda,
        "max_grad_norm": max_grad_norm,
        "vf_coef": vf_coef,
        "policy_kwargs": {
            "net_arch": net_arch,
            "activation_fn": activation_fn,
        },
    }

### Define the callback function

https://github.com/CppMaster/SC2-AI/blob/master/optuna_utils/trial_eval_callback.py

In [31]:
from stable_baselines3.common.callbacks import EvalCallback, BaseCallback, StopTrainingOnNoModelImprovement
from typing import Optional

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
        callback_after_eval: Optional[BaseCallback] = None
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
            callback_after_eval=callback_after_eval
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Define the objective function

https://github.com/CppMaster/SC2-AI/blob/master/minigames/move_to_beacon/src/optuna_search.py

In [32]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    ### YOUR CODE HERE
    # TODO:
    # 1. Sample hyperparameters and update the default keyword arguments: `kwargs.update(other_params)`
    # 2. Create the evaluation envs
    # 3. Create the `TrialEvalCallback`    

    # 1. Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_ppo_params(trial))

    # Create the RL model
    model = PPO(**kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_envs = make_vec_env(ENV_ID, n_envs=N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    # and report the performance using `N_EVAL_EPISODES` every `EVAL_FREQ`
    # TrialEvalCallback signature:
    # TrialEvalCallback(eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose)
    
    stop_callback = StopTrainingOnNoModelImprovement(max_no_improvement_evals=30, 
                                                     min_evals=50, verbose=1)
    
    eval_callback = TrialEvalCallback(eval_envs, trial, n_eval_episodes=N_EVAL_EPISODES, 
                                      eval_freq=EVAL_FREQ, deterministic=True, verbose=0,
                                      callback_after_eval=stop_callback)

    ### END OF YOUR CODE

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_envs.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

### The optimization loop

In [None]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)

# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)
pruner = MedianPruner(n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=10)

# Create the study and start the hyperparameter optimization
study = optuna.create_study(study_name='ppo-pong', sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_ppo_pong.csv")

try:
    fig1 = plot_optimization_history(study)
    fig2 = plot_param_importances(study)
    fig3 = plot_parallel_coordinate(study)

    fig1.show()
    fig2.show()
    fig3.show()

except (ValueError, ImportError, RuntimeError) as e:
    print("Error during plotting")
    print(e)

[I 2024-06-15 09:24:17,364] A new study created in memory with name: ppo-pong
[I 2024-06-15 09:38:24,802] Trial 2 finished with value: -21.0 and parameters: {'gamma': 0.00032860123733997106, 'max_grad_norm': 2.0445457512289225, 'exponent_n_steps': 5, 'n_epochs': 20, 'batch_size': 32, 'lr': 0.0034186389772942836, 'ent_coef': 0.00387076461028788, 'clip_range': 0.1, 'gae_lambda': 0.8057481336267502, 'vf_coef': 0.5256451599723027, 'net_arch': 'small', 'activation_fn': 'elu'}. Best is trial 0 with value: -21.0.
[I 2024-06-15 09:49:12,698] Trial 4 finished with value: -21.0 and parameters: {'gamma': 0.0007523048951526965, 'max_grad_norm': 0.5101104998439433, 'exponent_n_steps': 6, 'n_epochs': 16, 'batch_size': 256, 'lr': 0.0365322257156689, 'ent_coef': 6.56136192130792e-07, 'clip_range': 0.4, 'gae_lambda': 0.840191139530352, 'vf_coef': 0.5629588411917277, 'net_arch': 'tiny', 'activation_fn': 'elu'}. Best is trial 1 with value: -21.0.
[I 2024-06-15 09:57:01,304] Trial 5 finished with value: -

### Best Params for PongNoFrameSkip

In [45]:
!rm -rf /kaggle/working/ppo.yml

In [46]:
%%writefile ppo.yml
PongNoFrameskip-v4:
  env_wrapper:
    - stable_baselines3.common.atari_wrappers.AtariWrapper
  frame_stack: 4
  policy: 'CnnPolicy'
  n_timesteps: !!float 5e6
  learning_rate: !!float 0.009929843682975054
  batch_size: 256
  gamma: .999
  clip_range: 0.4
  ent_coef: 1.6077823351479547e-08
  n_envs: 8
  n_epochs: 9
  n_steps: 128
  vf_coef: 0.7945615838365445
  gae_lambda: 0.9342974216877361
  policy_kwargs: "dict(net_arch=[dict(pi=[64, 64], vf=[64, 64]),],activation_fn=nn.Tanh)"
  normalize: False

Writing ppo.yml


### Train Agent 

In [47]:
!python -m rl_zoo3.train --algo ppo  --env PongNoFrameskip-v4 -f logs/ -c ppo.yml

2024-06-15 12:59:27.644816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-15 12:59:27.644872: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-15 12:59:27.646195: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Seed: 3607532832
Loading hyperparameters from: ppo.yml
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 256),
             ('clip_range', 0.4),
             ('ent_coef', 1.6077823351479547e-08),
             ('env_wrapper',
              ['stable_baselines3.common.atari_wrappers.AtariWrapper']),
          

### Evaluate Agent 

In [48]:
!python -m rl_zoo3.enjoy  --algo ppo  --env PongNoFrameskip-v4  --no-render  --n-timesteps 10000  --folder logs/

2024-06-15 13:21:06.812800: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-15 13:21:06.812859: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-15 13:21:06.814561: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loading latest experiment, id=4
Loading logs/ppo/PongNoFrameskip-v4_4/PongNoFrameskip-v4.zip
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Stacking 4 frames
Atari Episode Score: -21.00
Atari Episode Length 3056
Atari Episode Score: -21.00
Atari Episode Length 3056
Atari Episode Score: -21.00
Atari Episode Length 3056
Atari Episode

### Push to Huggingface

In [49]:
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [50]:
!python -m rl_zoo3.push_to_hub  --algo ppo --env PongNoFrameskip-v4  --repo-name ppo-PongNoFrameskip-v4  -orga hishamcse  -f logs/

2024-06-15 13:22:31.124462: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-15 13:22:31.124520: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-15 13:22:31.126089: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loading latest experiment, id=4
Loading logs/ppo/PongNoFrameskip-v4_4/PongNoFrameskip-v4.zip
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Stacking 4 frames
Wrapping the env in a VecTransposeImage.
Uploading to hishamcse/ppo-PongNoFrameskip-v4, make sure to have the rights
[38;5;4mℹ This function will save, evaluate, generate a v

## BreakoutNoFrameskip-v4 

### Best Params 

In [51]:
%%writefile b_ppo.yml
BreakoutNoFrameskip-v4:
  env_wrapper:
    - stable_baselines3.common.atari_wrappers.AtariWrapper
  frame_stack: 4
  policy: 'CnnPolicy'
  n_timesteps: !!float 5e6
  learning_rate: lin_2.5e-4
  batch_size: 256
  gamma: .999
  clip_range: lin_0.1
  ent_coef: 0.01
  n_envs: 8
  n_epochs: 4
  n_steps: 128
  vf_coef: 0.6
  gae_lambda: 0.9342974216877361
  policy_kwargs: "dict(net_arch=[dict(pi=[64, 64], vf=[64, 64]),],activation_fn=nn.Tanh)"
  normalize: False

Writing b_ppo.yml


### Train 

In [52]:
!python -m rl_zoo3.train --algo ppo --env BreakoutNoFrameskip-v4 -f b_logs/ -c b_ppo.yml

2024-06-15 13:28:17.195866: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-15 13:28:17.195920: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-15 13:28:17.197423: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Seed: 107358395
Loading hyperparameters from: b_ppo.yml
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 256),
             ('clip_range', 'lin_0.1'),
             ('ent_coef', 0.01),
             ('env_wrapper',
              ['stable_baselines3.common.atari_wrappers.AtariWrapper']),
             ('frame_

### Evaluate 

In [53]:
!python -m rl_zoo3.enjoy  --algo ppo --env BreakoutNoFrameskip-v4 --no-render  --n-timesteps 10000  --folder b_logs/

2024-06-15 17:45:02.266994: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-15 17:45:02.267059: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-15 17:45:02.268647: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loading latest experiment, id=1
Loading b_logs/ppo/BreakoutNoFrameskip-v4_1/BreakoutNoFrameskip-v4.zip
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Stacking 4 frames
Atari Episode Score: 50.00
Atari Episode Length 4819
Atari Episode Score: 31.00
Atari Episode Length 4012
Atari Episode Score: 47.00
Atari Episode Length 4766
Atari 

### Push to Huggingface 

In [54]:
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [55]:
!python -m rl_zoo3.push_to_hub --algo ppo --env BreakoutNoFrameskip-v4  --repo-name ppo-BreakoutNoFrameskip-v4  -orga hishamcse  -f b_logs/

2024-06-15 17:48:15.625646: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-15 17:48:15.625705: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-15 17:48:15.627107: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loading latest experiment, id=1
Loading b_logs/ppo/BreakoutNoFrameskip-v4_1/BreakoutNoFrameskip-v4.zip
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Stacking 4 frames
Wrapping the env in a VecTransposeImage.
Uploading to hishamcse/ppo-BreakoutNoFrameskip-v4, make sure to have the rights
[38;5;4mℹ This function will save, evaluate