<a href="https://colab.research.google.com/github/iject/SB3/blob/main/optuna/optuna_lab_edit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter tuning with Optuna

Github repo: https://github.com/araffin/tools-for-robotic-rl-icra2022

Optuna: https://github.com/optuna/optuna

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines3.readthedocs.io/en/master/

SB3 Contrib: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.


## Introduction

In this notebook, you will learn the importance of tuning hyperparameters. You will first try to optimize the parameters manually and then we will see how to automate the search using Optuna.


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [27]:
!pip install stable-baselines3
!pip install gymnasium[mujoco]

Collecting mujoco>=2.1.5 (from gymnasium[mujoco])
  Downloading mujoco-3.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting glfw (from mujoco>=2.1.5->gymnasium[mujoco])
  Downloading glfw-2.9.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.p39.p310.p311.p312.p313-none-manylinux_2_28_x86_64.whl.metadata (5.4 kB)
Downloading mujoco-3.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading glfw-2.9.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.p39.p310.p311.p312.p313-none-manylinux_2_28_x86_64.whl (243 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.5/243.5 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packag

In [2]:
# Optional: install SB3 contrib to have access to additional algorithms
!pip install sb3-contrib

Collecting sb3-contrib
  Downloading sb3_contrib-2.6.0-py3-none-any.whl.metadata (4.1 kB)
Downloading sb3_contrib-2.6.0-py3-none-any.whl (92 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.8/92.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sb3-contrib
Successfully installed sb3-contrib-2.6.0


In [3]:
# Optuna will be used in the last part when doing hyperparameter tuning
!pip install optuna

Collecting optuna
  Downloading optuna-4.3.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.15.2-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.3.0-py3-none-any.whl (386 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m386.6/386.6 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.15.2-py3-none-any.whl (231 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.9/231.9 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, alembic, optuna
Successfully installed alembic-1.15.2 colorlog-6.9.0 optuna-4.3.0


## Imports

In [4]:
import gymnasium as gym
import numpy as np

The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [5]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN

In [6]:
# Algorithms from the contrib repo
# https://github.com/Stable-Baselines-Team/stable-baselines3-contrib
from sb3_contrib import QRDQN, TQC

In [7]:
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy

# Part I: The Importance Of Tuned Hyperparameters



When compared with Supervised Learning, Deep Reinforcement Learning is far more sensitive to the choice of hyper-parameters such as learning rate, number of neurons, number of layers, optimizer ... etc.

Poor choice of hyper-parameters can lead to poor/unstable convergence. This challenge is compounded by the variability in performance across random seeds (used to initialize the network weights and the environment).

In addition to hyperparameters, selecting the appropriate algorithm is also an important choice. We will demonstrate it on the simple Pendulum task.

See [gym doc](https://gym.openai.com/envs/Pendulum-v0/): "The inverted pendulum swingup problem is a classic problem in the control literature. In this version  of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright."


Let's try first with PPO and a small budget of 4000 steps (20 episodes):

In [8]:
env_id = "Pendulum-v1"
# Env used only for evaluation
eval_envs = make_vec_env(env_id, n_envs=10)
# 4000 training timesteps
budget_pendulum = 4000

### PPO

In [None]:
ppo_model = PPO("MlpPolicy", env_id, seed=0, verbose=0).learn(budget_pendulum)

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

PPO Mean episode reward: -1177.32 +/- 255.97


### A2C

In [None]:
# Define and train a A2C model
a2c_model = A2C("MlpPolicy", env_id, seed=0, verbose=0).learn(budget_pendulum)

In [None]:
# Evaluate the train A2C model
mean_reward, std_reward = evaluate_policy(a2c_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"A2C Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

A2C Mean episode reward: -1324.29 +/- 318.48


Both are far from solving the env (mean reward around -200).
Now, let's try with an off-policy algorithm:

### Training longer PPO ?

Maybe training longer would help?

You can try with 10x the budget, but in the case of A2C/PPO, training longer won't help much, finding better hyperparameters is needed instead.

In [None]:
# train longer
new_budget = 10 * budget_pendulum

ppo_model = PPO("MlpPolicy", env_id, seed=0, verbose=0).learn(new_budget)

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

PPO Mean episode reward: -1226.19 +/- 280.93


### PPO - Tuned Hyperparameters

Using Optuna, we can in fact tune the hyperparameters and find a working solution (from the [RL Zoo](https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/ppo.yml)):

In [None]:
tuned_params = {
    "gamma": 0.9,
    "use_sde": True,
    "sde_sample_freq": 4,
    "learning_rate": 1e-3,
}

# budget = 10 * budget_pendulum
ppo_tuned_model = PPO("MlpPolicy", env_id, seed=1, verbose=1, **tuned_params).learn(50_000, log_interval=5)

Using cpu device
Creating environment from the given name 'Pendulum-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 200         |
|    ep_rew_mean          | -1.2e+03    |
| time/                   |             |
|    fps                  | 661         |
|    iterations           | 5           |
|    time_elapsed         | 15          |
|    total_timesteps      | 10240       |
| train/                  |             |
|    approx_kl            | 0.033493523 |
|    clip_fraction        | 0.157       |
|    clip_range           | 0.2         |
|    entropy_loss         | -2.62       |
|    explained_variance   | 0.811       |
|    learning_rate        | 0.001       |
|    loss                 | 14          |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0132     |
|    std                  | 0.933       |
|    value_

In [None]:
mean_reward, std_reward = evaluate_policy(ppo_tuned_model, eval_envs, n_eval_episodes=100, deterministic=True)

print(f"Tuned PPO Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

Tuned PPO Mean episode reward: -158.56 +/- 100.66


Note: if you try SAC on the simple MountainCarContinuous environment, you will encounter some issues without tuned hyperparameters: https://github.com/rail-berkeley/softlearning/issues/76

Simple environments can be challenging even for SOTA algorithms.

# Part II: Grad Student Descent


### Challenge (10 minutes): "Grad Student Descent"
The challenge is to find the best hyperparameters (max performance) for A2C on `CartPole-v1` with a limited budget of 20 000 training steps.


Maximum reward: 500 on `CartPole-v1`

The hyperparameters should work for different random seeds.

In [9]:
budget = 20_000

#### The baseline: default hyperparameters

In [10]:
eval_envs_cartpole = make_vec_env("CartPole-v1", n_envs=10)

In [11]:
model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=0).learn(budget)

Using cpu device
Creating environment from the given name 'CartPole-v1'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 25.4     |
|    ep_rew_mean        | 25.4     |
| time/                 |          |
|    fps                | 351      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.518   |
|    explained_variance | 0.571    |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 2.16     |
|    value_loss         | 7.77     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 26.4     |
|    ep_rew_mean        | 26.4     |
| time/                 |          |
|    fps                | 411      |


In [12]:
mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:126.12 +/- 12.20


**Your goal is to beat that baseline and get closer to the optimal score of 500**

Time to tune!

In [13]:
import torch.nn as nn

In [14]:
policy_kwargs = dict(
    net_arch=[
      dict(vf=[64, 64], pi=[64, 64]), # network architectures for actor/critic
    ],
    activation_fn=nn.Tanh,
)

hyperparams = dict(
    n_steps=5, # number of steps to collect data before updating policy
    learning_rate=7e-4,
    gamma=0.99, # discount factor
    max_grad_norm=0.5, # The maximum value for the gradient clipping
    ent_coef=0.0, # Entropy coefficient for the loss calculation
)

model = A2C("MlpPolicy", "CartPole-v1", seed=8, verbose=0, **hyperparams).learn(budget)

In [15]:
mean_reward, std_reward = evaluate_policy(model, eval_envs_cartpole, n_eval_episodes=50, deterministic=True)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:128.70 +/- 13.47


Hint - Recommended Hyperparameter Range

```python
gamma = trial.suggest_float("gamma", 0.9, 0.99999, log=True)
max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
# from 2**3 = 8 to 2**10 = 1024
n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)
learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
ent_coef = trial.suggest_float("ent_coef", 0.00000001, 0.1, log=True)
# net_arch tiny: {"pi": [64], "vf": [64]}
# net_arch default: {"pi": [64, 64], "vf": [64, 64]}
# activation_fn = nn.Tanh / nn.ReLU
```

# Part III: Automatic Hyperparameter Tuning





In this part we will create a script that allows to search for the best hyperparameters automatically.

### Imports

In [16]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

### Config

In [17]:
N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 5  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
N_TIMESTEPS = int(2e4)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 15)  # 15 minutes

ENV_ID = "CartPole-v1"

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
    "env": ENV_ID,
}

### Exercise (5 minutes): Define the search space

In [20]:
from typing import Any, Dict, Optional
import torch
import torch.nn as nn

def sample_a2c_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for A2C hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    # Discount factor between 0.9 and 0.9999
    gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.1, log=True)
    max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    # 8, 16, 32, ... 1024
    n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)

    ### YOUR CODE HERE
    # TODO:
    # - define the learning rate search space [1e-5, 1] (log) -> `suggest_float`
    # - define the network architecture search space ["tiny", "small"] -> `suggest_categorical`
    # - define the activation function search space ["tanh", "relu"]
    learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
    net_arch = trial.suggest_categorical("net_arch", ["tiny", "small"])
    activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    ### END OF YOUR CODE

    # Display true values
    trial.set_user_attr("gamma_", gamma)
    trial.set_user_attr("n_steps", n_steps)

    net_arch = [
        {"pi": [64], "vf": [64]}
        if net_arch == "tiny"
        else {"pi": [64, 64], "vf": [64, 64]}
    ]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    return {
        "n_steps": n_steps,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "max_grad_norm": max_grad_norm,
        "policy_kwargs": {
            "net_arch": net_arch,
            "activation_fn": activation_fn,
        },
    }

### Define the objective function

First we define a custom callback to report the results of periodic evaluations to Optuna:

In [23]:
import os

In [25]:
log_path_EC = r"log_path/EvalCallback"
os.makedirs(log_path_EC, exist_ok=True)

In [26]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
        log_path: Optional[str] = None,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
            log_path=log_path,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Exercise (10 minutes): Define the objective function

Then we define the objective function that is in charge of sampling hyperparameters, creating the model and then returning the result to Optuna

In [27]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    ### YOUR CODE HERE
    # TODO:
    # 1. Sample hyperparameters and update the default keyword arguments: `kwargs.update(other_params)`
    # 2. Create the evaluation envs
    # 3. Create the `TrialEvalCallback`

    # 1. Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_a2c_params(trial))
    # Create the RL model
    model = A2C(**kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_env = make_vec_env(ENV_ID, N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    # and report the performance using `N_EVAL_EPISODES` every `EVAL_FREQ`
    # TrialEvalCallback signature:
    # TrialEvalCallback(eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose)
    eval_callback = TrialEvalCallback(eval_env, trial,
                                      N_EVAL_EPISODES, EVAL_FREQ,
                                      verbose=0, log_path=log_path_EC)

    ### END OF YOUR CODE

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_envs.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

### The optimization loop

In [28]:
%%time
# ~15 minuts

import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_a2c_cartpole.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()

[I 2025-05-13 11:39:38,138] A new study created in memory with name: no-name-2b90683f-4db3-4fdf-a0ae-ecc199ce7d87


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.66
Episode length: 9.40 +/- 0.66
New best mean reward!


[I 2025-05-13 11:40:05,108] Trial 0 finished with value: 9.5 and parameters: {'gamma': 0.0012048481146359712, 'max_grad_norm': 4.707126915205051, 'exponent_n_steps': 6, 'lr': 0.007165665181062419, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 0 with value: 9.5.


Eval num_timesteps=20000, episode_reward=9.50 +/- 0.81
Episode length: 9.50 +/- 0.81
New best mean reward!
Eval num_timesteps=10000, episode_reward=485.40 +/- 29.25
Episode length: 485.40 +/- 29.25
New best mean reward!


[I 2025-05-13 11:40:25,481] Trial 1 finished with value: 500.0 and parameters: {'gamma': 0.012815196432391299, 'max_grad_norm': 1.7474648131234445, 'exponent_n_steps': 6, 'lr': 0.006068960116412594, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=10000, episode_reward=87.00 +/- 30.78
Episode length: 87.00 +/- 30.78
New best mean reward!


[I 2025-05-13 11:40:57,461] Trial 2 finished with value: 90.5 and parameters: {'gamma': 0.0064228497085123525, 'max_grad_norm': 3.591240666518141, 'exponent_n_steps': 3, 'lr': 1.556276893934477e-05, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=90.50 +/- 28.98
Episode length: 90.50 +/- 28.98
New best mean reward!
Eval num_timesteps=10000, episode_reward=183.10 +/- 108.83
Episode length: 183.10 +/- 108.83
New best mean reward!
Eval num_timesteps=20000, episode_reward=67.60 +/- 17.00
Episode length: 67.60 +/- 17.00


[I 2025-05-13 11:41:15,171] Trial 3 finished with value: 67.6 and parameters: {'gamma': 0.00012438516109911142, 'max_grad_norm': 1.591891715542006, 'exponent_n_steps': 10, 'lr': 0.03321031464339074, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.49
Episode length: 9.40 +/- 0.49
New best mean reward!
Eval num_timesteps=20000, episode_reward=9.60 +/- 0.92
Episode length: 9.60 +/- 0.92
New best mean reward!


[I 2025-05-13 11:41:34,952] Trial 4 finished with value: 9.6 and parameters: {'gamma': 0.007086897683522574, 'max_grad_norm': 0.7086484322759414, 'exponent_n_steps': 10, 'lr': 0.9044482374674678, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.
[I 2025-05-13 11:41:43,894] Trial 5 pruned. 


Eval num_timesteps=10000, episode_reward=59.30 +/- 6.81
Episode length: 59.30 +/- 6.81
New best mean reward!
Eval num_timesteps=10000, episode_reward=215.30 +/- 47.84
Episode length: 215.30 +/- 47.84
New best mean reward!


[I 2025-05-13 11:42:13,917] Trial 6 finished with value: 322.6 and parameters: {'gamma': 0.07176419066871012, 'max_grad_norm': 1.553962456497771, 'exponent_n_steps': 3, 'lr': 0.0007663156871253216, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=322.60 +/- 75.28
Episode length: 322.60 +/- 75.28
New best mean reward!


[I 2025-05-13 11:42:22,013] Trial 7 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.80
Episode length: 9.40 +/- 0.80
New best mean reward!
Eval num_timesteps=10000, episode_reward=137.20 +/- 17.23
Episode length: 137.20 +/- 17.23
New best mean reward!


[I 2025-05-13 11:42:46,301] Trial 8 finished with value: 500.0 and parameters: {'gamma': 0.01943866299474629, 'max_grad_norm': 2.825910788434675, 'exponent_n_steps': 5, 'lr': 0.0009548613971127521, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=10000, episode_reward=249.50 +/- 19.59
Episode length: 249.50 +/- 19.59
New best mean reward!


[I 2025-05-13 11:43:05,890] Trial 9 finished with value: 500.0 and parameters: {'gamma': 0.00010905236609332113, 'max_grad_norm': 2.275976124365757, 'exponent_n_steps': 8, 'lr': 0.007117544498143271, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:43:16,243] Trial 10 pruned. 


Eval num_timesteps=10000, episode_reward=9.10 +/- 0.70
Episode length: 9.10 +/- 0.70
New best mean reward!


[I 2025-05-13 11:43:28,401] Trial 11 pruned. 


Eval num_timesteps=10000, episode_reward=141.20 +/- 19.53
Episode length: 141.20 +/- 19.53
New best mean reward!


[I 2025-05-13 11:43:40,089] Trial 12 pruned. 


Eval num_timesteps=10000, episode_reward=110.30 +/- 4.43
Episode length: 110.30 +/- 4.43
New best mean reward!


[I 2025-05-13 11:43:53,849] Trial 13 pruned. 


Eval num_timesteps=10000, episode_reward=55.80 +/- 14.11
Episode length: 55.80 +/- 14.11
New best mean reward!
Eval num_timesteps=10000, episode_reward=377.30 +/- 121.16
Episode length: 377.30 +/- 121.16
New best mean reward!


[I 2025-05-13 11:44:12,785] Trial 14 finished with value: 500.0 and parameters: {'gamma': 0.006844084611787624, 'max_grad_norm': 3.30332500505751, 'exponent_n_steps': 7, 'lr': 0.0028881984991066907, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:44:24,499] Trial 15 pruned. 


Eval num_timesteps=10000, episode_reward=9.70 +/- 0.64
Episode length: 9.70 +/- 0.64
New best mean reward!


[I 2025-05-13 11:44:33,765] Trial 16 pruned. 


Eval num_timesteps=10000, episode_reward=94.70 +/- 59.50
Episode length: 94.70 +/- 59.50
New best mean reward!


[I 2025-05-13 11:44:47,373] Trial 17 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.66
Episode length: 9.40 +/- 0.66
New best mean reward!


[I 2025-05-13 11:45:01,368] Trial 18 pruned. 


Eval num_timesteps=10000, episode_reward=46.30 +/- 10.17
Episode length: 46.30 +/- 10.17
New best mean reward!


[I 2025-05-13 11:45:12,125] Trial 19 pruned. 


Eval num_timesteps=10000, episode_reward=82.50 +/- 33.15
Episode length: 82.50 +/- 33.15
New best mean reward!
Eval num_timesteps=10000, episode_reward=452.30 +/- 99.89
Episode length: 452.30 +/- 99.89
New best mean reward!
Eval num_timesteps=20000, episode_reward=465.70 +/- 56.95
Episode length: 465.70 +/- 56.95
New best mean reward!


[I 2025-05-13 11:45:34,841] Trial 20 finished with value: 465.7 and parameters: {'gamma': 0.04949056778991064, 'max_grad_norm': 3.1362754418621215, 'exponent_n_steps': 9, 'lr': 0.002798266694709333, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 1 with value: 500.0.
[I 2025-05-13 11:45:43,040] Trial 21 pruned. 


Eval num_timesteps=10000, episode_reward=70.40 +/- 20.04
Episode length: 70.40 +/- 20.04
New best mean reward!


[I 2025-05-13 11:45:52,392] Trial 22 pruned. 


Eval num_timesteps=10000, episode_reward=177.30 +/- 10.54
Episode length: 177.30 +/- 10.54
New best mean reward!


[I 2025-05-13 11:46:02,848] Trial 23 pruned. 


Eval num_timesteps=10000, episode_reward=180.90 +/- 91.26
Episode length: 180.90 +/- 91.26
New best mean reward!


[I 2025-05-13 11:46:12,636] Trial 24 pruned. 


Eval num_timesteps=10000, episode_reward=9.80 +/- 0.40
Episode length: 9.80 +/- 0.40
New best mean reward!
Eval num_timesteps=10000, episode_reward=396.40 +/- 152.53
Episode length: 396.40 +/- 152.53
New best mean reward!


[I 2025-05-13 11:46:31,801] Trial 25 finished with value: 500.0 and parameters: {'gamma': 0.0136448737575126, 'max_grad_norm': 1.3005702449485217, 'exponent_n_steps': 6, 'lr': 0.007085578115128296, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:46:40,834] Trial 26 pruned. 


Eval num_timesteps=10000, episode_reward=9.10 +/- 0.83
Episode length: 9.10 +/- 0.83
New best mean reward!


[I 2025-05-13 11:46:49,325] Trial 27 pruned. 


Eval num_timesteps=10000, episode_reward=168.80 +/- 99.78
Episode length: 168.80 +/- 99.78
New best mean reward!
Eval num_timesteps=10000, episode_reward=241.90 +/- 123.01
Episode length: 241.90 +/- 123.01
New best mean reward!
Eval num_timesteps=20000, episode_reward=471.90 +/- 76.33
Episode length: 471.90 +/- 76.33
New best mean reward!


[I 2025-05-13 11:47:11,050] Trial 28 finished with value: 471.9 and parameters: {'gamma': 0.0006432192926722292, 'max_grad_norm': 0.9478098001252754, 'exponent_n_steps': 8, 'lr': 0.0010423230700244402, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 1 with value: 500.0.
[I 2025-05-13 11:47:21,864] Trial 29 pruned. 


Eval num_timesteps=10000, episode_reward=9.50 +/- 0.50
Episode length: 9.50 +/- 0.50
New best mean reward!


[I 2025-05-13 11:47:32,021] Trial 30 pruned. 


Eval num_timesteps=10000, episode_reward=168.00 +/- 31.08
Episode length: 168.00 +/- 31.08
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:47:50,711] Trial 31 finished with value: 496.0 and parameters: {'gamma': 0.013566344113846203, 'max_grad_norm': 3.06861889068011, 'exponent_n_steps': 7, 'lr': 0.0024107022307455553, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=496.00 +/- 12.00
Episode length: 496.00 +/- 12.00
Eval num_timesteps=10000, episode_reward=484.20 +/- 31.60
Episode length: 484.20 +/- 31.60
New best mean reward!


[I 2025-05-13 11:48:10,489] Trial 32 finished with value: 500.0 and parameters: {'gamma': 0.006046071709291928, 'max_grad_norm': 2.8506910027773666, 'exponent_n_steps': 7, 'lr': 0.0038970101089912754, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:48:19,382] Trial 33 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.49
Episode length: 9.40 +/- 0.49
New best mean reward!


[I 2025-05-13 11:48:28,094] Trial 34 pruned. 


Eval num_timesteps=10000, episode_reward=82.20 +/- 31.96
Episode length: 82.20 +/- 31.96
New best mean reward!


[I 2025-05-13 11:48:37,375] Trial 35 pruned. 


Eval num_timesteps=10000, episode_reward=114.30 +/- 102.74
Episode length: 114.30 +/- 102.74
New best mean reward!


[I 2025-05-13 11:48:46,646] Trial 36 pruned. 


Eval num_timesteps=10000, episode_reward=166.20 +/- 7.76
Episode length: 166.20 +/- 7.76
New best mean reward!


[I 2025-05-13 11:48:54,972] Trial 37 pruned. 


Eval num_timesteps=10000, episode_reward=47.10 +/- 8.28
Episode length: 47.10 +/- 8.28
New best mean reward!


[I 2025-05-13 11:49:09,810] Trial 38 pruned. 


Eval num_timesteps=10000, episode_reward=32.10 +/- 3.88
Episode length: 32.10 +/- 3.88
New best mean reward!


[I 2025-05-13 11:49:21,226] Trial 39 pruned. 


Eval num_timesteps=10000, episode_reward=173.80 +/- 20.13
Episode length: 173.80 +/- 20.13
New best mean reward!
Eval num_timesteps=10000, episode_reward=431.20 +/- 84.91
Episode length: 431.20 +/- 84.91
New best mean reward!


[I 2025-05-13 11:49:40,225] Trial 40 pruned. 


Eval num_timesteps=20000, episode_reward=464.60 +/- 71.21
Episode length: 464.60 +/- 71.21
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:50:01,632] Trial 41 finished with value: 500.0 and parameters: {'gamma': 0.013271806737811693, 'max_grad_norm': 1.176795158642811, 'exponent_n_steps': 6, 'lr': 0.007029353789827419, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:50:24,236] Trial 42 finished with value: 9.6 and parameters: {'gamma': 0.009472707763656206, 'max_grad_norm': 1.3139517317189608, 'exponent_n_steps': 5, 'lr': 0.014363429403047565, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=9.60 +/- 0.49
Episode length: 9.60 +/- 0.49
Eval num_timesteps=10000, episode_reward=352.20 +/- 17.76
Episode length: 352.20 +/- 17.76
New best mean reward!


[I 2025-05-13 11:50:44,206] Trial 43 finished with value: 500.0 and parameters: {'gamma': 0.020884808347814424, 'max_grad_norm': 1.6430032928107483, 'exponent_n_steps': 6, 'lr': 0.006350135123736204, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=10000, episode_reward=484.10 +/- 33.22
Episode length: 484.10 +/- 33.22
New best mean reward!


[I 2025-05-13 11:51:06,315] Trial 44 finished with value: 500.0 and parameters: {'gamma': 0.09481093808580765, 'max_grad_norm': 2.137423242262984, 'exponent_n_steps': 6, 'lr': 0.0028951573186738445, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:51:21,183] Trial 45 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.66
Episode length: 9.40 +/- 0.66
New best mean reward!


[I 2025-05-13 11:51:31,034] Trial 46 pruned. 


Eval num_timesteps=10000, episode_reward=124.80 +/- 17.56
Episode length: 124.80 +/- 17.56
New best mean reward!


[I 2025-05-13 11:51:43,293] Trial 47 pruned. 


Eval num_timesteps=10000, episode_reward=9.40 +/- 0.66
Episode length: 9.40 +/- 0.66
New best mean reward!


[I 2025-05-13 11:51:52,176] Trial 48 pruned. 


Eval num_timesteps=10000, episode_reward=107.80 +/- 53.74
Episode length: 107.80 +/- 53.74
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:52:15,751] Trial 49 finished with value: 500.0 and parameters: {'gamma': 0.028964143101394984, 'max_grad_norm': 1.7682151937425326, 'exponent_n_steps': 7, 'lr': 0.007294152561993862, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00


[I 2025-05-13 11:52:27,528] Trial 50 pruned. 


Eval num_timesteps=10000, episode_reward=9.30 +/- 0.46
Episode length: 9.30 +/- 0.46
New best mean reward!


[I 2025-05-13 11:52:37,490] Trial 51 pruned. 


Eval num_timesteps=10000, episode_reward=370.80 +/- 61.54
Episode length: 370.80 +/- 61.54
New best mean reward!
Eval num_timesteps=10000, episode_reward=405.10 +/- 97.96
Episode length: 405.10 +/- 97.96
New best mean reward!


[I 2025-05-13 11:52:57,608] Trial 52 pruned. 


Eval num_timesteps=20000, episode_reward=496.70 +/- 9.90
Episode length: 496.70 +/- 9.90
New best mean reward!
Eval num_timesteps=10000, episode_reward=409.50 +/- 91.49
Episode length: 409.50 +/- 91.49
New best mean reward!


[I 2025-05-13 11:53:19,486] Trial 53 finished with value: 500.0 and parameters: {'gamma': 0.005096314522168715, 'max_grad_norm': 3.3767405830408, 'exponent_n_steps': 5, 'lr': 0.004561160265931337, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:53:29,648] Trial 54 pruned. 


Eval num_timesteps=10000, episode_reward=115.40 +/- 62.70
Episode length: 115.40 +/- 62.70
New best mean reward!


[I 2025-05-13 11:53:39,165] Trial 55 pruned. 


Eval num_timesteps=10000, episode_reward=221.30 +/- 140.10
Episode length: 221.30 +/- 140.10
New best mean reward!
Eval num_timesteps=10000, episode_reward=458.40 +/- 83.62
Episode length: 458.40 +/- 83.62
New best mean reward!


[I 2025-05-13 11:54:03,509] Trial 56 pruned. 


Eval num_timesteps=20000, episode_reward=270.60 +/- 187.41
Episode length: 270.60 +/- 187.41


[I 2025-05-13 11:54:13,480] Trial 57 pruned. 


Eval num_timesteps=10000, episode_reward=343.70 +/- 129.20
Episode length: 343.70 +/- 129.20
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 11:54:34,233] Trial 58 finished with value: 9.8 and parameters: {'gamma': 0.010568363052910559, 'max_grad_norm': 1.2299475998513585, 'exponent_n_steps': 5, 'lr': 0.008023760008201592, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 1 with value: 500.0.


Eval num_timesteps=20000, episode_reward=9.80 +/- 0.60
Episode length: 9.80 +/- 0.60


[I 2025-05-13 11:54:45,036] Trial 59 pruned. 


Eval num_timesteps=10000, episode_reward=194.60 +/- 23.95
Episode length: 194.60 +/- 23.95
New best mean reward!
Number of finished trials:  60
Best trial:
  Value: 500.0
  Params: 
    gamma: 0.012815196432391299
    max_grad_norm: 1.7474648131234445
    exponent_n_steps: 6
    lr: 0.006068960116412594
    net_arch: tiny
    activation_fn: tanh
  User attrs:
    gamma_: 0.9871848035676087
    n_steps: 64


Complete example: https://github.com/DLR-RM/rl-baselines3-zoo

In [32]:
study.best_params

{'gamma': 0.012815196432391299,
 'max_grad_norm': 1.7474648131234445,
 'exponent_n_steps': 6,
 'lr': 0.006068960116412594,
 'net_arch': 'tiny',
 'activation_fn': 'tanh'}

In [47]:
  display(study.best_trial)
  display(len(study.trials))


FrozenTrial(number=1, state=1, values=[500.0], datetime_start=datetime.datetime(2025, 5, 13, 11, 40, 5, 113075), datetime_complete=datetime.datetime(2025, 5, 13, 11, 40, 25, 481307), params={'gamma': 0.012815196432391299, 'max_grad_norm': 1.7474648131234445, 'exponent_n_steps': 6, 'lr': 0.006068960116412594, 'net_arch': 'tiny', 'activation_fn': 'tanh'}, user_attrs={'gamma_': 0.9871848035676087, 'n_steps': 64}, system_attrs={}, intermediate_values={1: 485.4, 2: 500.0}, distributions={'gamma': FloatDistribution(high=0.1, log=True, low=0.0001, step=None), 'max_grad_norm': FloatDistribution(high=5.0, log=True, low=0.3, step=None), 'exponent_n_steps': IntDistribution(high=10, log=False, low=3, step=1), 'lr': FloatDistribution(high=1.0, log=True, low=1e-05, step=None), 'net_arch': CategoricalDistribution(choices=('tiny', 'small')), 'activation_fn': CategoricalDistribution(choices=('tanh', 'relu'))}, trial_id=1, value=None)

60

In [37]:
from optuna.visualization import plot_contour
from optuna.visualization import plot_edf
from optuna.visualization import plot_intermediate_values
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_rank
from optuna.visualization import plot_slice
from optuna.visualization import plot_timeline

In [38]:
plot_optimization_history(study)


In [39]:
plot_intermediate_values(study)


In [40]:
plot_parallel_coordinate(study)


In [42]:
optuna.importance.get_param_importances(study)

{'lr': np.float64(0.5276217753400368),
 'gamma': np.float64(0.18475264828839788),
 'max_grad_norm': np.float64(0.1256818169856007),
 'activation_fn': np.float64(0.12460765292023145),
 'exponent_n_steps': np.float64(0.037247955233250446),
 'net_arch': np.float64(8.815123248268194e-05)}

In [43]:
optuna.visualization.plot_param_importances(study)

# Conclusion

What we have seen in this notebook:
- the importance of good hyperparameters
- how to do automatic hyperparameter search with optuna


---

# Далее мои разработки

# Part IV: Automatic Hyperparameter Tuning для vec_env





In this part we will create a script that allows to search for the best hyperparameters automatically.

In [46]:
data = np.load(r"/content/log_path/EvalCallback/evaluations.npz")
print(data)
lst = data.files
print(lst)
for item in lst:
    print(item)
    print(data[item])


NpzFile '/content/log_path/EvalCallback/evaluations.npz' with keys: timesteps, results, ep_lengths
['timesteps', 'results', 'ep_lengths']
timesteps
[10000]
results
[[174. 177. 179. 212. 220. 169. 169. 213. 192. 241.]]
ep_lengths
[[174 177 179 212 220 169 169 213 192 241]]


### Imports

In [None]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

### Config

In [48]:
N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 5  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
N_TIMESTEPS = int(2e4)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 15)  # 15 minutes

# для train_env
N_ENVS = N_EVAL_ENVS
EVAL_FREQ = max(EVAL_FREQ // N_ENVS, 1)

ENV_ID = "CartPole-v1"

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
}

### Exercise (5 minutes): Define the search space

In [49]:
from typing import Any, Dict, Optional
import torch
import torch.nn as nn

def sample_a2c_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for A2C hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    # Discount factor between 0.9 and 0.9999
    gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.1, log=True)
    max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 5.0, log=True)
    # 8, 16, 32, ... 1024
    n_steps = 2 ** trial.suggest_int("exponent_n_steps", 3, 10)

    ### YOUR CODE HERE
    # TODO:
    # - define the learning rate search space [1e-5, 1] (log) -> `suggest_float`
    # - define the network architecture search space ["tiny", "small"] -> `suggest_categorical`
    # - define the activation function search space ["tanh", "relu"]
    learning_rate = trial.suggest_float("lr", 1e-5, 1, log=True)
    net_arch = trial.suggest_categorical("net_arch", ["tiny", "small"])
    activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    ### END OF YOUR CODE

    # Display true values
    trial.set_user_attr("gamma_", gamma)
    trial.set_user_attr("n_steps", n_steps)

    net_arch = [
        {"pi": [64], "vf": [64]}
        if net_arch == "tiny"
        else {"pi": [64, 64], "vf": [64, 64]}
    ]

    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    return {
        "n_steps": n_steps,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "max_grad_norm": max_grad_norm,
        "policy_kwargs": {
            "net_arch": net_arch,
            "activation_fn": activation_fn,
        },
    }

### Define the objective function

First we define a custom callback to report the results of periodic evaluations to Optuna:

In [50]:
import os

In [51]:
log_path_EC = r"log_path/EvalCallback"
os.makedirs(log_path_EC, exist_ok=True)

In [52]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
        log_path: Optional[str] = None,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
            log_path=log_path,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Exercise (10 minutes): Define the objective function

Then we define the objective function that is in charge of sampling hyperparameters, creating the model and then returning the result to Optuna

In [53]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    ### YOUR CODE HERE
    # TODO:
    # 1. Sample hyperparameters and update the default keyword arguments: `kwargs.update(other_params)`
    # 2. Create the evaluation envs
    # 3. Create the `TrialEvalCallback`

    # 1. Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_a2c_params(trial))
    # Create the RL model
    train_env = make_vec_env(ENV_ID, N_ENVS)
    model = A2C(env=train_env, **kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_env = make_vec_env(ENV_ID, N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    # and report the performance using `N_EVAL_EPISODES` every `EVAL_FREQ`
    # TrialEvalCallback signature:
    # TrialEvalCallback(eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose)
    eval_callback = TrialEvalCallback(eval_env, trial,
                                      N_EVAL_EPISODES, EVAL_FREQ,
                                      verbose=0, log_path=log_path_EC)

    ### END OF YOUR CODE

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_envs.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

### The optimization loop

In [54]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS, multivariate=True)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_a2c_cartpole.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()


Argument ``multivariate`` is an experimental feature. The interface can change in the future.

[I 2025-05-13 12:48:41,823] A new study created in memory with name: no-name-ae91113b-e754-4d45-a61a-83251135f7a1

As shared layers in the mlp_extractor are removed since SB3 v1.8.0, you should now pass directly a dictionary and not a list (net_arch=dict(pi=..., vf=...) instead of net_arch=[dict(pi=..., vf=...)])



Eval num_timesteps=10000, episode_reward=283.20 +/- 130.66
Episode length: 283.20 +/- 130.66
New best mean reward!


[I 2025-05-13 12:48:48,303] Trial 0 finished with value: 500.0 and parameters: {'gamma': 0.002022566695992499, 'max_grad_norm': 0.9534889257609054, 'exponent_n_steps': 5, 'lr': 0.004627032684513558, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=10000, episode_reward=70.10 +/- 17.48
Episode length: 70.10 +/- 17.48
New best mean reward!


[I 2025-05-13 12:48:55,851] Trial 1 finished with value: 88.7 and parameters: {'gamma': 0.024523345068315085, 'max_grad_norm': 0.3980820626059061, 'exponent_n_steps': 7, 'lr': 2.171859073238636e-05, 'net_arch': 'tiny', 'activation_fn': 'tanh'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=88.70 +/- 32.20
Episode length: 88.70 +/- 32.20
New best mean reward!
Eval num_timesteps=10000, episode_reward=9.20 +/- 0.75
Episode length: 9.20 +/- 0.75
New best mean reward!


[I 2025-05-13 12:49:01,187] Trial 2 finished with value: 54.1 and parameters: {'gamma': 0.00011136442137797863, 'max_grad_norm': 1.834734215486351, 'exponent_n_steps': 9, 'lr': 0.019416395632161135, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=54.10 +/- 24.74
Episode length: 54.10 +/- 24.74
New best mean reward!
Eval num_timesteps=10000, episode_reward=9.20 +/- 0.60
Episode length: 9.20 +/- 0.60
New best mean reward!


[I 2025-05-13 12:49:07,885] Trial 3 finished with value: 9.6 and parameters: {'gamma': 0.06288350294948113, 'max_grad_norm': 0.33768296448436336, 'exponent_n_steps': 8, 'lr': 0.09654869862445599, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=9.60 +/- 0.92
Episode length: 9.60 +/- 0.92
New best mean reward!
Eval num_timesteps=10000, episode_reward=385.60 +/- 101.39
Episode length: 385.60 +/- 101.39
New best mean reward!


[I 2025-05-13 12:49:15,045] Trial 4 finished with value: 74.4 and parameters: {'gamma': 0.003203427621611785, 'max_grad_norm': 1.0034430334733164, 'exponent_n_steps': 4, 'lr': 0.010654915246562629, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=74.40 +/- 4.34
Episode length: 74.40 +/- 4.34


[I 2025-05-13 12:49:17,590] Trial 5 pruned. 


Eval num_timesteps=10000, episode_reward=50.30 +/- 9.89
Episode length: 50.30 +/- 9.89
New best mean reward!
Eval num_timesteps=10000, episode_reward=307.90 +/- 148.64
Episode length: 307.90 +/- 148.64
New best mean reward!


[I 2025-05-13 12:49:24,664] Trial 6 finished with value: 500.0 and parameters: {'gamma': 0.00043209781191678103, 'max_grad_norm': 0.5030035401544901, 'exponent_n_steps': 5, 'lr': 0.005169111467919504, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:49:27,385] Trial 7 pruned. 


Eval num_timesteps=10000, episode_reward=88.00 +/- 25.98
Episode length: 88.00 +/- 25.98
New best mean reward!
Eval num_timesteps=10000, episode_reward=498.80 +/- 3.60
Episode length: 498.80 +/- 3.60
New best mean reward!


[I 2025-05-13 12:49:34,471] Trial 8 finished with value: 500.0 and parameters: {'gamma': 0.00047661017337397206, 'max_grad_norm': 2.9020608530750907, 'exponent_n_steps': 6, 'lr': 0.01097978197279876, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:49:37,364] Trial 9 pruned. 


Eval num_timesteps=10000, episode_reward=9.10 +/- 0.54
Episode length: 9.10 +/- 0.54
New best mean reward!


[I 2025-05-13 12:49:41,098] Trial 10 pruned. 


Eval num_timesteps=10000, episode_reward=178.10 +/- 8.71
Episode length: 178.10 +/- 8.71
New best mean reward!


[I 2025-05-13 12:49:44,956] Trial 11 pruned. 


Eval num_timesteps=10000, episode_reward=184.50 +/- 20.30
Episode length: 184.50 +/- 20.30
New best mean reward!


[I 2025-05-13 12:49:47,910] Trial 12 pruned. 


Eval num_timesteps=10000, episode_reward=184.60 +/- 127.29
Episode length: 184.60 +/- 127.29
New best mean reward!


[I 2025-05-13 12:49:50,508] Trial 13 pruned. 


Eval num_timesteps=10000, episode_reward=9.20 +/- 0.75
Episode length: 9.20 +/- 0.75
New best mean reward!


[I 2025-05-13 12:49:53,066] Trial 14 pruned. 


Eval num_timesteps=10000, episode_reward=69.00 +/- 7.82
Episode length: 69.00 +/- 7.82
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:50:02,690] Trial 15 finished with value: 464.4 and parameters: {'gamma': 0.005313908409744688, 'max_grad_norm': 0.6443054938706968, 'exponent_n_steps': 3, 'lr': 0.0010147656830992661, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=464.40 +/- 57.26
Episode length: 464.40 +/- 57.26


[I 2025-05-13 12:50:05,759] Trial 16 pruned. 


Eval num_timesteps=10000, episode_reward=264.00 +/- 120.98
Episode length: 264.00 +/- 120.98
New best mean reward!
Eval num_timesteps=10000, episode_reward=414.00 +/- 90.89
Episode length: 414.00 +/- 90.89
New best mean reward!


[I 2025-05-13 12:50:13,059] Trial 17 finished with value: 371.8 and parameters: {'gamma': 0.0009658428794749642, 'max_grad_norm': 0.9958692176938595, 'exponent_n_steps': 5, 'lr': 0.0018773243242793049, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=371.80 +/- 94.67
Episode length: 371.80 +/- 94.67


[I 2025-05-13 12:50:16,455] Trial 18 pruned. 


Eval num_timesteps=10000, episode_reward=39.30 +/- 22.28
Episode length: 39.30 +/- 22.28
New best mean reward!


[I 2025-05-13 12:50:18,874] Trial 19 pruned. 


Eval num_timesteps=10000, episode_reward=66.80 +/- 14.66
Episode length: 66.80 +/- 14.66
New best mean reward!


[I 2025-05-13 12:50:21,885] Trial 20 pruned. 


Eval num_timesteps=10000, episode_reward=86.20 +/- 22.38
Episode length: 86.20 +/- 22.38
New best mean reward!


[I 2025-05-13 12:50:25,441] Trial 21 pruned. 


Eval num_timesteps=10000, episode_reward=9.20 +/- 0.75
Episode length: 9.20 +/- 0.75
New best mean reward!


[I 2025-05-13 12:50:28,294] Trial 22 pruned. 


Eval num_timesteps=10000, episode_reward=231.30 +/- 138.96
Episode length: 231.30 +/- 138.96
New best mean reward!


[I 2025-05-13 12:50:30,815] Trial 23 pruned. 


Eval num_timesteps=10000, episode_reward=107.10 +/- 52.72
Episode length: 107.10 +/- 52.72
New best mean reward!


[I 2025-05-13 12:50:33,260] Trial 24 pruned. 


Eval num_timesteps=10000, episode_reward=9.20 +/- 0.87
Episode length: 9.20 +/- 0.87
New best mean reward!


[I 2025-05-13 12:50:36,882] Trial 25 pruned. 


Eval num_timesteps=10000, episode_reward=98.80 +/- 44.25
Episode length: 98.80 +/- 44.25
New best mean reward!
Eval num_timesteps=10000, episode_reward=404.70 +/- 134.56
Episode length: 404.70 +/- 134.56
New best mean reward!


[I 2025-05-13 12:50:44,415] Trial 26 finished with value: 489.3 and parameters: {'gamma': 0.0013814325551818269, 'max_grad_norm': 3.4804928173852963, 'exponent_n_steps': 5, 'lr': 0.0014810482635905446, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=489.30 +/- 32.10
Episode length: 489.30 +/- 32.10
New best mean reward!


[I 2025-05-13 12:50:47,415] Trial 27 pruned. 


Eval num_timesteps=10000, episode_reward=38.40 +/- 19.74
Episode length: 38.40 +/- 19.74
New best mean reward!


[I 2025-05-13 12:50:51,106] Trial 28 pruned. 


Eval num_timesteps=10000, episode_reward=135.10 +/- 115.93
Episode length: 135.10 +/- 115.93
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:50:57,481] Trial 29 finished with value: 235.4 and parameters: {'gamma': 0.000809411971488431, 'max_grad_norm': 0.5082063170538985, 'exponent_n_steps': 5, 'lr': 0.018925417827107775, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=235.40 +/- 7.51
Episode length: 235.40 +/- 7.51


[I 2025-05-13 12:51:01,972] Trial 30 pruned. 


Eval num_timesteps=10000, episode_reward=381.40 +/- 156.66
Episode length: 381.40 +/- 156.66
New best mean reward!


[I 2025-05-13 12:51:05,387] Trial 31 pruned. 


Eval num_timesteps=10000, episode_reward=148.50 +/- 59.25
Episode length: 148.50 +/- 59.25
New best mean reward!


[I 2025-05-13 12:51:08,297] Trial 32 pruned. 


Eval num_timesteps=10000, episode_reward=9.50 +/- 0.92
Episode length: 9.50 +/- 0.92
New best mean reward!


[I 2025-05-13 12:51:10,676] Trial 33 pruned. 


Eval num_timesteps=10000, episode_reward=60.10 +/- 19.49
Episode length: 60.10 +/- 19.49
New best mean reward!


[I 2025-05-13 12:51:13,464] Trial 34 pruned. 


Eval num_timesteps=10000, episode_reward=26.90 +/- 4.53
Episode length: 26.90 +/- 4.53
New best mean reward!


[I 2025-05-13 12:51:17,052] Trial 35 pruned. 


Eval num_timesteps=10000, episode_reward=209.30 +/- 147.22
Episode length: 209.30 +/- 147.22
New best mean reward!


[I 2025-05-13 12:51:19,530] Trial 36 pruned. 


Eval num_timesteps=10000, episode_reward=72.20 +/- 59.59
Episode length: 72.20 +/- 59.59
New best mean reward!


[I 2025-05-13 12:51:22,034] Trial 37 pruned. 


Eval num_timesteps=10000, episode_reward=95.20 +/- 20.48
Episode length: 95.20 +/- 20.48
New best mean reward!


[I 2025-05-13 12:51:24,506] Trial 38 pruned. 


Eval num_timesteps=10000, episode_reward=59.50 +/- 21.83
Episode length: 59.50 +/- 21.83
New best mean reward!


[I 2025-05-13 12:51:27,655] Trial 39 pruned. 


Eval num_timesteps=10000, episode_reward=27.50 +/- 3.26
Episode length: 27.50 +/- 3.26
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:51:34,978] Trial 40 finished with value: 135.7 and parameters: {'gamma': 0.0011515696169602763, 'max_grad_norm': 0.40548854664844924, 'exponent_n_steps': 4, 'lr': 0.0031919736826360318, 'net_arch': 'small', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=135.70 +/- 10.56
Episode length: 135.70 +/- 10.56
Eval num_timesteps=10000, episode_reward=496.70 +/- 9.90
Episode length: 496.70 +/- 9.90
New best mean reward!


[I 2025-05-13 12:51:44,352] Trial 41 finished with value: 419.8 and parameters: {'gamma': 0.003525568837046354, 'max_grad_norm': 0.3281702995522167, 'exponent_n_steps': 3, 'lr': 0.0018295110923661096, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=419.80 +/- 84.53
Episode length: 419.80 +/- 84.53


[I 2025-05-13 12:51:48,214] Trial 42 pruned. 


Eval num_timesteps=10000, episode_reward=120.50 +/- 72.21
Episode length: 120.50 +/- 72.21
New best mean reward!


[I 2025-05-13 12:51:51,937] Trial 43 pruned. 


Eval num_timesteps=10000, episode_reward=91.40 +/- 19.49
Episode length: 91.40 +/- 19.49
New best mean reward!


[I 2025-05-13 12:51:55,267] Trial 44 pruned. 


Eval num_timesteps=10000, episode_reward=148.40 +/- 51.56
Episode length: 148.40 +/- 51.56
New best mean reward!


[I 2025-05-13 12:51:59,174] Trial 45 pruned. 


Eval num_timesteps=10000, episode_reward=9.10 +/- 0.70
Episode length: 9.10 +/- 0.70
New best mean reward!


[I 2025-05-13 12:52:01,897] Trial 46 pruned. 


Eval num_timesteps=10000, episode_reward=69.90 +/- 16.25
Episode length: 69.90 +/- 16.25
New best mean reward!


[I 2025-05-13 12:52:06,737] Trial 47 pruned. 


Eval num_timesteps=10000, episode_reward=9.50 +/- 0.50
Episode length: 9.50 +/- 0.50
New best mean reward!


[I 2025-05-13 12:52:09,357] Trial 48 pruned. 


Eval num_timesteps=10000, episode_reward=85.00 +/- 33.57
Episode length: 85.00 +/- 33.57
New best mean reward!


[I 2025-05-13 12:52:12,935] Trial 49 pruned. 


Eval num_timesteps=10000, episode_reward=12.60 +/- 1.28
Episode length: 12.60 +/- 1.28
New best mean reward!


[I 2025-05-13 12:52:16,306] Trial 50 pruned. 


Eval num_timesteps=10000, episode_reward=186.30 +/- 27.96
Episode length: 186.30 +/- 27.96
New best mean reward!


[I 2025-05-13 12:52:20,567] Trial 51 pruned. 


Eval num_timesteps=10000, episode_reward=202.00 +/- 130.80
Episode length: 202.00 +/- 130.80
New best mean reward!


[I 2025-05-13 12:52:24,060] Trial 52 pruned. 


Eval num_timesteps=10000, episode_reward=25.00 +/- 5.00
Episode length: 25.00 +/- 5.00
New best mean reward!


[I 2025-05-13 12:52:27,535] Trial 53 pruned. 


Eval num_timesteps=10000, episode_reward=9.20 +/- 0.75
Episode length: 9.20 +/- 0.75
New best mean reward!


[I 2025-05-13 12:52:30,256] Trial 54 pruned. 


Eval num_timesteps=10000, episode_reward=9.10 +/- 0.83
Episode length: 9.10 +/- 0.83
New best mean reward!


[I 2025-05-13 12:52:33,654] Trial 55 pruned. 


Eval num_timesteps=10000, episode_reward=231.40 +/- 93.02
Episode length: 231.40 +/- 93.02
New best mean reward!


[I 2025-05-13 12:52:36,447] Trial 56 pruned. 


Eval num_timesteps=10000, episode_reward=156.50 +/- 82.70
Episode length: 156.50 +/- 82.70
New best mean reward!


[I 2025-05-13 12:52:39,065] Trial 57 pruned. 


Eval num_timesteps=10000, episode_reward=76.80 +/- 23.71
Episode length: 76.80 +/- 23.71
New best mean reward!


[I 2025-05-13 12:52:43,962] Trial 58 pruned. 


Eval num_timesteps=10000, episode_reward=338.40 +/- 34.66
Episode length: 338.40 +/- 34.66
New best mean reward!


[I 2025-05-13 12:52:46,545] Trial 59 pruned. 


Eval num_timesteps=10000, episode_reward=29.60 +/- 3.77
Episode length: 29.60 +/- 3.77
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:52:54,311] Trial 60 finished with value: 114.3 and parameters: {'gamma': 0.019281366837999938, 'max_grad_norm': 0.41774421869401807, 'exponent_n_steps': 3, 'lr': 0.004051366426875929, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=114.30 +/- 4.47
Episode length: 114.30 +/- 4.47
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:53:02,900] Trial 61 finished with value: 117.5 and parameters: {'gamma': 0.0023210004246074067, 'max_grad_norm': 3.5668926630547766, 'exponent_n_steps': 3, 'lr': 0.0033065111460579694, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=117.50 +/- 8.99
Episode length: 117.50 +/- 8.99


[I 2025-05-13 12:53:05,793] Trial 62 pruned. 


Eval num_timesteps=10000, episode_reward=127.60 +/- 98.46
Episode length: 127.60 +/- 98.46
New best mean reward!


[I 2025-05-13 12:53:10,235] Trial 63 pruned. 


Eval num_timesteps=10000, episode_reward=92.60 +/- 19.53
Episode length: 92.60 +/- 19.53
New best mean reward!


[I 2025-05-13 12:53:12,847] Trial 64 pruned. 


Eval num_timesteps=10000, episode_reward=123.50 +/- 38.93
Episode length: 123.50 +/- 38.93
New best mean reward!


[I 2025-05-13 12:53:16,760] Trial 65 pruned. 


Eval num_timesteps=10000, episode_reward=221.50 +/- 38.42
Episode length: 221.50 +/- 38.42
New best mean reward!
Eval num_timesteps=10000, episode_reward=495.60 +/- 13.20
Episode length: 495.60 +/- 13.20
New best mean reward!


[I 2025-05-13 12:53:24,377] Trial 66 finished with value: 500.0 and parameters: {'gamma': 0.0007345209971748903, 'max_grad_norm': 0.6425951766035796, 'exponent_n_steps': 7, 'lr': 0.014182953573358767, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:53:30,784] Trial 67 finished with value: 198.0 and parameters: {'gamma': 0.00226493689213256, 'max_grad_norm': 0.5267133446110182, 'exponent_n_steps': 7, 'lr': 0.007300768998892059, 'net_arch': 'small', 'activation_fn': 'tanh'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=198.00 +/- 91.65
Episode length: 198.00 +/- 91.65


[I 2025-05-13 12:53:33,501] Trial 68 pruned. 


Eval num_timesteps=10000, episode_reward=9.50 +/- 0.50
Episode length: 9.50 +/- 0.50
New best mean reward!


[I 2025-05-13 12:53:37,049] Trial 69 pruned. 


Eval num_timesteps=10000, episode_reward=9.60 +/- 0.66
Episode length: 9.60 +/- 0.66
New best mean reward!


[I 2025-05-13 12:53:40,435] Trial 70 pruned. 


Eval num_timesteps=10000, episode_reward=130.90 +/- 48.67
Episode length: 130.90 +/- 48.67
New best mean reward!


[I 2025-05-13 12:53:43,512] Trial 71 pruned. 


Eval num_timesteps=10000, episode_reward=419.60 +/- 114.38
Episode length: 419.60 +/- 114.38
New best mean reward!


[I 2025-05-13 12:53:46,347] Trial 72 pruned. 


Eval num_timesteps=10000, episode_reward=99.60 +/- 70.97
Episode length: 99.60 +/- 70.97
New best mean reward!


[I 2025-05-13 12:53:49,410] Trial 73 pruned. 


Eval num_timesteps=10000, episode_reward=69.80 +/- 29.88
Episode length: 69.80 +/- 29.88
New best mean reward!


[I 2025-05-13 12:53:52,138] Trial 74 pruned. 


Eval num_timesteps=10000, episode_reward=123.40 +/- 26.18
Episode length: 123.40 +/- 26.18
New best mean reward!


[I 2025-05-13 12:53:55,879] Trial 75 pruned. 


Eval num_timesteps=10000, episode_reward=453.10 +/- 56.55
Episode length: 453.10 +/- 56.55
New best mean reward!


[I 2025-05-13 12:53:58,712] Trial 76 pruned. 


Eval num_timesteps=10000, episode_reward=9.80 +/- 0.87
Episode length: 9.80 +/- 0.87
New best mean reward!


[I 2025-05-13 12:54:02,101] Trial 77 pruned. 


Eval num_timesteps=10000, episode_reward=9.50 +/- 0.67
Episode length: 9.50 +/- 0.67
New best mean reward!


[I 2025-05-13 12:54:05,707] Trial 78 pruned. 


Eval num_timesteps=10000, episode_reward=418.30 +/- 28.96
Episode length: 418.30 +/- 28.96
New best mean reward!


[I 2025-05-13 12:54:08,574] Trial 79 pruned. 


Eval num_timesteps=10000, episode_reward=158.30 +/- 94.01
Episode length: 158.30 +/- 94.01
New best mean reward!


[I 2025-05-13 12:54:11,285] Trial 80 pruned. 


Eval num_timesteps=10000, episode_reward=178.80 +/- 171.39
Episode length: 178.80 +/- 171.39
New best mean reward!


[I 2025-05-13 12:54:14,787] Trial 81 pruned. 


Eval num_timesteps=10000, episode_reward=9.20 +/- 0.60
Episode length: 9.20 +/- 0.60
New best mean reward!
Eval num_timesteps=10000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00
New best mean reward!


[I 2025-05-13 12:54:21,508] Trial 82 finished with value: 500.0 and parameters: {'gamma': 0.0005932690904182689, 'max_grad_norm': 0.5706951929750083, 'exponent_n_steps': 5, 'lr': 0.0143123319817666, 'net_arch': 'tiny', 'activation_fn': 'relu'}. Best is trial 0 with value: 500.0.


Eval num_timesteps=20000, episode_reward=500.00 +/- 0.00
Episode length: 500.00 +/- 0.00


[W 2025-05-13 12:54:22,926] Trial 83 failed with parameters: {'gamma': 0.0009902212289901145, 'max_grad_norm': 0.8793536051091422, 'exponent_n_steps': 5, 'lr': 0.004964937803482774, 'net_arch': 'tiny', 'activation_fn': 'relu'} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "<ipython-input-53-008036110673>", line 42, in objective
    model.learn(N_TIMESTEPS, callback=eval_callback)
  File "/usr/local/lib/python3.11/dist-packages/stable_baselines3/a2c/a2c.py", line 201, in learn
    return super().learn(
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/stable_baselines3/common/on_policy_algorithm.py", line 324, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
                

Number of finished trials:  84
Best trial:
  Value: 500.0
  Params: 
    gamma: 0.002022566695992499
    max_grad_norm: 0.9534889257609054
    exponent_n_steps: 5
    lr: 0.004627032684513558
    net_arch: tiny
    activation_fn: relu
  User attrs:
    gamma_: 0.9979774333040075
    n_steps: 32


In [None]:
Params:
    gamma: 0.002022566695992499
    max_grad_norm: 0.9534889257609054
    exponent_n_steps: 5
    lr: 0.004627032684513558
    net_arch: tiny
    activation_fn: relu
  User attrs:
    gamma_: 0.9979774333040075
    n_steps: 32

  hyperparams = {}

# Part V: Теперь для Ant-v5

In this part we will create a script that allows to search for the best hyperparameters automatically.

### Imports

In [8]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

### Config

In [19]:
N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 5  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
# N_WARMUP_STEPS = N_EVALUATIONS // 3 # Do not prune before 1/3 of the max budget is used
N_WARMUP_STEPS = N_STARTUP_TRIALS + 3 # Do not prune before 1/3 of the max budget is used
N_TIMESTEPS = int(1e3)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 15)  # 15 minutes

# для train_env
N_ENVS = 1
EVAL_FREQ = max(EVAL_FREQ // N_ENVS, 1)

ENV_ID = "Ant-v5"
ALGO = SAC

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
}

### Exercise (5 minutes): Define the search space

In [21]:
from typing import Any, Dict, Optional
import torch
import torch.nn as nn

def sample_SAC_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for SAC hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    one_minus_gamma = trial.suggest_float("one_minus_gamma", 0.0001, 0.03, log=True)
    # From 2**5=32 to 2**11=2048
    batch_size_pow = trial.suggest_int("batch_size_pow", 2, 11)

    learning_rate = trial.suggest_float("learning_rate", 1e-5, 0.002, log=True)
    train_freq = trial.suggest_int("train_freq", 1, 10)
    gradient_steps = trial.suggest_int("gradient_steps", 1, 10)
    # Polyak coeff
    tau = trial.suggest_float("tau", 0.001, 0.08, log=True)

    # NOTE: Add "verybig" to net_arch when tuning HER
    net_arch = trial.suggest_categorical("net_arch", ["small", "medium", "big"])
    # activation_fn = trial.suggest_categorical('activation_fn', [nn.Tanh, nn.ReLU, nn.ELU, nn.LeakyReLU])

    trial.set_user_attr("gamma", 1 - one_minus_gamma)
    trial.set_user_attr("batch_size", 2**batch_size_pow)

    net_arch = {
        "small": [64, 64],
        "medium": [256, 256],
        "big": [400, 300],
        # "large": [256, 256, 256],
        # "verybig": [512, 512, 512],
    }[net_arch]

    hyperparams = {
        "gamma": 1 - one_minus_gamma,
        "batch_size": 2**batch_size_pow,
        "learning_rate": learning_rate,
        "train_freq": train_freq,
        "gradient_steps": gradient_steps,
        "tau": tau,
        "policy_kwargs": {
            "net_arch": net_arch,
            # "activation_fn": activation_fn,
        },
    }



    return hyperparams

### Define the objective function

First we define a custom callback to report the results of periodic evaluations to Optuna:

In [22]:
import os

In [23]:
log_path_EC = r"log_path/EvalCallback"
os.makedirs(log_path_EC, exist_ok=True)

In [24]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
        log_path: Optional[str] = None,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
            log_path=log_path,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Exercise (10 minutes): Define the objective function

Then we define the objective function that is in charge of sampling hyperparameters, creating the model and then returning the result to Optuna

In [29]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    ### YOUR CODE HERE
    # TODO:
    # 1. Sample hyperparameters and update the default keyword arguments: `kwargs.update(other_params)`
    # 2. Create the evaluation envs
    # 3. Create the `TrialEvalCallback`

    # 1. Sample hyperparameters and update the keyword arguments
    kwargs.update(sample_SAC_params(trial))
    # Create the RL model
    train_env = make_vec_env(ENV_ID, N_ENVS)
    model = ALGO(env=train_env, **kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_env = make_vec_env(ENV_ID, N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    # and report the performance using `N_EVAL_EPISODES` every `EVAL_FREQ`
    # TrialEvalCallback signature:
    # TrialEvalCallback(eval_env, trial, n_eval_episodes, eval_freq, deterministic, verbose)
    eval_callback = TrialEvalCallback(eval_env, trial,
                                      N_EVAL_EPISODES, EVAL_FREQ,
                                      verbose=0, log_path=None)

    ### END OF YOUR CODE

    nan_encountered = False
    try:
        # Train the model
        print("Start train model")
        model.learn(N_TIMESTEPS, callback=eval_callback)
        print("End train model\n")
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        print(e)
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_env.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

### The optimization loop

In [30]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS, multivariate=True)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_WARMUP_STEPS
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv("study_results_sac_ant-v5.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()

[I 2025-05-14 04:39:28,677] A new study created in memory with name: no-name-ffb80845-2626-4210-aae1-3bc797eb3670


Start train model


[I 2025-05-14 04:40:23,260] Trial 0 finished with value: 914.4960729999999 and parameters: {'one_minus_gamma': 0.012477726234843946, 'batch_size_pow': 8, 'learning_rate': 0.001975504325557571, 'train_freq': 6, 'gradient_steps': 4, 'tau': 0.002330761938808824, 'net_arch': 'big'}. Best is trial 0 with value: 914.4960729999999.


End train model

Start train model


[I 2025-05-14 04:43:41,910] Trial 1 finished with value: -1441.8796455 and parameters: {'one_minus_gamma': 0.004391686162792903, 'batch_size_pow': 10, 'learning_rate': 0.00022409356822235054, 'train_freq': 4, 'gradient_steps': 5, 'tau': 0.05764443682687115, 'net_arch': 'big'}. Best is trial 0 with value: 914.4960729999999.


End train model

Start train model


[I 2025-05-14 04:44:17,327] Trial 2 finished with value: 992.1687992000001 and parameters: {'one_minus_gamma': 0.0002621068077748957, 'batch_size_pow': 3, 'learning_rate': 1.5332264405427608e-05, 'train_freq': 8, 'gradient_steps': 9, 'tau': 0.028773590411915368, 'net_arch': 'medium'}. Best is trial 2 with value: 992.1687992000001.


End train model

Start train model


[I 2025-05-14 04:44:42,771] Trial 3 finished with value: 993.4691006999999 and parameters: {'one_minus_gamma': 0.005305849458695914, 'batch_size_pow': 2, 'learning_rate': 3.3961630755715313e-05, 'train_freq': 8, 'gradient_steps': 3, 'tau': 0.014274594389315528, 'net_arch': 'medium'}. Best is trial 3 with value: 993.4691006999999.


End train model

Start train model


[I 2025-05-14 04:45:21,431] Trial 4 finished with value: 983.4404080999999 and parameters: {'one_minus_gamma': 0.012846876862291103, 'batch_size_pow': 7, 'learning_rate': 2.1500937841900233e-05, 'train_freq': 4, 'gradient_steps': 6, 'tau': 0.07268008929488795, 'net_arch': 'small'}. Best is trial 3 with value: 993.4691006999999.


End train model

Start train model


[I 2025-05-14 04:45:49,195] Trial 5 finished with value: 994.1780402 and parameters: {'one_minus_gamma': 0.009077975944648613, 'batch_size_pow': 2, 'learning_rate': 4.3068523141333924e-05, 'train_freq': 8, 'gradient_steps': 4, 'tau': 0.011833368213639131, 'net_arch': 'medium'}. Best is trial 5 with value: 994.1780402.


End train model

Start train model


[I 2025-05-14 04:46:26,868] Trial 6 finished with value: 981.4277420999999 and parameters: {'one_minus_gamma': 0.010217048038517217, 'batch_size_pow': 5, 'learning_rate': 0.00010079084185602418, 'train_freq': 8, 'gradient_steps': 9, 'tau': 0.004393647497506277, 'net_arch': 'medium'}. Best is trial 5 with value: 994.1780402.


End train model

Start train model


[I 2025-05-14 04:46:58,413] Trial 7 finished with value: 990.3552143999999 and parameters: {'one_minus_gamma': 0.02747161607625616, 'batch_size_pow': 4, 'learning_rate': 5.832485787659468e-05, 'train_freq': 3, 'gradient_steps': 2, 'tau': 0.0014819083944463862, 'net_arch': 'medium'}. Best is trial 5 with value: 994.1780402.


End train model

Start train model


[I 2025-05-14 04:47:36,173] Trial 8 finished with value: 733.6404319999999 and parameters: {'one_minus_gamma': 0.0008483705845665281, 'batch_size_pow': 5, 'learning_rate': 0.00011449642320361032, 'train_freq': 5, 'gradient_steps': 5, 'tau': 0.016564938776729085, 'net_arch': 'medium'}. Best is trial 5 with value: 994.1780402.


End train model

Start train model


[I 2025-05-14 04:48:05,091] Trial 9 finished with value: 994.105611 and parameters: {'one_minus_gamma': 0.000699022172147523, 'batch_size_pow': 4, 'learning_rate': 2.9046038364624926e-05, 'train_freq': 6, 'gradient_steps': 3, 'tau': 0.0014040973641470304, 'net_arch': 'medium'}. Best is trial 5 with value: 994.1780402.


End train model

Start train model


[I 2025-05-14 04:48:46,327] Trial 10 finished with value: 988.1370539 and parameters: {'one_minus_gamma': 0.003989655532292222, 'batch_size_pow': 8, 'learning_rate': 1.2372182595867406e-05, 'train_freq': 8, 'gradient_steps': 5, 'tau': 0.008222985043853043, 'net_arch': 'medium'}. Best is trial 5 with value: 994.1780402.


End train model

Start train model


[I 2025-05-14 04:49:14,470] Trial 11 finished with value: 986.7968661000001 and parameters: {'one_minus_gamma': 0.0006513269515077358, 'batch_size_pow': 2, 'learning_rate': 2.446724844437169e-05, 'train_freq': 7, 'gradient_steps': 3, 'tau': 0.0014021285149663754, 'net_arch': 'big'}. Best is trial 5 with value: 994.1780402.


End train model

Start train model


[I 2025-05-14 04:49:39,479] Trial 12 finished with value: 965.0582225 and parameters: {'one_minus_gamma': 0.012455923419230349, 'batch_size_pow': 2, 'learning_rate': 0.000701857459525878, 'train_freq': 10, 'gradient_steps': 3, 'tau': 0.02786394444331373, 'net_arch': 'medium'}. Best is trial 5 with value: 994.1780402.


End train model

Start train model


[I 2025-05-14 04:50:06,215] Trial 13 finished with value: 1000.4074965000002 and parameters: {'one_minus_gamma': 0.001821742372963789, 'batch_size_pow': 8, 'learning_rate': 9.24939595501435e-05, 'train_freq': 7, 'gradient_steps': 1, 'tau': 0.001242337252706291, 'net_arch': 'medium'}. Best is trial 13 with value: 1000.4074965000002.


End train model

Start train model


[I 2025-05-14 04:50:30,294] Trial 14 finished with value: 993.2200258999999 and parameters: {'one_minus_gamma': 0.002552596267723277, 'batch_size_pow': 6, 'learning_rate': 0.0003484932480426924, 'train_freq': 5, 'gradient_steps': 1, 'tau': 0.0011215200389354954, 'net_arch': 'medium'}. Best is trial 13 with value: 1000.4074965000002.


End train model

Start train model


[I 2025-05-14 04:50:56,836] Trial 15 finished with value: 976.4716634 and parameters: {'one_minus_gamma': 0.0006882152462257798, 'batch_size_pow': 9, 'learning_rate': 5.146253387427615e-05, 'train_freq': 6, 'gradient_steps': 2, 'tau': 0.002021202448276771, 'net_arch': 'small'}. Best is trial 13 with value: 1000.4074965000002.


End train model

Start train model


[I 2025-05-14 04:51:29,470] Trial 16 finished with value: 991.2677854 and parameters: {'one_minus_gamma': 0.008803282135113269, 'batch_size_pow': 7, 'learning_rate': 9.512519933163889e-05, 'train_freq': 10, 'gradient_steps': 3, 'tau': 0.004843642632611225, 'net_arch': 'big'}. Best is trial 13 with value: 1000.4074965000002.


End train model

Start train model


[I 2025-05-14 04:53:07,068] Trial 17 finished with value: 1000.2393804 and parameters: {'one_minus_gamma': 0.00036488587408493964, 'batch_size_pow': 11, 'learning_rate': 3.94033779322373e-05, 'train_freq': 9, 'gradient_steps': 4, 'tau': 0.001014112412006081, 'net_arch': 'medium'}. Best is trial 13 with value: 1000.4074965000002.


End train model

Start train model


[I 2025-05-14 04:55:14,909] Trial 18 finished with value: 994.0387825 and parameters: {'one_minus_gamma': 0.00045394972509737086, 'batch_size_pow': 11, 'learning_rate': 1.7482017373797255e-05, 'train_freq': 8, 'gradient_steps': 5, 'tau': 0.0015692967931188676, 'net_arch': 'medium'}. Best is trial 13 with value: 1000.4074965000002.


End train model

Number of finished trials:  19
Best trial:
  Value: 1000.4074965000002
  Params: 
    one_minus_gamma: 0.001821742372963789
    batch_size_pow: 8
    learning_rate: 9.24939595501435e-05
    train_freq: 7
    gradient_steps: 1
    tau: 0.001242337252706291
    net_arch: medium
  User attrs:
    gamma: 0.9981782576270362
    batch_size: 256


In [37]:
display(study.best_params)
display(study.best_trial.params)
display(study.best_trial.user_attrs)

{'one_minus_gamma': 0.001821742372963789,
 'batch_size_pow': 8,
 'learning_rate': 9.24939595501435e-05,
 'train_freq': 7,
 'gradient_steps': 1,
 'tau': 0.001242337252706291,
 'net_arch': 'medium'}

{'one_minus_gamma': 0.001821742372963789,
 'batch_size_pow': 8,
 'learning_rate': 9.24939595501435e-05,
 'train_freq': 7,
 'gradient_steps': 1,
 'tau': 0.001242337252706291,
 'net_arch': 'medium'}

{'gamma': 0.9981782576270362, 'batch_size': 256}

In [39]:
best_trial = dict(**study.best_trial.params, **study.best_trial.user_attrs)
display(best_trial)
tuned_hyperparams = dict(gamma=best_trial["gamma"],
                         batch_size=best_trial["batch_size"],
                         learning_rate=best_trial["learning_rate"],
                         train_freq=best_trial["train_freq"],
                         gradient_steps=best_trial["gradient_steps"],
                         tau=best_trial["tau"],
                         policy_kwargs=dict(net_arch = [256, 256]),
                         )

{'one_minus_gamma': 0.001821742372963789,
 'batch_size_pow': 8,
 'learning_rate': 9.24939595501435e-05,
 'train_freq': 7,
 'gradient_steps': 1,
 'tau': 0.001242337252706291,
 'net_arch': 'medium',
 'gamma': 0.9981782576270362,
 'batch_size': 256}

In [46]:
vec_env = make_vec_env(ENV_ID, N_EVAL_ENVS-1)
tuned_model = ALGO("MlpPolicy", env=vec_env, **tuned_hyperparams).learn(300_000, progress_bar=True)

eval_env = make_vec_env(ENV_ID, N_EVAL_ENVS)

mean_reward, std_reward = evaluate_policy(tuned_model, eval_env)
print(f"{ALGO.__name__} Mean episode reward: {mean_reward:.2f} +/- {std_reward:.2f}")

vec_env.close()
tuned_model.env.close()
eval_env.close()

Output()

SAC Mean episode reward: 479.08 +/- 29.34


In [47]:
print(f"study_results_{ALGO.__name__}_{ENV_ID}.csv")

study_results_SAC_Ant-v5.csv


In [48]:
import pickle

study_ant = None
with open('study_Ant-v5.pickle', 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    study_ant = pickle.load(f)

In [49]:
plot_param_importances(study_ant)

In [50]:
plot_optimization_history(study_ant)

# Part VI: Теперь для Ant-v5 (VecEnv)

In this part we will create a script that allows to search for the best hyperparameters automatically.

### Imports

In [61]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances

### Config

In [62]:
N_TRIALS = 100  # Maximum number of trials
N_JOBS = 1 # Number of jobs to run in parallel
N_STARTUP_TRIALS = 5  # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS = 2  # Number of evaluations during the training
# N_WARMUP_STEPS = N_EVALUATIONS // 3 # Do not prune before 1/3 of the max budget is used
N_WARMUP_STEPS = N_STARTUP_TRIALS + 3 # Do not prune before 1/3 of the max budget is used
N_TIMESTEPS = int(0.1e6)  # Training budget
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS = 5
N_EVAL_EPISODES = 10
TIMEOUT = int(60 * 60 * 5)  # 5 hours

# для train_env
N_ENVS = N_EVAL_ENVS
EVAL_FREQ = max(EVAL_FREQ // N_ENVS, 1)

ENV_ID = "Ant-v5"
ALGO = SAC

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
    # "batch_size": 256,              # по-умолчанию
    # "tau": 0.02,                    # позже попробовать (default=0.005)
    # "buffer_size": 32768,           # позже попробовать (default=1000000)
    # "policy_kwargs": {
            # "net_arch": [400, 300], # (default=[256, 256])
            # "log_std_init": -4      # позже попробовать (default=-3)
        # },
}

### Exercise (5 minutes): Define the search space

In [63]:
from typing import Any, Dict, Optional
import torch
import torch.nn as nn

def sample_SAC_params(trial: optuna.Trial) -> Dict[str, Any]:
    """
    Sampler for SAC hyperparameters.

    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    one_minus_gamma = trial.suggest_float("one_minus_gamma", 0.0001, 0.03, log=True)
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 0.002, log=True)
    train_freq = trial.suggest_int("train_freq", 1, 15)
    learning_starts = trial.suggest_int("learning_starts", 100, 50_000)
    gradient_steps = trial.suggest_int("gradient_steps", 1, 15)
    tau = trial.suggest_float("tau", 0.001, 0.08, log=True)

    trial.set_user_attr("gamma", 1 - one_minus_gamma)

    hyperparams = {
        "gamma": 1 - one_minus_gamma,
        "learning_rate": learning_rate,
        "train_freq": train_freq,
        "learning_starts": learning_starts,
        "gradient_steps": gradient_steps,
        "tau": tau,
    }

    return hyperparams

### Define the objective function

First we define a custom callback to report the results of periodic evaluations to Optuna:

In [64]:
import os
import time

In [65]:
log_path_EC = r"log_path/EvalCallback"
os.makedirs(log_path_EC, exist_ok=True)

In [66]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """

    def __init__(
        self,
        eval_env: gym.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
        log_path: Optional[str] = None,
    ):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
            log_path=log_path,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Exercise (10 minutes): Define the objective function

Then we define the objective function that is in charge of sampling hyperparameters, creating the model and then returning the result to Optuna

In [67]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """

    kwargs = DEFAULT_HYPERPARAMS.copy()
    kwargs.update(sample_SAC_params(trial))

    train_env = make_vec_env(ENV_ID, N_ENVS)
    model = ALGO(env=train_env, **kwargs)

    eval_env = make_vec_env(ENV_ID, N_EVAL_ENVS)

    eval_callback = TrialEvalCallback(eval_env, trial,
                                      N_EVAL_EPISODES, EVAL_FREQ,
                                      verbose=0)

    nan_encountered = False
    try:
        print("Start train model")
        start_time = time.time()
        model.learn(N_TIMESTEPS, callback=eval_callback)
        print(f"End train model. Train time: {time.time() - start_time}\n")
    except AssertionError as e:
        print(e)
        nan_encountered = True
    finally:
        vec_env.close()
        model.env.close()
        eval_env.close()

    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward

### The optimization loop

In [68]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)
# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS, multivariate=True)
# Do not prune before 1/3 of the max budget is used
pruner = MedianPruner(
    n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_WARMUP_STEPS
)
# Create the study and start the hyperparameter optimization
study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")

try:
    study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except KeyboardInterrupt:
    pass

print("Number of finished trials: ", len(study.trials))

print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")

print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

print("  User attrs:")
for key, value in trial.user_attrs.items():
    print(f"    {key}: {value}")

# Write report
study.trials_dataframe().to_csv(f"study_results_{ALGO.__name__}_{ENV_ID}.csv")

fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)

fig1.show()
fig2.show()


Argument ``multivariate`` is an experimental feature. The interface can change in the future.

[I 2025-05-14 06:43:25,489] A new study created in memory with name: no-name-9e4c5605-62ad-4b43-ba4b-061f01b3ed5d


Start train model


[I 2025-05-14 07:03:00,114] Trial 0 finished with value: 777.8675183 and parameters: {'one_minus_gamma': 0.00016750007051566018, 'learning_rate': 0.0012840522387973963, 'train_freq': 1, 'learning_starts': 32603, 'gradient_steps': 2, 'tau': 0.004633077233065415}. Best is trial 0 with value: 777.8675183.


End train model. Train time: 1174.2450001239777

Start train model


[I 2025-05-14 07:09:14,081] Trial 1 finished with value: -2574.1259835000005 and parameters: {'one_minus_gamma': 0.0040461805868500155, 'learning_rate': 9.505465313707192e-05, 'train_freq': 8, 'learning_starts': 45988, 'gradient_steps': 5, 'tau': 0.07793480371287047}. Best is trial 0 with value: 777.8675183.


End train model. Train time: 373.8698272705078

Start train model


[I 2025-05-14 07:15:36,752] Trial 2 finished with value: 109.66953900000001 and parameters: {'one_minus_gamma': 0.0001589430879215898, 'learning_rate': 0.00089696170646981, 'train_freq': 3, 'learning_starts': 1822, 'gradient_steps': 1, 'tau': 0.05981792922745191}. Best is trial 0 with value: 777.8675183.


End train model. Train time: 382.5422022342682

Start train model


[I 2025-05-14 07:19:47,941] Trial 3 finished with value: 197.9263046 and parameters: {'one_minus_gamma': 0.012157182443670998, 'learning_rate': 1.8568570124451012e-05, 'train_freq': 8, 'learning_starts': 49853, 'gradient_steps': 3, 'tau': 0.020739864177558586}. Best is trial 0 with value: 777.8675183.


End train model. Train time: 251.11895418167114

Start train model


[I 2025-05-14 08:07:35,327] Trial 4 finished with value: 670.4293459 and parameters: {'one_minus_gamma': 0.0022329798914613144, 'learning_rate': 0.00020133933216518506, 'train_freq': 2, 'learning_starts': 40420, 'gradient_steps': 12, 'tau': 0.0049506203876494865}. Best is trial 0 with value: 777.8675183.


End train model. Train time: 2867.2956862449646

Start train model


[I 2025-05-14 08:21:02,491] Trial 5 finished with value: 91.3013344 and parameters: {'one_minus_gamma': 0.00030420208905859925, 'learning_rate': 0.00038415468457731155, 'train_freq': 4, 'learning_starts': 30346, 'gradient_steps': 5, 'tau': 0.012847469779156087}. Best is trial 0 with value: 777.8675183.


End train model. Train time: 807.0488126277924

Start train model


[I 2025-05-14 09:00:31,862] Trial 6 finished with value: 340.21945560000006 and parameters: {'one_minus_gamma': 0.00032527622004957795, 'learning_rate': 0.0019188421657142505, 'train_freq': 2, 'learning_starts': 37907, 'gradient_steps': 9, 'tau': 0.0010842423027341902}. Best is trial 0 with value: 777.8675183.


End train model. Train time: 2369.2687559127808

Start train model


[I 2025-05-14 09:11:43,191] Trial 7 finished with value: 492.8056063 and parameters: {'one_minus_gamma': 0.00025381282110940434, 'learning_rate': 0.001384028379181535, 'train_freq': 1, 'learning_starts': 30718, 'gradient_steps': 1, 'tau': 0.003075293569211277}. Best is trial 0 with value: 777.8675183.


End train model. Train time: 671.2271797657013

Start train model


[W 2025-05-14 09:14:09,919] Trial 8 failed with parameters: {'one_minus_gamma': 0.007558668778922603, 'learning_rate': 0.0014212258837454621, 'train_freq': 5, 'learning_starts': 25333, 'gradient_steps': 2, 'tau': 0.006187097413002806} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "<ipython-input-67-bdb47f0c5964>", line 29, in objective
    model.learn(N_TIMESTEPS, callback=eval_callback)
  File "/usr/local/lib/python3.11/dist-packages/stable_baselines3/sac/sac.py", line 308, in learn
    return super().learn(
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/stable_baselines3/common/off_policy_algorithm.py", line 328, in learn
    rollout = self.collect_rollouts(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/st

Number of finished trials:  9
Best trial:
  Value: 777.8675183
  Params: 
    one_minus_gamma: 0.00016750007051566018
    learning_rate: 0.0012840522387973963
    train_freq: 1
    learning_starts: 32603
    gradient_steps: 2
    tau: 0.004633077233065415
  User attrs:
    gamma: 0.9998324999294843


In [69]:
import pickle

with open('study_Ant-v5(100k, colab).pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(study, f, pickle.HIGHEST_PROTOCOL)



---



# Optuna use example from Arrafin [post](https://araffin.github.io/post/optuna/)

In [None]:
"""Optuna example that optimizes the hyperparameters of
a reinforcement learning agent using PPO implementation from Stable-Baselines3
on a Gymnasium environment.

This is a simplified version of what can be found in https://github.com/DLR-RM/rl-baselines3-zoo.

You can run this example as follows:
    $ python optimize_ppo.py

"""

from typing import Any

import gymnasium
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_util import make_vec_env
import torch
import torch.nn as nn


N_TRIALS = 500
N_STARTUP_TRIALS = 10
N_EVALUATIONS = 2
N_TIMESTEPS = 40_000
EVAL_FREQ = int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_EPISODES = 10

ENV_ID = "Pendulum-v1"
N_ENVS = 5

DEFAULT_HYPERPARAMS = {
    "policy": "MlpPolicy",
}


def sample_ppo_params(trial: optuna.Trial) -> dict[str, Any]:
    """Sampler for PPO hyperparameters."""
    # From 2**5=32 to 2**12=4096
    n_steps_pow = trial.suggest_int("n_steps_pow", 5, 12)
    gamma = trial.suggest_float("gamma", 0.97, 0.9999)
    learning_rate = trial.suggest_float("learning_rate", 3e-5, 3e-3, log=True)
    activation_fn = trial.suggest_categorical("activation_fn", ["tanh", "relu"])

    n_steps = 2**n_steps_pow
    # Display true values
    trial.set_user_attr("n_steps", n_steps)
    # Convert to PyTorch objects
    activation_fn = {"tanh": nn.Tanh, "relu": nn.ReLU}[activation_fn]

    return {
        "n_steps": n_steps,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "policy_kwargs": {
            "activation_fn": activation_fn,
        },
    }


class TrialEvalCallback(EvalCallback):
    """Callback used for evaluating and reporting a trial."""

    def __init__(
        self,
        eval_env: gymnasium.Env,
        trial: optuna.Trial,
        n_eval_episodes: int = 5,
        eval_freq: int = 10000,
        deterministic: bool = True,
        verbose: int = 0,
    ):
        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            super()._on_step()
            self.eval_idx += 1
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need.
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True


def objective(trial: optuna.Trial) -> float:
    vec_env = make_vec_env(ENV_ID, n_envs=N_ENVS)
    kwargs = DEFAULT_HYPERPARAMS.copy()
    # Sample hyperparameters.
    kwargs.update(sample_ppo_params(trial))
    # Create the RL model.
    model = PPO(env=vec_env, **kwargs)
    # Create env used for evaluation.
    eval_env = make_vec_env(ENV_ID, n_envs=N_ENVS)
    # Create the callback that will periodically evaluate and report the performance.
    eval_callback = TrialEvalCallback(
        eval_env,
        trial,
        n_eval_episodes=N_EVAL_EPISODES,
        eval_freq=max(EVAL_FREQ // N_ENVS, 1),
        deterministic=True,
    )

    nan_encountered = False
    try:
        model.learn(N_TIMESTEPS, callback=eval_callback)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN.
        print(e)
        nan_encountered = True
    finally:
        # Free memory.
        model.env.close()
        eval_env.close()

    # Tell the optimizer that the trial failed.
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    return eval_callback.last_mean_reward


if __name__ == "__main__":
    # Set pytorch num threads to 1 for faster training.
    torch.set_num_threads(1)

    sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS, multivariate=True)
    # Do not prune before 1/3 of the max budget is used.
    pruner = MedianPruner(
        n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_EVALUATIONS // 3
    )

    study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")
    try:
        study.optimize(objective, n_trials=N_TRIALS, timeout=600)
    except KeyboardInterrupt:
        pass

    print(f"Number of finished trials: {len(study.trials)}")

    print("Best trial:")
    trial = study.best_trial

    print("  Value: ", trial.value)

    print("  Params: ")
    for key, value in trial.params.items():
        print(f"    {key}: {value}")

    print("  User attrs:")
    for key, value in trial.user_attrs.items():
        print(f"    {key}: {value}")

[I 2025-05-07 06:23:50,486] A new study created in memory with name: no-name-ec51315d-b57f-4e1d-8172-06178f875f08
[I 2025-05-07 06:24:22,000] Trial 0 finished with value: -1104.8031417 and parameters: {'n_steps_pow': 11, 'gamma': 0.9717752594505243, 'learning_rate': 0.0009183829962213949, 'activation_fn': 'tanh'}. Best is trial 0 with value: -1104.8031417.
[I 2025-05-07 06:24:52,831] Trial 1 finished with value: -1403.2242927 and parameters: {'n_steps_pow': 8, 'gamma': 0.9956854702881278, 'learning_rate': 0.00017373306505887276, 'activation_fn': 'relu'}. Best is trial 0 with value: -1104.8031417.
[I 2025-05-07 06:25:24,747] Trial 2 finished with value: -1461.2066082000003 and parameters: {'n_steps_pow': 6, 'gamma': 0.9735116678394656, 'learning_rate': 3.3852430611972214e-05, 'activation_fn': 'relu'}. Best is trial 0 with value: -1104.8031417.
[I 2025-05-07 06:25:55,779] Trial 3 finished with value: -1257.453818 and parameters: {'n_steps_pow': 11, 'gamma': 0.9799202074205143, 'learning_

Number of finished trials: 19
Best trial:
  Value:  -163.77188689999997
  Params: 
    n_steps_pow: 8
    gamma: 0.9749712609115107
    learning_rate: 0.0018765871621028618
    activation_fn: relu
  User attrs:
    n_steps: 256
