# Lab 15: Benchmarking with Stable-Baselines3 and TensorBoard

## Introduction

In this lab, you will transition from **hand-implemented reinforcement learning algorithms** to a **standardized, research-grade deep reinforcement learning workflow** using **Stable-Baselines3 (SB3)** and **TensorBoard**.

In earlier labs, you implemented algorithms such as **TRPO** and **PPO** from scratch in order to understand their mathematical foundations, optimization objectives, and implementation details. While such low-level implementations are essential for building intuition, **modern reinforcement learning research and applications almost always rely on well-tested libraries and systematic experiment logging tools**.

This lab is designed to bridge that gap by introducing a practical and reproducible RL experimentation pipeline.

## Using Stable-Baselines3

Stable-Baselines3 (SB3) is a widely used Python library that provides **reliable and standardized implementations of modern reinforcement learning algorithms**. Instead of manually implementing training loops, rollout buffers, and optimization steps, SB3 allows you to focus on **algorithm selection, hyperparameter tuning, and experimental analysis**.

Using SB3 typically follows a simple workflow:

1. **Create an environment** (often a vectorized environment for efficient data collection)
2. **Instantiate a model** by selecting an algorithm and policy type
3. **Train the model** using a fixed number of timesteps
4. **Save, load, and evaluate** trained policies
5. **Monitor training progress** using TensorBoard

In [4]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.callbacks import CheckpointCallback
import os
import torch

ENV_ID = "Hopper-v4"
SEED = 42
N_ENVS = 8
TOTAL_TIMESTEPS = 2_000_000
SAVE_FREQ = 100_000


CHECKPOINT_DIR = "./checkpoints"
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

vec_env = make_vec_env(
    ENV_ID,
    n_envs=N_ENVS,
    seed=SEED,
)

checkpoint_callback = CheckpointCallback(
    save_freq=SAVE_FREQ,
    save_path=CHECKPOINT_DIR,
    name_prefix="ppo_hopper"
)


policy_kwargs = dict(
    activation_fn=torch.nn.Tanh,            
    net_arch=dict(pi=[256, 256], vf=[256, 256])  
)

model = PPO(
    "MlpPolicy",
    vec_env,
    policy_kwargs=policy_kwargs,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=256,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.0,
    vf_coef=0.5,
    max_grad_norm=0.5,
    verbose=1,
    tensorboard_log = "./tb_logs"
)


model.learn(
    total_timesteps=TOTAL_TIMESTEPS,
    callback=checkpoint_callback,
    tb_log_name="PPO_Hopper_base"
)

# 最终模型（再存一次）
model.save("ppo_hopper_final")

vec_env.close()

Using cuda device
Logging to ./tb_logs\PPO_Hopper_base_2
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 21       |
|    ep_rew_mean     | 16.9     |
| time/              |          |
|    fps             | 1212     |
|    iterations      | 1        |
|    time_elapsed    | 13       |
|    total_timesteps | 16384    |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 27.1        |
|    ep_rew_mean          | 27.2        |
| time/                   |             |
|    fps                  | 867         |
|    iterations           | 2           |
|    time_elapsed         | 37          |
|    total_timesteps      | 32768       |
| train/                  |             |
|    approx_kl            | 0.014268925 |
|    clip_fraction        | 0.21        |
|    clip_range           | 0.2         |
|    entropy_loss         | -4.21       |
|    explained_

KeyboardInterrupt: 

To monitor training progress, TensorBoard can be launched from the project directory by running  
`tensorboard --logdir tb_logs`.  
This command starts a local web server (typically at `http://localhost:6006`) that visualizes the logs generated during training. Each run appears as a separate entry and can be enabled or disabled for comparison. TensorBoard allows you to inspect learning curves such as episode reward, loss terms, and training speed, making it an essen

## Algorithm Comparison Task

In this part of the lab, you will use **Stable-Baselines3** to solve the same continuous control task using four different reinforcement learning algorithms: **PPO, TRPO, DDPG, and SAC**. All algorithms should be trained under a **consistent experimental setup**, including the same environment, comparable network architectures, and a fixed training budget.

You are required to log all training runs using **TensorBoard** and compare their performance by **plotting the episode reward curves**. Through this comparison, you should analyze differences in learning speed, training stability, and final performance. The goal is not only to determine which algorithm performs best, but also to understand *why* different algorithmic designs lead to different learning behaviors.


In [None]:
## your time to work on it