# Project Situated AI Assignment 2: Robotics and Reinforcement Learning

This notebook provides a quickstart for training RL agents on Gymnasium Robotics environments using RL Baselines3 Zoo.

## Part 1: Setup

Run these cells once to install dependencies.

### Collab Setup

In [2]:
# # Install system dependencies (Colab only - skip if running locally)
# !apt-get update && apt-get install -q -y swig cmake ffmpeg freeglut3-dev xvfb

In [4]:
# # Setup virtual display for video recording (Colab only)
# import os
# os.system("Xvfb :1 -screen 0 1024x768x24 &")
# os.environ['DISPLAY'] = ':1'

In [5]:
# # Mount your drive to the session
# from google.colab import drive
# drive.mount('/content/drive')

### General Setup

In [6]:
# # Install Python packages
# !pip install -q rl-zoo3
# !pip install -q -e git+https://github.com/Farama-Foundation/Gymnasium-Robotics.git#egg=gymnasium-robotics

In [9]:
# # Create wrapper for record_video (rl_zoo3.record_video doesn't support --gym-packages)
# with open('record_video.py', 'w') as f:
#     f.write(
#         '#!/usr/bin/env python\n'
#         'import gymnasium_robotics\n'
#         'import runpy\n'
#         'runpy.run_module("rl_zoo3.record_video", run_name="__main__")\n'
#     )

## Part 2: Configure Hyperparameters

RL Zoo expects hyperparameters in a YAML file. Modify these to experiment with different settings.

In [31]:
import optuna

import gymnasium as gym
import gymnasium_robotics

from optuna import trial, study


In [None]:
def n_timesteps(trial: trial.Trial) -> int:
    

In [33]:
def objective(trial: trial.Trial):
    x = trial.suggest_float("x", 0, 10)
    return x

study = optuna.create_study()
study.optimize(objective, n_trials=3)

[32m[I 2026-01-20 15:39:00,518][0m A new study created in memory with name: no-name-7e1cf8d2-ed72-45d7-859d-2927720c438c[0m
[32m[I 2026-01-20 15:39:00,520][0m Trial 0 finished with value: 1.7150606414445912 and parameters: {'x': 1.7150606414445912}. Best is trial 0 with value: 1.7150606414445912.[0m
[32m[I 2026-01-20 15:39:00,520][0m Trial 1 finished with value: 0.43349053614892497 and parameters: {'x': 0.43349053614892497}. Best is trial 1 with value: 0.43349053614892497.[0m
[32m[I 2026-01-20 15:39:00,522][0m Trial 2 finished with value: 0.5046264956700108 and parameters: {'x': 0.5046264956700108}. Best is trial 1 with value: 0.43349053614892497.[0m


In [36]:
import yaml

hyperparams = {
    'FetchReachDense-v4': {
        'n_timesteps': 1000,
        'policy': 'MultiInputPolicy',
        'noise_type': 'ornstein-uhlenbeck',
        'noise_std': 0.5,
        'gradient_steps': 1,
        'train_freq': 1,
        'learning_rate': 1e-3,
        'batch_size': 256,
        'policy_kwargs': "dict(net_arch=[32, 32])",
    }
}

with open('hyperparams.yaml', 'w') as f:
    yaml.dump(hyperparams, f, sort_keys=False)

print("Hyperparameters saved to hyperparams.yaml")

Hyperparameters saved to hyperparams.yaml


In [37]:
import optuna
import gymnasium as gym
import gymnasium_robotics
from stable_baselines3 import DDPG, A2C, PPO
from stable_baselines3.common.noise import OrnsteinUhlenbeckActionNoise
from stable_baselines3.common.evaluation import evaluate_policy
import numpy as np

gym.register_envs(gymnasium_robotics)

def objective(trial):
    params = {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True),
        "batch_size": trial.suggest_categorical("batch_size", [64, 128, 256]),
        "tau": trial.suggest_float("tau", 0.001, 0.05, log=True),
        "gamma": trial.suggest_float("gamma", 0.95, 0.995),
        "noise_std": trial.suggest_float("noise_std", 0.05, 0.5),
        "net_arch": trial.suggest_categorical("net_arch", ["small", "medium", "large"]),
    }
    
    net_arch_map = {"small": [64, 64], "medium": [256, 256], "large": [400, 300]}
    
    env = gym.make("FetchReachDense-v4")
    noise = OrnsteinUhlenbeckActionNoise(
        mean=np.zeros(env.action_space.shape[0]),
        sigma=params["noise_std"] * np.ones(env.action_space.shape[0])
    )
    
    model = DDPG(
        policy="MultiInputPolicy",
        env=env,
        learning_rate=params["learning_rate"],
        batch_size=params["batch_size"],
        tau=params["tau"],
        gamma=params["gamma"],
        buffer_size=100000,
        learning_starts=1000,
        action_noise=noise,
        policy_kwargs=dict(net_arch=net_arch_map[params["net_arch"]]),
        verbose=0,
    )
    
    model.learn(total_timesteps=25000)
    mean_reward, _ = evaluate_policy(model, env, n_eval_episodes=10)
    env.close()
    
    return mean_reward

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30, show_progress_bar=True)

print(f"Best reward: {study.best_trial.value}")
print(f"Best params: {study.best_trial.params}")

[32m[I 2026-01-20 16:02:38,117][0m A new study created in memory with name: no-name-e0e3f097-8b8f-47ee-91d5-c2dca2cfbb79[0m
Best trial: 0. Best value: -1.74981:   3%|▎         | 1/30 [00:51<24:48, 51.34s/it]

[32m[I 2026-01-20 16:03:29,475][0m Trial 0 finished with value: -1.749806179665029 and parameters: {'learning_rate': 0.004668562400418924, 'batch_size': 256, 'tau': 0.0015164643247002746, 'gamma': 0.9862852046927062, 'noise_std': 0.22829117172612534, 'net_arch': 'small'}. Best is trial 0 with value: -1.749806179665029.[0m


Best trial: 0. Best value: -1.74981:   7%|▋         | 2/30 [01:33<21:21, 45.77s/it]

[32m[I 2026-01-20 16:04:11,341][0m Trial 1 finished with value: -6.676471281796694 and parameters: {'learning_rate': 2.582549765640076e-05, 'batch_size': 64, 'tau': 0.0013009718473785015, 'gamma': 0.9773592450065514, 'noise_std': 0.07071599744468342, 'net_arch': 'small'}. Best is trial 0 with value: -1.749806179665029.[0m


Best trial: 0. Best value: -1.74981:  10%|█         | 3/30 [02:17<20:16, 45.06s/it]

[32m[I 2026-01-20 16:04:55,567][0m Trial 2 finished with value: -2.961957621015608 and parameters: {'learning_rate': 0.0004556591488958037, 'batch_size': 128, 'tau': 0.0026721702977186095, 'gamma': 0.9596857722937998, 'noise_std': 0.1298309701377535, 'net_arch': 'small'}. Best is trial 0 with value: -1.749806179665029.[0m


Best trial: 0. Best value: -1.74981:  13%|█▎        | 4/30 [03:05<20:03, 46.29s/it]

[32m[I 2026-01-20 16:05:43,740][0m Trial 3 finished with value: -8.914393394440413 and parameters: {'learning_rate': 3.0074405002963306e-05, 'batch_size': 128, 'tau': 0.007990577762987245, 'gamma': 0.9739521664159131, 'noise_std': 0.4368345022218849, 'net_arch': 'small'}. Best is trial 0 with value: -1.749806179665029.[0m


Best trial: 0. Best value: -1.74981:  17%|█▋        | 5/30 [04:25<24:21, 58.44s/it]

[32m[I 2026-01-20 16:07:03,733][0m Trial 4 finished with value: -4.552997749112547 and parameters: {'learning_rate': 4.190342217258564e-05, 'batch_size': 128, 'tau': 0.017219851816823308, 'gamma': 0.9565638417359934, 'noise_std': 0.10320732044803821, 'net_arch': 'large'}. Best is trial 0 with value: -1.749806179665029.[0m


Best trial: 0. Best value: -1.74981:  20%|██        | 6/30 [05:11<21:38, 54.09s/it]

[32m[I 2026-01-20 16:07:49,358][0m Trial 5 finished with value: -4.091273233853281 and parameters: {'learning_rate': 0.00012011667665234712, 'batch_size': 128, 'tau': 0.0018127545012462058, 'gamma': 0.9628910029911988, 'noise_std': 0.10373343357040972, 'net_arch': 'small'}. Best is trial 0 with value: -1.749806179665029.[0m


Best trial: 0. Best value: -1.74981:  20%|██        | 6/30 [05:56<23:46, 59.43s/it]

[33m[W 2026-01-20 16:08:34,711][0m Trial 6 failed with parameters: {'learning_rate': 0.0015114807441251564, 'batch_size': 256, 'tau': 0.0015576873553844327, 'gamma': 0.9888236347993589, 'noise_std': 0.23073320173292075, 'net_arch': 'small'} because of the following error: KeyboardInterrupt().[0m
Traceback (most recent call last):
  File [35m"/Users/jaydenkm/workspaces/school/proj_situated/robotics_rl/.venv/lib/python3.13/site-packages/optuna/study/_optimize.py"[0m, line [35m206[0m, in [35m_run_trial[0m
    value_or_values = func(trial)
  File [35m"/var/folders/s1/fhmv_2b17yd1pk7g4y8sw2240000gn/T/ipykernel_31320/1191017117.py"[0m, line [35m43[0m, in [35mobjective[0m
    [31mmodel.learn[0m[1;31m(total_timesteps=25000)[0m
    [31m~~~~~~~~~~~[0m[1;31m^^^^^^^^^^^^^^^^^^^^^^^[0m
  File [35m"/Users/jaydenkm/workspaces/school/proj_situated/robotics_rl/.venv/lib/python3.13/site-packages/stable_baselines3/ddpg/ddpg.py"[0m, line [35m126[0m, in [35mlearn[0m
    return 




KeyboardInterrupt: 

## Part 3: Train the Agent

Train a DDPG agent on the FetchReachDense-v4 environment. Training logs are saved to `logs/`.

Run the following code in the terminal in ~/project_path/src/robotics_rl for training the agent

```
python -m rl_zoo3.train --algo ddpg --env FetchReachDense-v4 \
    --gym-packages gymnasium_robotics -c hyperparams.yaml
```

## Part 4: Evaluate the Agent

Run the trained agent and see its performance metrics.

Run the following for evaluating the agent

```
!python -m rl_zoo3.train --algo ddpg --env FetchReachDense-v4 \
    --gym-packages gymnasium_robotics -c hyperparams.yaml
```

## Part 5: Record and View Video

Record a video of the trained policy to visually evaluate performance.

In [None]:
# record_video.py is a wrapper that pre-loads gymnasium_robotics
# (rl_zoo3.record_video doesn't support --gym-packages)
!python record_video.py --algo ddpg --env FetchReachDense-v4 -f logs/ -n 1000

In [None]:
import base64
from pathlib import Path
from IPython import display as ipythondisplay

def show_videos(video_path, prefix=""):
    """Display MP4 videos from a folder in the notebook."""
    html = []
    for mp4 in Path(video_path).glob(f"{prefix}*.mp4"):
        video_b64 = base64.b64encode(mp4.read_bytes()).decode('ascii')
        html.append(f'''<video alt="{mp4}" autoplay loop controls style="height: 400px;">
            <source src="data:video/mp4;base64,{video_b64}" type="video/mp4" />
        </video>''')
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [None]:
# Display the recorded video
# Update the path if your experiment ID differs (check logs/ddpg/ folder)
show_videos('logs/ddpg/FetchReachDense-v4_1/videos/')

### Useful Commands

```bash
# Train with a different algorithm
python -m rl_zoo3.train --algo sac --env FetchReachDense-v4 --gym-packages gymnasium_robotics -c hyperparams.yaml

# Train with a specific seed (for reproducibility)
python -m rl_zoo3.train --algo ddpg --env FetchReachDense-v4 --gym-packages gymnasium_robotics -c hyperparams.yaml --seed 42

# Load best model instead of final model
python -m rl_zoo3.enjoy --algo ddpg --env FetchReachDense-v4 --gym-packages gymnasium_robotics -f logs/ --load-best
```