<a href="https://colab.research.google.com/github/maguid28/Deep-Reinforcement-Learning/blob/main/SAC_with_HER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Baselines - Hindsight Experience Replay on Highway Env

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)

Highway env: [https://github.com/eleurent/highway-env](https://github.com/eleurent/highway-env)

[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

## Install Dependencies and Stable Baselines Using Pip


```
pip install stable-baselines3[extra]
```

In [1]:
# for autoformatting
# %load_ext jupyter_black

In [2]:
# Install stable-baselines latest version
!pip install "stable-baselines3[extra]>=2.0.0a4"

Collecting stable-baselines3>=2.0.0a4 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading stable_baselines3-2.4.0a11-py3-none-any.whl.metadata (4.5 kB)
Collecting gymnasium<1.1.0,>=0.29.1 (from stable-baselines3>=2.0.0a4->stable-baselines3[extra]>=2.0.0a4)
  Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting ale-py>=0.9.0 (from stable-baselines3[extra]>=2.0.0a4)
  Downloading ale_py-0.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium<1.1.0,>=0.29.1->stable-baselines3>=2.0.0a4->stable-baselines3[extra]>=2.0.0a4)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading stable_baselines3-2.4.0a11-py3-none-any.whl (183 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ale_py-0.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K  

In [3]:
# Install highway-env
!pip install highway-env

Collecting highway-env
  Downloading highway_env-1.10.1-py3-none-any.whl.metadata (16 kB)
Downloading highway_env-1.10.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: highway-env
Successfully installed highway-env-1.10.1


## Import policy, RL agent, ...

In [4]:
import gymnasium as gym
import highway_env
import numpy as np

from stable_baselines3 import HerReplayBuffer, SAC, DDPG
from stable_baselines3.common.noise import NormalActionNoise

## Create the Gym env and instantiate the agent

For this example, we will be using the parking environment from the [highway-env](https://github.com/Farama-Foundation/HighwayEnv) repo by @eleurent.

The parking env is a goal-conditioned continuous control task, in which the vehicle must park in a given space with the appropriate heading.


![parking-env](https://raw.githubusercontent.com/eleurent/highway-env/gh-media/docs/media/parking-env.gif)



### Train Soft Actor-Critic (SAC) agent

Here, we use HER "future" goal sampling strategy, where we create 4 artificial transitions per real transition

Note: the hyperparameters (network architecture, discount factor, ...) were tuned for this task

In [5]:
env = gym.make("parking-v0")

  and should_run_async(code)


In [6]:
# SAC hyperparams:
model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
    verbose=1,
    buffer_size=int(1e6),
    learning_rate=1e-3,
    gamma=0.95,
    batch_size=256,
    policy_kwargs=dict(net_arch=[256, 256, 256]),
)

  and should_run_async(code)


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [7]:
# Train for 1e5 steps
model.learn(int(1e5))
# Save the trained agent
model.save('her_sac_highway')

  and should_run_async(code)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    success_rate    | 0.94     |
| time/              |          |
|    episodes        | 1896     |
|    fps             | 39       |
|    time_elapsed    | 1930     |
|    total_timesteps | 76815    |
| train/             |          |
|    actor_loss      | 1.5      |
|    critic_loss     | 0.039    |
|    ent_coef        | 0.0046   |
|    ent_coef_loss   | -0.358   |
|    learning_rate   | 0.001    |
|    n_updates       | 76714    |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 20.2     |
|    ep_rew_mean     | -6.77    |
|    success_rate    | 0.94     |
| time/              |          |
|    episodes        | 1900     |
|    fps             | 39       |
|    time_elapsed    | 1932     |
|    total_timesteps | 76884    |
| train/             |          |
|    actor_loss      | 1.61     |
|    critic_loss     | 0.00549  |
|    ent_coef    

In [8]:
# Load saved model
model = SAC.load('her_sac_highway', env=env)

  and should_run_async(code)


Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


#### Evaluate the agent

In [9]:
# we use the gym >v.26 API here. Note that you could also wrap the env in a DummyVecEnv
# which allows you to use a simplified API
obs, _ = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    done = truncated or terminated
    episode_reward += reward
    if done or info.get("is_success", False):
        print("Reward:", episode_reward, "Success?", info.get("is_success", False))
        episode_reward = 0.0
        obs, _ = env.reset()

Reward: -2.355129741427345 Success? True
Reward: -9.081994387895351 Success? True
Reward: -2.2852601007293827 Success? True
Reward: -4.675070981726117 Success? True
Reward: -5.115426798700607 Success? True
Reward: -4.259585992408998 Success? True
Reward: -3.3722954832556766 Success? True
Reward: -4.427494844392133 Success? True
Reward: -12.054428444829757 Success? True
Reward: -7.695538667586793 Success? True
Reward: -5.34783251796113 Success? True
Reward: -7.879838793682838 Success? True
Reward: -7.199446795778505 Success? True
Reward: -4.743820351324023 Success? True
Reward: -10.057539438522307 Success? True
Reward: -7.434510674777588 Success? True
Reward: -6.686455333443665 Success? True
Reward: -4.730047082631324 Success? True
Reward: -10.604326538825276 Success? True
Reward: -3.45660062655118 Success? True
Reward: -7.785878768166735 Success? True
Reward: -9.495165920763398 Success? True
Reward: -5.054197758135411 Success? True
Reward: -11.12064554140158 Success? False
Reward: -5.5

### Train DDPG agent

In [10]:
# Create the action noise object that will be used for exploration
n_actions = env.action_space.shape[0]
noise_std = 0.2
action_noise = NormalActionNoise(
    mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions)
)

model = DDPG(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy="future",
    ),
    verbose=1,
    buffer_size=int(1e6),
    learning_rate=1e-3,
    action_noise=action_noise,
    gamma=0.95,
    batch_size=256,
    policy_kwargs=dict(net_arch=[256, 256, 256]),
)

  and should_run_async(code)


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [None]:
# Train for 2e5 steps
model.learn(int(2e5))
# Save the trained agent
model.save('her_ddpg_highway')

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 56.2     |
|    ep_rew_mean     | -33.9    |
|    success_rate    | 0        |
| time/              |          |
|    episodes        | 4        |
|    fps             | 71       |
|    time_elapsed    | 3        |
|    total_timesteps | 225      |
| train/             |          |
|    actor_loss      | 0.338    |
|    critic_loss     | 0.0322   |
|    learning_rate   | 0.001    |
|    n_updates       | 124      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 56.6     |
|    ep_rew_mean     | -31.9    |
|    success_rate    | 0        |
| time/              |          |
|    episodes        | 8        |
|    fps             | 56       |
|    time_elapsed    | 8        |
|    total_timesteps | 453      |
| train/             |          |
|    actor_loss      | 0.528    |
|    critic_loss     | 0.0192   |
|    learning_

In [None]:
# Load saved model
model = DDPG.load('her_ddpg_highway', env=env)

#### Evaluate the agent

In [None]:
# we use the gym >v.26 API here. Note that you could also wrap the env in a DummyVecEnv
# which allows you to use the old gym API a simplified API
obs, _ = env.reset()

# Evaluate the agent
episode_reward = 0
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    done = truncated or terminated
    episode_reward += reward
    if done or info.get("is_success", False):
        print("Reward:", episode_reward, "Success?", info.get("is_success", False))
        episode_reward = 0.0
        obs, _ = env.reset()