# Lab12: Deep Deterministic Policy Gradient (DDPG) for Continuous Control

In this lab, you will implement and train a **Deep Deterministic Policy Gradient (DDPG)** agent to solve a **continuous control task** using the MuJoCo physics simulator.  
The target environment is:

> **Hopper-v4** — a one-legged robot that must learn to hop forward as fast and as stably as possible.

## 1. Environment: Hopper-v4

The [Hopper environment](https://gymnasium.farama.org/environments/mujoco/hopper/) is a physics-based locomotion task simulated using **MuJoCo**.

- **Observation Space:** 11-dimensional continuous state  
- **Action Space:** 3-dimensional continuous torque control  
- **Objective:** Move forward as fast as possible without falling  
- **Episode Ends When:** The robot falls or becomes unstable  

This task represents a realistic robotic control problem with:
- Nonlinear dynamics  
- High-dimensional state space  
- Continuous actions  
- Long-term credit assignment  

In [3]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider

env = gym.make("HalfCheetah-v4", render_mode="rgb_array")
print("Env id:", env.spec.id)  
obs, info = env.reset(seed=0)

frames = []
num_steps = 300

for t in range(num_steps):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)

    frame = env.render()
    frames.append(frame.copy())   

    if terminated or truncated:
        obs, info = env.reset()

env.close()
print("Collected frames:", len(frames))

def show_frame(i):
    plt.figure(figsize=(4, 4))
    plt.imshow(frames[i])
    plt.axis("off")
    plt.title(f"Hopper-v4 frame {i}")
    plt.show()

interact(
    show_frame,
    i=IntSlider(0, min=0, max=len(frames)-1, step=1, description="Frame")
)

Env id: HalfCheetah-v4
Collected frames: 300


interactive(children=(IntSlider(value=0, description='Frame', max=299), Output()), _dom_classes=('widget-inter…

<function __main__.show_frame(i)>

## 2. Background

Many real-world control problems involve **continuous actions**, such as:
- Robot joint torques  
- Vehicle steering and acceleration  
- Control forces in physical systems  

Classical Deep Q-Networks (DQN) cannot be directly applied to these problems because they require **discrete** action spaces.  
DDPG extends Q-learning to **continuous control** by combining:

- A **policy network (Actor)** that outputs continuous actions.
- A **value network (Critic)** that estimates the Q-function.
- **Target networks** for stable training.
- A **replay buffer** for off-policy learning.

DDPG is one of the foundational algorithms for continuous reinforcement learning and serves as the basis for more advanced methods such as **TD3** and **SAC**.


## 3. Algorithm Implementation : DDPG

DDPG consists of the following components:

- **Actor Network**  
  Outputs a deterministic continuous action given a state.

- **Critic Network**  
  Estimates the Q-value of state–action pairs.

- **Target Networks**  
  Slowly updated copies of the actor and critic for stable learning.

- **Replay Buffer**  
  Stores past transitions for off-policy training.

- **Exploration Noise**  
  Added to the actor’s output during training for sufficient exploration.



In [4]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque
import time

In [5]:
class Actor(nn.Module):
    def __init__(self, obs_dim, act_dim, max_action):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, act_dim),
            nn.Tanh()
        )
        self.max_action = max_action

    def forward(self, x):
        return self.max_action * self.net(x)


class Critic(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim + act_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, s, a):
        return self.net(torch.cat([s, a], dim=1))

In [6]:
class ReplayBuffer:
    def __init__(self, capacity=1_000_000):
        self.buffer = deque(maxlen=capacity)

    def push(self, s, a, r, s2, d):
        self.buffer.append((s, a, r, s2, d))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        s, a, r, s2, d = map(np.array, zip(*batch))
        return (
            torch.FloatTensor(s),
            torch.FloatTensor(a),
            torch.FloatTensor(r).unsqueeze(1),
            torch.FloatTensor(s2),
            torch.FloatTensor(d).unsqueeze(1),
        )

    def __len__(self):
        return len(self.buffer)

In [7]:
env = gym.make("HalfCheetah-v4")
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.shape[0]
max_action = float(env.action_space.high[0])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

actor = Actor(obs_dim, act_dim, max_action).to(device)
critic = Critic(obs_dim, act_dim).to(device)
actor_target = Actor(obs_dim, act_dim, max_action).to(device)
critic_target = Critic(obs_dim, act_dim).to(device)

actor_target.load_state_dict(actor.state_dict())
critic_target.load_state_dict(critic.state_dict())

actor_opt = optim.Adam(actor.parameters(), lr=1e-4)
critic_opt = optim.Adam(critic.parameters(), lr=1e-3)

buffer = ReplayBuffer()

gamma = 0.99
tau = 0.005
batch_size = 256
exploration_noise = 0.1

total_steps = 300_000
warmup_steps = 10_000

state, _ = env.reset()
episode_reward = 0
episode_length = 0

In [9]:
SAVE_PATH = "ddpg_cheetah_actor_class.pth"

In [None]:
episode = 0

for step in range(total_steps):

    # --- select action ---
    state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
    action = actor(state_tensor).cpu().data.numpy().flatten()

    if step < warmup_steps:
        action = env.action_space.sample()
    else:
        action += np.random.normal(0, exploration_noise, size=act_dim)

    action = np.clip(action, -max_action, max_action)

    # --- step env ---
    next_state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated

    buffer.push(state, action, reward, next_state, float(done))

    state = next_state
    episode_reward += reward
    episode_length += 1
    
    # --- reset if done ---
    if done:
        print(f"Episode {episode} | Reward: {episode_reward:.1f} | Lenght: {episode_length}")
        state, _ = env.reset()
        episode_reward = 0
        episode_length = 0
        episode += 1

    # --- update ---
    if len(buffer) > batch_size:

        s, a, r, s2, d = buffer.sample(batch_size)
        s = s.to(device)
        a = a.to(device)
        r = r.to(device)
        s2 = s2.to(device)
        d = d.to(device)

        # Critic update
        with torch.no_grad():
            a2 = actor_target(s2)
            q_target = r + gamma * (1 - d) * critic_target(s2, a2)

        q_val = critic(s, a)
        critic_loss = nn.MSELoss()(q_val, q_target)

        critic_opt.zero_grad()
        critic_loss.backward()
        critic_opt.step()

        # Actor update
        actor_loss = -critic(s, actor(s)).mean()

        actor_opt.zero_grad()
        actor_loss.backward()
        actor_opt.step()

        # Target update
        for p, p_t in zip(actor.parameters(), actor_target.parameters()):
            p_t.data.copy_(tau * p.data + (1 - tau) * p_t.data)

        for p, p_t in zip(critic.parameters(), critic_target.parameters()):
            p_t.data.copy_(tau * p.data + (1 - tau) * p_t.data)

    # --- occasionally show progress ---
    if step % 10_000 == 0:
        print(f"Step {step}/{total_steps}")
        torch.save(actor.state_dict(), SAVE_PATH)
        
print("Training finished and model saved!")

Step 0/300000
Episode 0 | Reward: -185.6 | Lenght: 1000
Episode 1 | Reward: -256.7 | Lenght: 1000
Episode 2 | Reward: -324.6 | Lenght: 1000
Episode 3 | Reward: -404.8 | Lenght: 1000
Episode 4 | Reward: -275.0 | Lenght: 1000
Episode 5 | Reward: -260.0 | Lenght: 1000
