# Lab 13: Twin Delayed Deep Deterministic Policy Gradient (TD3) on HalfCheetah

In this lab, we extend the previous **DDPG experiment** to its improved variant:  
**Twin Delayed Deep Deterministic Policy Gradient (TD3)**, using the **HalfCheetah** continuous control task.

HalfCheetah is a standard MuJoCo benchmark that requires learning stable, high-speed locomotion with continuous actions, making it well suited for evaluating actor–critic algorithms.

TD3 was proposed to address several well-known failure modes of DDPG, including:
- **Q-value overestimation**
- **Training instability**
- **High sensitivity to hyperparameters**

TD3 introduces three key improvements over DDPG:

1. **Twin Critics**  
   Two independent Q-networks are trained, and the minimum of the two target Q-values is used to reduce overestimation bias.

2. **Target Policy Smoothing**  
   Gaussian noise is added to the target action when computing the TD target, making the critic less sensitive to sharp changes in the policy.

3. **Delayed Policy Updates**  
   The actor and target networks are updated less frequently than the critics, improving overall training stability.

In this lab, you will:
- Implement TD3 on the **HalfCheetah environment**.
- Reuse the **same Actor and Critic network structures** from the previous lab.
- Compare **DDPG vs TD3** in terms of learning speed, stability, and final performance.

By keeping the network architecture unchanged and only modifying the learning algorithm, this lab demonstrates how **algorithmic design alone can significantly improve reinforcement learning performance**.


In [6]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym

In [7]:
env_name = "HalfCheetah-v4"   
SAVE_PATH = "TD3_Cheetah_actor.pth"

In [8]:
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super().__init__()
        self.l1 = nn.Linear(state_dim, 256)
        self.l2 = nn.Linear(256, 256)
        self.l3 = nn.Linear(256, action_dim)
        self.max_action = max_action

    def forward(self, state):
        x = torch.relu(self.l1(state))
        x = torch.relu(self.l2(x))
        x = torch.tanh(self.l3(x))       
        return x * self.max_action       


class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.l1 = nn.Linear(state_dim + action_dim, 256)
        self.l2 = nn.Linear(256, 256)
        self.l3 = nn.Linear(256, 1)

    def forward(self, state, action):
        x = torch.cat([state, action], dim=1)
        x = torch.relu(self.l1(x))
        x = torch.relu(self.l2(x))
        q = self.l3(x)
        return q

In [9]:
class ReplayBuffer:
    def __init__(self, state_dim, action_dim, max_size=int(1e6)):
        self.max_size = max_size
        self.ptr = 0
        self.size = 0

        self.state = np.zeros((max_size, state_dim), dtype=np.float32)
        self.action = np.zeros((max_size, action_dim), dtype=np.float32)
        self.next_state = np.zeros((max_size, state_dim), dtype=np.float32)
        self.reward = np.zeros((max_size, 1), dtype=np.float32)
        self.done = np.zeros((max_size, 1), dtype=np.float32)

    def add(self, s, a, r, s2, d):
        self.state[self.ptr] = s
        self.action[self.ptr] = a
        self.reward[self.ptr] = r
        self.next_state[self.ptr] = s2
        self.done[self.ptr] = d

        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample(self, batch_size, device):
        idx = np.random.randint(0, self.size, size=batch_size)

        state = torch.as_tensor(self.state[idx], dtype=torch.float32, device=device)
        action = torch.as_tensor(self.action[idx], dtype=torch.float32, device=device)
        reward = torch.as_tensor(self.reward[idx], dtype=torch.float32, device=device)
        next_state = torch.as_tensor(self.next_state[idx], dtype=torch.float32, device=device)
        done = torch.as_tensor(self.done[idx], dtype=torch.float32, device=device)

        return state, action, reward, next_state, done

In [10]:
env = gym.make(env_name)
eval_env = gym.make(env_name)

seed = 0
np.random.seed(seed)
torch.manual_seed(seed)
env.reset(seed=seed)
eval_env.reset(seed=seed + 1)

state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
max_action = float(env.action_space.high[0])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


max_episodes = 500
max_steps_per_episode = 1000

start_timesteps = 25000  
expl_noise = 0.1        

batch_size = 256
discount = 0.99
tau = 0.005

policy_noise = 0.2     
noise_clip = 0.5          
policy_delay = 2        

actor = Actor(state_dim, action_dim, max_action).to(device)
actor_target = Actor(state_dim, action_dim, max_action).to(device)
actor_target.load_state_dict(actor.state_dict())

critic1 = Critic(state_dim, action_dim).to(device)
critic2 = Critic(state_dim, action_dim).to(device)
critic1_target = Critic(state_dim, action_dim).to(device)
critic2_target = Critic(state_dim, action_dim).to(device)
critic1_target.load_state_dict(critic1.state_dict())
critic2_target.load_state_dict(critic2.state_dict())

actor_optimizer = optim.Adam(actor.parameters(), lr=1e-3)
critic1_optimizer = optim.Adam(critic1.parameters(), lr=1e-3)
critic2_optimizer = optim.Adam(critic2.parameters(), lr=1e-3)
replay_buffer = ReplayBuffer(state_dim, action_dim)
total_steps = 0
gradient_step = 0  

mse_loss = nn.MSELoss()

In [11]:
for episode in range(1, max_episodes + 1):
    state, _ = env.reset()
    episode_reward = 0.0

    for t in range(max_steps_per_episode):
        total_steps += 1

        if total_steps < start_timesteps:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                s_tensor = torch.as_tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
                a_tensor = actor(s_tensor)
                action = a_tensor.cpu().numpy().flatten()
                action = action + expl_noise * np.random.randn(*action.shape)
                action = np.clip(action, -max_action, max_action)


        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        replay_buffer.add(state, action, reward, next_state, float(done))
        state = next_state
        episode_reward += reward


        if replay_buffer.size >= batch_size:
            gradient_step += 1

            state_b, action_b, reward_b, next_state_b, done_b = replay_buffer.sample(
                batch_size, device
            )

            with torch.no_grad():

                noise = (torch.randn_like(action_b) * policy_noise).clamp(
                    -noise_clip, noise_clip
                )

                next_action_b = actor_target(next_state_b)
                next_action_b = (next_action_b + noise).clamp(-max_action, max_action)

                target_Q1 = critic1_target(next_state_b, next_action_b)
                target_Q2 = critic2_target(next_state_b, next_action_b)
                target_Q = torch.min(target_Q1, target_Q2)
                target = reward_b + (1.0 - done_b) * discount * target_Q


            current_Q1 = critic1(state_b, action_b)
            current_Q2 = critic2(state_b, action_b)

            critic1_loss = mse_loss(current_Q1, target)
            critic2_loss = mse_loss(current_Q2, target)
            critic_loss = critic1_loss + critic2_loss

            critic1_optimizer.zero_grad()
            critic2_optimizer.zero_grad()
            critic_loss.backward()
            critic1_optimizer.step()
            critic2_optimizer.step()


            if gradient_step % policy_delay == 0:
                # Actor loss = - E[ Q1(s, π(s)) ]
                actor_loss = -critic1(state_b, actor(state_b)).mean()
                actor_optimizer.zero_grad()
                actor_loss.backward()
                actor_optimizer.step()

                # soft update
                with torch.no_grad():
                    for param, target_param in zip(actor.parameters(), actor_target.parameters()):
                        target_param.data.copy_(
                            tau * param.data + (1.0 - tau) * target_param.data
                        )
                    for param, target_param in zip(critic1.parameters(), critic1_target.parameters()):
                        target_param.data.copy_(
                            tau * param.data + (1.0 - tau) * target_param.data
                        )
                    for param, target_param in zip(critic2.parameters(), critic2_target.parameters()):
                        target_param.data.copy_(
                            tau * param.data + (1.0 - tau) * target_param.data
                        )

        if done:
            break


    print(f"Episode {episode:4d} | Steps: {t+1:4d} | Reward: {episode_reward:8.2f}")

Episode    1 | Steps: 1000 | Reward:  -301.23
Episode    2 | Steps: 1000 | Reward:  -213.73
Episode    3 | Steps: 1000 | Reward:  -141.26
Episode    4 | Steps: 1000 | Reward:  -220.39
Episode    5 | Steps: 1000 | Reward:  -235.70
Episode    6 | Steps: 1000 | Reward:  -242.73
Episode    7 | Steps: 1000 | Reward:  -195.29
Episode    8 | Steps: 1000 | Reward:  -182.95
Episode    9 | Steps: 1000 | Reward:  -322.25
Episode   10 | Steps: 1000 | Reward:  -339.32
Episode   11 | Steps: 1000 | Reward:  -199.36
Episode   12 | Steps: 1000 | Reward:  -397.89
Episode   13 | Steps: 1000 | Reward:  -273.43
Episode   14 | Steps: 1000 | Reward:  -205.63
Episode   15 | Steps: 1000 | Reward:  -263.46
Episode   16 | Steps: 1000 | Reward:  -162.17
Episode   17 | Steps: 1000 | Reward:  -344.87
Episode   18 | Steps: 1000 | Reward:  -222.37
Episode   19 | Steps: 1000 | Reward:  -380.59
Episode   20 | Steps: 1000 | Reward:  -280.07
Episode   21 | Steps: 1000 | Reward:  -352.40
Episode   22 | Steps: 1000 | Rewar

KeyboardInterrupt: 