# Policy Gradient on Racetrack Environment (Optimized)

This notebook implements a Policy Gradient method for a custom racetrack environment using PyTorch. It includes:
- Dynamic input size handling
- Entropy-based regularization
- Advantage normalization
- Gradient clipping
- **Performance optimization** by disabling rendering

We use the REINFORCE algorithm, which is a Monte Carlo policy gradient method for optimizing the policy network.

### 1. Imports and Device Configuration

We import the required libraries and set the device to CUDA if available.

In [3]:
import gymnasium as gym
import highway_env
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from torch.distributions import Normal
import time

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### 2. Training Mode and Render Settings

We define a simple flag to control whether the environment should render for visualization or run headlessly for training.

In [5]:
training = True
render_mode = None if training else "human"

### 3. Environment Configuration

The configuration dictionary defines both the observation and action spaces, along with various simulation and reward parameters. Key features include:
- OccupancyGrid observation type with vehicle presence and road occupancy
- Continuous action space with lateral-only control (no acceleration)
- Reward structure penalizing collisions, drifting from the lane center, and overactive control
- Rendering is disabled when training to improve performance

In [14]:
config = {
    "observation": {
        "type": "OccupancyGrid",
        "features": ['presence', 'on_road', 'velocity', 'heading'],
        "grid_size": [[-18, 18], [-18, 18]],
        "grid_step": [2, 2],
        "as_image": False,
        "align_to_vehicle_axes": True
    },
    "action": {
        "type": "ContinuousAction",
        "longitudinal": True,
        "lateral": True,
    },
    "simulation_frequency": 15,
    "policy_frequency": 10,
    "duration": 100,
    "collision_reward": -1,
    "lane_centering_cost": 2,
    # "action_reward": -0.05,
    "controlled_vehicles": 1,
    "other_vehicles": 0,
    "screen_width": 600,
    "screen_height": 600,
    "centering_position": [0.5, 0.5],
    "reward_speed_range": [0.2, 0.8],
    "reward_speed_weight": 0.2,
    "scaling": 7,
    "show_trajectories": False,
    "offroad_terminal": True,
    "road": {
        "type": "sine_curve"},
    "render_agent": False if training else True,
    "offscreen_rendering": True if training else False,
    "offroad_terminal": True,
    "offroad_reward": -1.0  
}

### 4. Environment Initialization

We instantiate the racetrack environment and inject the configuration directly into the unwrapped environment. A sample observation is retrieved to infer the shape of the state space and action dimensionality for the policy network.

In [10]:
env = gym.make("racetrack-v0", render_mode=render_mode)
env.unwrapped.configure(config)
obs_sample, _ = env.reset()

obs_shape = obs_sample.shape
act_dim = env.action_space.shape[0] if len(env.action_space.shape) > 0 else 1

### Policy and Value Networks

We use a classic actor-critic architecture in which:
- The Actor learns a stochastic policy modeled by a Normal distribution, allowing for continuous actions.
- The Critic estimates the value function V(s), used to reduce the variance of the policy gradient via advantage estimation.
- Both models use a fully connected feedforward network with ReLU activations and are designed to support dynamic input shapes.

### Actor Network

The actor outputs the parameters of a Normal distribution from which actions are sampled. We use log_std as a learnable parameter to allow the network to control its exploration behavior dynamically.


In [7]:
class Actor(nn.Module):
    def __init__(self, obs_shape, act_dim, hidden=128):
        super().__init__()
        self.flatten = nn.Flatten()

        with torch.no_grad():
            dummy = torch.zeros(1, *obs_shape)
            in_dim = self.flatten(dummy).shape[1]

        self.backbone = nn.Sequential(
            nn.Linear(in_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU()
        )

        # cabeça de média
        self.mu_head   = nn.Linear(hidden, act_dim)
        # cabeça de log-std (inicializada em –1 ⇒ std≈0.37)
        self.logstd_head = nn.Linear(hidden, act_dim)
        nn.init.constant_(self.logstd_head.bias, -1.0)

    def forward(self, obs):
        x  = self.backbone(self.flatten(obs))
        mu = torch.tanh(self.mu_head(x))        # [-1, 1]

        log_std = self.logstd_head(x).clamp(-3, 1)  # std∈[0.05, 2.7]
        std = log_std.exp()

        return torch.distributions.Normal(mu, std)

### Critic Network

The critic estimates the value of each state using a separate feedforward network. Its output is a scalar value for each state input.

In [8]:
class Critic(nn.Module):
    def __init__(self, obs_shape, hidden_size=64):
        super().__init__()
        self.flatten = nn.Flatten()

        with torch.no_grad():
            dummy = torch.zeros(1, *obs_shape)
            n_flatten = self.flatten(dummy).shape[1]

        self.v = nn.Sequential(
            nn.Linear(n_flatten, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1)
        )

    def forward(self, obs):
        x = self.flatten(obs)
        return self.v(x).squeeze(-1)  # Output shape: (batch,)

### Model Initialization and Optimizers

We instantiate the networks and their optimizers. AdamW is used for better weight decay handling compared to standard Adam.

In [15]:
# Model instantiation
actor = Actor(obs_shape, act_dim).to(device)
critic = Critic(obs_shape).to(device)

# Optimizers
opt_actor = optim.AdamW(actor.parameters(), lr=1e-3)
opt_critic = optim.AdamW(critic.parameters(), lr=1e-3)

## REINFORCE with Baseline: Training Loop

This section implements the training procedure for a policy gradient method (REINFORCE) using the actor-critic architecture defined earlier.

Key elements of the implementation:
- The actor samples actions from a Normal distribution.
- The critic estimates the value function V(s) to compute the advantage.
- Rewards are accumulated using Monte Carlo returns.
- The policy is optimized using the advantage-weighted log-probabilities.
- We include entropy regularization to encourage exploration.
- Gradients are clipped to stabilize training.


In [16]:
# ------ hiperparâmetros ------
gamma   = 0.95
epochs  = 100
act_range = torch.tensor([5.0, 1.0], device=device)  # escala das ações (lon, lat)

# ------ métricas ------
ret_hist, len_hist, ent_hist = [], [], []

for epoch in range(epochs):
    obs, _ = env.reset()
    obs = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)

    logps, values, rewards = [], [], []
    done, step, ep_ret = False, 0, 0.0

    while not done and step < 200:
        dist  = actor(obs)
        value = critic(obs)

        raw_a = dist.rsample()                    # reparameterização
        squashed_a = torch.tanh(raw_a)
        env_a = (squashed_a * act_range).cpu().detach().numpy()[0]

        # log-prob com correção do squash (ver Appendix de SAC)
        logp = (dist.log_prob(raw_a) - torch.log(1 - squashed_a.pow(2) + 1e-6)).sum(-1)

        obs_next, r, done, _, _ = env.step(env_a)

        if hasattr(env, "vehicle") and not env.vehicle.on_road:
            r -= 1.0;  done = True

        logps.append(logp);  values.append(value);  rewards.append(r)
        ep_ret += r
        obs = torch.tensor(obs_next, dtype=torch.float32, device=device).unsqueeze(0)
        step += 1

    # -------- GAE simples --------
    R, returns = 0.0, []
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    returns = torch.tensor(returns, dtype=torch.float32, device=device)
    values  = torch.cat(values)
    adv     = (returns - values)
    adv     = (adv - adv.mean()) / (adv.std() + 1e-8)

    # -------- perdas --------
    logps   = torch.cat(logps)
    policy_loss = -(logps * adv.detach()).mean()
    value_loss  = 0.5 * adv.pow(2).mean()
    entropy     = dist.entropy().mean()
    ent_hist.append(entropy.item())

    ent_coef = 0.01 * (0.995 ** epoch)
    loss = policy_loss + value_loss - ent_coef * entropy

    # -------- otimização --------
    opt_actor.zero_grad();  opt_critic.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(actor.parameters(), 0.5)
    torch.nn.utils.clip_grad_norm_(critic.parameters(), 0.5)
    opt_actor.step();  opt_critic.step()

    ret_hist.append(ep_ret);  len_hist.append(step)
    if epoch % 10 == 0:
        print(f"Ep {epoch:>3} | Return {ep_ret:>6.1f} | Ent {entropy:.3f}")
        
# Salvar o checkpoint
torch.save({
    'actor_state_dict': actor.state_dict(),
    'critic_state_dict': critic.state_dict(),
    'opt_actor_state_dict': opt_actor.state_dict(),
    'opt_critic_state_dict': opt_critic.state_dict(),
    'epoch': epoch,
}, 'models/checkpoint_task2.pth')

# 📊 Plotagens
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(ret_hist)
plt.xlabel("Epoch")
plt.ylabel("Total Reward")
plt.title("Recompensa por Episódio")

plt.subplot(1, 3, 2)
plt.plot(ent_hist)
plt.xlabel("Epoch")
plt.ylabel("Entropia")
plt.title("Entropia da Política")

plt.subplot(1, 3, 3)
plt.plot(len_hist)
plt.xlabel("Epoch")
plt.ylabel("Duração do Episódio")
plt.title("Passos por Episódio")

plt.tight_layout()
plt.show()

Ep   0 | Return   21.1 | Ent 0.428
Ep  10 | Return   14.5 | Ent 0.444
Ep  20 | Return    3.9 | Ent 0.485


### Rendering the Trained Agent

After training the policy, we can visualize its performance by running the agent in the environment with rendering enabled. This helps qualitatively assess how well the agent learned the task, such as staying on track or making smooth lateral movements.

#### Key Points
- The environment is re-initialized with "human" render mode to display a GUI window.
- The trained policy is used in deterministic mode by taking the mean of the action distribution.
- Rendering is done for a limited number of steps or until the episode terminates.

In [13]:
# Model instantiation
actor = Actor(obs_shape, act_dim).to(device)
critic = Critic(obs_shape).to(device)

opt_actor = torch.optim.Adam(actor.parameters(), lr=1e-4)
opt_critic = torch.optim.Adam(critic.parameters(), lr=1e-3)

checkpoint = torch.load('models/checkpoint_task2.pth', map_location=device)

actor.load_state_dict(checkpoint['actor_state_dict'])
critic.load_state_dict(checkpoint['critic_state_dict'])

opt_actor.load_state_dict(checkpoint['opt_actor_state_dict'])
opt_critic.load_state_dict(checkpoint['opt_critic_state_dict'])

# Environment instantiation for rendering
env_render = gym.make("racetrack-v0", render_mode="human")
env_render.unwrapped.configure(config)

obs, _ = env_render.reset()
obs = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)

done = False
step = 0

while not done and step < 300:
    dist = actor(obs)                                          # Forward pass
    action = dist.mean                                          # Deterministic policy (mean of Normal)
    obs_next, reward, done, truncated, _ = env_render.step(action.detach().cpu().numpy()[0])
    
    obs = torch.tensor(obs_next, dtype=torch.float32, device=device).unsqueeze(0)
    step += 1

env_render.close()