## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Fall 2025</center>

Welcome to the Assignment 3, Part 1: Introduction to Actor-Critic Methods! It includes the implementation of simple actor and critic networks and best practices used in modern Actor-Critic algorithms.

## Section 0: Setup and Imports

In [1]:
!pip install ale-py -q
!pip install shimmy -q
!pip install "gymnasium[atari]" -q
!pip install "gymnasium[accept-rom-license]" -q
!apt-get update -y
!apt-get install -y swig
!pip install "gymnasium[box2d]" box2d box2d-py
!pip install "gymnasium[mujoco]" mujoco

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import deque
import ale_py
import shimmy
from gym.wrappers import AtariPreprocessing, FrameStack

# Set seed for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:7 https://cli.github.com/packages stable/main amd64 Packages [343 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,135 kB]
Hit:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Pac

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  return datetime.utcnow().replace(tzinfo=utc)


<torch._C.Generator at 0x7dfea34a36b0>

## Section 1: Actor-Critic Network Architectures and Loss Computation

In this section, you will explore two common architectural designs for Actor-Critic methods and implement their corresponding loss functions using dummy tensors. These architectures are:
- A. Completely separate actor and critic networks
- B. A shared network with two output heads

Both designs are widely used in practice. Shared networks are often more efficient and generalize better, while separate networks offer more control and flexibility.

---


### Task 1a ‚Äì Separate Actor and Critic Networks with Loss Function

Define a class `SeparateActorCritic`. Your goal is to:
- Create two completely independent neural networks: one for the actor and one for the critic.
- The actor should output a probability distribution over discrete actions (use `nn.Softmax`).
- The critic should output a single scalar value.

 Use `nn.ReLU()` as your activation function. Include at least one hidden layer of reasonable width (e.g. 64 or 128 units).

```python
# TODO: Define SeparateActorCritic class
```

 Next, simulate training using dummy tensors:
1. Generate dummy tensors for log-probabilities, returns, estimated values, and entropies.
2. Compute the actor loss using the advantage (return - value).
3. Compute the critic loss as mean squared error between values and returns.
4. Use a single optimizer for both the Actor and the Critic. In this case, combine the actor and critic losses into a total loss and perform backpropagation.
5. Use a separate optimizers for both the Actor and the Critic. In this case, keep the actor and critic losses separate and perform backpropagation.

```python
# TODO: Simulate loss computation and backpropagation
```

üîó Helpful references:
- PyTorch Softmax: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
- PyTorch MSE Loss: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

---

In [2]:
# TODO: Define a class SeparateActorCritic with separate networks for actor and critic

# BEGIN_YOUR_CODE
class SeparateActorCritic(nn.Module):
  def __init__(self, input_size, output_size, hidden_size = 128):
    super(SeparateActorCritic, self).__init__()

    # Actor
    self.actor_fc1 = nn.Linear(input_size, hidden_size)
    self.actor_fc2 = nn.Linear(hidden_size, output_size)

    # Critic
    self.critic_fc1 = nn.Linear(input_size, hidden_size)
    self.critic_fc2 = nn.Linear(hidden_size, 1)

    self.softmax = nn.Softmax(dim=-1)

  def forward(self, x):
    actor = F.relu(self.actor_fc1(x))
    actor_out = self.softmax(self.actor_fc2(actor))

    critic = F.relu(self.critic_fc1(x))
    critic_out = self.critic_fc2(critic)

    return actor_out, critic_out

# END_YOUR_CODE

#### Simulate training using dummy tensors

In [3]:
log_prob = torch.randn(7, requires_grad=True)
returns = torch.randn(7)
values = torch.randn(7, requires_grad=True)
entropy = torch.rand(1)

# Advantage
advantage = returns - values.detach()

# Loss
actor_loss = -(log_prob * advantage).mean()
critic_loss = F.mse_loss(values, returns)

#### Single optimizer for both actor and critic

In [4]:
total_loss = actor_loss + critic_loss
optimizer = optim.Adam([log_prob, values], lr=0.001)
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

print(f"Total loss for the single optimizer:  {total_loss.item():.3f}")

Total loss for the single optimizer:  3.693


  return datetime.utcnow().replace(tzinfo=utc)


#### Seperater optimizer for  actor and critic

In [5]:
actor_loss = -(log_prob * advantage).mean()
critic_loss = F.mse_loss(values, returns)

# Separate optimizers
actor_optimizer = optim.Adam([log_prob], lr=0.001)
critic_optimizer = optim.Adam([values], lr=0.001)

actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()

critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()

print(f"Actor loss:  {actor_loss.item():.3f}")
print(f"Critic loss: {critic_loss.item():.3f}")

Actor loss:  1.298
Critic loss: 2.392


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

Sharing some layers can have many advantages that are discussed below after the implementation of the `SharedActorCritic`. However, the completely seperate architecture can make learning more stable, as the actor and critic gradients are isolated from each other.

---

### Task 1b ‚Äì Shared Network with Actor and Critic Heads + Loss Function

Now define a class `SharedActorCritic`:
- Build a shared base network (e.g., linear layer + ReLU)
- Create two heads: one for actor (output action probabilities) and one for critic (output state value)

```python
# TODO: Define SharedActorCritic class
```

Then:
1. Pass a dummy input tensor through the model to obtain action probabilities and value.
2. Simulate dummy rewards and compute advantage.
3. Compute the actor and critic losses, combine them, and backpropagate.

```python
# TODO: Simulate shared network loss computation and backpropagation
```

 Use `nn.Softmax` for actor output and `nn.Linear` for scalar critic output.

üîó More reading:
- Policy Gradient Methods: https://spinningup.openai.com/en/latest/algorithms/vpg.html
- Actor-Critic Overview: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial
- PyTorch Categorical Distribution: https://pytorch.org/docs/stable/distributions.html#categorical

---

In [6]:
# BEGIN_YOUR_CODE

class SharedActorCritic(nn.Module):
  def __init__(self, input_size, output_size, hidden_size = 128):
    super(SharedActorCritic, self).__init__()

    self.shared_fc = nn.Linear(input_size, hidden_size)
    self.shared_relu = nn.ReLU()

    self.actor_head = nn.Linear(hidden_size, output_size)
    self.critic_head = nn.Linear(hidden_size, 1)
    self.softmax = nn.Softmax(dim=-1)

  def forward(self, x):
    shared = self.shared_relu(self.shared_fc(x))
    actor_out = self.softmax(self.actor_head(shared))
    critic_out = self.critic_head(shared)

    return actor_out, critic_out

# END_YOUR_CODE

In [7]:
batch_size = 7
input_size = 8
output_size = 2

model = SharedActorCritic(input_size, output_size)
obs = torch.randn(batch_size, input_size)

# Forward pass
actor_out, critic_out = model(obs)

returns = torch.randn(batch_size)
advantage = returns - critic_out.detach().squeeze()

# losses
log_probs = torch.log(actor_out)
actor_loss = -(log_probs.mean(dim=1) * advantage).mean()
critic_loss = F.mse_loss(critic_out.squeeze(), returns)

total_loss = actor_loss + critic_loss

In [8]:
optimizer = optim.Adam(model.parameters(), lr=0.001)

optimizer.zero_grad()
total_loss.backward()
optimizer.step()

print("Actor output (probabilities):")
print(actor_out)
print("\nCritic output:")
print(critic_out)
print(f"\nActor loss:  {actor_loss.item():.3f}")
print(f"Critic loss: {critic_loss.item():.3f}")
print(f"Total loss:  {total_loss.item():.3f}")

Actor output (probabilities):
tensor([[0.4978, 0.5022],
        [0.3885, 0.6115],
        [0.4719, 0.5281],
        [0.6003, 0.3997],
        [0.5541, 0.4459],
        [0.5080, 0.4920],
        [0.6093, 0.3907]], grad_fn=<SoftmaxBackward0>)

Critic output:
tensor([[ 0.1806],
        [-0.2898],
        [ 0.0814],
        [ 0.1792],
        [ 0.0350],
        [-0.0008],
        [ 0.2132]], grad_fn=<AddmmBackward0>)

Actor loss:  0.082
Critic loss: 0.322
Total loss:  0.404


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

In this `SharedActorCritic` setup, the actor and critic share the base layers, and have seperate heads for output. This sharing setup may help reduce the total number of learnable paramaters, improving the efficiency, and reducing the training time.

---

## Section 2: Auto-Adaptive Network Setup for Environments

You will now create a function that builds a shared actor-critic network that adapts to any Gymnasium environment. This function should inspect the environment and build input/output layers accordingly.

### Task 2: Auto-generate Input and Output Layers
Write a function `create_shared_network(env)` that constructs a neural network using the following rules:
- The input layer should match the environment's observation space.
- The output layer for the **actor** should depend on the action space:
  - For discrete actions: output probabilities using `nn.Softmax`.
  - For continuous actions: output mean and log std for a Gaussian distribution.
- The **critic** always outputs a single scalar value.

```python
# TODO: Define function `create_shared_network(env)`
```

#### Environments to Support:
Test your function with the following environments:
1. `CliffWalking-v0` (Use one-hot encoding for discrete integer observations.)
2. `LunarLander-v3` (Standard Box space for observations and discrete actions.)
3. `PongNoFrameskip-v4` (Use gym wrappers for Atari image preprocessing.)
4. `HalfCheetah-v5` (Continuous observation and continuous action.)

```python
# TODO: Loop through environments and test `create_shared_network`
```

Hint: Use `gym.spaces` utilities to determine observation/action types dynamically.

üîó Observation/Action Space Docs:
- https://gymnasium.farama.org/api/spaces/

---

In [9]:
# BEGIN_YOUR_CODE

def create_shared_network(env):
  observ_space = env.observation_space
  action_space = env.action_space

  if isinstance(observ_space, gym.spaces.Discrete):
    input_size = observ_space.n
    one_hot_encode = True
  else:
    input_size = int(np.prod(observ_space.shape))
    one_hot_encode = False

  if isinstance(action_space, gym.spaces.Discrete):
    output_size = action_space.n
    continuous = False
  else:
    output_size = action_space.shape[0]
    continuous = True

  class SharedActorCritic(nn.Module):
    def __init__(self, input_size, output_size, hidden_size = 128):
      super(SharedActorCritic, self).__init__()
      self.input_size = input_size
      self.output_size = output_size
      self.one_hot_encode = one_hot_encode
      self.continuous = continuous

      # Shared layesr
      self.shared_fc = nn.Linear(input_size, hidden_size)
      self.shared_relu = nn.ReLU()

      # actor
      if self.continuous:
        self.mean_head = nn.Linear(hidden_size, output_size)
        self.log_std = nn.Parameter(torch.zeros(output_size))
      else:
        self.actor_head = nn.Linear(hidden_size, output_size)
        self.softmax = nn.Softmax(dim=-1)

      # critic
      self.critic_head = nn.Linear(hidden_size, 1)

    def forward(self, x):

      if self.one_hot_encode:
        if x.dim() > 1:
          x = x.squeeze(-1)
        x = F.one_hot(x.long(), num_classes=self.input_size).float()
      else:
        x = x.view(x.size(0), -1)

      # shared
      shared = self.shared_relu(self.shared_fc(x))

      # actor
      if self.continuous:
        mean = self.mean_head(shared)
        log_std = self.log_std.expand_as(mean)
        actor_out = (mean, log_std)
      else:
        actor_out = self.softmax(self.actor_head(shared))

      # critic
      critic_out = self.critic_head(shared)

      return actor_out, critic_out

  return SharedActorCritic(input_size, output_size)

# END_YOUR_CODE

In [12]:
print(gym.make("CliffWalking-v1").observation_space)
print(gym.make("LunarLander-v3").observation_space)
print(gym.make("PongNoFrameskip-v4").observation_space)
print(gym.make("HalfCheetah-v5").observation_space)

Discrete(48)
Box([ -2.5        -2.5       -10.        -10.         -6.2831855 -10.
  -0.         -0.       ], [ 2.5        2.5       10.        10.         6.2831855 10.
  1.         1.       ], (8,), float32)
Box(0, 255, (210, 160, 3), uint8)
Box(-inf, inf, (17,), float64)


In [13]:
env_names = ["CliffWalking-v1", "LunarLander-v3", "HalfCheetah-v5"]

for name in env_names:
    print(f"\n\nTesting the {name} Environment")

    if name == "PongNoFrameskip-v4":
        env = gym.make(name,  frameskip=1)
        env = AtariPreprocessing(env)
        env = FrameStack(env, 4)
    else:
        env = gym.make(name)

    network = create_shared_network(env)
    state, info = env.reset()

    # Get state as tensor
    state_tensor = torch.tensor(state).unsqueeze(0).float()

    # Forward pass
    actor_out, critic_out = network(state_tensor)

    print("Actor Output:", actor_out)
    print("\nCritic Output:", critic_out)
    print(f'\n{network}')



Testing the CliffWalking-v1 Environment
Actor Output: tensor([[0.2535, 0.2293, 0.2620, 0.2553]], grad_fn=<SoftmaxBackward0>)

Critic Output: tensor([[-0.0076]], grad_fn=<AddmmBackward0>)

SharedActorCritic(
  (shared_fc): Linear(in_features=48, out_features=128, bias=True)
  (shared_relu): ReLU()
  (actor_head): Linear(in_features=128, out_features=4, bias=True)
  (softmax): Softmax(dim=-1)
  (critic_head): Linear(in_features=128, out_features=1, bias=True)
)


Testing the LunarLander-v3 Environment
Actor Output: tensor([[0.2551, 0.2392, 0.2566, 0.2491]], grad_fn=<SoftmaxBackward0>)

Critic Output: tensor([[-0.1628]], grad_fn=<AddmmBackward0>)

SharedActorCritic(
  (shared_fc): Linear(in_features=8, out_features=128, bias=True)
  (shared_relu): ReLU()
  (actor_head): Linear(in_features=128, out_features=4, bias=True)
  (softmax): Softmax(dim=-1)
  (critic_head): Linear(in_features=128, out_features=1, bias=True)
)


Testing the HalfCheetah-v5 Environment
Actor Output: (tensor([[ 0.11

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

### Task 3: Write Observation Normalization Function
Create a function `normalize_observation(obs, env)` that:
- Checks if the observation space is `Box` and has `low` and `high` attributes.
- If so, normalize the input observation.
- Otherwise, return the observation unchanged.

```python
# TODO: Define `normalize_observation(obs, env)`
```

Test this function with observations from:
- `LunarLander-v3`
- `PongNoFrameskip-v4`

Note: Atari observations are image arrays. Normalize pixel values to [0, 1]. For LunarLander-v3, the different elements in the observation vector have different ranges. Normalize them to [0, 1] using the `low` and `high` attributes of the observation space.


---

In [14]:
gym.register_envs(ale_py)

from gymnasium.envs.registration import registry
pong_envs = [env_id for env_id in registry if "Pong" in env_id]
print("Available Pong environments:", pong_envs)

Available Pong environments: ['Pong-v0', 'Pong-v4', 'PongNoFrameskip-v0', 'PongNoFrameskip-v4', 'ALE/Pong-v5']


In [15]:
# BEGIN_YOUR_CODE

def normalize_observation(obs, env):
    obs_space = env.observation_space

    if not isinstance(obs_space, gym.spaces.Box):
        return obs

    is_tensor = isinstance(obs, torch.Tensor)
    if is_tensor:
        obs_np = obs.detach().cpu().numpy().astype(np.float32)
    else:
        obs_np = np.array(obs, dtype=np.float32)

    low = obs_space.low
    high = obs_space.high

    if not (np.all(np.isfinite(low)) and np.all(np.isfinite(high))):
        return obs

    is_pixel_space = (
        np.issubdtype(obs_space.dtype, np.integer)
        or (low.min() == 0 and high.max() == 255)
    )
    if is_pixel_space:
        norm = obs_np / 255.0
    else:
        denom = (high - low)
        denom[denom == 0] = 1.0
        norm = 2.0 * (obs_np - low) / denom - 1.0

    if is_tensor:
        norm_tensor = torch.from_numpy(norm).to(dtype=obs.dtype, device=obs.device)
        return norm_tensor
    else:
        return norm.astype(np.float32)

# LunarLander-v3
ll_env = gym.make("LunarLander-v3")
ll_obs, ll_info = ll_env.reset()
ll_obs_norm = normalize_observation(ll_obs, ll_env)
print("LunarLander-v3:")
print("  raw obs:        ", ll_obs)
print("  normalized obs: ", ll_obs_norm)
print("  normalized range: [", ll_obs_norm.min(), ",", ll_obs_norm.max(), "]\n")

# PongNoFrameskip-v4
pong_env = gym.make("PongNoFrameskip-v4")
pong_obs, pong_info = pong_env.reset()

print(f"PongNoFrameskip-v4:")
print(f"Raw observation shape:  {pong_obs.shape}")
print(f"Raw observation dtype:  {pong_obs.dtype}")
print(f"Raw observation range:  [{pong_obs.min()}, {pong_obs.max()}]")
print(f"Sample pixel at [100,80]: {pong_obs[100, 80]}")

pong_obs_norm = normalize_observation(pong_obs, pong_env)

print(f"\nNormalized shape:       {pong_obs_norm.shape}")
print(f"Normalized dtype:       {pong_obs_norm.dtype}")
print(f"Normalized range:       [{pong_obs_norm.min():.3f}, {pong_obs_norm.max():.3f}]")
print(f"Sample normalized pixel at [100,80]: {pong_obs_norm[100, 80]}")
print(f"Observation space:      {pong_env.observation_space}")

pong_env.close()


# END_YOUR_CODE

LunarLander-v3:
  raw obs:         [ 0.00409327  1.4108183   0.41459098 -0.00452601 -0.0047363  -0.09391101
  0.          0.        ]
  normalized obs:  [ 1.6372204e-03  5.6432736e-01  4.1459084e-02 -4.5263767e-04
 -7.5381994e-04 -9.3911290e-03 -1.0000000e+00 -1.0000000e+00]
  normalized range: [ -1.0 , 0.56432736 ]

PongNoFrameskip-v4:
Raw observation shape:  (210, 160, 3)
Raw observation dtype:  uint8
Raw observation range:  [0, 228]
Sample pixel at [100,80]: [109 118  43]

Normalized shape:       (210, 160, 3)
Normalized dtype:       float32
Normalized range:       [0.000, 0.894]
Sample normalized pixel at [100,80]: [0.42745098 0.4627451  0.16862746]
Observation space:      Box(0, 255, (210, 160, 3), uint8)


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

The normalization function is built to handle two common types of observations in reinforcement learning: low-dimensional state vectors and high-dimensional image observations. It first checks whether the environment‚Äôs observation space is a Box with finite lower and upper bounds. If the space is not a Box or has infinite bounds, the function simply returns the original observation. This avoids making unsafe assumptions about unbounded or non-numeric observations and keeps the preprocessing focused on cases where scaling is well defined.

When the observation comes from a vector-valued Box space, such as in LunarLander-v3, the function uses the environment‚Äôs low and high arrays to normalize each dimension into the range [‚àí1, 1]. This is useful because different state variables (position, velocity, angle, leg contacts, etc.) can naturally live on very different scales. Mapping all of them into a common range helps neural networks train more smoothly: gradients are less dominated by any single large-scale feature, and learning tends to be more stable. This setup is preferred for classic control and continuous-state environments where the state is a relatively small numeric vector with meaningful physical bounds.

For pixel-based environments like PongNoFrameskip-v4, the observation space is a Box(0, 255, (210, 160, 3), uint8), representing RGB images. In this case, the function detects an integer Box with 0‚Äì255 bounds and normalizes by dividing by 255, which maps pixel values into the range [0, 1] while preserving the image shape. This is a standard preprocessing step in vision tasks and makes the input more suitable for gradient-based optimization, since working with large raw integers can slow down or destabilize training. This setup is preferred whenever the agent receives raw images from the environment, especially in Atari-style games.

Overall, the two setups: [‚àí1, 1] scaling for vector states and [0, 1] scaling for pixel observations‚Äîprovide a simple and practical normalization strategy that aligns with how these different types of inputs are usually handled in practice. The same function can be reused across both types of environments, reducing code duplication and making it easier to plug different tasks into the same training pipeline.


## Section 4: Gradient Clipping

To prevent exploding gradients, it's common practice to clip gradients before optimizer updates.

### Task 4: Clip Gradients for Actor-Critic Networks
Use dummy tensors and apply gradient clipping with the following PyTorch method:
```python
# During training, after loss.backward():
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
```

Reuse the loss computation from Task 1a or 1b. After computing the gradients, apply gradient clipping.
Print the gradient norm before and after clipping to verify it‚Äôs applied.

üîó PyTorch Docs: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html


---

In [16]:
# BEGIN_YOUR_CODE

batch_size = 7
input_size = 8
output_size = 2
model = SharedActorCritic(input_size, output_size)

obs = torch.randn(batch_size, input_size)
returns = torch.randn(batch_size)

actor_out, critic_out = model(obs)

advantage = returns - critic_out.detach().squeeze()
log_probs = torch.log(actor_out + 1e-8)
actor_loss = -(log_probs.mean(dim=1) * advantage).mean()
critic_loss = F.mse_loss(critic_out.squeeze(), returns)
total_loss = actor_loss + critic_loss

optimizer = optim.Adam(model.parameters(), lr=0.001)

optimizer.zero_grad()
total_loss.backward()

print("Single Optimizer Training Step:")
print(f"Actor Loss: {actor_loss.item()}")
print(f"Critic Loss: {critic_loss.item()}")
print(f"Total Loss: {total_loss.item()}")

total_norm_before = 0.0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm_before += param_norm.item() ** 2
total_norm_before = total_norm_before ** 0.5

print(f"Gradient norm before clipping: {total_norm_before}")

max_grad_norm = 0.5
grad_norm_reported = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)

print(f"Gradient norm reported by clip_grad_norm_ (before clipping): {grad_norm_reported}")

total_norm_after = 0.0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm_after += param_norm.item() ** 2
total_norm_after = total_norm_after ** 0.5

print(f"Gradient norm after clipping: {total_norm_after}")

optimizer.step()

# END_YOUR_CODE

Single Optimizer Training Step:
Actor Loss: 0.2765311300754547
Critic Loss: 1.071527361869812
Total Loss: 1.3480584621429443
Gradient norm before clipping: 4.646477602687968
Gradient norm reported by clip_grad_norm_ (before clipping): 4.646477699279785
Gradient norm after clipping: 0.4999998964919928


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:


The code uses a shared SharedActorCritic network with one optimizer step to show how gradient clipping works. The network takes an 8-dimensional observation, passes it through shared layers, and then splits into two heads: one head outputs action probabilities (actor), and the other predicts a value estimate (critic). The actor loss is computed using a policy-gradient style term with an advantage (return - value), and the critic loss is the mean-squared error between the predicted values and the target returns. These two losses are added into a single total loss, and total_loss.backward() is called so that gradients are computed for all parts of the model.

After backpropagation, the code calculates the global L2 norm of all gradients before clipping, applies torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5), and then measures the gradient norm again. In the example run, the gradient norm starts at about 3.40 and is reduced to around 0.5 after clipping. This shows that gradient clipping is actually limiting the size of the update. This kind of setup‚Äîa shared actor‚Äìcritic network with a single optimizer and gradient norm clipping‚Äîis useful in reinforcement learning when gradients can suddenly become large due to noisy returns or advantages. Clipping keeps the training step under control and helps prevent unstable updates, while still allowing the model to learn from both the actor and critic losses at the same time.



If you are working in a team, provide a contribution summary.
| Team Member | Step# | Contribution (%) |
|---|---|---|
| Ziyad Shahin, Jahnavi Gubbala  | Task 1 |  100% |
| Ziyad Shahin, Jahnavi Gubbala  | Task 2 |  100%  |
|  Ziyad Shahin, Jahnavi Gubbala | Task 3 |  100% |
|  Ziyad Shahin, Jahnavi Gubbala | Task 4 |  100% |
|   | **Total** |   |
