## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Spring 2025</center>

Welcome to the Assignment 3, Part 1: Introduction to Actor-Critic Methods! It includes the implementation of simple actor and critic networks and best practices used in modern Actor-Critic algorithms.

## Section 0: Setup and Imports

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import deque

# Set seed for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7976783b6e50>

## Section 1: Actor-Critic Network Architectures and Loss Computation

In this section, you will explore two common architectural designs for Actor-Critic methods and implement their corresponding loss functions using dummy tensors. These architectures are:
- A. Completely separate actor and critic networks
- B. A shared network with two output heads

Both designs are widely used in practice. Shared networks are often more efficient and generalize better, while separate networks offer more control and flexibility.

---


### Task 1a – Separate Actor and Critic Networks with Loss Function

Define a class `SeparateActorCritic`. Your goal is to:
- Create two completely independent neural networks: one for the actor and one for the critic.
- The actor should output a probability distribution over discrete actions (use `nn.Softmax`).
- The critic should output a single scalar value.

 Use `nn.ReLU()` as your activation function. Include at least one hidden layer of reasonable width (e.g. 64 or 128 units).

```python
# TODO: Define SeparateActorCritic class
```

 Next, simulate training using dummy tensors:
1. Generate dummy tensors for log-probabilities, returns, estimated values, and entropies.
2. Compute the actor loss using the advantage (return - value).
3. Compute the critic loss as mean squared error between values and returns.
4. Use a single optimizer for both the Actor and the Critic. In this case, combine the actor and critic losses into a total loss and perform backpropagation.
5. Use a separate optimizers for both the Actor and the Critic. In this case, keep the actor and critic losses separate and perform backpropagation.

```python
# TODO: Simulate loss computation and backpropagation
```

🔗 Helpful references:
- PyTorch Softmax: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
- PyTorch MSE Loss: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

---

In [2]:
# TODO: Define a class SeparateActorCritic with separate networks for actor and critic

# BEGIN_YOUR_CODE

class SeparateActorCritic(nn.Module):
    def __init__(self, stdim, actdim, hiddensiz=128):
        super(SeparateActorCritic, self).__init__()
        self.actor = nn.Sequential(
            nn.Linear(stdim, hiddensiz),
            nn.ReLU(),
            nn.Linear(hiddensiz, actdim),
            nn.Softmax(dim=-1)
        )

        self.critic = nn.Sequential(
            nn.Linear(stdim, hiddensiz),
            nn.ReLU(),
            nn.Linear(hiddensiz, 1)
        )

    def forward(self, state):
        actn_p = self.actor(state)
        stva = self.critic(state)
        return actn_p, stva

# END_YOUR_CODE

In [3]:

stdim = 4
actdim = 2
batchsiz = 5
hiddensiz = 128

model = SeparateActorCritic(stdim, actdim, hiddensiz)
states = torch.randn(batchsiz, stdim)
actions = torch.randint(0, actdim, (batchsiz,))
returns = torch.randn(batchsiz, 1)
actn_p, values = model(states)
dist = torch.distributions.Categorical(actn_p)
logp = dist.log_prob(actions)
entropies = dist.entropy()


advantages = returns - values.detach()


In [4]:

actoloss = -(logp * advantages.squeeze()).mean()
critiloss = F.mse_loss(values, returns)
entropyb = entropies.mean()
totalloss = actoloss + critiloss - 0.01 * entropyb
optimizer = optim.Adam(model.parameters(), lr=1e-3)
optimizer.zero_grad()
totalloss.backward()
optimizer.step()

print("completed.")


completed.


In [5]:

actn_p, values = model(states)
dist = torch.distributions.Categorical(actn_p)
logp = dist.log_prob(actions)
entropies = dist.entropy()
advantages = returns - values.detach()

actoloss = -(logp * advantages.squeeze()).mean()
critiloss = F.mse_loss(values, returns)
actorparams = list(model.actor.parameters())
criticparams = list(model.critic.parameters())
actoropti = optim.Adam(actorparams, lr=1e-3)
critiopti = optim.Adam(criticparams, lr=1e-3)
actoropti.zero_grad()
actoloss.backward(retain_graph=True)
actoropti.step()
critiopti.zero_grad()
critiloss.backward()
critiopti.step()

print(" separate optimizers completed.")


 separate optimizers completed.


In [6]:
print(model)
print("\nActor Network:")
print(model.actor)

print("\nCritic Network:")
print(model.critic)

SeparateActorCritic(
  (actor): Sequential(
    (0): Linear(in_features=4, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=2, bias=True)
    (3): Softmax(dim=-1)
  )
  (critic): Sequential(
    (0): Linear(in_features=4, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=1, bias=True)
  )
)

Actor Network:
Sequential(
  (0): Linear(in_features=4, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=2, bias=True)
  (3): Softmax(dim=-1)
)

Critic Network:
Sequential(
  (0): Linear(in_features=4, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=1, bias=True)
)


In [7]:
print(f"Action Probabilities: \n{actn_p}")
print(f"Log Probabilities: \n{logp}")
print(f"Actor Loss: {actoloss.item()}")
print(f"Critic Loss: {critiloss.item()}")
print(f"Entropy Bonus: {entropyb.item()}")
print(f"Total Loss (Actor + Critic + Entropy): {totalloss.item()}")


Action Probabilities: 
tensor([[0.5427, 0.4573],
        [0.4347, 0.5653],
        [0.4420, 0.5580],
        [0.4095, 0.5905],
        [0.3245, 0.6755]], grad_fn=<SoftmaxBackward0>)
Log Probabilities: 
tensor([-0.6113, -0.8332, -0.5833, -0.8927, -0.3924],
       grad_fn=<SqueezeBackward1>)
Actor Loss: 0.23915183544158936
Critic Loss: 1.0585072040557861
Entropy Bonus: 0.6783283352851868
Total Loss (Actor + Critic + Entropy): 1.365321397781372


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

In [8]:
#This setup uses a shared optimizer for both the actor and the critic, by updating both networks simultaneously based on a combined loss.
#In the Combined Actor and Critic Losses the actor and critic losses are combined into a single loss.
#The critic's loss is calculated using MSE between predicted values and true returns which provides
# a stable method for value function approximation.
#A single optimizer simplifies training by updating both networks together ideal for environments with closely related actor-critic dynamics.
#separate optimizers offer more control, allowing for independent learning rates for the actor and critic which is good for comple environments.


### Task 1b – Shared Network with Actor and Critic Heads + Loss Function

Now define a class `SharedActorCritic`:
- Build a shared base network (e.g., linear layer + ReLU)
- Create two heads: one for actor (output action probabilities) and one for critic (output state value)

```python
# TODO: Define SharedActorCritic class
```

Then:
1. Pass a dummy input tensor through the model to obtain action probabilities and value.
2. Simulate dummy rewards and compute advantage.
3. Compute the actor and critic losses, combine them, and backpropagate.

```python
# TODO: Simulate shared network loss computation and backpropagation
```

 Use `nn.Softmax` for actor output and `nn.Linear` for scalar critic output.

🔗 More reading:
- Policy Gradient Methods: https://spinningup.openai.com/en/latest/algorithms/vpg.html
- Actor-Critic Overview: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial
- PyTorch Categorical Distribution: https://pytorch.org/docs/stable/distributions.html#categorical

---

In [9]:
# BEGIN_YOUR_CODE
class SharedActorCritic(nn.Module):
    def __init__(self, stdim, actdim, hiddensiz=128):
        super(SharedActorCritic, self).__init__()

        self.shared_base = nn.Sequential(
            nn.Linear(stdim, hiddensiz),
            nn.ReLU(),
        )
        self.actor_head = nn.Linear(hiddensiz, actdim)
        self.critic_head = nn.Linear(hiddensiz, 1)

    def forward(self, state):
        so = self.shared_base(state)
        actn_p = torch.softmax(self.actor_head(so), dim=-1)
        stva = self.critic_head(so)
        return actn_p, stva


# END_YOUR_CODE

In [10]:
import torch
import torch.nn.functional as F
import torch.optim as optim

stdim = 4
actdim = 2
batchsiz = 5
hiddensiz = 128
model = SharedActorCritic(stdim, actdim, hiddensiz)
states = torch.randn(batchsiz, stdim)
actions = torch.randint(0, actdim, (batchsiz,))
returns = torch.randn(batchsiz, 1)
actn_p, values = model(states)
dist = torch.distributions.Categorical(actn_p)
logp = dist.log_prob(actions)
entropies = dist.entropy()
advantages = returns - values.detach()
actoloss = -(logp * advantages.squeeze()).mean()
critiloss = F.mse_loss(values, returns)
entropyb = entropies.mean()
totalloss = actoloss + critiloss - 0.01 * entropyb
optimizer = optim.Adam(model.parameters(), lr=1e-3)
optimizer.zero_grad()
totalloss.backward()
optimizer.step()

print("Shared network loss computation and backpropagation completed.")


Shared network loss computation and backpropagation completed.


In [11]:
print("Actor Loss:", actoloss.item())
print("Critic Loss:", critiloss.item())
entropyb = entropies.mean()
print("Entropy Bonus:", entropyb.item())
totalloss = actoloss + critiloss - 0.01 * entropyb
print("Total Loss:", totalloss.item())
for name, param in model.named_parameters():
    print(f"{name}: {param.data}")

Actor Loss: 0.15495535731315613
Critic Loss: 0.8752533793449402
Entropy Bonus: 0.6888728141784668
Total Loss: 1.023319959640503
shared_base.0.weight: tensor([[ 0.0361,  0.2982, -0.4571, -0.4232],
        [-0.2571,  0.3197,  0.3343, -0.3405],
        [ 0.0943, -0.2639, -0.2331,  0.4322],
        [ 0.2405, -0.1357,  0.0726,  0.0959],
        [ 0.1161, -0.1701, -0.2954,  0.3320],
        [ 0.0662,  0.2103, -0.1904,  0.2412],
        [ 0.3629,  0.3051,  0.1349,  0.4909],
        [-0.0011,  0.1064,  0.4751,  0.4859],
        [-0.2121,  0.4514,  0.1835, -0.0550],
        [ 0.4376, -0.2600,  0.3259,  0.4965],
        [-0.3439, -0.2960, -0.4945, -0.0910],
        [-0.4531, -0.1665,  0.1528,  0.3524],
        [ 0.2795,  0.2765,  0.2084, -0.3091],
        [-0.3970,  0.0207,  0.4748,  0.0282],
        [-0.0430, -0.3394, -0.2479, -0.3854],
        [ 0.4013, -0.0825, -0.4407,  0.0546],
        [ 0.0353, -0.0608,  0.1780,  0.2699],
        [-0.4692, -0.3820, -0.0230, -0.4646],
        [-0.4552,  0.4

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

The shared actor-critic architecture gives a highly efficient model. Instead of using two separate networks, the policy (actor) and value (critic) share the same feature extractor which means there are fewer parameters and this helps the model to learn faster.
Computation needs to be efficient. Actor and critic benefit in environments where there is value estimation and policy decisions rely on the same aspects of the input. The main aim is to reduce overfitting which is a form of regularization.
But, in more complex environments, separate networks may perform better.



## Section 2: Auto-Adaptive Network Setup for Environments

You will now create a function that builds a shared actor-critic network that adapts to any Gymnasium environment. This function should inspect the environment and build input/output layers accordingly.

### Task 2: Auto-generate Input and Output Layers
Write a function `create_shared_network(env)` that constructs a neural network using the following rules:
- The input layer should match the environment's observation space.
- The output layer for the **actor** should depend on the action space:
  - For discrete actions: output probabilities using `nn.Softmax`.
  - For continuous actions: output mean and log std for a Gaussian distribution.
- The **critic** always outputs a single scalar value.

```python
# TODO: Define function `create_shared_network(env)`
```

#### Environments to Support:
Test your function with the following environments:
1. `CliffWalking-v0` (Use one-hot encoding for discrete integer observations.)
2. `LunarLander-v3` (Standard Box space for observations and discrete actions.)
3. `PongNoFrameskip-v4` (Use gym wrappers for Atari image preprocessing.)
4. `HalfCheetah-v5` (Continuous observation and continuous action.)

```python
# TODO: Loop through environments and test `create_shared_network`
```

Hint: Use `gym.spaces` utilities to determine observation/action types dynamically.

🔗 Observation/Action Space Docs:
- https://gymnasium.farama.org/api/spaces/

---

In [12]:
%pip install ale-py
%pip install autorom==0.6.1
!AutoROM --accept-license
%pip install "gymnasium[atari,accept-rom-license]"
%pip show gymnasium
%pip install "gymnasium[mujoco]"
%pip install swig
%pip install "gymnasium[box2d]"

import gymnasium as gym
import ale_py

gym.register_envs(ale_py)
%pip install "gymnasium[other]"

Collecting autorom==0.6.1
  Downloading AutoROM-0.6.1-py3-none-any.whl.metadata (2.4 kB)
Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Installing collected packages: autorom
Successfully installed autorom-0.6.1
AutoROM will download the Atari 2600 ROMs.
They will be installed to:
	/usr/local/lib/python3.11/dist-packages/AutoROM/roms

Existing ROMs will be overwritten.
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/adventure.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/air_raid.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/alien.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/amidar.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/assault.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/asterix.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/asteroids.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/atlantis.bin
Installed /usr/local

In [13]:
import torch
import torch.nn as nn
import numpy as np
import gymnasium as gym
from gymnasium.spaces import Box, Discrete
from gymnasium.wrappers import AtariPreprocessing, ResizeObservation

class SharedActorCritic(nn.Module):
    def __init__(self, input_shape, action_space, hidden_dim=128):
        super().__init__()
        self.action_space = action_space
        self.use_cnn = len(input_shape) == 3

        if self.use_cnn:
            self.conv = nn.Sequential(
                nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
                nn.ReLU(),
                nn.Conv2d(32, 64, kernel_size=4, stride=2),
                nn.ReLU(),
                nn.Conv2d(64, 64, kernel_size=3, stride=1),
                nn.ReLU(),
                nn.Flatten()
            )

            with torch.no_grad():
                dummy = torch.zeros(1, *input_shape)
                conv_out_size = self.conv(dummy).shape[1]
            self.shared = nn.Sequential(
                nn.Linear(conv_out_size, hidden_dim),
                nn.ReLU()
            )
        else:
            self.shared = nn.Sequential(
                nn.Linear(np.prod(input_shape), hidden_dim),
                nn.ReLU()
            )

        if isinstance(action_space, Discrete):
            self.actor = nn.Sequential(
                nn.Linear(hidden_dim, action_space.n),
                nn.Softmax(dim=-1)
            )
        else:
            act_dim = action_space.shape[0]
            self.actor_mean = nn.Linear(hidden_dim, act_dim)
            self.actor_log_std = nn.Parameter(torch.zeros(act_dim))
        self.critic = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        if self.use_cnn:
            x = x / 255.0
            x = self.conv(x)
        else:
            x = x.view(x.size(0), -1)
        x = self.shared(x)

        if isinstance(self.action_space, Discrete):
            return self.actor(x), self.critic(x)
        else:
            mean = self.actor_mean(x)
            std = self.actor_log_std.exp().expand_as(mean)
            return (mean, std), self.critic(x)


In [14]:
def create_shared_network(env):
    obs_space = env.observation_space
    act_space = env.action_space

    if isinstance(obs_space, Discrete):
        input_shape = (obs_space.n,)
        one_hot = True
    elif isinstance(obs_space, Box):
        input_shape = obs_space.shape
        one_hot = False
    else:
        raise NotImplementedError("notsupported")

    model = SharedActorCritic(input_shape, act_space)
    return model, one_hot


In [15]:
def preprocess_obs(obs, one_hot, obs_space):
    if one_hot:
        vec = np.zeros(obs_space.n, dtype=np.float32)
        vec[obs] = 1.0
        return torch.tensor(vec).unsqueeze(0)
    elif len(obs.shape) == 2:
        obs = np.expand_dims(obs, axis=0)
        return torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
    elif len(obs.shape) == 3:
        obs = np.transpose(obs, (2, 0, 1))
        return torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
    else:
        return torch.tensor(obs, dtype=torch.float32).unsqueeze(0)


In [16]:
env_ids = [
    "CliffWalking-v0",
    "LunarLander-v3",
    "PongNoFrameskip-v4",
    "HalfCheetah-v5"
]

for env_id in env_ids:
    print(f"\nTesting {env_id}")

    if "Pong" in env_id:
        env = gym.make(env_id)
        env = AtariPreprocessing(env, grayscale_obs=True)
        env = ResizeObservation(env, shape=(84, 84))
    else:
        env = gym.make(env_id)

    model, one_hot = create_shared_network(env)
    obs, _ = env.reset()
    input_tensor = preprocess_obs(obs, one_hot, env.observation_space)

    with torch.no_grad():
        output = model(input_tensor)

    if isinstance(env.action_space, Discrete):
        print("Action probabilities:", output[0])
    else:
        print(" Action mean/std:", output[0])

    print("Value estimate:", output[1])
    env.close()



Testing CliffWalking-v0
Action probabilities: tensor([[0.2409, 0.2486, 0.2339, 0.2766]])
Value estimate: tensor([[-0.0473]])

Testing LunarLander-v3
Action probabilities: tensor([[0.2224, 0.2819, 0.2542, 0.2415]])
Value estimate: tensor([[0.0667]])

Testing PongNoFrameskip-v4
Action probabilities: tensor([[1.3875e-36, 1.0000e+00, 2.1152e-40, 3.0886e-19, 2.4227e-20, 1.5432e-35]])
Value estimate: tensor([[3.8886]])

Testing HalfCheetah-v5
 Action mean/std: (tensor([[ 0.0579, -0.0120,  0.1169, -0.0969, -0.0237, -0.0091]]), tensor([[1., 1., 1., 1., 1., 1.]]))
Value estimate: tensor([[-0.0988]])


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

We can see the observations and action spaces:
CliffWalking-v0: One-hot encoding it avoids the false ordering in discrete states making sure that there is clear state separation.


LunarLander-v3: It is the standard Box observations  which are fed directly into the network and they  are suitable for low-dimensional continuous input.

Pong-v4: It uses image-based inputs, preprocessing which reduces  the complexity even with preserving the structure.

HalfCheetah-v5: The actions require outputting mean and log std for a Gaussian policy which are ideal for smooth control.
The shared base with separate actor and critic heads improves efficiency and feature learning.



### Task 3: Write Observation Normalization Function
Create a function `normalize_observation(obs, env)` that:
- Checks if the observation space is `Box` and has `low` and `high` attributes.
- If so, normalize the input observation.
- Otherwise, return the observation unchanged.

```python
# TODO: Define `normalize_observation(obs, env)`
```

Test this function with observations from:
- `LunarLander-v3`
- `PongNoFrameskip-v4`

Note: Atari observations are image arrays. Normalize pixel values to [0, 1]. For LunarLander-v3, the different elements in the observation vector have different ranges. Normalize them to [0, 1] using the `low` and `high` attributes of the observation space.


---

In [17]:
# BEGIN_YOUR_CODE
import numpy as np

def normalize_observation(obs, env):
    obs_space = env.observation_space

    if isinstance(obs_space, gym.spaces.Box):
        if obs_space.dtype == np.uint8 or np.max(obs_space.high) > 1.0:

            return obs.astype(np.float32) / 255.0 if obs.dtype == np.uint8 else \
                   (obs - obs_space.low) / (obs_space.high - obs_space.low + 1e-8)
        else:

            return obs
    else:

        return obs


# END_YOUR_CODE

In [18]:
import gymnasium as gym
from gymnasium.wrappers import AtariPreprocessing, ResizeObservation


env1 = gym.make("LunarLander-v3")
obs1, _ = env1.reset()
norm_obs1 = normalize_observation(obs1, env1)

print("\nLunarLander-v3")
print("Raw observation:", obs1)
print("Normalized:", norm_obs1)


env2 = gym.make("PongNoFrameskip-v4")
env2 = AtariPreprocessing(env2, grayscale_obs=True)
env2 = ResizeObservation(env2, shape=(84, 84))
obs2, _ = env2.reset()
norm_obs2 = normalize_observation(obs2, env2)

print("\nPongNoFrameskip-v4")
print("Raw observation shape:", obs2.shape, "dtype:", obs2.dtype)
print("Normalized shape:", norm_obs2.shape, "min:", np.min(norm_obs2), "max:", np.max(norm_obs2))



LunarLander-v3
Raw observation: [ 4.2390823e-04  1.4205443e+00  4.2914189e-02  4.2773551e-01
 -4.8433049e-04 -9.7207259e-03  0.0000000e+00  0.0000000e+00]
Normalized: [0.50008476 0.7841088  0.5021457  0.52138674 0.49996144 0.49951395
 0.         0.        ]

PongNoFrameskip-v4
Raw observation shape: (84, 84) dtype: uint8
Normalized shape: (84, 84) min: 0.20392157 max: 0.9254902


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

Normalization is a done to ensure stable and efficient training. Different environments have different normalization:

LunarLander-v3 gives a vector of continuous values, each with different scales. Without  the normalization the large-magnitude features can dominate learning. With the environment’s low and high bounds to scale values into [0, 1] make sure all features contribute equally to learning.

PongNoFrameskip-v4 gives image-based observations as unit pixel arrays which range from 0 to 255.
This setup is used because it prevents issues as uneven weight updates caused by scale differences, and allows the model to converge more smoothly.


## Section 4: Gradient Clipping

To prevent exploding gradients, it's common practice to clip gradients before optimizer updates.

### Task 4: Clip Gradients for Actor-Critic Networks
Use dummy tensors and apply gradient clipping with the following PyTorch method:
```python
# During training, after loss.backward():
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
```

Reuse the loss computation from Task 1a or 1b. After computing the gradients, apply gradient clipping.
Print the gradient norm before and after clipping to verify it’s applied.

🔗 PyTorch Docs: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html


---

In [19]:
# BEGIN_YOUR_CODE
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


class SharedActorCritic(nn.Module):
    def __init__(self, stdim, actdim, hiddensiz=128):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(stdim, hiddensiz),
            nn.ReLU()
        )
        self.actor = nn.Sequential(
            nn.Linear(hiddensiz, actdim),
            nn.Softmax(dim=-1)
        )
        self.critic = nn.Linear(hiddensiz, 1)

    def forward(self, x):
        x = self.shared(x)
        return self.actor(x), self.critic(x)
stdim = 4
actdim = 2
batchsiz = 5
model = SharedActorCritic(stdim, actdim)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
states = torch.randn(batchsiz, stdim)
actions = torch.randint(0, actdim, (batchsiz,))
returns = torch.randn(batchsiz, 1)
actn_p, values = model(states)
values = values.squeeze(-1)
dist = torch.distributions.Categorical(actn_p)
logp = dist.log_prob(actions)
entropies = dist.entropy()
advantages = returns.squeeze() - values.detach()
actoloss = -(logp * advantages).mean()
critiloss = F.mse_loss(values, returns.squeeze())
entropyb = entropies.mean()
totalloss = actoloss + critiloss - 0.01 * entropyb
optimizer.zero_grad()
totalloss.backward()
total_norm_before = torch.norm(
    torch.stack([p.grad.norm(2) for p in model.parameters() if p.grad is not None])
).item()
print(f"Gradient norm before clipping: {total_norm_before:.4f}")
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
total_norm_after = torch.norm(
    torch.stack([p.grad.norm(2) for p in model.parameters() if p.grad is not None])
).item()
print(f"Gradient norm after clipping:  {total_norm_after:.4f}")
optimizer.step()


# END_YOUR_CODE

Gradient norm before clipping: 3.8468
Gradient norm after clipping:  0.5000


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

Gradient clipping is used to prevent exploding gradients during training, mainly in unstable neural networks.

When the gradients become too large, they can cause the model to diverge. So, clipping will make sure that the total gradient is in range.

In actor-critic methods, gradient clipping is preferred when:
The policy and value losses can create large, unbalanced updates.
The environment has sparse or delayed rewards, which can increase the loss.
By using clip_grad_norm_() after loss.backward(), we stop the large updates from messing up training, but still allow small changes go through. This helps the model learn more smoothly.


If you are working in a team, provide a contribution summary.
| Team Member | Step# | Contribution (%) |
|---|---|---|
|   | Task 1 |   |
|   | Task 2 |   |
|   | Task 3 |   |
|   | Task 4 |   |
|   | **Total** |   |
