# üß© Lab 10: REINFORCE on CartPole 

In the previous labs, we solved the CartPole control task using a Monte Carlo approach to estimate the value function. We discretized the state space, collected full trajectories, computed returns, and used those returns to update a tabular estimate of \(Q(s,a)\).

In this lab, we will revisit the CartPole environment, but instead of estimating a value function, we will directly learn a **parameterized policy** using a neural network. This approach is known as **policy gradient**. Rather than selecting actions based on a Q-table, the policy network outputs a probability distribution over actions, and we update its parameters so that actions leading to higher returns become more likely.

Our goal is to implement **REINFORCE**, one of the simplest policy‚Äêgradient algorithms:
- collect full episodes under the current policy,
- compute Monte Carlo returns for each time step,
- and adjust the policy parameters in the direction that increases the log‚Äêprobability of good actions.

In [None]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.nn.functional as F

In [None]:
env = gym.make("CartPole-v1")   # no need to discretize now
obs, info = env.reset(seed=0)
obs_dim = env.observation_space.shape[0]  # 4 for CartPole
n_actions = env.action_space.n            # 2 for CartPole

### Task 1: Define a Policy Network `PolicyNet`

In this part, you will implement a small neural network that represents the policy  
$\pi_\theta(a \mid s)$ for CartPole.

The observation space of `CartPole-v1` is a 4-dimensional vector:
- cart position
- cart velocity
- pole angle
- pole angular velocity

The action space has 2 discrete actions:
- `0`: push cart to the left  
- `1`: push cart to the right  

We will use a **multi-layer perceptron (MLP)** that:
- takes the observation \(s \in \mathbb{R}^4\) as input,
- outputs **logits** over the 2 actions (these will go into a `Categorical` distribution),
- uses **two hidden layers** with ReLU activations.

Hints for architecture:
- Use `nn.Sequential` to stack layers.
- A common choice for CartPole is:
  - First hidden layer: around 100‚Äì150 units (e.g., `128`).
  - Second hidden layer: smaller, e.g., about half of the first layer (e.g., `64`).
- The final linear layer should map from the second hidden layer to `n_actions`
  (no activation on the output layer; the `Categorical` distribution will handle the softmax internally).

In [None]:
# ----- Policy network œÄ_Œ∏(a | s) -----
class PolicyNet(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        # Your time to work on it
        self.net = ####

    def forward(self, x):
        logits = self.net(x)  # shape: (batch, n_actions)
        return torch.distributions.Categorical(logits=logits)

In [None]:
policy = PolicyNet(obs_dim, n_actions)
optimizer = optim.Adam(policy.parameters(), lr=1e-4)
gamma = 0.99
num_episodes = 200000

### Task 2: Implement the REINFORCE Loss Function

After collecting one full episode and computing the Monte Carlo returns $G_t$ for
each time step, the final step is to update the policy parameters.  
In REINFORCE, we adjust the policy in the direction that increases the
log-probability of actions that resulted in high returns.

Recall the update rule:

$$
\theta \leftarrow \theta + \alpha \, \nabla_\theta 
\log \pi_\theta(a_t \mid s_t) \, G_t.
$$

In practice, instead of applying this update manually, we construct a **loss
function** such that performing gradient descent on the loss produces the same
update as gradient ascent on $J(\theta)$.

Your task:

1. You have a list of `log_probs`, one for each action taken in the episode.
2. You have a list of `returns`, containing the Monte Carlo return $G_t$ for each step.
3. Combine them into a single scalar loss.

In [None]:
returns_history = []

for ep in range(1, num_episodes + 1):
    obs, _ = env.reset()
    done = False

    log_probs = []   # store log œÄ_Œ∏(a_t | s_t)
    rewards = []     # store r_t

    # Generate an episode
    while not done:
        s_t = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)   # shape (1, obs_dim)
        dist = policy(s_t)                                          # œÄ_Œ∏(. | s_t)
        a_t = dist.sample()                                         # sample action
        log_prob_t = dist.log_prob(a_t)                             # log œÄ_Œ∏(a_t | s_t)

        obs_next, r, term, trunc, _ = env.step(a_t.item())
        done = term or trunc

        log_probs.append(log_prob_t)
        rewards.append(r)

        obs = obs_next

    #  Value update
    T = len(rewards)
    G = 0.0
    returns = []

    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)

    returns = torch.tensor(returns, dtype=torch.float32)
    log_probs = torch.stack(log_probs)      # shape (T,)

    # Your time to work on it
    loss = ####

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    ep_return = sum(rewards)
    returns_history.append(ep_return)

    if ep % 100 == 0:
        avg = np.mean(returns_history[-50:])
        print(f"Episode {ep:4d} | Return: {ep_return:4.1f} | "
              f"Avg(50): {avg:5.1f}")

env.close()