# Policy gradient methods cont.
### REINFORCE with baseline and actor-critic methods
RLDMUU, UniNE 2025, jakub.tluczek@unine.ch

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
import numpy as np
from matplotlib import pyplot as plt
from tqdm import tqdm

First, let's start with setting up PyTorch. As usual, we get the device on which we will compute everything, and also we will set the random seed to make the results reproducible:

In [2]:
torch.manual_seed(123)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
np.random.seed(123)

In [3]:
# running mean function for the purpose of visualization
def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / float(N)

### REINFORCE WITH BASELINE

Last time we've seen how to implement REINFORCE algorithm, and how standardizing the returns $G$ made learning more stable. Another way to stabilize the learning process by reducing variance is to use some baseline algorithm $b(s)$, which would give us the expected return at state $s$. We can then use another parametrized, differentiable function to approximate the value at a given state. We can reuse the policy $\pi(a|s, \bf{\theta})$ from the previous exercise, while adding a state value function $\hat{v}(s, \bf{w})$, with its own parameters and separate learning rates $\alpha_\theta$ and $\alpha_{\bf{w}}$. Then, just as it was the case with previous exercise, for each episode we collect its trajectory $\tau$ and compute the following:

- generate a trajectory $\tau$ following policy $\pi(\cdot | \cdot, \theta)$
- for each $t$ in $\tau$:
    - $G_t \leftarrow \sum_{k=t+1}^{T} \gamma^{k-t-1} r_k$

Now however, instead of updating the network with discounted rewards in the future, we calculate the advantage term $\delta$ for each timestep $t$:

- $\delta \leftarrow G - \hat{v}(s_t, \bf{w})$

Then value network might be optimized with respect to the advantage term, while the update of the the policy network is conducted as follows:

- $\theta \leftarrow \theta + \alpha_\theta \gamma^t \delta \nabla ln \pi(a_t | s_t, \theta)$

In [4]:
class PolicyNetwork(nn.Module):
    def __init__(self, n_inputs, n_outputs, hidden_dim_size):
        super(PolicyNetwork, self).__init__()
        # 2 fully connected layers
        self.linear1 = nn.Linear(n_inputs, hidden_dim_size)
        self.linear2 = nn.Linear(hidden_dim_size, n_outputs)

    def forward(self, state):
        x = F.relu(self.linear1(state))
        # instead of returning one output, let's return logπ together with π
        probs = F.softmax(self.linear2(x), dim=-1)
        log_probs = torch.log(probs)

        return probs, log_probs

In [None]:
# TODO: Implement the value network
class ValueNetwork(nn.Module):
    def __init__(self, num_states, hidden_dim):
        pass

    def forward(self, state):
        pass

In [None]:
# TODO: initialize the value network and its parameters

We are still going to use the `CartPole` environment:

In [9]:
env = gym.make('CartPole-v1')

REINFORCE with baseline main loop:

In [None]:
NUM_TRAJECTORIES = 2000
MAX_EPISODE_LENGTH = 500
gamma = 0.9
# placeholders for rewards for each episode
rewards = []
policy_losses = []
value_losses = []
# iterating through trajectories
for tau in tqdm(range(NUM_TRAJECTORIES)):
    # resetting the environment
    state, info = env.reset()
    # setting done to False for while loop 
    done = False
    # storing trajectory and logπ(a_t|s_t, θ)
    transition_buffer = []
    log_probs = []
    state_values =[]
    
    t = 0
    while done == False and t < MAX_EPISODE_LENGTH:
        # TODO: play the episode and  collect the data
        pass
    # logging the episode length as a cumulative reward
    rewards.append(t)
    returns = []
    for t_prime in range(t):
        # computing discounted rewards in future for every timestep
        G = 0
        for i, tick in enumerate(transition_buffer[t_prime:]):
            G += (gamma ** i) * tick
        returns.append(G)

    # turning the returns vector into a tensor
    returns = torch.tensor(returns).to(device)
    # TODO: compute the advantage term δ
    deltas = ...
    
    # TODO: perform update for both policy and value network


100%|██████████| 2000/2000 [02:36<00:00, 12.76it/s]


In [None]:
# visualize the results
plt.figure(figsize=(12,9))
plt.plot(running_mean(rewards, 50))
plt.grid()
plt.title("REINFORCE with baseline cumulative rewards")

### 1-step actor-critic

Another approach is to update the policy not at the end of each trajectory, but at each timestep using the 1 step return. Therefore when computing $\delta$ advantage term, we can use not the discounted rewards from all timesteps until the terminal state, but rather:

- $G_{t:t+1} \leftarrow r + \gamma \hat{v}(s', \bf{w})$
- $\delta \leftarrow G_{t:t+1} - \hat{v}(s, \bf{w})$

More specifically, we can use the existing policy to compute the advantage term in the online fashion:

- $\delta \leftarrow r + \gamma \hat{v}_{\pi}(s', \textbf{w}) - \hat{v}(s, \bf{w})$

Then we can update the value network with:

$\theta_v \leftarrow \theta_v + \alpha_\theta \delta^2 \nabla \hat{v}$

and the update for policy network:

$\theta_\pi \leftarrow \theta_\pi + \alpha_\theta \gamma^t \delta \nabla ln \pi(a_t | s_t, \theta)$


First let's reuse the policy and value networks from the previous exercise:

In [16]:
policy = PolicyNetwork(n_inputs=4, n_outputs=2, hidden_dim_size=128).to(device)
value = ValueNetwork(num_states=4, hidden_dim=128).to(device)
policy_optimizer = torch.optim.Adam(params=policy.parameters(), lr=1e-4)
value_optimizer = torch.optim.Adam(params=value.parameters(), lr=1e-3)

In [None]:
NUM_TRAJECTORIES = 1000
MAX_EPISODE_LENGTH = 500
gamma = 0.99
# placeholders for rewards for each episode
rewards = []
policy_losses = []
value_losses = []
# iterating through trajectories
for tau in tqdm(range(NUM_TRAJECTORIES)):
    # resetting the environment
    state, info = env.reset(seed=123)
    # setting done to False for while loop 
    done = False

    t = 0
    while done == False and t < MAX_EPISODE_LENGTH:
        # TODO: perform the actor-critic update
        t += 1
    rewards.append(t)

In [None]:
# visualize the results
plt.figure(figsize=(12,9))
plt.plot(running_mean(rewards, 50))
plt.grid()
plt.title("1-step actor-critic cumulative rewards")