# Policy Gradient

```{note}
Since the beginning of the course, we have only studied value-based methods, where we estimate a value function as an intermediate step towards finding an optimal policy. Finding an optimal value function leads to having an optimal policy:

$$\pi^{\ast}(s) = \arg\max_{a}Q^{\ast}(s, a)$$

With policy-based methods, we want to optimize the policy directly without having an intermediate step of learning a value function.
```

## The policy-gradient methods

In policy-based methods, we directly learn to approximate $\pi^{\ast}$ without having to learn a value function. 

* The idea is to parameterize the policy. For instance, using a neural network $\pi_{\theta}$, this policy will output a probability distribution over actions (stochastic policy).

* Our objective then is to maximize the performance of the parameterized policy using gradient ascent. To do that, we define an objective function $J(\theta)$, that is, the expected cumulative reward, and we want to find the value $\theta$ that maximizes this objective function.

![](images/policy2.png)

## The advantages and disadvantages of policy-gradient methods

There are multiple advantages over value-based methods. Let’s see some of them:

* Policy-gradient methods can learn a stochastic policy while value functions can’t.

* Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces. The problem with Deep Q-learning is that their predictions assign a score for each possible action, at each time step, given the current state. Instead, with policy-gradient methods, we output a probability distribution over actions.
    
* Policy-gradient methods have better convergence properties. In value-based methods, we use an aggressive operator to change the value function: we take the maximum over Q-estimates. Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
    
Naturally, policy-gradient methods also have some disadvantages:

* Frequently, policy-gradient methods converges to a local maximum instead of a global optimum.

* Policy-gradient goes slower, step by step: it can take longer to train.

* Policy-gradient can have high variance.

## The Policy Gradient Theorem

The objective function outputs the expected cumulative reward:

$$J(\theta) = \mathbb{E}_{\tau\sim\pi}[R(\tau)]$$

It can be formulated as:

![](images/policy3.png)

Policy-gradient is an optimization problem: we want to find the values of $\theta$ that maximize our objective function $J(\theta)$,  so we need to use gradient-ascent:

$$\theta\leftarrow\theta + \alpha * \nabla J(\theta)$$

However, there are two problems with computing the derivative of $J(\theta)$:

1. We can’t calculate the true gradient of the objective function since it requires calculating the probability of each possible trajectory, which is computationally super expensive. So we want to calculate a gradient estimation with a sample-based estimate.

2. To differentiate this objective function, we need to differentiate the state distribution, this is attached to the environment. The problem is that we can’t differentiate it because we might not know about it.

Fortunately we’re going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.

![](images/policy4.png)

````{prf:proof}
We have:

```{math}
\begin{aligned}
\nabla J(\theta) &= \nabla_{\theta}\sum_{\tau}P(\tau;\theta)R(\tau)\\
&=\sum_{\tau}\nabla_{\theta}P(\tau;\theta)R(\tau)\\
&=\sum_{\tau}P(\tau;\theta)\frac{\nabla_{\theta}P(\tau;\theta)}{P(\tau;\theta)}R(\tau)\\
&=\sum_{\tau}P(\tau;\theta)\nabla{\log P(\tau;\theta)}R(\tau)\\
&=\sum_{\tau}P(\tau;\theta)\nabla\left[\log \mu(s_{0})\Pi_{t=0}^{H}P(s_{t+1}|s_{t}, a_{t})\pi_{\theta}(a_{t}|s_{t})\right]R(\tau)\\
&=\sum_{\tau}P(\tau;\theta)\sum_{t=0}^{H}\nabla \pi_{\theta}(a_{t}|s_{t})R(\tau)\\
&=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\log \pi_{\theta}(a_{t}|s_{t})R(\tau)\right]
\end{aligned}
```

````

## The Reinforce algorithm (Monte Carlo Reinforce)

In a loop:

* Use the policy $\pi_{\theta}$ to collect some episodes

* Use these episodes to estimate the gradient.

![](images/policy5.png)

We can interpret this update as follows: $-\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})$ is the direction of steepest increase of the (log) probability of selecting action at from state $s_{t}$. This tells us:

* If the return $R(\tau)$ is high, it will push up the probabilities of the (state, action) combinations.

* Otherwise, it will push down the probabilities of the (state, action) combinations.

## Pytorch example

### Policy network

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Policy(nn.Module):
    """MLP"""
    def __init__(self):
        super(Policy, self).__init__()
        self.affine1 = nn.Linear(4, 128)
        self.dropout = nn.Dropout(p=0.6)
        self.affine2 = nn.Linear(128, 2)
        
        self.saved_log_probs = []
        self.rewards = []

    def forward(self, x):
        x = self.affine1(x)
        x = self.dropout(x)
        x = F.relu(x)
        action_scores = self.affine2(x)
        return F.softmax(action_scores, dim=1)

In [2]:
policy = Policy()
optimizer = optim.Adam(policy.parameters(), lr=1e-2)

*`self.saved_log_probs` saves $\left[\pi_{\theta}(a_{0}|s_{0}), \pi_{\theta}(a_{1}|s_{1}),\dots,\pi_{\theta}(a_{T}|s_{T})\right]$

*`self.rewards` saves $\left[r_{0},r_{1},\dots,r_{T}\right]$

### Env

In [3]:
import gym

env = gym.make('CartPole-v1')
env.reset(seed=1)

(array([ 0.00118216,  0.04504637, -0.03558404,  0.04486495], dtype=float32),
 {})

### One episode

In [4]:
import numpy as np
from torch.distributions import Categorical

def select_action(state):
    state = torch.from_numpy(state).float().unsqueeze(0)
    probs = policy(state)
    m = Categorical(probs)
    action = m.sample()
    policy.saved_log_probs.append(m.log_prob(action))
    return action.item()

In [5]:
gamma = 0.99

def finish_episode():
    R = 0
    eps = np.finfo(np.float32).eps.item()
    policy_loss = []
    returns = deque()
    for r in policy.rewards[::-1]:
        R = r + gamma * R
        returns.appendleft(R)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + eps)
    
    for log_prob, R in zip(policy.saved_log_probs, returns):
        policy_loss.append(-log_prob * R)
    optimizer.zero_grad()
    policy_loss = torch.cat(policy_loss).sum()
    policy_loss.backward()
    optimizer.step()
    del policy.rewards[:]
    del policy.saved_log_probs[:]

`returns` = $\left[R_{0}, R_{1}, \dots, R_{T}\right]$ where $R_{t} = r_{t} + \gamma r_{t+1} + \gamma^{2}r_{t+2} + \dots$

`policy_loss` = $\sum_{i=0}^{T}-\log\pi_{\theta}(a_{t}|s_{t})R_{t}$

### Main loop

In [6]:
from itertools import count
from collections import deque

def main():
    running_reward = 10
    for i_episode in count(1):
        state, _ = env.reset()
        ep_reward = 0
        for t in range(1, 10000):  # Don't infinite loop while learning
            action = select_action(state)
            state, reward, done, _, _ = env.step(action)
            policy.rewards.append(reward)
            ep_reward += reward
            if done:
                break

        running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
        finish_episode()
        if i_episode % 10 == 0:
            print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
                  i_episode, ep_reward, running_reward))
        if running_reward > env.spec.reward_threshold:
            print("Solved! Running reward is now {} and "
                  "the last episode runs to {} time steps!".format(running_reward, t))
            break

In [7]:
main()

Episode 10	Last reward: 12.00	Average reward: 15.65
Episode 20	Last reward: 30.00	Average reward: 21.50
Episode 30	Last reward: 57.00	Average reward: 26.75
Episode 40	Last reward: 34.00	Average reward: 34.43
Episode 50	Last reward: 70.00	Average reward: 55.36
Episode 60	Last reward: 72.00	Average reward: 103.52
Episode 70	Last reward: 132.00	Average reward: 119.66
Episode 80	Last reward: 140.00	Average reward: 117.84
Episode 90	Last reward: 27.00	Average reward: 96.78
Episode 100	Last reward: 12.00	Average reward: 76.47
Episode 110	Last reward: 103.00	Average reward: 81.30
Episode 120	Last reward: 105.00	Average reward: 77.09
Episode 130	Last reward: 93.00	Average reward: 83.25
Episode 140	Last reward: 128.00	Average reward: 107.52
Episode 150	Last reward: 180.00	Average reward: 155.75
Episode 160	Last reward: 483.00	Average reward: 219.93
Episode 170	Last reward: 62.00	Average reward: 200.44
Episode 180	Last reward: 192.00	Average reward: 196.48
Episode 190	Last reward: 808.00	Average