# Policy Gradient

## The Simplest Policy Gradient

Here, we consider the case of a stochastic, parameterized policy, $\pi_{\theta}$. We aim to maximize the expected return $J(\pi_{\theta}) = \mathbb{E}_{\tau\sim\pi_{\theta}}[R(\tau)]$. For the purposes of this derivation, we’ll take $R(\tau)$ to give the finite-horizon undiscounted return, but the derivation for the infinite-horizon discounted return setting is almost identical.

We would like to optimize the policy by gradient ascent:

$$\theta_{k+1} = \theta_{k} + \alpha\nabla_{\theta}J(\pi_{\theta})|_{\theta_{k}}.$$

The gradient of policy performance, $\nabla_{\theta}J(\pi_{\theta})$, is called the policy gradient, and algorithms that optimize the policy this way are called policy gradient algorithms.

To actually use this algorithm, we need an expression for the policy gradient which we can numerically compute. This involves two steps:
1. deriving the analytical gradient of policy performance, which turns out to have the form of an expected value.
2. forming a sample estimate of that expected value, which can be computed with data from a finite number of agent-environment interaction steps.

In this subsection, we’ll find the simplest form of that expression. In later subsections, we’ll show how to improve on the simplest form to get the version we actually use in standard policy gradient implementations.

$$
\begin{split}
\nabla_{\theta}J(\pi_{\theta}) &= \nabla_{\theta}\mathbb{E}_{\tau\sim\pi_{\theta}}[R(\tau)]\\
&= \nabla_{\theta}\int_{\tau}P(\tau|\theta)R(\tau) \\
&= \int_{\tau}\nabla_{\theta}P(\tau|\theta)R(\tau) \\
&= \int_{\tau}P(\tau|\theta)\nabla_{\theta}\log P(\tau|\theta)R(\tau)\\
&= \mathbb{E}_{\tau\sim\pi_{\theta}}\left[\log P(\tau|\theta)R(\tau)\right]\\
&= \mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})R(\tau)\right]
\end{split}
$$

This is an expectation, which means that we can estimate it with a sample mean. If we collect a set of trajectories $\mathcal{D} = \{\tau_{i}\}_{i=1,\dots,N}$ where each trajectory is obtained by letting the agent act in the environment using the policy $\pi_{\theta}$, the policy gradient can be estimated with

$$\hat{g} = \frac{1}{|\mathcal{D}|}\sum_{\tau\in\mathcal{D}}\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})R(\tau)$$

This last expression is the simplest version of the computable expression we desired. Assuming that we have represented our policy in a way which allows us to calculate $\log\pi_{\theta}(a_{t}|s_{t})$, and if we are able to run the policy in the environment to collect the trajectory dataset, we can compute the policy gradient and take an update step.

## Expected Grad-Log-Prob Lemma

In this subsection, we will derive an intermediate result which is extensively used throughout the theory of policy gradients. We will call it the Expected Grad-Log-Prob (EGLP) lemma.

**EGLP Lemma.** Suppose that $P_{\theta}$ is a parameterized probability distribution over a random variable, $x$. Then:

$$\underset{x\sim P_{\theta}}{\mathbb{E}}[\nabla_{\theta}\log P_{\theta}(x)] = 0$$

**Proof.** Recall that all probability distributions are normalized:

$$\int_{x}P_{\theta}(x) = 1$$

Take the gradient of both sides of the normalization condition and use the log derivative trick to get:

$$
\begin{split}
0 &= \nabla_{\theta}1 \\
&= \nabla_{\theta}\int_{x}P_{\theta}(x) \\
&= \int_{x}\nabla_{\theta}P_{\theta}(x) \\
&= \int_{x}P_{\theta}(x)\nabla_{\theta}\log P_{\theta}(x)
\end{split}
$$

## Don’t Let the Past Distract You

Examine our most recent expression for the policy gradient:

$$\nabla_{\theta}J(\pi_{\theta}) = \mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})R(\tau)\right]$$

Taking a step with this gradient pushes up the log-probabilities of each action in proportion to $R(\tau)$, the sum of all rewards ever obtained. But this doesn’t make much sense.

Agents should really only reinforce actions on the basis of their consequences. Rewards obtained before taking an action have no bearing on how good that action was: only rewards that come after.

It turns out that this intuition shows up in the math, and we can show that the policy gradient can also be expressed by

$$\nabla_{\theta}J(\pi_{\theta}) = \mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\sum_{t'=t}^{T}R(s_{t'}, a_{t'}, s_{t'+1})\right]$$

In this form, actions are only reinforced based on rewards obtained after they are taken.

## Baselines in Policy Gradients

An immediate consequence of the EGLP lemma is that for any function $b$ which only depends on state,

$$\underset{a_{t}\sim\pi_{\theta}}{\mathbb{E}}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})b(s_{t})\right] = 0$$

This allows us to add or subtract any number of terms like this from our expression for the policy gradient, without changing it in expectation:

$$\nabla_{\theta}J(\pi_{\theta}) = \mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\left(\sum_{t'=t}^{T}R(s_{t'}, a_{t'}, s_{t'+1}) - b(s_{t})\right)\right]$$

Any function $b$ used in this way is called a **baseline.**

The most common choice of baseline is the on-policy value function $V^{\pi}(s_{t})$. Recall that this is the average return an agent gets if it starts in state $s_t$ and then acts according to policy $\pi$ for the rest of its life.

Empirically, the choice $b(s_t) = V^{\pi}(s_t)$ has the desirable effect of reducing variance in the sample estimate for the policy gradient. This results in faster and more stable policy learning. It is also appealing from a conceptual angle: it encodes the intuition that if an agent gets what it expected, it should “feel” neutral about it.