# Trust Region Policy Optimization

```{note}
TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be. The constraint is expressed in terms of KL-Divergence, a measure of distance between probability distributions.<br>
This is different from normal policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have very large differences in performance—so a single bad step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency. TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance.
```

## Quick Facts

* TRPO is an on-policy algorithm.
* TRPO can be used for environments with either discrete or continuous action spaces.

## Key Equations

Let $\pi_{\theta}$ denote a policy with parameters $\theta$. The theoretical TRPO update is:

$$
\begin{split}
\theta_{k+1} = \arg \max_{\theta} \; & {\mathcal L}(\theta_k, \theta) \\
\text{s.t.} \; & \bar{D}_{KL}(\theta || \theta_k) \leq \delta
\end{split}
$$

where ${\mathcal L}(\theta_k, \theta)$ is the surrogate advantage, a measure of how policy $\pi_{\theta}$ performs relative to the old policy $\pi_{\theta_k}$ using data from the old policy:

$$
{\mathcal L}(\theta_k, \theta) = \underset{s,a \sim \pi_{\theta_k}}{\mathbb{E}}\left[{
    \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a)
    }\right],
$$

and $\bar{D}_{KL}(\theta || \theta_k)$ is an average KL-divergence between policies across states visited by the old policy:

$$
\bar{D}_{KL}(\theta || \theta_k) = \underset{s \sim \pi_{\theta_k}}{\mathbb{E}}\left[{
    D_{KL}\left(\pi_{\theta}(\cdot|s) || \pi_{\theta_k} (\cdot|s) \right)
}\right].
$$

```{note}
The objective and constraint are both zero when $\theta = \theta_k$. Furthermore, the gradient of the constraint with respect to $\theta$ is zero when $\theta = \theta_k$.
```

The theoretical TRPO update isn’t the easiest to work with, so TRPO makes some approximations to get an answer quickly. We Taylor expand the objective and constraint to leading order around $\theta_k$:

$$
\begin{split}
{\mathcal L}(\theta_k, \theta) &\approx g^T (\theta - \theta_k) \\
\bar{D}_{KL}(\theta || \theta_k) & \approx \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k)
\end{split}
$$

resulting in an approximate optimization problem,

$$
\begin{split}
\theta_{k+1} = \arg \max_{\theta} \; & g^T (\theta - \theta_k) \\
\text{s.t.} \; & \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) \leq \delta.
\end{split}
$$

```{note}
By happy coincidence, the gradient g of the surrogate advantage function with respect to $\theta$, evaluated at $\theta = \theta_k$, is exactly equal to the policy gradient, $\nabla_{\theta} J(\pi_{\theta})$! 
```

This approximate problem can be analytically solved by the methods of Lagrangian duality, yielding the solution:

$$
\theta_{k+1} = \theta_k + \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g.
$$

If we were to stop here, and just use this final result, the algorithm would be exactly calculating the Natural Policy Gradient. A problem is that, due to the approximation errors introduced by the Taylor expansion, this may not satisfy the KL constraint, or actually improve the surrogate advantage. TRPO adds a modification to this update rule: a backtracking line search,

$$
\theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g,
$$

where $\alpha \in (0,1)$ is the backtracking coefficient, and $j$ is the smallest nonnegative integer such that $\pi_{\theta_{k+1}}$ satisfies the KL constraint and produces a positive surrogate advantage.