# TRPO

```{note}
Policy Gradient methods (PG) are popular in reinforcement learning (RL). The basic principle uses gradient ascent to follow policies with the steepest increase in rewards. However, the first-order optimizer is not very accurate for curved areas. We can get overconfidence and make bad moves that ruin the progress of the training. TRPO is one of the most quoted papers in addressing this issue.
```

To understand TRPO, it will be better off to discuss three key concepts first.

## Minorize-Maximization algorithm

Can we guarantee that any policy updates always improve the expected rewards? The MM algorithm achieves this iteratively by maximizing a lower bound function (the blue line below) approximating the expected reward locally.

![](images/trpo1.webp)

We start with an initial policy guess. We find a lower bound $M$ that approximate the expected reward $\eta$ locally at the current guess. We locate the optimal point for $M$ and use it as the next guess. We approximate a lower bound again and repeat the iteration. Eventually, our guess will converge to the optimal policy. To make this work, $M$ should be easier to optimize than $\eta$. As a preview, $M$ is a quadratic equation

$$ax^{2} + bx + c$$

but in the vector form:

$$g\cdot(\theta - \theta_{\text{old}}) - \frac{\beta}{2}(\theta - \theta_{\text{old}})^{T}F(\theta - \theta_{\text{old}})$$

It is a convex function and well studied on how to optimize it.

## Trust region

There are two major optimization methods: line search and trust region. Gradient descent is a line search. We determine the descending direction first and then take a step towards that direction.

![](images/trpo2.webp)

In the trust region, we determine the maximum step size that we want to explore and then we locate the optimal point within this trust region.

## Importance sampling

Importance sampling calculates the expected value of $f(x)$ where $x$ has a data distribution $p$.

$$\mathbb{E}_{x\sim p}\left[f(x)\right]$$

In importance sampling, we sample data from $q$ instead of $p$ and use the probability ratio between $p$ and $q$ to recalibrate the result.

$$\mathbb{E}_{x\sim q}\left[\frac{f(x)p(x)}{q(x)}\right]$$

Let’s go into details of applying the importance sampling concept in PG:

$$L^{PG}(\theta) = \hat{\mathbb{E}}_{t}\left[\log\pi_{\theta}(a_{t}|s_{t})\hat{A}_{t}\right]$$

This can be expressed as importance sampling (IS) also:

$$L_{\theta_{\text{old}}}^{IS} = \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}|s_{t})}\hat{A}_{t}\right]$$

The derivatives for both objective functions are the same. i.e. they have the same optimal solution.

$$\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})|_{\theta_{\text{old}}} = \frac{\nabla\pi_{\theta}(a_{t}|s_{t})|_{\theta_{\text{old}}}}{\pi_{\theta_{\text{old}}}(a_{t}|s_{t})} = \nabla_{\theta}\left(\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}|s_{t})}\right)|_{\theta_{\text{old}}}$$