# Generalized Advantage Estimation

```{note}
The two
main challenges of policy gradient methods are the large number of samples typically required, and the difficulty
of obtaining stable and steady improvement despite the nonstationarity of the
incoming data.<br>
We address the first challenge by using value functions to substantially
reduce the variance of policy gradient estimates at the cost of some bias. We address the second challenge by using trust region optimization
procedure for both the policy and the value function, which are represented by
neural networks.
```

## Preliminaries

There are several different related expressions for the policy gradient, which
have the form

$$g = \mathbb{E}\left[\sum_{t=0}^{\infty}\Psi_{t}\nabla_{\theta}\log\pi_{\theta}(a_t|s_t)\right]$$

where $\Psi_{t}$ may be one of the following:

1. $\sum_{t=0}^{\infty}$: total reward of the trajectory.
2. $\sum_{t'=t}^{\infty}r_{t'}$: reward following action $a_t$.
3. $\sum_{t'=t}^{\infty}r_{t'} - b(s_t)$: baselined version of
previous formula.
4. $Q_{\pi}(s_t, a_t)$: state-action value function.
5. $A_{\pi}(s_t, a_t)$: advantage function.
6. $r_{t} + V_{\pi}(s_{t+1}) - V_{\pi}(s_t)$: TD residual.

The choice $\Psi_{t} = A_{\pi}(s_t, a_t)$ yields almost the lowest possible variance, though in practice, the
advantage function is not known and must be estimated.

We will introduce a parameter $\gamma$ that allows us to reduce variance by downweighting rewards corresponding
to delayed effects, at the cost of introducing bias. This parameter corresponds to the
discount factor used in discounted formulations of MDPs, , but we treat it as a variance reduction
parameter in an undiscounted problem.

$$V_{\pi,\gamma} := \mathbb{E}_{s_{t+1}:\infty, a_{t}:\infty}\left[\sum_{l=0}^{\infty}\gamma^{l}r_{t+l}\right]$$

Before proceeding, we will introduce the notion of a $\gamma-$just estimator of the advantage function,
which is an estimator that does not introduce bias when we use it in place of $A^{\pi,\gamma}$.

**Definition 1.** The estimator $\hat{A}_{t}$ is $\gamma-$just if

$$\mathbb{E}_{s_0:\infty,a_0:\infty}\left[\hat{A}_{t}\nabla_{\theta}\log\pi_{\theta}(a_t|s_t)\right]=\mathbb{E}_{s_0:\infty,a_0:\infty}\left[A_{\pi,\gamma}(a_t, s_t)\nabla_{\theta}\log\pi_{\theta}(a_t|s_t)\right] = g^{\gamma}$$

**Proposition 1.** Suppose that $\hat{A}_{t}$ can be written in the form $\hat{A}_{t} = Q_t(s_{t:\infty},a_{t:\infty}) - b_t(s_{0:t,}, a_{0:t-1})$ such that for all $(s_t, a_t)$, $\mathbb{E}Q_{t}=Q_{\pi,\gamma}$. Then $\hat{A}$ is $\gamma-$just.

## Advantage function estimation