# Policy Gradient

```{note}
Since the beginning of the course, we have only studied value-based methods, where we estimate a value function as an intermediate step towards finding an optimal policy. Finding an optimal value function leads to having an optimal policy:

$$\pi^{\ast}(s) = \arg\max_{a}Q^{\ast}(s, a)$$

With policy-based methods, we want to optimize the policy directly without having an intermediate step of learning a value function.
```

## The policy-gradient methods

In policy-based methods, we directly learn to approximate $\pi^{\ast}$ without having to learn a value function. 

* The idea is to parameterize the policy. For instance, using a neural network $\pi_{\theta}$, this policy will output a probability distribution over actions (stochastic policy).

* Our objective then is to maximize the performance of the parameterized policy using gradient ascent. To do that, we define an objective function $J(\theta)$, that is, the expected cumulative reward, and we want to find the value $\theta$ that maximizes this objective function.

![](images/policy2.png)

## The advantages and disadvantages of policy-gradient methods

There are multiple advantages over value-based methods. Let’s see some of them:

* Policy-gradient methods can learn a stochastic policy while value functions can’t.

* Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces. The problem with Deep Q-learning is that their predictions assign a score for each possible action, at each time step, given the current state. Instead, with policy-gradient methods, we output a probability distribution over actions.
    
* Policy-gradient methods have better convergence properties. In value-based methods, we use an aggressive operator to change the value function: we take the maximum over Q-estimates. Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
    
Naturally, policy-gradient methods also have some disadvantages:

* Frequently, policy-gradient methods converges to a local maximum instead of a global optimum.

* Policy-gradient goes slower, step by step: it can take longer to train.

* Policy-gradient can have high variance.

## The Policy Gradient Theorem

The objective function outputs the expected cumulative reward:

$$J(\theta) = \mathbb{E}_{\tau\sim\pi}[R(\tau)]$$

It can be formulated as:

![](images/policy3.png)

Policy-gradient is an optimization problem: we want to find the values of $\theta$ that maximize our objective function $J(\theta)$,  so we need to use gradient-ascent:

$$\theta\leftarrow\theta + \alpha * \nabla J(\theta)$$

However, there are two problems with computing the derivative of $J(\theta)$:

1. We can’t calculate the true gradient of the objective function since it requires calculating the probability of each possible trajectory, which is computationally super expensive. So we want to calculate a gradient estimation with a sample-based estimate.

2. To differentiate this objective function, we need to differentiate the state distribution, this is attached to the environment. The problem is that we can’t differentiate it because we might not know about it.

Fortunately we’re going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.

![](images/policy4.png)

````{prf:proof}
We have:

```{math}
\begin{aligned}
\nabla J(\theta) &= \nabla_{\theta}\sum_{\tau}P(\tau;\theta)R(\tau)\\
&=\sum_{\tau}\nabla_{\theta}P(\tau;\theta)R(\tau)\\
&=\sum_{\tau}P(\tau;\theta)\frac{\nabla_{\theta}P(\tau;\theta)}{P(\tau;\theta)}R(\tau)\\
&=\sum_{\tau}P(\tau;\theta)\nabla{\log P(\tau;\theta)}R(\tau)\\
&=\sum_{\tau}P(\tau;\theta)\nabla\left[\log \mu(s_{0})\Pi_{t=0}^{H}P(s_{t+1}|s_{t}, a_{t})\pi_{\theta}(a_{t}|s_{t})\right]R(\tau)\\
&=\sum_{\tau}P(\tau;\theta)\sum_{t=0}^{H}\nabla \pi_{\theta}(a_{t}|s_{t})R(\tau)\\
&=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\log \pi_{\theta}(a_{t}|s_{t})R(\tau)\right]
\end{aligned}
```

````

## The Reinforce algorithm (Monte Carlo Reinforce)

In a loop:

* Use the policy $\pi_{\theta}$ to collect some episodes

* Use these episodes to estimate the gradient.

![](images/policy5.png)

We can interpret this update as follows: $-\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})$ is the direction of steepest increase of the (log) probability of selecting action at from state $s_{t}$. This tells us:

* If the return $R(\tau)$ is high, it will push up the probabilities of the (state, action) combinations.

* Otherwise, it will push down the probabilities of the (state, action) combinations.