# Policy Gradients
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/matyama/deep-rl-hands-on/blob/main/11_policy_gradients.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
        Run in Google Colab
    </a>
  </td>
</table>

In [1]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit

echo "Running on Google Colab, therefore installing dependencies..."
pip install ptan>=0.7 pytorch-ignite

## Values and Policy
Contrary to the value iteration methods (Q-Learning) which try to estimate the state values (state-action values), the *policy gradient* technique focus directly on the policy $\pi(s)$. 

Direct policy modeling has several advantages:
* From certain point of view, we don't care that much about the expected discounted rewards but rather the decision/action $\pi(s)$ to take in each state $s$
* As we saw earlier with the *Categorical DQN*, learning a distribution helps to better capture the underlying MDP (especially in stochastic environments)
* It becomes quite a hard to determine the best action to take when the action space is large or even continuous. The DQN model of $Q(s, a)$ is highly non-linear and the optimization problem $a^* = argmax_a Q(s, a)$ can be hard to solve.

In the value iteration case our DQN parametrized the state-action values as $DQN(s) \to Q_\mathbf{w}(s, \cdot)$. Similarly, we will represent the policy as a probability distribution over actions $\pi_\mathbf{w}(s)$ parametrized by the NN.

*Modelling the output as action (class) probabilities is a typical technique in classification tasks that gives us a smooth representation (intuitively, changing NN weights $\mathbf{w}$ a bit changes $\pi$ a bit as well - compared to the case with discrete action labels which would change in steps).*

## Gradients of the Policy

*Policy Gradient* methods are closely related to the *Cross-Entropy Method* introduced earlier. The gradient is a direction in which we want to change NN weights to maximize the accumulated reward and is proportional in scale to the $Q$ state-action value and in the direction to the log of action probabilities:
$$
\nabla J \approx \mathbb{E}[Q(s, a) \nabla log(\pi(a | s))]
$$
where the expectation means that we average the gradient over several steps.

Equivalently we can say that we optimize the loss function $\mathcal{L} = -Q(s, a) log(\pi(a | s))$ (Note: SGD minimizes the loss function but we want to maximize the gradient, therefore the minus sign).