# Policy Gradients
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/matyama/deep-rl-hands-on/blob/main/11_policy_gradients.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
        Run in Google Colab
    </a>
  </td>
</table>

In [1]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit

echo "Running on Google Colab, therefore installing dependencies..."
pip install ptan>=0.7 pytorch-ignite

## Values and Policy
Contrary to the value iteration methods (Q-Learning) which try to estimate the state values (state-action values), the *policy gradient* technique focus directly on the policy $\pi(s)$. 

Direct policy modeling has several advantages:
* From certain point of view, we don't care that much about the expected discounted rewards but rather the decision/action $\pi(s)$ to take in each state $s$
* As we saw earlier with the *Categorical DQN*, learning a distribution helps to better capture the underlying MDP (especially in stochastic environments)
* It becomes quite a hard to determine the best action to take when the action space is large or even continuous. The DQN model of $Q(s, a)$ is highly non-linear and the optimization problem $a^* = argmax_a Q(s, a)$ can be hard to solve.

In the value iteration case our DQN parametrized the state-action values as $DQN(s) \to Q_\mathbf{w}(s, \cdot)$. Similarly, we will represent the policy as a probability distribution over actions $\pi_\mathbf{w}(s)$ parametrized by the NN.

*Modelling the output as action (class) probabilities is a typical technique in classification tasks that gives us a smooth representation (intuitively, changing NN weights $\mathbf{w}$ a bit changes $\pi$ a bit as well - compared to the case with discrete action labels which would change in steps).*

## Gradients of the Policy

*Policy Gradient* methods are closely related to the *Cross-Entropy Method* introduced earlier. The gradient is a direction in which we want to change NN weights to maximize the accumulated reward and is proportional in scale to the $Q$ state-action value and in the direction to the log of action probabilities:
$$
\nabla J \approx \mathbb{E}[Q(s, a) \nabla \log(\pi(a | s))]
$$
where the expectation means that we average the gradient over several steps.

Equivalently we can say that we optimize the loss function $\mathcal{L} = -Q(s, a) \log(\pi(a | s))$ (Note: SGD minimizes the loss function but we want to maximize the gradient, therefore the minus sign).

Recall that in the *Cross-Entropy Method* we sampled the environment for few episodes and trained only on transitions from the above-average ones. This corresponds to having $Q(s, a) = 1$ for the good transitions and $Q(s, a) = 0$ otherwise. In general, policy gradient methods differ in the way how $Q$ values are treated but in any case we want to use $Q(s, a) \in [0, 1]$:
1. for better separation of episode
1. to incorporate the discount factor and thus the uncertainty about future rewards

## The REINFORCE method
The outline of the *REINFORCE* methods is the following:
1. Initialize NN weights randomly
1. Play $N$ full episode and collect experiences $(s, a, r, s')$
1. Compute actual $Q$ values for every played episode $k$ and step $t$: $Q_{k, t} = \sum_{i=0}^t \gamma^t r_t$
1. Compute the loss for all transitions: $\mathcal{L} = - \sum_{k, t} Q_{k, t} \log(\pi(s_{k, t}, a_{k, t}))$
1. Do one SGD step by minimizing the loss and update NN weights
1. Repeat from step 2. until convergence

Properties of the REINFORCE method:
* We **don't need an explicit exploration policy** because we explore automatically using the policy our NN outputs.
* **On-policy** method, therefore no ER buffer is needed because we can't train on the data from old policies. On the other hand, value methods typically converge faster (need less interations with the environment).
* We train on actual Q values and not estimated ones so we **don't need a target NN** to break experience correlations either.