### Policy Gradient Algorithms

Tons of policy gradient algorithms have been proposed during recent years and there is no way for me to exhaust them. I’m introducing some of them that I happened to know and read about.

#### REINFORCE

REINFORCE (Monte-Carlo policy gradient) relies on an estimated return by Monte-Carlo methods using episode samples to update the policy parameter $\theta$. REINFORCE works because the expectation of the sample gradient is equal to the actual gradient:


$$ \nabla_{theta}J(\theta) = \boldsymbol{E}_{\pi} [ Q^{\pi}(s,a) \nabla_{\theta} \ln \pi_{\theta}(a|s)] $$
$$ \nabla_{theta}J(\theta) = \boldsymbol{E}_{\pi} [ G_t \nabla_{\theta} \ln \pi_{\theta}(A_t|S_t)] $$

Therefore we are able to measure $G_t$ from real sample trajectories and use that to update our policy gradient. It relies on a full trajectory and that’s why it is a Monte-Carlo method.

The process is pretty straightforward:

1. Initialize the policy parameter $\theta$ at random.
2. Generate one trajectory on policy $\pi_\theta$: $S_1,A_1,R_2,S_2,A_2,…,S_T$.
3. For $t=1, 2, … , T$:
    1. Estimate the the return $G_t$;
    2. Update policy parameters: $\theta \leftarrow \theta+\alpha\gamma^tG_t\nabla_\theta \ln \pi_\theta(A_t, S_t)$

A widely used variation of REINFORCE is to subtract a baseline value from the return Gt to reduce the variance of gradient estimation while keeping the bias unchanged (Remember we always want to do this when possible). For example, a common baseline is to subtract state-value from action-value, and if applied, we would use advantage $A(s,a)=Q(s,a)−V(s)$ in the gradient ascent update. This [post](https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/) nicely explained why a baseline works for reducing the variance, in addition to a set of fundamentals of policy gradient.

#### Actor-Critic

Two main components in policy gradient are the policy model and the value function. It makes a lot of sense to learn the value function in addition to the policy, since knowing the value function can assist the policy update, such as by reducing gradient variance in vanilla policy gradients, and that is exactly what the **Actor-Critic** method does.

Actor-critic methods consist of two models, which may optionally share parameters:

- **Critic** updates the value function parameters $w$ and depending on the algorithm it could be action-value $Q_w(a|s)$ or state-value $V_w(s)$.
- **Actor** updates the policy parameters $\theta$ for $\pi_\theta(a|s)$, in the direction suggested by the critic.

Let’s see how it works in a simple action-value actor-critic algorithm.

1. Initialize s, $\theta$, w at random; sample $a ∼ \pi_\theta(a|s)$.
2. For $t=1…T$:
    1. Sample reward $r_t∼R(s,a)$ and next state $s′∼P(s′|s,a)$;
    2. Then sample the next action $a′∼ \pi_\theta(a′|s′)$;
    3. Update the policy parameters: $\theta \leftarrow \theta + \alpha_\theta Q_w(s,a) \nabla_\theta \ln \pi_\theta(a|s)$;
    4. Compute the correction (TD error) for action-value at time t: $$\delta_t=r_t+\gamma Q_w(s′,a′)−Q_w(s,a)$$ and use it to update the parameters of action-value function: $$w\leftarrow w+\alpha_w \delta_t \nabla_w Q_w(s,a)$$
    5. Update $a \leftarrow a′$ and $s \leftarrow s′$.

Two learning rates, $\alpha_\theta$ and $\alpha_w$, are predefined for policy and value function parameter updates respectively.



#### A3C

Asynchronous Advantage Actor-Critic (Mnih et al., 2016), short for A3C, is a classic policy gradient method with a special focus on parallel training.

In A3C, the critics learn the value function while multiple actors are trained in parallel and get synced with global parameters from time to time. Hence, A3C is designed to work well for parallel training.

Let’s use the state-value function as an example. The loss function for state value is to minimize the mean squared error, $J_v(w)=(G_t−V_w(s))^2$ and gradient descent can be applied to find the optimal w. This state-value function is used as the baseline in the policy gradient update.

Here is the algorithm outline:

1. We have global parameters, $\theta$ and w; similar thread-specific parameters, $\theta'$ and w’.
2. Initialize the time step t=1
3. While $T <= T_{MAX}$:
    1. Reset gradient: $d\theta = 0$ and $dw = 0$.
    2. Synchronize thread-specific parameters with global ones: $\theta’ = \theta$ and $w’ = w$.
    3. $t_{start} = t$ and sample a starting state $s_t$.
    4. While ($s_t$ != TERMINAL) and $t−t_{start}<=t_{max}$:
        1. Pick the action $A_t∼\pi\theta'(A_t|S_t)$ and receive a new reward $R_t$ and a new state $s_t+1$.
        2. Update $t = t + 1$ and $T = T + 1$
    5. Initialize the variable that holds the return estimation
    $$
    \begin{equation}
  R=\begin{cases}
    0, & \text{if s_t is TERMINAL}.\\
    V_{w'}(s_t), & \text{otherwise}.
  \end{cases}
\end{equation}
    $$
    6. For $i=t−1,…,t_{start}$:
        1. $R \leftarrow \gamma R + R_i$; here $R$ is a MC measure of $G_i$.
        2. Accumulate gradients w.r.t. $\theta’: d\theta \leftarrow d\theta + \nabla\theta' \log\pi\theta′(a_i|s_i)(R−V_{w′}(s_i))$;
        
        Accumulate gradients w.r.t. $w’: dw←dw+2(R−V_{w′}(s_i))∇_{w′}(R−V_{w′}(s_i))$.
    7. Update synchronously $\theta$ using $d\theta$, and $w$ using $dw$.
