# Advantage Actor-Critic (A2C)

Recall in Reinforce, we want to increase the probability of actions in a trajectory proportionally to how high the return is.

$$\nabla_{\theta}J(\pi_{\theta}) = \mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{t=0}^{T}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})R(\tau)\right]$$

This will leads to the Problem of Variance in Reinforce.

## Reducing variance with Actor-Critic methods

The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: the Actor-Critic method.

To understand the Actor-Critic, imagine you’re playing a video game. You can play with a friend that will provide you with some feedback. You’re the Actor and your friend is the Critic.

You don’t know how to play at the beginning, so you try some actions randomly. The Critic observes your action and provides feedback. Learning from this feedback, you’ll update your policy and be better at playing that game. On the other hand, your friend (Critic) will also update their way to provide feedback so it can be better next time.

This is the idea behind Actor-Critic. We learn two function approximations:

* A policy that controls how our agent acts: $\pi_{\theta}(s)$

* A value function to assist the policy update by measuring how good the action taken is: $\hat{q}_{w}(s, a)$

## The Actor-Critic Proces

Now that we have seen the Actor Critic’s big picture, let’s dive deeper to understand how the Actor and Critic improve together during the training.

As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):

* Actor, a **policy function** parameterized by theta: $\pi_{\theta}(s)$

* Critic, a **value function** parameterized by w: $\hat{q}_{w}(s, a)$

Let’s see the training process to understand how the Actor and Critic are optimized:

* At each timestep $t$, we get the current state $S_{t}$ from the environment and pass it as input through our Actor and Critic.

* Our Policy takes the state and outputs an action $A_{t}$

* The Critic takes that action also as input and, using $S_{t}$ and $A_{t}$, computes the value of taking that action at that state: the Q-value.

* The action $A_{t}$ performed in the environment outputs a new state $S_{t+1}$ and a reward $R_{t+1}$

* The Actor updates its policy parameters using the Q value.

$$\Delta\theta = \alpha\nabla_{\theta}(\log\pi_{\theta}(a_{t}|s_{t}))\hat{q}_{w}(s_{t}, a_{t})$$

* Thanks to its updated parameters, the Actor produces the next action to take at $A_{t+1}$ given the new state $S_{t+1}$

* The Critic then updates its value parameters by TD learning.

$$\Delta w = \beta(r_{t+1} + \gamma\hat{q}_{w}(s_{t+1}, a_{t+1}) - \hat{q}_{w}(s_{t}, a_{t}))\nabla_{w}\hat{q}_{w}(s_{t}, a_{t})$$

## Adding Advantage in Actor-Critic (A2C)

We can stabilize learning further by using the Advantage function as Critic instead of the Action value function.

The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: how taking that action at a state is better compared to the average value of the state.

$$A(s, a) = Q(s, a) - V(s)$$

In other words, this function calculates the extra reward we get if we take this action at that state compared to the mean reward we get at that state. The extra reward is what’s beyond the expected value of that state.

* If $A(s, a) > 0$, our gradient is pushed in that direction.

* If $A(s, a) < 0$, our gradient is pushed in the opposite direction.

The problem with implementing this advantage function is that it requires two value functions - $Q(s, a)$ and $V(s)$. Fortunately, we can use the TD error as a good estimator of the advantage function.

$$A(s, a) = r + \gamma V(s') - V(s)$$

(A2C) At time step $t$, generate $(s_{t},a_{t},r_{t+1},s_{t+1})$, then:

* Advantage (TD error): $\delta_{t} = r_{t+1} + \gamma v(s_{t+1}, w_{t}) - v(s_{t},w_{t})$, here we use $r_{t+1} + \gamma v_{t}(s_{t+1})$ to approximate $q_{t}(s_{t},a_{t})$

* Actor (policy update): $\theta_{t+1} = \theta_{t} + \alpha_{\theta}\delta_{t}\nabla_{\theta}\ln\pi(a_{t}|s_{t},\theta_{t})$

* Critic (value update): $w_{t+1} = w_{t} + \alpha_{w}\delta_{t}\nabla_{w}v(s_{t},w_{t})$