# Asynchronous Advantage Actor-Critic
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/matyama/deep-rl-hands-on/blob/main/13_a3c.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
        Run in Google Colab
    </a>
  </td>
</table>

In [1]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit

echo "Running on Google Colab, therefore installing dependencies..."
pip install ptan>=0.7 tensorboardX

## Correlation and Sample Efficiency
Let's do a brief summary of both the value-based and policy-based methods.

**Value-based methods (DQN & extensions)**
* Off-policy methods that use different policy to sample the envrinment than it's being trained
* Therefore can use old experiences to approximate state-action values and collect them in an experience replay buffer
* So one particular experience will be used few times before it's retired from the replay buffer
* For this reason it's *sample efficient*, meaning that we dont' have to interact with the environment frequently
* And as such it's suitable in situation when such interation is costly

**Policy-based methods (PG & Actor-Critic)**
* A2C and related are on-policy methods that use the same policy to sample the environment and training
* In this case one can't collect old experiences because these would yield a gradient for the old policy that sampled them
* Instead, these methods use multiple concurrent environments to break sample correlations that invalidate the i.i.d. assumption of SGD methods
* The benefit is inherent parallelism and modelling the policy directly as a distribution over actions
* On the other hand, because each experience can be used only once, these methods are *sample inefficient* and are typically used only when the environment interation is cheap

## A3C Method
As mentioned before, A2C uses samples from multiple environments to break corellations. It, however, does it in an sequential (round-robin) and fully synchronous manner. Contrary to that, A3C introduces multiple parallel agents, each with its own NN copy, that:
* In parallel sample the one or few environments
* Push data to the training process running a SGD-like optimization
* Receives upadted NN parameters back from the traininer

Now, there are two basic variants depending on what the "data" definition is:
* **Data parallelism** - The data produces by each agent are simply experienced transisions
* **Gradient parallelism** - Extends the resposibility of an agent which in this setup additionally computes the loss (forward pass) and gradients (backpropagation). Gradients are then collected by the trainer to be applied in the optimization step.

The difference might seem subtle but the forward pass and backpropagation (loss and gradient computation) is the most costly part of an SGD update so pushing these to the agent processes might significantly increase throughput especially if we have multiple GPUs already distributed in a cluster. If, however, we have just one beefy GPU the *data parallelism* approach is usually better and also easier to implement.