# The Actor-Critic Method
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/matyama/deep-rl-hands-on/blob/main/12_a2c.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
        Run in Google Colab
    </a>
  </td>
</table>

In [1]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit

echo "Running on Google Colab, therefore installing dependencies..."
pip install ptan>=0.7 tensorboardX

## Variance Reduction
Let's start by recalling the policy gradient defined by the *Policy Gradients (PG)* method:
$$
\nabla J \approx \mathbb{E}[Q(s, a) \nabla \log(\pi(a|s))]
$$

One of the weak points of the PG method is that the gradient scales $Q(s, a)$ may experience quite significant variance* which does not help the training at all. We fixed this issue by introducing a fixed *baseline* value (e.g. mean reward) that was subtracted from the gradient scales Q.

\* Recall formal defintion: $\mathbb{V}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2]$

Let's illustrate this problem and solution on simple example:
* Assume there are three actions with $Q_1$, $Q_2$ some small positive values and $Q_3$ being large negative
* In this case there will be small positive gradient towards fist two actions and large negative one repelling the policy from the third one
* Now imagine $Q_1$ and $Q_2$ were large positive values instead. Then $Q_3$ would become small but positive value. The gradient would still push the policy towards fist two actions but it would direct the gradient towards the trird one a bit as well (instead of pushing it away from it)!

Now it's a bit more clear why subtracting a constant value that we called the *baseline* helps.

## Advantage Actor-Critic (A2C)
*Advantage Actor-Critic (A2C)* method can be viewed as a combination of PG and DQN with a simple idea extending the variance reduction theme we discussed above. Until now we treated the *baseline* value as single constant that we subtracted from all $Q(s, a)$ values. A2C pushes this further and uses different baselines for each state $s$.

If one recalls the *Duelling DQN* which exploited the fact that $Q(s, a) = V(s) + A(s, a)$ - i.e. state-action values are composed of a *baseline* state values $V(s)$ and action advantages $A(s, a)$ in these states, it is quite straightforward to figure out which values A2C uses as state baselines - the state values $V(s)$!

The *Advantage* Actor-Critic name then comes from the fact that our gradient scales turn to action advantages after subtracting state values:
$$
\nabla J \approx \mathbb{E}[Q(s, a) \nabla \log(\pi(a|s))] \to \mathbb{E}[A(s, a) \nabla \log(\pi(a|s))]
$$

Finally, the question is how do we obtain $V(s)$? Here comes the second part which is the combination with the DQN approach - we simply train a DQN alongside our PGN.

*Notes*:
* *There'll actually be just single NN that will learn both the policy and state values (discussed below)*
* *Improvements from both methods are still applicable (also metioned and shown in following sections)*

### Common Imports

In [2]:
# flake8: noqa: E402,I001

import sys
import time
from collections import deque
from dataclasses import dataclass
from typing import Any, Iterable, List, Optional, Tuple

import gym
import numpy as np
import ptan
import torch
import torch.nn as nn
from ptan.experience import ExperienceFirstLast
from tensorboardX import SummaryWriter

### Reward Tracker

In [3]:
class RewardTracker:
    def __init__(
        self,
        writer: SummaryWriter,
        stop_reward: float,
        window_size: int = 100,
    ) -> None:
        self.writer = writer
        self.stop_reward = stop_reward
        self.window_size = window_size

    def __enter__(self) -> "RewardTracker":
        self.ts = time.time()
        self.ts_frame = 0
        self.total_rewards = []
        return self

    def __exit__(self, *args: Any) -> None:
        self.writer.close()

    def add_reward(self, reward: float, frame: int) -> bool:
        """
        Returns an indication of whether a termination contition was reached.
        """

        self.total_rewards.append(reward)

        fps = (frame - self.ts_frame) / (time.time() - self.ts)

        self.ts_frame = frame
        self.ts = time.time()

        mean_reward = np.mean(self.total_rewards[-self.window_size :])

        if frame % self.window_size == 0:
            print(
                f"{frame}: done {len(self.total_rewards)} games, "
                f"mean reward {mean_reward:.3}, speed {fps:.2} fps"
            )

        sys.stdout.flush()

        self.writer.add_scalar("fps", fps, frame)
        self.writer.add_scalar("reward_100", mean_reward, frame)
        self.writer.add_scalar("reward", reward, frame)

        return mean_reward > self.stop_reward

### Atari A2C PG Network
This NN is quite similar to the *Dueling DQN* architecture but with an important difference. In the Dueling DQN we have also two parts
1. Part for the state values $V(s)$
1. Part for the action advantages $A(s, a)$

But as with any other DQN we did still output $Q(s, a) = V(s) + A(s, a)$. Here we have two separate outputs with common base network:
1. Policy network that outputs action logits - basically policy $\pi(a|s)$ when one converts them to probabilities using softmax
1. Value network which computes $V(s)$

In [4]:
class AtariA2C(nn.Module):
    """
    A2C network with 2D convolutional base for Atari envs. and two heads:
    1. Policy - dense network that outputs action logits
    2. Value - dence network that models state values `V(s)`
    """

    def __init__(self, input_shape: Tuple[int, ...], n_actions: int) -> None:
        super().__init__()

        # 2D conv. base network common to both heads
        #  - This way both nets share commonly learned basic features
        #  - Also helps with convergence (compared to having two separate NNs)
        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
        )

        conv_out_size = self._get_conv_out(input_shape)

        # Policy NN - outputs action logits
        self.policy = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions),
        )

        # Value NN - outputs state value
        self.value = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1),
        )

    def _get_conv_out(self, shape: Tuple[int, ...]) -> int:
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        For an input batch of states, returns pair of tensors
          1. Policy logits for all actions in these states
          2. Values of these states
        """
        inputs = x.float() / 256
        conv_out = self.conv(inputs).view(inputs.size()[0], -1)
        return self.policy(conv_out), self.value(conv_out)