# The Actor-Critic Method
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/matyama/deep-rl-hands-on/blob/main/12_a2c.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
        Run in Google Colab
    </a>
  </td>
</table>

In [8]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit

echo "Running on Google Colab, therefore installing dependencies..."
pip install ptan>=0.7 tensorboardX

Running on Google Colab, therefore installing dependencies...


## Variance Reduction
Let's start by recalling the policy gradient defined by the *Policy Gradients (PG)* method:
$$
\nabla J \approx \mathbb{E}[Q(s, a) \nabla \log(\pi(a|s))]
$$

One of the weak points of the PG method is that the gradient scales $Q(s, a)$ may experience quite significant variance* which does not help the training at all. We fixed this issue by introducing a fixed *baseline* value (e.g. mean reward) that was subtracted from the gradient scales Q.

\* Recall formal defintion: $\mathbb{V}[X] = \mathbb{E}[(X - \mathbb{E}[X])^2]$

Let's illustrate this problem and solution on simple example:
* Assume there are three actions with $Q_1$, $Q_2$ some small positive values and $Q_3$ being large negative
* In this case there will be small positive gradient towards fist two actions and large negative one repelling the policy from the third one
* Now imagine $Q_1$ and $Q_2$ were large positive values instead. Then $Q_3$ would become small but positive value. The gradient would still push the policy towards fist two actions but it would direct the gradient towards the trird one a bit as well (instead of pushing it away from it)!

Now it's a bit more clear why subtracting a constant value that we called the *baseline* helps.

## Advantage Actor-Critic (A2C)
*Advantage Actor-Critic (A2C)* method can be viewed as a combination of PG and DQN with a simple idea extending the variance reduction theme we discussed above. Until now we treated the *baseline* value as single constant that we subtracted from all $Q(s, a)$ values. A2C pushes this further and uses different baselines for each state $s$.

If one recalls the *Duelling DQN* which exploited the fact that $Q(s, a) = V(s) + A(s, a)$ - i.e. state-action values are composed of a *baseline* state values $V(s)$ and action advantages $A(s, a)$ in these states, it is quite straightforward to figure out which values A2C uses as state baselines - the state values $V(s)$!

The *Advantage* Actor-Critic name then comes from the fact that our gradient scales turn to action advantages after subtracting state values:
$$
\mathbb{E}[Q(s, a) \nabla \log(\pi(a|s))] \to \mathbb{E}[A(s, a) \nabla \log(\pi(a|s))]
$$

Finally, the question is how do we obtain $V(s)$? Here comes the second part which is the combination with the DQN approach - we simply train a DQN alongside our PGN.

*Notes*:
* *There'll actually be just single NN that will learn both the policy and state values (discussed below)*
* *Improvements from both methods are still applicable (also metioned and shown in following sections)*

### Common Imports

In [None]:
# flake8: noqa: E402,I001

import time
from typing import Any, List, Sequence, Tuple

import gym
import numpy as np
import ptan
import torch
import torch.nn as nn
from ptan.experience import ExperienceFirstLast
from tensorboardX import SummaryWriter

### Reward Tracker

In [None]:
class RewardTracker:
    def __init__(
        self,
        writer: SummaryWriter,
        stop_reward: float,
        window_size: int = 100,
    ) -> None:
        self.writer = writer
        self.stop_reward = stop_reward
        self.window_size = window_size
        self.best_mean_reward = float("-inf")

    def __enter__(self) -> "RewardTracker":
        self.ts = time.time()
        self.ts_frame = 0
        self.total_rewards = []
        return self

    def __exit__(self, *args: Any) -> None:
        self.writer.close()

    def add_reward(self, reward: float, frame: int) -> bool:
        """
        Returns an indication of whether a termination contition was reached.
        """

        self.total_rewards.append(reward)

        fps = (frame - self.ts_frame) / (time.time() - self.ts)

        self.ts_frame = frame
        self.ts = time.time()

        mean_reward = np.mean(self.total_rewards[-self.window_size :])

        if mean_reward > self.best_mean_reward:
            self.best_mean_reward = mean_reward
            print(
                f"{frame}: done {len(self.total_rewards)} games, "
                f"mean reward {mean_reward:.3f}, speed {fps:.2f} fps"
            )

        self.writer.add_scalar("fps", fps, frame)
        self.writer.add_scalar("reward_100", mean_reward, frame)
        self.writer.add_scalar("reward", reward, frame)

        return mean_reward > self.stop_reward

### Atari A2C PG Network
This NN is quite similar to the *Dueling DQN* architecture but with an important difference. In the Dueling DQN we have also two parts
1. Part for the state values $V(s)$
1. Part for the action advantages $A(s, a)$

But as with any other DQN we did still output $Q(s, a) = V(s) + A(s, a)$. Here we have two separate outputs with common base network:
1. Policy network that outputs action logits - basically policy $\pi(a|s)$ when one converts them to probabilities using softmax
1. Value network which computes $V(s)$

In [None]:
class AtariA2C(nn.Module):
    """
    A2C network with 2D convolutional base for Atari envs. and two heads:
    1. Policy - dense network that outputs action logits
    2. Value - dence network that models state values `V(s)`
    """

    def __init__(self, input_shape: Tuple[int, ...], n_actions: int) -> None:
        super().__init__()

        # 2D conv. base network common to both heads
        #  - This way both nets share commonly learned basic features
        #  - Also helps with convergence (compared to having two separate NNs)
        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
        )

        conv_out_size = self._get_conv_out(input_shape)

        # Policy NN - outputs action logits
        self.policy = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions),
        )

        # Value NN - outputs state value
        self.value = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1),
        )

    def _get_conv_out(self, shape: Tuple[int, ...]) -> int:
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        For an input batch of states, returns pair of tensors
          1. Policy logits for all actions in these states
          2. Values of these states
        """
        inputs = x.float() / 256
        conv_out = self.conv(inputs).view(inputs.size()[0], -1)
        return self.policy(conv_out), self.value(conv_out)

### Batch Unpacking
Similarly to implementations of other methods, we will use a function that unpacks an experience batch into its components.

One important difference here is that we'll use the A2C NN to evaluate final states from the batch to get $V(s_N)$ - here we assume that we make $N$ steps ahead in the environmet/environments.

Using all the experienced rewards $r_i$ from the batch we can compute target Q values as
$$
Q(s, a) = \sum_{i = 0}^{N - 1} \gamma^i r_i + \gamma^T V(s_N)
$$
* First part (the sum) is the total discounted reward from all but last steps
* The second is the discounted future reward from the $N$-th step (predicted by the NN)

In [None]:
def unpack_batch(
    batch: Sequence[ExperienceFirstLast],
    net: AtariA2C,
    gamma: float,
    reward_steps: int,
    device: str = "cpu",
) -> Tuple[torch.FloatTensor, torch.LongTensor, torch.FloatTensor]:
    """
    Convert batch into training tensors

    :param batch: Experiences from environment(s)
    :param net: A2C network that can approximate state values
    :returns: states, actions, target Q values as tensors
    """
    states, actions, rewards, not_done_exps, last_states = [], [], [], [], []

    # Unwrap each transition from the batch
    #  - And mark entries from unfinished episodes
    for i, exp in enumerate(batch):

        states.append(np.array(exp.state, copy=False))
        actions.append(int(exp.action))
        rewards.append(exp.reward)

        if exp.last_state is not None:
            not_done_exps.append(i)
            last_states.append(np.array(exp.last_state, copy=False))

    # Note: Wrapping states into np array is to fix PyTorch performance issue

    # Convert states and actions to tensors
    states = torch.FloatTensor(np.array(states, copy=False)).to(device)
    actions = torch.LongTensor(actions).to(device)

    # Compute target state values V(s) for training the net
    #  - Uses given A2C NN to predit the state values

    # Init target Q(s, a) to (discounted) rewards
    #  - This will be the final value at the end of an episode
    target_values = np.array(rewards, dtype=np.float32)

    if not_done_exps:

        # Convert next states to a tensor
        last_states = torch.FloatTensor(np.array(last_states, copy=False)).to(
            device
        )

        # Use given A2C net to predict future values
        _, last_state_values = net(last_states)
        last_state_values = last_state_values.data.cpu().numpy()[:, 0]

        # Add future values to the discounted rewards if episode is not done
        last_state_values *= gamma ** reward_steps
        target_values[not_done_exps] += last_state_values

    # Convert target values to a tensor
    target_values = torch.FloatTensor(target_values).to(device)

    return states, actions, target_values

### A2C Training *(Atari Pong)*
The training loop build on and extends the PG example from previous chapter.

The loss function now has three components:
1. **Policy loss** - similar to the PG method but with the gradient scales $A(s, a) = Q(s, a) - V(s)$ where $Q(s, a)$ are obtained from the batch unpacking described above (i.e. from experienced rewards and net's $V(s')$) and $V(s)$ is net's prediction for current state(s)
1. **Value loss** - simply a MSE between current values $V(s)$ and TD targets (the same values we used for policy loss from the experience batch)
1. **Entropy bonus** - the same techinque used for PG that adds an entropy component $\mathcal{L}_H = \beta \sum_i \pi(s_i) \log(\pi(s_i))$ that pushes the policy more towards uniform distribution that favours exploration

Finally, we'll use multiple copies of the same environment to sample our experience batch from. This is the same techinque to break correlations that was used for the vanilla PG method. Thre are two basic variants of the *Actor-Critic* method:
* A2C which uses parallel environments with synchronized policy gradient updates
* A3C which stands for *Asynchronous Advantage Actor-Critic* and will be described in the next chapter

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

In [None]:
# Hyperparameters
GAMMA = 0.99
LEARNING_RATE = 0.001
ADAM_EPS = 1e-3
ENTROPY_BETA = 0.01
BATCH_SIZE = 128
NUM_ENVS = 50
REWARD_STEPS = 4
CLIP_GRAD = 0.1
STOP_REWARD = 18
SEED = 42

# Set RNG state
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)

# Determine where the computations will take place
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Make multiple instances of the Atari Pong environment


def make_pong_env(i: int, seed: int) -> gym.Env:
    env = ptan.common.wrappers.wrap_dqn(gym.make("PongNoFrameskip-v4"))
    env.seed(seed + i)
    return env


envs = [make_pong_env(i, seed=SEED) for i in range(NUM_ENVS)]

# Create the A2C network for Atari environments
net = AtariA2C(
    input_shape=envs[0].observation_space.shape,
    n_actions=envs[0].action_space.n,
).to(device)
print(net)

# Initialize the policy agent
# - Instead of passing the whole NN, we use a callback over it returning just
#   the action logits
agent = ptan.agent.PolicyAgent(
    model=lambda x: net(x)[0],
    apply_softmax=True,
    device=device,
)

# Create `REWARD_STEPS`-ahead experience source over all environments
exp_source = ptan.experience.ExperienceSourceFirstLast(
    env=envs,
    agent=agent,
    gamma=GAMMA,
    steps_count=REWARD_STEPS,
)

# Create Adam optimizer
#  - Note: We use larger epsilon to make the training converge
optimizer = torch.optim.Adam(net.parameters(), lr=LEARNING_RATE, eps=ADAM_EPS)

# Create TensorBoard writer for metrics collection
writer = SummaryWriter(comment="-pong-a2c")

# Create TensorBoard trackers
with RewardTracker(writer, stop_reward=STOP_REWARD) as tracker:
    with ptan.common.utils.TBMeanTracker(writer, batch_size=10) as tb_tracker:

        batch = []

        # Run the training loop consuming experiences from the source
        for i, exp in enumerate(exp_source):

            # Add new experience to current batch
            batch.append(exp)

            new_rewards = exp_source.pop_total_rewards()
            if new_rewards:

                # Record new reward and check for termination
                solved = tracker.add_reward(reward=new_rewards[0], frame=i)

                # Stop if the mean reward was good enough
                if solved:
                    print(f"Solved in {i} steps!")
                    break

            # Let the batch fill up
            if len(batch) < BATCH_SIZE:
                continue

            # Unpack and clear current batch
            states, actions, target_values = unpack_batch(
                batch=batch,
                net=net,
                gamma=GAMMA,
                reward_steps=REWARD_STEPS,
                device=device,
            )
            batch.clear()

            # Clear gradients
            optimizer.zero_grad()

            # Compute action logits and values of current states
            action_logits, values = net(states)

            # Compute V(s) part of the loss function
            value_loss = nn.functional.mse_loss(
                input=values.squeeze(-1),
                target=target_values,
            )

            # Compute the policy part of the loss function
            #  - We use A(s, a) = Q(s, a) - V(s) as the gradient scales
            #  - Note: We detach values from the autograph to stop grad flow
            log_action_prob = nn.functional.log_softmax(action_logits, dim=1)
            advantage = target_values - values.detach()
            scaled_log_action_prob = (
                advantage * log_action_prob[range(BATCH_SIZE), actions]
            )
            policy_loss = -scaled_log_action_prob.mean()

            # Compute entropy bonus to the loss function
            action_prob = nn.functional.softmax(action_logits, dim=1)
            entropy_loss = (
                ENTROPY_BETA
                * (action_prob * log_action_prob).sum(dim=1).mean()
            )

            # First calculate policy gradients only
            policy_loss.backward(retain_graph=True)
            grads = np.concatenate(
                [
                    param.grad.data.cpu().numpy().flatten()
                    for param in net.parameters()
                    if param.grad is not None
                ]
            )

            # Apply entropy and value gradients
            loss = entropy_loss + value_loss
            loss.backward()

            # Use gradient clipping (by l2 norm) before making next step
            nn.utils.clip_grad_norm_(net.parameters(), CLIP_GRAD)
            optimizer.step()

            # Get total loss for tracking
            loss += policy_loss

            # Track metrics
            tb_tracker.track("advantage", advantage, i)
            tb_tracker.track("values", values, i)
            tb_tracker.track("batch_rewards", target_values, i)
            tb_tracker.track("loss_entropy", entropy_loss, i)
            tb_tracker.track("loss_policy", policy_loss, i)
            tb_tracker.track("loss_value", value_loss, i)
            tb_tracker.track("loss_total", loss, i)
            tb_tracker.track("grad_l2", np.sqrt(np.mean(np.square(grads))), i)
            tb_tracker.track("grad_max", np.max(np.abs(grads)), i)
            tb_tracker.track("grad_var", np.var(grads), i)