# Actor-critic (AC2) on pong

In [1]:
import sys
sys.path.append("../Chapter11/")

In [2]:
import gym
import ptan
import numpy as np
import argparse
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.nn.utils as nn_utils
import torch.nn.functional as F
import torch.optim as optim

from lib import common

We start, as usual, by defining hyperparameters. These values
are not tuned, as we will do this in the next section of this chapter. 

We have one new
value here: CLIP_GRAD. This hyperparameter specifies the threshold for gradient
clipping, which basically prevents our gradients from becoming too large at the
optimization stage and pushing our policy too far. Clipping is implemented using
the PyTorch functionality, but the idea is very simple: if the L2 norm of the gradient
is larger than this hyperparameter, then the gradient vector is clipped to this value.

The REWARD_STEPS hyperparameter determines how many steps ahead we will take
to approximate the total discounted reward for every action.

In the policy gradient methods, we used about 10 steps, but in A2C, we will use
our value approximation to get a state value for further steps, so it will be fine to
decrease the number of steps.

In [3]:
GAMMA = 0.99
LEARNING_RATE = 0.003
ENTROPY_BETA = 0.03
BATCH_SIZE = 32
NUM_ENVS = 50

REWARD_STEPS = 4
CLIP_GRAD = 0.1

Our network architecture has a shared convolution body and two heads: the first
returns the policy with the probability distribution over our actions and the second
head returns one single number, which will approximate the state's value. It might
look similar to our dueling DQN architecture from Chapter 8, DQN Extensions, but
our training procedure is different.

The forward pass through the network returns a tuple of two tensors: policy
and value. Now we have a large and important function, which takes the
batch of environment transitions and returns three tensors: the batch of states,
batch of actions taken, and batch of Q-values calculated using the formula 𝑄(𝑠, 𝑎) = Σ 𝛾^𝑖 * 𝑟_𝑖 + 𝛾^𝑁 * 𝑉(𝑠_𝑁). This Q-value will be used in two places: to calculate mean squared error (MSE) loss to improve the value approximation in the same way as DQN, and to calculate the advantage of the action.

In [4]:
class AtariA2C(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(AtariA2C, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

        conv_out_size = self._get_conv_out(input_shape)
        self.policy = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )

        self.value = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        fx = x.float() / 256
        conv_out = self.conv(fx).view(fx.size()[0], -1)
        return self.policy(conv_out), self.value(conv_out)

In the first loop, we just walk through our batch of transitions and copy their fields
into the lists. Note that the reward value already contains the discounted reward for
REWARD_STEPS, as we use the ptan.ExperienceSourceFirstLast class. We also
need to handle episode-ending situations and remember indices of batch entries for
non-terminal episodes.

In the preceding code, we convert the gathered state and actions into a PyTorch
tensor and copy them into the graphics processing unit (GPU) if needed. The extra
call to np.array() might look redundant, but without it, the performance of tensor
creation degrades 5-10x. This issue in PyTorch (https://github.com/pytorch/
pytorch/issues/13918) hasn't been solved yet, so one solution is to pass a single
NumPy array instead of a list of arrays.

The rest of the function calculates Q-values, taking into account the terminal
episodes.

In [5]:
def unpack_batch(batch, net, device='cpu'):
    """
    Convert batch into training tensors
    :param batch:
    :param net:
    :return: states variable, actions tensor, reference values variable
    """
    states = []
    actions = []
    rewards = []
    not_done_idx = []
    last_states = []
    for idx, exp in enumerate(batch):
        states.append(np.array(exp.state, copy=False))
        actions.append(int(exp.action))
        rewards.append(exp.reward)
        if exp.last_state is not None:
            not_done_idx.append(idx)
            last_states.append(np.array(exp.last_state, copy=False))

    states_v = torch.FloatTensor(
        np.array(states, copy=False)).to(device)
    actions_t = torch.LongTensor(actions).to(device)

    # handle rewards
    rewards_np = np.array(rewards, dtype=np.float32)
    if not_done_idx:
        last_states_v = torch.FloatTensor(np.array(last_states, copy=False)).to(device)
        last_vals_v = net(last_states_v)[1]
        last_vals_np = last_vals_v.data.cpu().numpy()[:, 0]
        last_vals_np *= GAMMA ** REWARD_STEPS
        rewards_np[not_done_idx] += last_vals_np

    ref_vals_v = torch.FloatTensor(rewards_np).to(device)

    return states_v, actions_t, ref_vals_v

The preparation code for the training loop is the same as usual, except that we now
use the array of environments to gather experience, instead of one environment.

One very important detail here is passing the eps parameter to the optimizer. If
you're familiar with the Adam algorithm, you may know that epsilon is a small
number added to the denominator to prevent zero division situations. Normally,
this value is set to some small number such as 1e-8 or 1e-10, but in our case, these
values turned out to be too small. I have no mathematically strict explanation for
this, but with the default value of epsilon, the method does not converge at all.
Very likely, the division to a small value of 1e-8 makes the gradients too large,
which turns out to be fatal for training stability.

In [None]:
device = torch.device("cuda")

make_env = lambda: ptan.common.wrappers.wrap_dqn(gym.make("PongNoFrameskip-v4"))
envs = [make_env() for _ in range(NUM_ENVS)]
writer = SummaryWriter(comment="-pong-a2c_" + "test")

net = AtariA2C(envs[0].observation_space.shape, envs[0].action_space.n).to(device)
print(net)

agent = ptan.agent.PolicyAgent(lambda x: net(x)[0], apply_softmax=True, device=device)
exp_source = ptan.experience.ExperienceSourceFirstLast(envs, agent, gamma=GAMMA, steps_count=REWARD_STEPS)

optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE, eps=1e-3)

batch = []

with common.RewardTracker(writer, stop_reward=18) as tracker:
    with ptan.common.utils.TBMeanTracker(writer, batch_size=10) as tb_tracker:
        for step_idx, exp in enumerate(exp_source):
            batch.append(exp)

            # handle new rewards
            new_rewards = exp_source.pop_total_rewards()
            if new_rewards:
                if tracker.reward(new_rewards[0], step_idx):
                    break

            if len(batch) < BATCH_SIZE:
                continue

            states_v, actions_t, vals_ref_v = unpack_batch(batch, net, device=device)
            batch.clear()

            optimizer.zero_grad()
            logits_v, value_v = net(states_v)
            loss_value_v = F.mse_loss(value_v.squeeze(-1), vals_ref_v)

            log_prob_v = F.log_softmax(logits_v, dim=1)
            adv_v = vals_ref_v - value_v.detach()
            log_prob_actions_v = adv_v * log_prob_v[range(BATCH_SIZE), actions_t]
            loss_policy_v = -log_prob_actions_v.mean()

            prob_v = F.softmax(logits_v, dim=1)
            entropy_loss_v = ENTROPY_BETA * (prob_v * log_prob_v).sum(dim=1).mean()

            # calculate policy gradients only
            loss_policy_v.backward(retain_graph=True)
            grads = np.concatenate([p.grad.data.cpu().numpy().flatten()
                                    for p in net.parameters()
                                    if p.grad is not None])

            # apply entropy and value gradients
            loss_v = entropy_loss_v + loss_value_v
            loss_v.backward()
            nn_utils.clip_grad_norm_(net.parameters(), CLIP_GRAD)
            optimizer.step()
            # get full loss
            loss_v += loss_policy_v

            tb_tracker.track("advantage",       adv_v, step_idx)
            tb_tracker.track("values",          value_v, step_idx)
            tb_tracker.track("batch_rewards",   vals_ref_v, step_idx)
            tb_tracker.track("loss_entropy",    entropy_loss_v, step_idx)
            tb_tracker.track("loss_policy",     loss_policy_v, step_idx)
            tb_tracker.track("loss_value",      loss_value_v, step_idx)
            tb_tracker.track("loss_total",      loss_v, step_idx)
            tb_tracker.track("grad_l2",         np.sqrt(np.mean(np.square(grads))), step_idx)
            tb_tracker.track("grad_max",        np.max(np.abs(grads)), step_idx)
            tb_tracker.track("grad_var",        np.var(grads), step_idx)

AtariA2C(
  (conv): Sequential(
    (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
    (1): ReLU()
    (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
    (3): ReLU()
    (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
  )
  (policy): Sequential(
    (0): Linear(in_features=3136, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=6, bias=True)
  )
  (value): Sequential(
    (0): Linear(in_features=3136, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=1, bias=True)
  )
)
37517: done 1 games, mean reward -21.000, speed 367.91 f/s
37570: done 2 games, mean reward -21.000, speed 333.56 f/s
37651: done 3 games, mean reward -21.000, speed 386.47 f/s
37682: done 4 games, mean reward -21.000, speed 318.37 f/s
38555: done 5 games, mean reward -21.000, speed 372.31 f/s
38566: done 6 games, mean reward -21.000, speed 174.43 f/s
39030: done 7 games, mean reward -21.000, speed 371

137059: done 127 games, mean reward -20.250, speed 366.45 f/s
137173: done 128 games, mean reward -20.240, speed 392.05 f/s
137647: done 129 games, mean reward -20.260, speed 358.93 f/s
137826: done 130 games, mean reward -20.250, speed 357.62 f/s
137923: done 131 games, mean reward -20.230, speed 344.84 f/s
138112: done 132 games, mean reward -20.230, speed 347.16 f/s
139480: done 133 games, mean reward -20.220, speed 379.55 f/s
139714: done 134 games, mean reward -20.220, speed 354.85 f/s
139818: done 135 games, mean reward -20.220, speed 369.22 f/s
140354: done 136 games, mean reward -20.210, speed 373.29 f/s
140393: done 137 games, mean reward -20.200, speed 385.98 f/s
140990: done 138 games, mean reward -20.200, speed 365.14 f/s
141842: done 139 games, mean reward -20.190, speed 373.09 f/s
142453: done 140 games, mean reward -20.180, speed 370.56 f/s
144832: done 141 games, mean reward -20.180, speed 370.55 f/s
144840: done 142 games, mean reward -20.180, speed 305.10 f/s
145618: 

265077: done 260 games, mean reward -20.270, speed 381.56 f/s
265654: done 261 games, mean reward -20.270, speed 367.66 f/s
266021: done 262 games, mean reward -20.270, speed 372.53 f/s
266098: done 263 games, mean reward -20.280, speed 386.11 f/s
266543: done 264 games, mean reward -20.290, speed 359.00 f/s
267758: done 265 games, mean reward -20.290, speed 374.33 f/s
267793: done 266 games, mean reward -20.310, speed 380.37 f/s
267835: done 267 games, mean reward -20.320, speed 374.03 f/s
270273: done 268 games, mean reward -20.320, speed 376.45 f/s
271000: done 269 games, mean reward -20.320, speed 372.91 f/s
272416: done 270 games, mean reward -20.340, speed 367.20 f/s
273288: done 271 games, mean reward -20.350, speed 360.33 f/s
273613: done 272 games, mean reward -20.360, speed 354.11 f/s
273931: done 273 games, mean reward -20.350, speed 351.47 f/s
274069: done 274 games, mean reward -20.330, speed 348.58 f/s
274298: done 275 games, mean reward -20.330, speed 356.08 f/s
275119: 

380408: done 393 games, mean reward -20.450, speed 356.70 f/s
382816: done 394 games, mean reward -20.440, speed 373.50 f/s
384011: done 395 games, mean reward -20.450, speed 371.35 f/s
385037: done 396 games, mean reward -20.430, speed 377.89 f/s
386034: done 397 games, mean reward -20.440, speed 382.25 f/s
386790: done 398 games, mean reward -20.440, speed 373.24 f/s
391727: done 399 games, mean reward -20.440, speed 372.70 f/s
391942: done 400 games, mean reward -20.430, speed 363.76 f/s
393299: done 401 games, mean reward -20.430, speed 378.50 f/s
393723: done 402 games, mean reward -20.430, speed 376.03 f/s
393909: done 403 games, mean reward -20.430, speed 357.65 f/s
395643: done 404 games, mean reward -20.430, speed 378.39 f/s
396607: done 405 games, mean reward -20.440, speed 369.55 f/s
397225: done 406 games, mean reward -20.450, speed 360.00 f/s
398772: done 407 games, mean reward -20.460, speed 371.78 f/s
399403: done 408 games, mean reward -20.470, speed 366.94 f/s
401602: 