## DQN on Pong

Before we jump into the code, some introduction is needed. Our examples are
becoming increasingly challenging and complex, which is not surprising, as
the complexity of the problems that we are trying to tackle is also growing. The
examples are as simple and concise as possible, but some of the code may be
difficult to understand at first.
Another thing to note is performance. Our previous examples for FrozenLake, or
CartPole, were not demanding from a performance perspective, as observations
were small, NN parameters were tiny, and shaving off extra milliseconds in the
training loop wasn't important. However, from now on, that's not the case. One
single observation from the Atari environment is 100k values, which have to be
rescaled, converted to floats, and stored in the replay buffer. One extra copy of
this data array can cost you training speed, which will not be seconds and minutes
anymore, but could be hours on even the fastest graphics processing unit (GPU)
available.
The NN training loop could also be a bottleneck. Of course, RL models are not
as huge monsters as state-of-the-art ImageNet models, but even the DQN model
from 2015 has more than 1.5M parameters, which is a lot for a GPU to crunch.
So, to make a long story short, performance matters, especially when you are
experimenting with hyperparameters and need to wait not for a single model to
train, but for dozens of them.
PyTorch is quite expressive, so more-or-less efficient processing code could look
much less cryptic than optimized TensorFlow graphs, but there is still a significant
opportunity for doing things slowly and making mistakes. For example, a naïve
version of DQN loss computation, which loops over every batch sample, is about two
times slower than a parallel version. However, a single extra copy of the data batch
could make the speed of the same code 13 times slower, which is quite significant.
This example has been split into three modules due to its length, logical structure,
and reusability. The modules are as follows:

• Chapter06/lib/wrappers.py: These are Atari environment wrappers,
mostly taken from the OpenAI Baselines project.

• Chapter06/lib/dqn_model.py: This is the DQN NN layer, with the same
architecture as the DeepMind DQN from the Nature paper.

• Chapter06/02_dqn_pong.py: This is the main module, with the training
loop, loss function calculation, and experience replay buffer.

### Wrappers

To make things faster, several transformations are applied to the Atari platform
interaction, which are described in DeepMind's paper. Some of these transformations
influence only performance, but some address Atari platform features that make
learning long and unstable. Transformations are usually implemented as OpenAI
Gym wrappers of various kinds. The full list is quite lengthy and there are several
implementations of the same wrappers in various sources. My personal favorite is
in the OpenAI Baselines repository, which is a set of RL methods and algorithms
implemented in TensorFlow and applied to popular benchmarks to establish the
common ground for comparing methods. The repository is available from https://
github.com/openai/baselines, and wrappers are available in this file: https://
github.com/openai/baselines/blob/master/baselines/common/atari_
wrappers.py.

Sometimes, when the DQN is not converging, the
problem is not in the code but in the wrongly wrapped environment. I've spent
several days debugging convergence issues caused by missing the FIRE button press
at the beginning of a game!

In [1]:
import cv2
import gym
import gym.spaces
import numpy as np
import collections

In [2]:
class FireResetEnv(gym.Wrapper):
    def __init__(self, env=None):
        """For environments where the user need to press FIRE for the game to start."""
        super(FireResetEnv, self).__init__(env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def step(self, action):
        return self.env.step(action)

    def reset(self):
        self.env.reset()
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset()
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset()
        return obs

The preceding wrapper presses the FIRE button in environments that require that
for the game to start. In addition to pressing FIRE, this wrapper checks for several
corner cases that are present in some games.

In [3]:
class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env=None, skip=4):
        """Return only every `skip`-th frame"""
        super(MaxAndSkipEnv, self).__init__(env)
        # most recent raw observations (for max pooling across time steps)
        self._obs_buffer = collections.deque(maxlen=2)
        self._skip = skip

    def step(self, action):
        total_reward = 0.0
        done = None
        for _ in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            self._obs_buffer.append(obs)
            total_reward += reward
            if done:
                break
        max_frame = np.max(np.stack(self._obs_buffer), axis=0)
        return max_frame, total_reward, done, info

    def reset(self):
        """Clear past frame buffer and init. to first obs. from inner env."""
        self._obs_buffer.clear()
        obs = self.env.reset()
        self._obs_buffer.append(obs)
        return obs

This wrapper combines the repetition of actions during K frames and pixels from
two consecutive frames.

In [4]:
class ProcessFrame84(gym.ObservationWrapper):
    def __init__(self, env=None):
        super(ProcessFrame84, self).__init__(env)
        self.observation_space = gym.spaces.Box(
            low=0, high=255, shape=(84, 84, 1), dtype=np.uint8)

    def observation(self, obs):
        return ProcessFrame84.process(obs)

    @staticmethod
    def process(frame):
        if frame.size == 210 * 160 * 3:
            img = np.reshape(frame, [210, 160, 3]).astype(
                np.float32)
        elif frame.size == 250 * 160 * 3:
            img = np.reshape(frame, [250, 160, 3]).astype(
                np.float32)
        else:
            assert False, "Unknown resolution."
        img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + \
              img[:, :, 2] * 0.114
        resized_screen = cv2.resize(
            img, (84, 110), interpolation=cv2.INTER_AREA)
        x_t = resized_screen[18:102, :]
        x_t = np.reshape(x_t, [84, 84, 1])
        return x_t.astype(np.uint8)

The goal of this wrapper is to convert input observations from the emulator, which
normally has a resolution of 210×160 pixels with RGB color channels, to a grayscale
84×84 image. It does this using a colorimetric grayscale conversion (which is closer
to human color perception than a simple averaging of color channels), resizing the
image, and cropping the top and bottom parts of the result.

In [5]:
class BufferWrapper(gym.ObservationWrapper):
    def __init__(self, env, n_steps, dtype=np.float32):
        super(BufferWrapper, self).__init__(env)
        self.dtype = dtype
        old_space = env.observation_space
        self.observation_space = gym.spaces.Box(
            old_space.low.repeat(n_steps, axis=0),
            old_space.high.repeat(n_steps, axis=0), dtype=dtype)

This class creates a stack of subsequent frames along the first dimension and returns
them as an observation. The purpose is to give the network an idea about the
dynamics of the objects, such as the speed and direction of the ball in Pong or how
enemies are moving. This is very important information, which it is not possible to
obtain from a single image.

In [6]:
class ImageToPyTorch(gym.ObservationWrapper):
    def __init__(self, env):
        super(ImageToPyTorch, self).__init__(env)
        old_shape = self.observation_space.shape
        new_shape = (old_shape[-1], old_shape[0], old_shape[1])
        self.observation_space = gym.spaces.Box(
            low=0.0, high=1.0, shape=new_shape, dtype=np.float32)

    def observation(self, observation):
        return np.moveaxis(observation, 2, 0)

**This simple wrapper changes the shape of the observation from HWC (height, width,
channel) to the CHW (channel, height, width) format required by PyTorch. The
input shape of the tensor has a color channel as the last dimension, but PyTorch's
convolution layers assume the color channel to be the first dimension.**

In [7]:
class ScaledFloatFrame(gym.ObservationWrapper):
    def observation(self, obs):
        return np.array(obs).astype(np.float32) / 255.0

The final wrapper we have in the library converts observation data from bytes to
floats, and scales every pixel's value to the range [0.0...1.0].

In [8]:
def make_env(env_name):
    env = gym.make(env_name)
    env = MaxAndSkipEnv(env)
    env = FireResetEnv(env)
    env = ProcessFrame84(env)
    env = ImageToPyTorch(env)
    env = BufferWrapper(env, 4)
    return ScaledFloatFrame(env)

At the end of the file is a simple function that creates an environment by its name
and applies all the required wrappers to it. That's it for wrappers, so let's look at our
model.

## DQN Model

The model published in Nature has three convolution layers followed by two
fully connected layers. All layers are separated by rectified linear unit (ReLU)
nonlinearities. The output of the model is Q-values for every action available in
the environment, without nonlinearity applied (as Q-values can have any value).
The approach of having all Q-values calculated with one pass through the network
helps us to increase speed significantly in comparison to treating Q(s, a) literally and
feeding observations and actions to the network to obtain the value of the action.

In [9]:
import torch
import torch.nn as nn
import numpy as np


class DQN(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(DQN, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

        conv_out_size = self._get_conv_out(input_shape)
        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        return self.fc(conv_out)


**To be able to write our network in the generic way, it was implemented in two
parts: convolution and sequential. PyTorch doesn't have a "flatter" layer that could
transform a 3D tensor into a 1D vector of numbers, with the requirement of feeding
convolution output to the fully connected layer. This problem is solved in the
forward() function, where we can reshape our batch of 3D tensors into a batch of
1D vectors.**

**Another small problem is that we don't know the exact number of values in the
output from the convolution layer produced with the input of the given shape.
However, we need to pass this number to the first fully connected layer constructor.
One possible solution would be to hard-code this number, which is a function of
input shape (for 84×84 input, the output from the convolution layer will have 3136
values); however, it's not the best way, as our code will become less robust to input
shape change. The better solution would be to have a simple function, _get_conv_
out(), that accepts the input shape and applies the convolution layer to a fake
tensor of such a shape. The result of the function would be equal to the number of
parameters returned by this application. It would be fast, as this call would be done
once on model creation, but also, it would allow us to have generic code.**

The application of transformations is done in two steps: 
- first we apply the convolution layer to the input, and then we obtain a 4D tensor on output. This result is flattened to have two dimensions: a batch size and all the parameters returned by the convolution for this batch entry as one long vector of numbers. This is done by the view() function of the tensors, which lets one single dimension be a -1 argument as a wildcard for the rest of the parameters. For example, if we have a tensor, T, of shape (2, 3, 4), which is a 3D tensor of 24 elements, we can reshape it into a 2D tensor with six rows and four columns using T.view(6, 4). This operation doesn't create a new memory object or move the data in memory; it just changes the higher-level shape of the tensor. The same result could be obtained by T.view(-1, 4) or T.view(6, -1), which is very convenient when your tensor has a batch size in the first dimension. 
- Finally, we pass this flattened 2D tensor to our fully connected layers to obtain Q-values for every batch input.

## Training

The third module contains the experience replay buffer, the agent, the loss function
calculation, and the training loop itself. 

Before going into the code, something needs
to be said about the training hyperparameters. DeepMind's Nature paper contained
a table with all the details about hyperparameters used to train its model on all 49
Atari games used for evaluation. DeepMind kept all those parameters the same for
all games (but trained individual models for every game), and it was the team's
intention to show that the method is robust enough to solve lots of games with
varying complexity, action space, reward structure, and other details using one
single model architecture and hyperparameters. However, our goal here is much
more modest: we want to solve just the Pong game.
Pong is quite simple and straightforward in comparison to other games in the Atari
test set, so the hyperparameters in the paper are overkill for our task. For example,
to get the best result on all 49 games, DeepMind used a million-observations replay
buffer, which requires approximately 20 GB of RAM to keep and lots of samples
from the environment to populate.

The epsilon decay schedule that was used is also not the best for a single Pong game.
In the training, DeepMind linearly decayed epsilon from 1.0 to 0.1 during the first
million frames obtained from the environment. However, my own experiments
have shown that for Pong, it's enough to decay epsilon over the first 150k frames
and then keep it stable. The replay buffer also can be much smaller: 10k transitions
will be enough.

In the following example, I've used my parameters. These differ from the parameters
in the paper but will allow us to solve Pong about 10 times faster. On a GeForce GTX
1080 Ti, the following version converges to a mean score of 19.0 in one to two hours,
but with DeepMind's hyperparameters, it will require at least a day.
This speedup, of course, is fine-tuning for one particular environment and can break
convergence on other games. You are free to play with the options and other games
from the Atari set.

In [10]:
import argparse
import time
import numpy as np
import collections

import torch
import torch.nn as nn
import torch.optim as optim

from tensorboardX import SummaryWriter


DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
MEAN_REWARD_BOUND = 19

GAMMA = 0.99
BATCH_SIZE = 32
REPLAY_SIZE = 10000
LEARNING_RATE = 1e-4
SYNC_TARGET_FRAMES = 1000
REPLAY_START_SIZE = 10000

EPSILON_DECAY_LAST_FRAME = 150000
EPSILON_START = 1.0
EPSILON_FINAL = 0.01

These parameters define the following:
- Our gamma value used for Bellman approximation (GAMMA)
- The batch size sampled from the replay buffer (BATCH_SIZE)
- The maximum capacity of the buffer (REPLAY_SIZE)
- The count of frames we wait for before starting training to populate the
replay buffer (REPLAY_START_SIZE)
- The learning rate used in the Adam optimizer, which is used in this example
(LEARNING_RATE)
- How frequently we sync model weights from the training model to the
target model, which is used to get the value of the next state in the Bellman
approximation (SYNC_TARGET_FRAMES)

The last batch of hyperparameters is related to the epsilon decay schedule. To
achieve proper exploration, we start with epsilon = 1.0 at the early stages of training,
which causes all actions to be selected randomly. Then, during the first 150,000
frames, epsilon is linearly decayed to 0.01, which corresponds to the random action
taken in 1% of steps. A similar scheme was used in the original DeepMind paper,
but the duration of decay was almost 10 times longer (so, epsilon = 0.01 was reached
after a million frames).

The next chunk of code defines our experience replay buffer, the purpose of which
is to keep the transitions obtained from the environment (tuples of the observation,
action, reward, done flag, and the next state). Each time we do a step in the
environment, we push the transition into the buffer, keeping only a fixed number of
steps (in our case, 10k transitions). For training, we randomly sample the batch of
transitions from the replay buffer, which allows us to break the correlation between
subsequent steps in the environment.

In [11]:
Experience = collections.namedtuple(
    'Experience', field_names=['state', 'action', 'reward',
                               'done', 'new_state'])

class ExperienceBuffer:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        return len(self.buffer)

    def append(self, experience):
        self.buffer.append(experience)

    def sample(self, batch_size):
        indices = np.random.choice(len(self.buffer), batch_size,
                                   replace=False)
        states, actions, rewards, dones, next_states = \
            zip(*[self.buffer[idx] for idx in indices])
        return np.array(states), np.array(actions), \
               np.array(rewards, dtype=np.float32), \
               np.array(dones, dtype=np.uint8), \
               np.array(next_states)

Most of the experience replay buffer code is quite straightforward: it basically
exploits the capability of the deque class to maintain the given number of entries
in the buffer. In the sample() method, we create a list of random indices and then
repack the sampled entries into NumPy arrays for more convenient loss calculation.

The next class we need to have is an Agent, which interacts with the environment
and saves the result of the interaction into the experience replay buffer that you
have just seen:

In [12]:
class Agent:
    def __init__(self, env, exp_buffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self._reset()

    def _reset(self):
        self.state = env.reset()
        self.total_reward = 0.0

    @torch.no_grad()
    def play_step(self, net, epsilon=0.0, device="cpu"):
        done_reward = None

        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state_a = np.array([self.state], copy=False)
            state_v = torch.tensor(state_a).to(device)
            q_vals_v = net(state_v)
            _, act_v = torch.max(q_vals_v, dim=1)
            action = int(act_v.item())

        # do step in the environment
        new_state, reward, is_done, _ = self.env.step(action)
        self.total_reward += reward

        exp = Experience(self.state, action, reward,
                         is_done, new_state)
        self.exp_buffer.append(exp)
        self.state = new_state
        if is_done:
            done_reward = self.total_reward
            self._reset()
        return done_reward

During the agent's initialization, we need to store references to the environment
and experience replay buffer, tracking the current observation and the total reward
accumulated so far (_reset module).

The main method of the agent is to perform a step in the environment and store its
result in the buffer. To do this, we need to select the action first. With the probability
epsilon (passed as an argument), we take the random action; otherwise, we use the
past model to obtain the Q-values for all possible actions and choose the best.

As the action has been chosen, we pass it to the environment to get the next
observation and reward, store the data in the experience buffer, and then handle
the end-of-episode situation. The result of the function is the total accumulated
reward if we have reached the end of the episode with this step, or None if not.

Now it is time for the last function in the training module, which calculates the loss
for the sampled batch. This function is written in a form to maximally exploit GPU
parallelism by processing all batch samples with vector operations, which makes
it harder to understand when compared with a naïve loop over the batch. Yet this
optimization pays off: the parallel version is more than two times faster than an
explicit loop over the batch.

In [13]:
def calc_loss(batch, net, tgt_net, device="cpu"):
    states, actions, rewards, dones, next_states = batch

    states_v = torch.tensor(np.array(
        states, copy=False)).to(device)
    next_states_v = torch.tensor(np.array(
        next_states, copy=False)).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.BoolTensor(dones).to(device)

    state_action_values = net(states_v).gather(
        1, actions_v.unsqueeze(-1)).squeeze(-1)
    with torch.no_grad():
        next_state_values = tgt_net(next_states_v).max(1)[0]
        next_state_values[done_mask] = 0.0
        next_state_values = next_state_values.detach()

    expected_state_action_values = next_state_values * GAMMA + \
                                   rewards_v
    return nn.MSELoss()(state_action_values,
                        expected_state_action_values)

**state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)**

In the preceding line, we pass observations to the first model and extract the
specific Q-values for the taken actions using the gather() tensor operation. The
first argument to the gather() call is a dimension index that we want to perform
gathering on (in our case, it is equal to 1, which corresponds to actions). 

The second argument is a tensor of indices of elements to be chosen. Extra
unsqueeze() and squeeze() calls are required to compute the index argument
for the gather functions, and to get rid of the extra dimensions that we created,
respectively. (The index should have the same number of dimensions as the data we
are processing.)

Keep in mind that the result of gather() applied to tensors is a differentiable
operation that will keep all gradients with respect to the final loss value.

**next_state_values = tgt_net(next_states_v).max(1)**

In the preceding line, we apply the target network to our next state observations
and calculate the maximum Q-value along the same action dimension, 1. Function
max() returns both maximum values and indices of those values (so it calculates
both max and argmax), which is very convenient. However, in this case, we are
interested only in values, so we take the first entry of the result.

**next_state_values[done_mask] = 0.0**

Here we make one simple, but very important, point: if transition in the batch
is from the last step in the episode, then our value of the action doesn't have a
discounted reward of the next state, as there is no next state from which to gather
the reward. This may look minor, but it is very important in practice: without this,
training will not converge.

**next_state_values = next_state_values.detach()**

In this line, we detach the value from its computation graph to prevent gradients
from flowing into the NN used to calculate Q approximation for the next states. This is important, as without this, our backpropagation of the loss will start to affect
both predictions for the current state and the next state. However, we don't want
to touch predictions for the next state, as they are used in the Bellman equation
to calculate reference Q-values. To block gradients from flowing into this branch
of the graph, we are using the detach() method of the tensor, which returns the
tensor without connection to its calculation history.

**expected_state_action_values = next_state_values * GAMMA + rewards_v
return nn.MSELoss()(state_action_values, expected_state_action_values)**

Finally, we calculate the Bellman approximation value and the mean squared error
loss.

In arguments, we pass our batch as a tuple of arrays (repacked by the sample()
method in the experience buffer), our network that we are training, and the target
network, which is periodically synced with the trained one.
The first model (passed as the argument network) is used to calculate gradients;
the second model in the tgt_net argument is used to calculate values for the next
states, and this calculation shouldn't affect gradients. To achieve this, we use the
detach() function of the PyTorch tensor to prevent gradients from flowing into the
target network's graph.

We wrap individual NumPy
arrays with batch data in PyTorch tensors and copy them to GPU if the CUDA device
was specified in arguments.

In [14]:
DEFAULT_ENV_NAME

'PongNoFrameskip-v4'

In [15]:
device = torch.device("cuda")

env = make_env(DEFAULT_ENV_NAME)

net = DQN(env.observation_space.shape,
                    env.action_space.n).to(device)
tgt_net = DQN(env.observation_space.shape,
                        env.action_space.n).to(device)
writer = SummaryWriter(comment="-" + DEFAULT_ENV_NAME)
print(net)

buffer = ExperienceBuffer(REPLAY_SIZE)
agent = Agent(env, buffer)
epsilon = EPSILON_START

optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
total_rewards = []
frame_idx = 0
ts_frame = 0
ts = time.time()
best_m_reward = None

while True:
    frame_idx += 1
    epsilon = max(EPSILON_FINAL, EPSILON_START -
                  frame_idx / EPSILON_DECAY_LAST_FRAME)

    reward = agent.play_step(net, epsilon, device=device)
    if reward is not None:
        total_rewards.append(reward)
        speed = (frame_idx - ts_frame) / (time.time() - ts)
        ts_frame = frame_idx
        ts = time.time()
        m_reward = np.mean(total_rewards[-100:])
        print("%d: done %d games, reward %.3f, "
              "eps %.2f, speed %.2f f/s" % (
            frame_idx, len(total_rewards), m_reward, epsilon,
            speed
        ))
        writer.add_scalar("epsilon", epsilon, frame_idx)
        writer.add_scalar("speed", speed, frame_idx)
        writer.add_scalar("reward_100", m_reward, frame_idx)
        writer.add_scalar("reward", reward, frame_idx)
        if best_m_reward is None or best_m_reward < m_reward:
            torch.save(net.state_dict(), args.env +
                       "-best_%.0f.dat" % m_reward)
            if best_m_reward is not None:
                print("Best reward updated %.3f -> %.3f" % (
                    best_m_reward, m_reward))
            best_m_reward = m_reward
        if m_reward > MEAN_REWARD_BOUND:
            print("Solved in %d frames!" % frame_idx)
            break

    if len(buffer) < REPLAY_START_SIZE:
        continue

    if frame_idx % SYNC_TARGET_FRAMES == 0:
        tgt_net.load_state_dict(net.state_dict())

    optimizer.zero_grad()
    batch = buffer.sample(BATCH_SIZE)
    loss_t = calc_loss(batch, net, tgt_net, device=device)
    loss_t.backward()
    optimizer.step()
writer.close()

DQN(
  (conv): Sequential(
    (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
    (1): ReLU()
    (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
    (3): ReLU()
    (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
  )
  (fc): Sequential(
    (0): Linear(in_features=3136, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=6, bias=True)
  )
)


NotImplementedError: 