# Basic DQN

To get started, we will implement the same DQN method as in Chapter 6, Deep
Q-Networks, but leveraging the high-level libraries described in Chapter 7, Higher-
Level RL Libraries. This will make our code much more compact, which is good,
as non-relevant details won't distract us from the method's logic.

## Model

In [1]:
import torch
import torch.nn as nn

import numpy as np


class DQN(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(DQN, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

        conv_out_size = self._get_conv_out(input_shape)
        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        fx = x.float() / 256
        conv_out = self.conv(fx).view(fx.size()[0], -1)
        return self.fc(conv_out)

## Common library

First of all, we have hyperparameters
for our Pong environment from the previous chapter. The hyperparameters are
stored in the SimpleNamespace object, which is a class from the Python standard
library that provides simple access to a variable set of keys and values. This makes
it easy to add another configuration set for different, more complicated Atari games
and allows us to experiment with hyperparameters:

In [2]:
import warnings
from datetime import timedelta, datetime
from types import SimpleNamespace
from typing import Iterable, Tuple, List

import ptan
import ptan.ignite as ptan_ignite
from ignite.engine import Engine
from ignite.metrics import RunningAverage
from ignite.contrib.handlers import tensorboard_logger as tb_logger


SEED = 123

In [3]:
HYPERPARAMS = {
    'pong': SimpleNamespace(**{
        'env_name':         "PongNoFrameskip-v4",
        'stop_reward':      18.0,
        'run_name':         'pong',
        'replay_size':      100000,
        'replay_initial':   10000,
        'target_net_sync':  1000,
        'epsilon_frames':   10**5,
        'epsilon_start':    1.0,
        'epsilon_final':    0.02,
        'learning_rate':    0.0001,
        'gamma':            0.99,
        'batch_size':       32
    }),
    'breakout-small': SimpleNamespace(**{
        'env_name':         "BreakoutNoFrameskip-v4",
        'stop_reward':      500.0,
        'run_name':         'breakout-small',
        'replay_size':      3*10 ** 5,
        'replay_initial':   20000,
        'target_net_sync':  1000,
        'epsilon_frames':   10 ** 6,
        'epsilon_start':    1.0,
        'epsilon_final':    0.1,
        'learning_rate':    0.0001,
        'gamma':            0.99,
        'batch_size':       64
    }),
    'breakout': SimpleNamespace(**{
        'env_name':         "BreakoutNoFrameskip-v4",
        'stop_reward':      500.0,
        'run_name':         'breakout',
        'replay_size':      10 ** 6,
        'replay_initial':   50000,
        'target_net_sync':  10000,
        'epsilon_frames':   10 ** 6,
        'epsilon_start':    1.0,
        'epsilon_final':    0.1,
        'learning_rate':    0.00025,
        'gamma':            0.99,
        'batch_size':       32
    }),
    'invaders': SimpleNamespace(**{
        'env_name': "SpaceInvadersNoFrameskip-v4",
        'stop_reward': 500.0,
        'run_name': 'breakout',
        'replay_size': 10 ** 6,
        'replay_initial': 50000,
        'target_net_sync': 10000,
        'epsilon_frames': 10 ** 6,
        'epsilon_start': 1.0,
        'epsilon_final': 0.1,
        'learning_rate': 0.00025,
        'gamma': 0.99,
        'batch_size': 32
    }),
}

The next function from lib/common.py has the name unpack_batch and it takes
the batch of transitions and converts it into the set of NumPy arrays suitable
for training. Every transition from ExperienceSourceFirstLast has a type
of ExperienceFirstLast, which is a namedtuple with the following fields:
- state: observation from the environment.
- action: integer action taken by the agent.
- reward: if we have created ExperienceSourceFirstLast with the
attribute steps_count=1, it's just the immediate reward. For larger step
counts, it contains the discounted sum of rewards for this number of steps.
- last_state: if the transition corresponds to the final step in the
environment, then this field is None; otherwise, it contains the last
observation in the experience chain.

In [4]:
def unpack_batch(batch: List[ptan.experience.ExperienceFirstLast]):
    states, actions, rewards, dones, last_states = [],[],[],[],[]
    for exp in batch:
        state = np.array(exp.state)
        states.append(state)
        actions.append(exp.action)
        rewards.append(exp.reward)
        dones.append(exp.last_state is None)
        if exp.last_state is None:
            lstate = state  # the result will be masked anyway
        else:
            lstate = np.array(exp.last_state)
        last_states.append(lstate)
    return np.array(states, copy=False), np.array(actions), \
           np.array(rewards, dtype=np.float32), \
           np.array(dones, dtype=np.uint8), \
           np.array(last_states, copy=False)

Note how we handle the final transitions in the batch. To avoid the special handling
of such cases, for terminal transitions, we store the initial state in the last_states
array. To make our calculations of the Bellman update correct, we can mask such
batch entries during the loss calculation using the dones array. Another solution
would be to calculate the value of the last states only for non-terminal transitions,
but it would make our loss function logic a bit more complicated.

Calculation of the DQN loss function is provided by the function calc_loss_dqn,
and the code is almost the same as in Chapter 6, Deep Q-Networks. One small addition
is torch.no_grad(), which stops the PyTorch calculation graph from being recorded.

In [5]:
def calc_loss_dqn(batch, net, tgt_net, gamma, device="cpu"):
    states, actions, rewards, dones, next_states = \
        unpack_batch(batch)

    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.BoolTensor(dones).to(device)

    actions_v = actions_v.unsqueeze(-1)
    state_action_vals = net(states_v).gather(1, actions_v)
    state_action_vals = state_action_vals.squeeze(-1)
    with torch.no_grad():
        next_state_vals = tgt_net(next_states_v).max(1)[0]
        next_state_vals[done_mask] = 0.0

    bellman_vals = next_state_vals.detach() * gamma + rewards_v
    return nn.MSELoss()(state_action_vals, bellman_vals)

Besides those core DQN functions, common.py provides several utilities related to our
training loop, data generation, and TensorBoard tracking. The first such utility is a
small class that implements epsilon decay during the training. Epsilon defines the
probability of taking the random action by the agent. It should be decayed from 1.0 in
the beginning (fully random agent) to some small number, like 0.02 or 0.01. The code
is trivial but needed in almost any DQN, so it is provided by the following little class:

In [6]:
class EpsilonTracker:
    def __init__(self, selector: ptan.actions.EpsilonGreedyActionSelector,
                 params: SimpleNamespace):
        self.selector = selector
        self.params = params
        self.frame(0)

    def frame(self, frame_idx: int):
        eps = self.params.epsilon_start - \
              frame_idx / self.params.epsilon_frames
        self.selector.epsilon = max(self.params.epsilon_final, eps)

Another small function is batch_generator, which takes ExperienceReplayBuffer
(the PTAN class described in Chapter 7, Higher-Level RL Libraries) and infinitely
generates training batches sampled from the buffer. In the beginning, the function
ensures that the buffer contains the required amount of samples.

In [7]:
def batch_generator(buffer: ptan.experience.ExperienceReplayBuffer,
                    initial: int, batch_size: int):
    buffer.populate(initial)
    while True:
        buffer.populate(1)
        yield buffer.sample(batch_size)

Finally, a lengthy, but nevertheless very useful, function called setup_ignite
attaches the needed Ignite handlers, showing the training progress and writing
metrics to TensorBoard.

In [8]:
def setup_ignite(engine: Engine, params: SimpleNamespace,
                 exp_source, run_name: str,
                 extra_metrics: Iterable[str] = ()):
    # get rid of missing metrics warning
    warnings.simplefilter("ignore", category=UserWarning)

    handler = ptan_ignite.EndOfEpisodeHandler(
        exp_source, bound_avg_reward=params.stop_reward)
    handler.attach(engine)
    ptan_ignite.EpisodeFPSHandler().attach(engine)

    @engine.on(ptan_ignite.EpisodeEvents.EPISODE_COMPLETED)
    def episode_completed(trainer: Engine):
        passed = trainer.state.metrics.get('time_passed', 0)
        print("Episode %d: reward=%.0f, steps=%s, "
              "speed=%.1f f/s, elapsed=%s" % (
            trainer.state.episode, trainer.state.episode_reward,
            trainer.state.episode_steps,
            trainer.state.metrics.get('avg_fps', 0),
            timedelta(seconds=int(passed))))

    @engine.on(ptan_ignite.EpisodeEvents.BOUND_REWARD_REACHED)
    def game_solved(trainer: Engine):
        passed = trainer.state.metrics['time_passed']
        print("Game solved in %s, after %d episodes "
              "and %d iterations!" % (
            timedelta(seconds=int(passed)),
            trainer.state.episode, trainer.state.iteration))
        trainer.should_terminate = True

    now = datetime.now().isoformat(timespec='minutes')
    logdir = f"runs/{now}-{params.run_name}-{run_name}"
    tb = tb_logger.TensorboardLogger(log_dir=logdir)
    run_avg = RunningAverage(output_transform=lambda v: v['loss'])
    run_avg.attach(engine, "avg_loss")

    metrics = ['reward', 'steps', 'avg_reward']
    handler = tb_logger.OutputHandler(
        tag="episodes", metric_names=metrics)
    event = ptan_ignite.EpisodeEvents.EPISODE_COMPLETED
    tb.attach(engine, log_handler=handler, event_name=event)

    # write to tensorboard every 100 iterations
    ptan_ignite.PeriodicEvents().attach(engine)
    metrics = ['avg_loss', 'avg_fps']
    metrics.extend(extra_metrics)
    handler = tb_logger.OutputHandler(
        tag="train", metric_names=metrics,
        output_transform=lambda a: a)
    event = ptan_ignite.PeriodEvents.ITERS_100_COMPLETED
    tb.attach(engine, log_handler=handler, event_name=event)

Initially, setup_ignite attaches two Ignite handlers provided by PTAN:
- EndOfEpisodeHandler, which emits the Ignite event every time a game
episode ends. It can also fire an event when the averaged reward for
episodes crosses some boundary. We use this to detect when the game
is finally solved.
- EpisodeFPSHandler, a small class that tracks the time the episode has taken
and the amount of interactions that we have had with the environment.
From this, we calculate frames per second (FPS), which is an important
performance metric to track.

Then we install two event handlers, with one being called at the end of an episode.
It will show information about the completed episode on the console. Another
function will be called when the average reward grows above the boundary defined
in the hyperparameters (18.0 in the case of Pong). This function shows a message
about the solved game and stops the training.

The rest of the function is related to the TensorBoard data that we want to track, first, we create a TensorboardLogger, a special class provided by Ignite to write
into TensorBoard. Our processing function will return the loss value, so we attach
the RunningAverage transformation (also provided by Ignite) to get a smoothed
version of the loss over time.

TensorboardLogger can track two groups of values from Ignite: outputs (values
returned by the transformation function) and metrics (calculated during the training
and kept in the engine state). EndOfEpisodeHandler and EpisodeFPSHandler
provide metrics, which are updated at the end of every game episode. So, we attach
OutputHandler, which will write into TensorBoard information about the episode
every time it is completed.

Another group of values that we want to track are metrics from the training process:
loss, FPS, and, possibly, some custom metrics. Those values are updated every
training iteration, but we are going to do millions of iterations, so we will store
values in TensorBoard every 100 training iterations; otherwise, the data files will be
huge. All this functionality might look too complicated, but it provides us with the
unified set of metrics gathered from the training process. In fact, Ignite is not very
complicated and provides a very flexible framework.

## Implementation

In [9]:
import gym
import ptan
import argparse
import random

import torch
import torch.optim as optim

from ignite.engine import Engine

NAME = "01_baseline"

First, we create the environment and apply a set of standard wrappers. We have
already discussed them in Chapter 6, Deep Q-Networks and will also touch upon them
in the next chapter, when we optimize the performance of the Pong solver. Then, we
create the DQN model and the target network.

In [10]:
random.seed(SEED)
torch.manual_seed(SEED)
params = HYPERPARAMS['pong']

device = torch.device("cuda")

env = gym.make(params.env_name)
env = ptan.common.wrappers.wrap_dqn(env)
env.seed(SEED)

net = DQN(env.observation_space.shape,
                    env.action_space.n).to(device)

tgt_net = ptan.agent.TargetNet(net)

Next, we create the agent, passing it an epsilon-greedy action selector. During
the training, epsilon will be decreased by the EpsilonTracker class that we have
already discussed.

In [13]:
selector = ptan.actions.EpsilonGreedyActionSelector(
        epsilon=params.epsilon_start)
epsilon_tracker = EpsilonTracker(selector, params)
agent = ptan.agent.DQNAgent(net, selector, device=device)

The next two very important objects are ExperienceSource and
ExperienceReplayBuffer. The first one takes the agent and environment and
provides transitions over game episodes. Those transitions will be kept in the
experience replay buffer.

In [14]:
exp_source = ptan.experience.ExperienceSourceFirstLast(
        env, agent, gamma=params.gamma)
buffer = ptan.experience.ExperienceReplayBuffer(
        exp_source, buffer_size=params.replay_size)

Then we create an optimizer and define the processing function, which will
be called for every batch of transitions to train the model. To do this, we call
function calc_loss_dqn and then backpropagate on the result.

In [17]:
optimizer = optim.Adam(net.parameters(),
                           lr=params.learning_rate)

def process_batch(engine, batch):
    optimizer.zero_grad()
    loss_v = calc_loss_dqn(
        batch, net, tgt_net.target_model,
        gamma=params.gamma, device=device)
    loss_v.backward()
    optimizer.step()
    epsilon_tracker.frame(engine.state.iteration)
    if engine.state.iteration % params.target_net_sync == 0:
        tgt_net.sync()
    return {
        "loss": loss_v.item(),
        "epsilon": selector.epsilon,
    }

This function also asks EpsilonTracker to decrease the epsilon and does periodical
target network synchronization.

In [18]:
engine = Engine(process_batch)
setup_ignite(engine, params, exp_source, NAME)
engine.run(batch_generator(buffer, params.replay_initial,
                                  params.batch_size))

Episode 1: reward=-19, steps=1083, speed=0.0 f/s, elapsed=0:00:27
Episode 2: reward=-19, steps=1037, speed=0.0 f/s, elapsed=0:00:27
Episode 3: reward=-20, steps=898, speed=0.0 f/s, elapsed=0:00:27
Episode 4: reward=-21, steps=1000, speed=0.0 f/s, elapsed=0:00:27
Episode 5: reward=-21, steps=819, speed=0.0 f/s, elapsed=0:00:27
Episode 6: reward=-21, steps=814, speed=0.0 f/s, elapsed=0:00:27
Episode 7: reward=-21, steps=903, speed=0.0 f/s, elapsed=0:00:27
Episode 8: reward=-21, steps=844, speed=0.0 f/s, elapsed=0:00:27
Episode 9: reward=-21, steps=881, speed=0.0 f/s, elapsed=0:00:27
Episode 10: reward=-18, steps=1141, speed=0.0 f/s, elapsed=0:00:27
Episode 11: reward=-21, steps=816, speed=0.0 f/s, elapsed=0:00:27
Episode 12: reward=-20, steps=991, speed=0.0 f/s, elapsed=0:00:27
Episode 13: reward=-21, steps=867, speed=0.0 f/s, elapsed=0:00:27
Episode 14: reward=-19, steps=1150, speed=0.0 f/s, elapsed=0:00:27
Episode 15: reward=-21, steps=847, speed=0.0 f/s, elapsed=0:00:27
Episode 16: re

Episode 124: reward=-21, steps=896, speed=62.0 f/s, elapsed=0:25:29
Episode 125: reward=-21, steps=897, speed=62.0 f/s, elapsed=0:25:43
Episode 126: reward=-21, steps=843, speed=62.0 f/s, elapsed=0:25:57
Episode 127: reward=-21, steps=846, speed=62.0 f/s, elapsed=0:26:10
Episode 128: reward=-21, steps=1034, speed=62.0 f/s, elapsed=0:26:27
Episode 129: reward=-21, steps=784, speed=62.0 f/s, elapsed=0:26:40
Episode 130: reward=-20, steps=1430, speed=62.0 f/s, elapsed=0:27:03
Episode 131: reward=-21, steps=820, speed=62.0 f/s, elapsed=0:27:16
Episode 132: reward=-21, steps=898, speed=62.0 f/s, elapsed=0:27:30
Episode 133: reward=-21, steps=943, speed=62.0 f/s, elapsed=0:27:45
Episode 134: reward=-19, steps=1138, speed=62.0 f/s, elapsed=0:28:04
Episode 135: reward=-20, steps=1275, speed=62.0 f/s, elapsed=0:28:24
Episode 136: reward=-20, steps=1160, speed=62.0 f/s, elapsed=0:28:43
Episode 137: reward=-21, steps=1182, speed=62.0 f/s, elapsed=0:29:02
Episode 138: reward=-21, steps=879, speed=

Episode 243: reward=-11, steps=2145, speed=62.0 f/s, elapsed=1:17:07
Episode 244: reward=-10, steps=2600, speed=62.0 f/s, elapsed=1:17:50
Episode 245: reward=-13, steps=2275, speed=62.0 f/s, elapsed=1:18:27
Episode 246: reward=-9, steps=2374, speed=61.9 f/s, elapsed=1:19:05
Episode 247: reward=-17, steps=1774, speed=61.9 f/s, elapsed=1:19:34
Episode 248: reward=-10, steps=2101, speed=61.9 f/s, elapsed=1:20:09
Episode 249: reward=-13, steps=1990, speed=61.9 f/s, elapsed=1:20:41
Episode 250: reward=-11, steps=1981, speed=61.9 f/s, elapsed=1:21:13
Episode 251: reward=-3, steps=3589, speed=61.9 f/s, elapsed=1:22:11
Episode 252: reward=-10, steps=2282, speed=61.9 f/s, elapsed=1:22:49
Episode 253: reward=-20, steps=1518, speed=61.9 f/s, elapsed=1:23:13
Episode 254: reward=-20, steps=1458, speed=61.9 f/s, elapsed=1:23:37
Episode 255: reward=-12, steps=1842, speed=61.9 f/s, elapsed=1:24:07
Episode 256: reward=-6, steps=2755, speed=61.9 f/s, elapsed=1:24:52
Episode 257: reward=-5, steps=2513, s

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/home/anton/envs/reinforcement_learning/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-c80b81bfcb1c>", line 4, in <module>
    params.batch_size))
  File "/home/anton/envs/reinforcement_learning/lib/python3.6/site-packages/ignite/engine/engine.py", line 446, in run
    self._handle_exception(e)
  File "/home/anton/envs/reinforcement_learning/lib/python3.6/site-packages/ignite/engine/engine.py", line 410, in _handle_exception
    raise e
  File "/home/anton/envs/reinforcement_learning/lib/python3.6/site-packages/ignite/engine/engine.py", line 433, in run
    hours, mins, secs = self._run_once_on_dataset()
  File "/home/anton/envs/reinforcement_learning/lib/python3.6/site-packages/ignite/engine/engine.py", line 399, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/anton/envs/reinforcement_learning/lib/python3.

TypeError: object of type 'NoneType' has no len()