# Dueling DQN

This improvement to DQN was proposed in 2015, in the paper called Dueling
Network Architectures for Deep Reinforcement Learning ([8] Wang et al., 2015). The
core observation of this paper is that the Q-values, Q(s, a), that our network is trying
to approximate can be divided into quantities: the value of the state, V(s), and the
advantage of actions in this state, A(s, a).
You have seen the quantity V(s) before, as it was the core of the value iteration
method from Chapter 5, Tabular Learning and the Bellman Equation. It is just equal to
the discounted expected reward achievable from this state. The advantage A(s, a)
is supposed to bridge the gap from A(s) to Q(s, a), as, by definition, Q(s, a) = V(s) +
A(s, a). In other words, the advantage A(s, a) is just the delta, saying how much extra
reward some particular action from the state brings us. The advantage could be
positive or negative and, in general, can have any magnitude. For example, at some
tipping point, the choice of one action over another can cost us a lot of the total reward.

The Dueling paper's contribution was an explicit separation of the value and the
advantage in the network's architecture, which brought better training stability,
faster convergence, and better results on the Atari benchmark. The architecture
difference from the classic DQN network is shown in the following illustration. The
classic DQN network (top) takes features from the convolution layer and, using fully
connected layers, transforms them into a vector of Q-values, one for each action. On
the other hand, dueling DQN (bottom) takes convolution features and processes
them using two independent paths: one path is responsible for V(s) prediction, which
is just a single number, and another path predicts individual advantage values,
having the same dimension as Q-values in the classic case. After that, we add V(s)
to every value of A(s, a) to obtain Q(s, a), which is used and trained as normal.

These changes in the architecture are not enough to make sure that the network will
learn V(s) and A(s, a) as we want it to. Nothing prevents the network, for example,
from predicting some state, V(s) = 0, and A(s) = [1, 2, 3, 4], which is completely
wrong, as the predicted V(s) is not the expected value of the state. We have yet
another constraint to set: we want the mean value of the advantage of any state
to be zero. In that case, the correct prediction for the preceding example will be
V(s) = 2.5 and A(s) = [–1.5, –0.5, 0.5, 1.5].
This constraint could be enforced in various ways, for example, via the loss
function; but in the Dueling paper, the authors proposed a very elegant
solution of subtracting the mean value of the advantage from the Q expression
in the network, which effectively pulls the mean for the advantage to zero:
𝑄(𝑠, 𝑎) = 𝑉(𝑠) + 𝐴(𝑠, 𝑎) − (1/𝑁)* Σ 𝐴(𝑠, 𝑘). 

This keeps the changes that need to be made
in the classic DQN very simple: to convert it to the double DQN, you need to change
only the network architecture, without affecting other pieces of the implementation.

All the changes
sit in the network architecture, so here I'll show only the network class (which is
in the lib/dqn_extra.py module).


``` python
class DuelingDQN(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(DuelingDQN, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32,
                      kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

        conv_out_size = self._get_conv_out(input_shape)
        self.fc_adv = nn.Sequential(
            nn.Linear(conv_out_size, 256),
            nn.ReLU(),
            nn.Linear(256, n_actions)
        )
        self.fc_val = nn.Sequential(
            nn.Linear(conv_out_size, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        adv, val = self.adv_val(x)
        return val + (adv - adv.mean(dim=1, keepdim=True))

    def adv_val(self, x):
        fx = x.float() / 256
        conv_out = self.conv(fx).view(fx.size()[0], -1)
        return self.fc_adv(conv_out), self.fc_val(conv_out)
```

Instead of defining a single path of fully connected layers, we create two different
transformations: one for advantages and one for value prediction. Also, to keep the
number of parameters in the model comparable to the original network, the inner
dimension in both paths is decreased from 512 to 256.

The changes in the forward() function are also very simple, thanks to PyTorch's
expressiveness: we calculate the value and advantage for our batch of samples
and add them together, subtracting the mean of the advantage to obtain the final
Q-values. A subtle, but important, difference lies in calculating the mean along the
second dimension of the tensor, which produces a vector of the mean advantage for
every sample in our batch.

In [1]:
import sys
sys.path.append("../Chapter08/")

In [2]:
import gym
import ptan
import argparse
import random
import numpy as np

import torch
import torch.optim as optim

from ignite.engine import Engine

from lib import common, dqn_extra

NAME = "06_dueling"
STATES_TO_EVALUATE = 1000
EVAL_EVERY_FRAME = 100


@torch.no_grad()
def evaluate_states(states, net, device, engine):
    s_v = torch.tensor(states).to(device)
    adv, val = net.adv_val(s_v)
    engine.state.metrics['adv'] = adv.mean().item()
    engine.state.metrics['val'] = val.mean().item()


random.seed(common.SEED)
torch.manual_seed(common.SEED)
params = common.HYPERPARAMS['pong']
device = torch.device("cuda")

env = gym.make(params.env_name)
env = ptan.common.wrappers.wrap_dqn(env)
env.seed(common.SEED)

net = dqn_extra.DuelingDQN(env.observation_space.shape, env.action_space.n).to(device)

tgt_net = ptan.agent.TargetNet(net)
selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=params.epsilon_start)
epsilon_tracker = common.EpsilonTracker(selector, params)
agent = ptan.agent.DQNAgent(net, selector, device=device)

exp_source = ptan.experience.ExperienceSourceFirstLast(
    env, agent, gamma=params.gamma)
buffer = ptan.experience.ExperienceReplayBuffer(
    exp_source, buffer_size=params.replay_size)
optimizer = optim.Adam(net.parameters(), lr=params.learning_rate)

def process_batch(engine, batch):
    optimizer.zero_grad()
    loss_v = common.calc_loss_dqn(batch, net, tgt_net.target_model,
                                  gamma=params.gamma, device=device)
    loss_v.backward()
    optimizer.step()
    epsilon_tracker.frame(engine.state.iteration)
    if engine.state.iteration % params.target_net_sync == 0:
        tgt_net.sync()
    if engine.state.iteration % EVAL_EVERY_FRAME == 0:
        eval_states = getattr(engine.state, "eval_states", None)
        if eval_states is None:
            eval_states = buffer.sample(STATES_TO_EVALUATE)
            eval_states = [np.array(transition.state, copy=False) for transition in eval_states]
            eval_states = np.array(eval_states, copy=False)
            engine.state.eval_states = eval_states
        evaluate_states(eval_states, net, device, engine)
    return {
        "loss": loss_v.item(),
        "epsilon": selector.epsilon,
    }

engine = Engine(process_batch)
common.setup_ignite(engine, params, exp_source, NAME, extra_metrics=('adv', 'val'))
engine.run(common.batch_generator(buffer, params.replay_initial, params.batch_size))

Episode 1: reward=-20, steps=1063, speed=0.0 f/s, elapsed=0:00:33
Episode 2: reward=-20, steps=834, speed=0.0 f/s, elapsed=0:00:33
Episode 3: reward=-21, steps=846, speed=0.0 f/s, elapsed=0:00:33
Episode 4: reward=-20, steps=891, speed=0.0 f/s, elapsed=0:00:33
Episode 5: reward=-20, steps=982, speed=0.0 f/s, elapsed=0:00:33
Episode 6: reward=-21, steps=882, speed=0.0 f/s, elapsed=0:00:33
Episode 7: reward=-20, steps=862, speed=0.0 f/s, elapsed=0:00:33
Episode 8: reward=-21, steps=877, speed=0.0 f/s, elapsed=0:00:33
Episode 9: reward=-21, steps=760, speed=0.0 f/s, elapsed=0:00:33
Episode 10: reward=-21, steps=818, speed=0.0 f/s, elapsed=0:00:33
Episode 11: reward=-21, steps=894, speed=0.0 f/s, elapsed=0:00:33
Episode 12: reward=-20, steps=985, speed=54.5 f/s, elapsed=0:00:46


KeyboardInterrupt: 