# Dueling Deep Q-Learning

## Table of Contents
- [1 - Motivation](#1)
- [2 - Dueling network architecture](#2)
- [3 - Advantages](#3)

Full paper: [Dueling Network Architectures for Deep Reinforcement Learning (2016)](https://arxiv.org/pdf/1511.06581.pdf)

<a name='1'></a>
# 1 - Motivation

The motivation of the authors was to introduce a more suitable new neural network architecture for model-free reinforcement learning, which can be easily combined with existing and future algorithms.

<a name='2'></a>
# 2 - Dueling network architecture

The main idea behind the dueling network architecture is to separate the representation of state values and (state-dependent) action advantages. Instead of the common one stream network architecture, used to estimate the Q-values for each action, the dueling network architecture consists of two streams that represent the value and advantage functions, while sharing a common convolutional feature learning module. The two streams are then combined via a special aggregating layer to produce an estimate of the state-action value function Q.

<img src="images/network_architecture.png">
<caption><center><font ><b>Figure 1</b>: Single stream Q-network (top) and dueling Q-network (bottom) </center></caption>
    
**Combination of the two streams**
    
To combine the two streams of fully-connected layers to output a Q estimate, we can't simply construct the aggregating module as follows:
    
$$Q(s, a, \theta, \alpha, \beta) =  V(s; \theta, \beta) + A(s, a'; \theta, \alpha))$$
    
This is because the equation is unidentifiable in the sense that given Q, we cannot recover V and A uniquely. To address this issue, we can force the advantage function estimator to have zero advantage at the chosen action:
    
$$Q(s, a, \theta, \alpha, \beta) =  V(s; \theta, \beta) + (A(s, a; \theta, \alpha) - \max_{a' \in \vert A \vert} A(s, a'; \theta, \alpha))$$
    
An alternative approach is to replace the max operator with an average:
    
$$Q(s, a, \theta, \alpha, \beta) =  V(s; \theta, \beta) + (A(s, a; \theta, \alpha) - \frac{1}{\vert A \vert}\sum_{a'} A(s, a'; \theta, \alpha))$$
    
        
In the equation above $\theta$ denotes the parameters of the convolutional layers, while $\alpha$ and $\beta$ are the parameters of the two streams of fully-connected layers.

<a name='3'></a>
# 3 - Advantages

Through the separation of the representation of state values and (state-dependent) action advantages we can learn the state-value function more efficiently. In contrast to a single-stream architecture, where at each step only the value for one of the actions is updated, in the dueling network architecture the value stream V is updated at each step. This allocates more resources to the representation of the state values and leads to a better approximation.

As shown in figure 2, the dueling network architecture performs better than the traditional Q-network architecture, especially in large action spaces.

<img src="images/nb_actions.png">
<caption><center><font ><b>Figure 2</b>: Performance comparision with increasing number of actions </center></caption>