[Deep Q-Networks](https://www.datahubbs.com/deep-q-learning-101/) are great, but they have a slight problem - they tend to overestimate their Q-values. A very easy way to address this, is by extending the ideas developed in the [double Q-learning](https://www.datahubbs.com/double-q-learning/) case to DQN's. This gives us **Double Deep Q-Networks**, which use a second network to learn an unbiased estimation of the Q-values. The great thing about this, is that it can reduce the over-estimation which also [improves performance](https://arxiv.org/abs/1509.06461).

## TL;DR

We build a Double DQN system to reduce the bias in training to learn a simple CartPole task.

## DQN Recap

DQN's burst on the scene when the cracked the Atari code for DeepMind a few years back. The key was to take [Q-learning](https://www.datahubbs.com/intro-to-q-learning/), but estimate the Q-function with a deep neural network. Now, simply using the Q-learning update equation to change the weights and biases of a neural network wasn't quite enough, so a few tricks had to be introduced to assist the network and enable it to train consistently.

The first trick was adding **experience replay**. This is a memory buffer that stores previously viewed state-action-reward-next state tuples that is sampled from during updating to break sequence dependence. 

The second trick was adding a **target network** to the system. This target network is just a copy of the neural network that we are training and gets updated periodically by the algorithm. The reason this is important is that it provides a relatively stable baseline for performance measurement. In reinforcement learning, we don't have labeled data to tell us when we're right or wrong, instead we have a reward signal. If we're maximizing a reward, then the bigger the better as far as our algorithm is concerned, but we don't know how big it gets.

For example, your RL agent is chugging along and picking up a few rewards here and there while completing its task. Lots of 0's and 1's, then suddenly, it gets a big reward of 10. Is that good? Well, it's better than anything we've seen thus far for sure, but perhaps if we took a different action we would have gotten a reward of 50 or 100 instead. We just don't know. 

This is part of the challenge of RL, we have to explore and learn a good policy, but also learn a baseline to compare our performance against. This is where the target network comes in. This target serves as our baseline, but if our baseline changes too much, it becomes very hard to learn, so we only update it intermittently. 

## Leveraging the Target Network

[Double Q-learning](https://www.datahubbs.com/double-q-learning/) is able to learn a better estimate because of its second Q-function approximator. We can do the same, but instead of a second, independent DQN, we can just tap our target network for double duty by using it to unbiase our estimates. This is quite easy to do by using the target network to estimate the value of our action. 

For a typical DQN, we calculate the loss using:

$$y_t^{DQN} = R_t + \gamma max_a \big(Q(s_{t+1}; \theta_T) \big)$$
  
Where $\theta_T$ represents our target network (and $\theta$ our DQN). So here, we're getting an estimate of the value from our target network. To transform this into our value estimate for the double DQN ($y_t^{DDQN}$), we simply update the function as follows:

$$y_t^{DDQN} = R_t + \gamma Q\big(s_{t+1}, argmax_a Q(s_{t+1}, a; \theta), \theta_T \big)$$

The difference here, is we're using our DQN to get our maximum action, then passing that action into the target network to yield our value estimate. In this way, we've now changed our target value to take the best from both networks just like we did in the [tabular version](https://www.datahubbs.com/double-q-learning/).

The full, DDQN algorithm is exactly the same as the DQN, just with an updated loss function.

> Initialize replay memory $D$ with $M$ samples and select minibatch sample size $B$

> Initialize networks with weights $\theta$ and $\theta_t = \theta$

> Select parameters $\alpha, \gamma \in (0, 1]$

> **FOR** each episode:

>> Initialize $s_0$

>>> **FOR** each step $t$ in the episode:

>>>> **IF** $p < \epsilon$ select a random action $a_t$

>>>> **ELSE** select $argmax_a \big(Q(s_t; \theta) \big)$

>>>> Take action $a_t$ and observe reward $R_t$ and new state $s_{t+1}$

>>>> Store transition ($s_t$, $a_t$, $R_t$, $s_{t+1}$) in replay buffer $D$

>>>> Sample random minibatch of $B$ transitions from $D$

>>>> Calculate the loss for all samples: 
$$y_t^{DDQN} = R_t + \gamma Q\big(s_{t+1}, argmax_a Q(s_{t+1}, a; \theta), \theta_T \big)$$
$$\mathcal{L}(\theta) = 
\begin{cases}
  \bigg(R_t + y_t^{DDQN} - Q(s_t, a_t; \theta) \bigg)^2 \\    
  \bigg(R_t - Q(s_t, a_t; \theta) \bigg)^2 \quad \text{if } s_{t+1} \text{ is a terminal state}
\end{cases}
$$


>>>> Update parameters $\theta$ with gradient descent

>>>> Every $N$ steps, set $\theta_T = \theta$


## DDQN Implementation

Our new and improved algorithm uses all of the same parts as the original DQN, so for brevity, I'm not going to reproduce the code for the `QNetwork` or `experienceReplayBuffer` here. You can get those details on the [DQN post](https://www.datahubbs.com/deep-q-learning-101/) or on [GitHub](https://github.com/hubbs5/rl_blog/blob/master/q_learning/deep/ddqn.py). Here, we'll just run the new algorithm for our double DQN agent.

So the only change here, is to the `calculate_loss` method where we have our new, DDQN loss calculation as shown above.

In [28]:
class DDQNAgent:
    
    def __init__(self, env, network, buffer, epsilon=0.05, batch_size=32):
        
        self.env = env
        self.network = network
        self.target_network = deepcopy(network)
        self.buffer = buffer
        self.epsilon = epsilon
        self.batch_size = batch_size
        self.window = 100
        self.reward_threshold = 195 # Avg reward before CartPole is "solved"
        self.initialize()
    
    def take_step(self, mode='train'):
        if mode == 'explore':
            action = self.env.action_space.sample()
        else:
            action = self.network.get_action(self.s_0, epsilon=self.epsilon)
            self.step_count += 1
        s_1, r, done, _ = self.env.step(action)
        self.rewards += r
        self.buffer.append(self.s_0, action, r, done, s_1)
        self.s_0 = s_1.copy()
        if done:
            self.s_0 = env.reset()
        return done
        
    # Implement DQN training algorithm
    def train(self, gamma=0.99, max_episodes=10000, 
              batch_size=32,
              network_update_frequency=4,
              network_sync_frequency=2000):
        self.gamma = gamma
        # Populate replay buffer
        while self.buffer.burn_in_capacity() < 1:
            self.take_step(mode='explore')
            
        ep = 0
        training = True
        while training:
            self.s_0 = self.env.reset()
            self.rewards = 0
            done = False
            while done == False:
                done = self.take_step(mode='train')
                # Update network
                if self.step_count % network_update_frequency == 0:
                    self.update()
                # Sync networks
                if self.step_count % network_sync_frequency == 0:
                    self.target_network.load_state_dict(
                        self.network.state_dict())
                    self.sync_eps.append(ep)
                    
                if done:
                    ep += 1
                    self.training_rewards.append(self.rewards)
                    self.training_loss.append(np.mean(self.update_loss))
                    self.update_loss = []
                    mean_rewards = np.mean(
                        self.training_rewards[-self.window:])
                    self.mean_training_rewards.append(mean_rewards)
                    print("\rEpisode {:d} Mean Rewards {:.2f}\t\t".format(
                        ep, mean_rewards), end="")
                    
                    if ep >= max_episodes:
                        training = False
                        print('\nEpisode limit reached.')
                        break
                    if mean_rewards >= self.reward_threshold:
                        training = False
                        print('\nEnvironment solved in {} episodes!'.format(
                            ep))
                        break
                        
    def calculate_loss(self, batch):
        states, actions, rewards, dones, next_states = [i for i in batch]
        rewards_t = torch.FloatTensor(rewards).to(device=self.network.device).reshape(-1,1)
        actions_t = torch.LongTensor(np.array(actions)).reshape(-1,1).to(
            device=self.network.device)
        dones_t = torch.ByteTensor(dones).to(device=self.network.device)
        
        qvals = torch.gather(self.network.get_qvals(states), 1, actions_t)
        
        #################################################################
        # DDQN Update
        next_actions = torch.max(self.network.get_qvals(next_states), dim=-1)[1]
        next_actions_t = torch.LongTensor(next_actions).reshape(-1,1).to(
            device=self.network.device)
        target_qvals = self.target_network.get_qvals(next_states)
        qvals_next = torch.gather(target_qvals, 1, next_actions_t).detach()
        #################################################################
        qvals_next[dones_t] = 0 # Zero-out terminal states
        expected_qvals = self.gamma * qvals_next + rewards_t
        loss = nn.MSELoss()(qvals, expected_qvals)
        return loss
    
    def update(self):
        self.network.optimizer.zero_grad()
        batch = self.buffer.sample_batch(batch_size=self.batch_size)
        loss = self.calculate_loss(batch)
        loss.backward()
        self.network.optimizer.step()
        if self.network.device == 'cuda':
            self.update_loss.append(loss.detach().cpu().numpy())
        else:
            self.update_loss.append(loss.detach().numpy())
        
    def initialize(self):
        self.training_rewards = []
        self.training_loss = []
        self.update_loss = []
        self.mean_training_rewards = []
        self.sync_eps = []
        self.rewards = 0
        self.step_count = 0
        self.s_0 = self.env.reset()

With our new `DDQNAgent`, we can go ahead and test it to see how it performs.

In [None]:
env = gym.make('CartPole-v0')
buffer = experienceReplayBuffer(memory_size=10000, burn_in=1000)
ddqn = QNetwork(env, learning_rate=1e-3)
agent = DDQNAgent(env, ddqn, buffer)
agent.train(max_episodes=5000, network_update_frequency=4, 
            network_sync_frequency=1000)

It learns!

<img src="https://www.datahubbs.com/wp-content/uploads/2019/09/ddqn_rewards.png">

You can also compare it to the performance of a DQN, as was done in the original paper. 

<img src="https://www.datahubbs.com/wp-content/uploads/2019/09/ddqn_paper_results.png">

Here, you can see that the value estimates (top row) are lower for the DDQN than the DQN. Additionally, they got better results with the DDQN thanks to these more accurate value estimates. 

DDQN is an easy way to boost the performance of your DQN, so go ahead, give it a shot!