# Deep Q-learning

```{note}
The main problem with Q-Learning is that it does not scale well to large MDPs with many states and actions. The solution is to find a function $Q_{\theta}(s,a)$ that approximates the Q-Value of any state-action
pair $(s,a)$ using a manageable number of parameters. A
DNN used to estimate Q-Values is called a Deep Q-Network (DQN), and using a
DQN for Approximate Q-Learning is called Deep Q-Learning.
```

## Basic Deep Q-learning

Now, how can we train a DQN? Well, consider the approximate Q-Value computed
by the DQN for a given state-action pair $(s, a)$. Thanks to Bellman, we know we want
this approximate Q-Value to be as close as possible to the reward $r$ that we actually
observe after playing action $a$ in state $s$, plus the discounted value of playing optimally from then on.

To estimate this sum of future discounted rewards, we can simply execute
the DQN on the next state $s'$ and for all possible actions $a'$. We get an approximate
future Q-Value for each possible action. We then pick the highest and discount it, and this gives us an estimate of
the sum of future discounted rewards. By summing the reward $r$ and the future discounted
value estimate, we get a target Q-Value $y(s, a)$ for the state-action pair $(s, a)$:

$$Q_{\text{target}}(s, a) = r + \gamma\max_{a'}Q_{\theta}(s', a')$$

With this target Q-Value, we can run a training step using any Gradient Descent algorithm. Specifically, we generally try to minimize the squared error between the estimated
Q-Value $Q(s, a)$ and the target Q-Value. And that’s all for the basic Deep Q-Learning
algorithm!

The first thing we need is a Deep Q-Network. In theory, you need a neural net that
takes a state-action pair and outputs an approximate Q-Value, but in practice it’s
much more efficient to use a neural net that takes a state and outputs one approximate
Q-Value for each possible action.

To select an action using this DQN, we pick the action with the largest predicted QValue.
To ensure that the agent explores the environment, we will use an $\epsilon$-greedy
policy.

```{tip}
Instead of training the DQN based only on the latest experiences, we will store all
experiences in a replay buffer, and we will sample a random training
batch from it at each training iteration. This helps reduce the correlations
between the experiences in a training batch, which tremendously helps training.
```

## Fixed Q-Value Targets

In the basic Deep Q-Learning algorithm, the model is used both to make predictions
and to set its own targets. This can lead to a situation analogous to a dog chasing its
own tail. This feedback loop can make the network unstable.

To solve this problem, in their 2013 paper the DeepMind researchers
used two DQNs instead of one: the first is the online model, which learns at each
step and is used to move the agent around, and the other is the target model used only
to define the targets. The target model is just a clone of the online model. 

Since the target model is updated much less often than the online model, the Q-Value
targets are more stable.

## Double DQN

In a 2015 paper, DeepMind researchers tweaked their DQN algorithm, increasing
its performance and somewhat stabilizing training. They called this variant Double
DQN. The update was based on the observation that the target network is prone to
overestimating Q-Values. 

To fix this, they proposed
using the online model instead of the target model when selecting the best
actions for the next states, and using the target model only to estimate the Q-Values
for these best actions.

paper: https://arxiv.org/pdf/1509.06461.pdf