# Double DQN (DDQN): A Comprehensive Tutorial

## Introduction

Double DQN (DDQN) is an enhancement of the standard Deep Q-Network (DQN) algorithm that addresses the overestimation bias in Q-learning methods. By using two separate networks, DDQN aims to provide more accurate value estimates, leading to improved policy performance in reinforcement learning tasks.

## Mathematical Background

### Markov Decision Process (MDP)

An MDP is defined by:
- **S**: A set of states.
- **A**: A set of actions.
- **P**: State transition probabilities, where $P(s' \mid s, a)$ represents the probability of transitioning to state $s'$ from state $s$ after taking action $a$.
- **R**: Reward function, where $R(s, a)$ is the expected reward received after taking action $a$ in state $s$.
- **$\gamma$**: Discount factor, $0 \leq \gamma < 1$, which represents the importance of future rewards.

### Q-Function

The Q-function, or action-value function, $Q(s, a)$, represents the expected cumulative reward of taking action $a$ in state $s$ and then following a policy $\pi$:

$$
Q^\pi(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \bigg| s_0 = s, a_0 = a, \pi \right]
$$

### Double DQN Update Rule

In DDQN, we maintain two separate networks:
- **Online Network**: Used to select actions.
- **Target Network**: Used to evaluate actions.

The update rule is as follows:

1. Select action using the online network:
   $$ a_t = \text{argmax}_a Q_{\text{online}}(s_t, a) $$

2. Compute the target Q-value using the target network:
   $$ Q_{\text{target}}(s_t, a_t) = r_t + \gamma Q_{\text{target}}(s_{t+1}, \text{argmax}_a Q_{\text{online}}(s_{t+1}, a)) $$

3. Update the online network using:
   $$ Q_{\text{online}}(s_t, a_t) \leftarrow Q_{\text{online}}(s_t, a_t) + \alpha \left[ Q_{\text{target}}(s_t, a_t) - Q_{\text{online}}(s_t, a_t) \right] $$

### Algorithm Steps

1. **Initialize**:
   - Initialize the online and target networks with random weights.
   - Initialize the replay memory to store experiences.

2. **Policy Selection**:
   - Choose an action $a_t$ for the current state $s_t$ using an epsilon-greedy policy based on the online network.

3. **Experience Replay**:
   - Store the experience $(s_t, a_t, r_t, s_{t+1})$ in replay memory.

4. **Sample Mini-Batch**:
   - Sample a mini-batch of experiences from the replay memory.

5. **Target Calculation**:
   - Calculate the target using the target network.

6. **Network Update**:
   - Update the online network based on the calculated targets.

7. **Update Target Network**:
   - Periodically update the target network weights to match the online network weights.

### Pseudocode

```python
Initialize online and target networks
Initialize replay memory
Repeat for each episode:
    Initialize state s
    Choose action a using policy derived from online network (epsilon-greedy)
    Repeat for each step of the episode:
        Take action a, observe r, s'
        Store (s, a, r, s') in replay memory
        Sample mini-batch from replay memory
        For each (s, a, r, s') in mini-batch:
            Calculate target using target network
            Update online network
        Update target network periodically
    until s is terminal
