# Deep Q-Learning

In deep Q-learning, we replace the table of Q-values with a neural network approximation, denoted as $Q(s,a; \theta)$, where $\theta$ represents the network parameters. The modifications to the standard Q-learning algorithm are as follows:

1. **Initialization:**  
   Initialize the Q-network $Q(s, a; \theta)$ with some initial approximation (typically random weights).

2. **Experience Sampling:**  
   Interact with the environment to obtain the tuple $(s, a, r, s')$. In practice, experiences are often stored in a replay buffer to decorrelate samples.

3. **Loss Calculation:**  
   Compute the loss $\mathcal{L}$:
   - If the episode has ended:
     $$
     \mathcal{L} = \left( Q(s, a; \theta) - r \right)^2
     $$
   - Otherwise:
     $$
     \mathcal{L} = \left( Q(s, a; \theta) - \left( r + \gamma \max_{a' \in A} Q(s', a'; \theta) \right) \right)^2
     $$

4. **Parameter Update:**  
   Update the network parameters $\theta$ using stochastic gradient descent (SGD) to minimize the loss $\mathcal{L}$.

5. **Iteration:**  
   Repeat from step 2 until convergence.

*Note:* In practical implementations, additional techniques such as target networks and experience replay are used to improve stability and performance.

Random behavior is better at the beginning of the training when our Q approximation is bad, as it gives us more uniformly distributed information about the environment states. As our training progresses, random behavior becomes inefficient, and we want to fall back to our Q approximation to decide how to act.

# SGD Optimization in Deep Q-Learning

Deep Q-learning treats Q-value approximation as a supervised learning problem, using **Stochastic Gradient Descent (SGD)**. However, RL data violates the **i.i.d. assumption** (one of the fundamental requirements for SGD optimization is that the training data is independent and identically distributed) because:

1. **Non-Independent Samples:** Consecutive experiences are highly correlated as they belong to the same episode.
2. **Non-Identical Distribution:** Training data comes from a suboptimal policy (e.g., random or $\epsilon$-greedy), while the goal is to learn an optimal policy.

To address this, **Replay Buffers** store past experiences and sample from them to create more independent training batches. This stabilizes training and ensures the model learns from recent yet diverse experiences.

### **Correlation Between Steps:**  
Bootstrapping with the Bellman equation links $Q(s, a)$ to $Q(s', a')$, causing instability as updates to $Q(s, a)$ can negatively impact $Q(s', a')$. To stabilize training, **target networks** are usedâ€”these are periodically updated copies of the main network to break this correlation.  

### **Partially Observable MDP (POMDP):**  
A **POMDP** is an MDP without the Markov property (future state depends on more than one past state), where the agent has incomplete information about the true state. This occurs in games like poker, where hidden opponent cards create uncertainty. A common solution is **using past observations** to approximate the full state.  


In [2]:
import gymnasium as gym
from src import dqn_model
from src import wrappers

from dataclasses import dataclass
import argparse
import time
import numpy as np
import collections
import typing as tt

import torch
import torch.nn as nn
import torch.optim as optim

from torch.utils.tensorboard.writer import SummaryWriter


In [3]:
DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
MEAN_REWARD_BOUND = 19

GAMMA = 0.99
BATCH_SIZE = 32
REPLAY_SIZE = 10000
LEARNING_RATE = 1e-4
SYNC_TARGET_FRAMES = 1000
REPLAY_START_SIZE = 1000

EPSILON_DECAY_LAST_FRAME = 150000
EPSILON_START = 1.0
EPSILON_FINAL = 0.01

State = np.ndarray
Action = int
BatchTensors = tt.Tuple[
    torch.ByteTensor,       # Current State
    torch.LongTensor,       # actions
    torch.Tensor,           # rewards
    torch.BoolTensor,       # done or trunc
    torch.ByteTensor,       # next stage
]

@dataclass
class Experience:
    state: State
    action: Action
    reward: float
    done_trunc: bool
    new_state: State

In [4]:
class ExperienceBuffer:
    def __init__(self, capacity: int):
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        return len(self.buffer)
    
    def append(self, experience: Experience):
        self.buffer.append(experience)

    def sample(self, batch_size: int) -> tt.List[Experience]:
        indices = np.random.choice(len(self), batch_size, replace=False)
        return [self.buffer[idx] for idx in indices]

In [5]:
class Agent:
    def __init__(self, env: gym.Env, exp_buffer: ExperienceBuffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self.state: tt.Optional[np.ndarray] = None
        self._reset()

    def _reset(self):
        self.state, _ = env.reset()
        self.total_reward = 0.0

    @torch.no_grad()
    def play_step(self, net: dqn_model.DQN, device: torch.device,
                  epsilon: float = 0.0) -> tt.Optional[float]:
        done_reward = None

        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state_v = torch.as_tensor(self.state).to(device)
            state_v.unsqueeze(0)
            q_vals_v = net(state_v)
            _, act_v = torch.max(q_vals_v, dim=1)
            action = int(act_v.item())

        # Do a step in the environment

        new_state, reward, is_done, is_tr, _ = self.env.step(action)
        self.total_reward += rewarr

        exp = Experience(
            state = self.state, action = action, reward = float(reward),
            done_trunc=is_done or is_tr, new_state=new_state
        )

        self.exp_buffer.append(exp)
        self.state = new_state
        if is_done or is_tr:
            done_reward = self.total_reward
            self._reset()
        return done_reward

In [None]:
def batch_to_tensor(batch: tt.List[Experience], device: torch.device) ->
    states, actions, rewards, dones, new_state = [], [], [], [], []
    for e in batch:
        states.append(e.state)
        actions.append(e.action)
        rewards.append(e.reward)
        dones.append(e.done_trunc)
        new_state.append(e.new_state)
        