## 🔹 1. **Q-Learning (Vanilla Q-Learning)**

Q-learning is a **model-free reinforcement learning algorithm** that learns the value of taking an action in a given state.
It uses a **Q-table** (state-action value table) to store values.

**Update Rule**:

$$
Q(s,a) \leftarrow Q(s,a) + \alpha \Big[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \Big]
$$

* $s$: current state
* $a$: action taken
* $r$: reward
* $s'$: next state
* $\alpha$: learning rate
* $\gamma$: discount factor

👉 Works well in **small discrete state spaces**, but struggles with large or continuous spaces since the Q-table becomes huge.

**Example:**
Suppose an agent in a grid world wants to reach a goal.

* States = grid cells
* Actions = {up, down, left, right}
* The Q-table might look like:

| State | Up  | Down | Left | Right |
| ----- | --- | ---- | ---- | ----- |
| (0,0) | 0   | 0.2  | 0    | 0.1   |
| (0,1) | 0.5 | 0.1  | 0.3  | 0.4   |

The agent updates this table until it learns the best path.

In [None]:
import gymnasium as gym        
import numpy as np

env = gym.make("CartPole-v1")

# Discretization
n_bins = (6, 12)   # angle, angular velocity bins
obs_space = np.array([env.observation_space.low, env.observation_space.high]).T
obs_space[1] = [4.8, 5, 0.418, 5]  # clip values

def discretize(obs):
    ratios = [(obs[i] + abs(obs_space[i][0])) / (obs_space[i][1] - obs_space[i][0]) for i in [2,3]]
    new_obs = [int(round((n_bins[i] - 1) * ratios[i])) for i in range(2)]
    return tuple(np.clip(new_obs, 0, np.array(n_bins)-1))

# Q-table
Q = np.zeros(n_bins + (env.action_space.n,))
alpha, gamma, eps = 0.1, 0.99, 1.0

for episode in range(5000):
    obs, _ = env.reset()
    state = discretize(obs)
    done = False

    while not done:
        if np.random.rand() < eps:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])

        next_obs, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated  # ✅ fix for Gymnasium

        next_state = discretize(next_obs)

        best_next = np.max(Q[next_state])
        Q[state + (action,)] += alpha * (reward + gamma * best_next - Q[state + (action,)])

        state = next_state

    eps = max(0.01, eps * 0.995)  # decay epsilon

## 🔹 2. **Deep Q-Learning (DQN or D-Q Learning)**

Instead of a Q-table, we use a **neural network** to approximate the Q-function:

$$
Q(s,a;\theta) \approx Q(s,a)
$$

* $\theta$: parameters of the neural network
* Input: state (can be high-dimensional, e.g., images)
* Output: Q-values for each possible action

### Key Features of DQN

1. **Experience Replay**: Store past experiences $(s,a,r,s')$ in a replay buffer, and sample mini-batches to break correlation between consecutive updates.
2. **Target Network**: Maintain a separate network for stable Q-value updates.

**Update Rule (with NN):**

$$
L(\theta) = \Big[ r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta) \Big]^2
$$

where $\theta^-$ are the parameters of the target network.

---

### Example: Playing Atari (Breakout 🎮)

* **Q-Learning**: Not feasible (huge state space, every pixel arrangement is a state).
* **DQN**: Input the raw image into a convolutional neural network → output Q-values for {move left, move right, fire}.

  * Example: The NN might learn that in state (ball near paddle, moving right), the Q-value for action “move right” is highest.

---

## 🔑 Key Differences

| Feature              | Q-Learning                    | Deep Q-Learning (DQN)                            |        |   |   |                     |
| -------------------- | ----------------------------- | ------------------------------------------------ | ------ | - | - | ------------------- |
| Value Representation | **Q-table** (explicit lookup) | **Neural Network** (function approximation)      |        |   |   |                     |
| State Space          | Small, discrete               | Large/continuous, high-dimensional               |        |   |   |                     |
| Memory               | Needs table of size (         | S                                                | \times | A | ) | Needs weights of NN |
| Stability            | More stable, but limited      | Needs tricks (experience replay, target network) |        |   |   |                     |
| Applications         | Gridworld, simple games       | Atari, robotics, real-world tasks                |        |   |   |                     |

---

✅ **In short:**

* **Q-learning** = Good for small toy problems.
* **DQN (Deep Q-learning)** = Scales Q-learning using neural nets → can solve complex problems like playing video games or controlling robots.

In [None]:
import gymnasium as gym           
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
from collections import deque

env = gym.make("CartPole-v1")

# Neural Network
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU(),
            nn.Linear(128, 64), nn.ReLU(),
            nn.Linear(64, action_dim)
        )
    def forward(self, x):
        return self.net(x)

# Hyperparams
state_dim, action_dim = env.observation_space.shape[0], env.action_space.n
policy_net, target_net = DQN(state_dim, action_dim), DQN(state_dim, action_dim)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.Adam(policy_net.parameters(), lr=1e-3)

replay_buffer = deque(maxlen=10000)
batch_size, gamma, eps = 64, 0.99, 1.0

def select_action(state):
    if random.random() < eps:
        return env.action_space.sample()
    state = torch.FloatTensor(state).unsqueeze(0)
    return policy_net(state).argmax().item()

# Training
for episode in range(500):
    state, _ = env.reset()
    done = False
    while not done:
        action = select_action(state)
        next_state, reward, done, _, _ = env.step(action)
        replay_buffer.append((state, action, reward, next_state, done))
        state = next_state

        # Update
        if len(replay_buffer) > batch_size:
            batch = random.sample(replay_buffer, batch_size)
            s, a, r, ns, d = zip(*batch)

            s = torch.FloatTensor(s)
            a = torch.LongTensor(a).unsqueeze(1)
            r = torch.FloatTensor(r).unsqueeze(1)
            ns = torch.FloatTensor(ns)
            d = torch.FloatTensor(d).unsqueeze(1)

            q_values = policy_net(s).gather(1, a)
            max_next_q = target_net(ns).max(1, keepdim=True)[0]
            target = r + gamma * max_next_q * (1 - d)

            loss = nn.MSELoss()(q_values, target.detach())
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    eps = max(0.01, eps * 0.995)
    if episode % 10 == 0:
        target_net.load_state_dict(policy_net.state_dict())

  s = torch.FloatTensor(s)
