# üìò Day 2: Deep Reinforcement Learning

**üéØ Goal:** Master Deep RL - combining neural networks with RL (like DeepMind's Atari agents!)

**‚è±Ô∏è Time:** 90-120 minutes

**üåü Why This Matters for AI:**
- Deep Q-Networks (DQN) revolutionized RL in 2013 - first to play Atari at human level
- Policy gradients power modern robotics (Boston Dynamics, Tesla robots)
- Actor-Critic methods used in AlphaGo, OpenAI Five, and Dota 2 AI
- Foundation of ChatGPT's RLHF (Proximal Policy Optimization)
- Powers self-driving cars, drone control, and robotic manipulation
- Enables learning in high-dimensional state spaces (images, sensor data)
- Used in Google data centers, recommendation systems, and game AI

---

## ü§î Why Deep Reinforcement Learning?

**Problem with Tabular Q-Learning:**

**Yesterday we learned Q-learning with tables:**
- Grid world: 16 states √ó 4 actions = 64 Q-values ‚úÖ
- Chess: ~10‚Å¥‚Å∞ states √ó moves ‚Üí IMPOSSIBLE to store! ‚ùå
- Atari games: 210√ó160√ó3 pixels = 100,800 dimensions ‚Üí IMPOSSIBLE! ‚ùå

**The Solution: Function Approximation**

Instead of table Q(s, a), use a function approximator:
```
Q(s, a) ‚âà Q(s, a; Œ∏)   (neural network with parameters Œ∏)
```

**Benefits:**
- ‚úÖ Handle large/continuous state spaces
- ‚úÖ Generalization: similar states ‚Üí similar Q-values
- ‚úÖ Learn from raw pixels (end-to-end)
- ‚úÖ Share knowledge across states

### üéØ The Deep RL Revolution

**Timeline:**
- **2013:** DeepMind's DQN plays Atari games (Nature paper 2015)
- **2015:** DQN beats human experts on 29/49 Atari games
- **2016:** AlphaGo beats Lee Sedol using policy networks
- **2017:** AlphaZero masters Chess/Shogi/Go from scratch
- **2019:** OpenAI Five beats Dota 2 world champions (PPO)
- **2022:** ChatGPT fine-tuned with PPO (RLHF)
- **2024-2025:** Deep RL in robotics, autonomous vehicles, drug discovery

**Key Innovations:**
1. **DQN (2013):** CNN + Q-learning + experience replay
2. **Policy Gradients:** Directly optimize policy (not Q-values)
3. **Actor-Critic:** Combine value functions + policy gradients
4. **PPO (2017):** Stable, robust, used in ChatGPT

Let's build Deep RL from scratch! üëá

In [None]:
# Import essential libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import deque
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from IPython.display import clear_output

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)
torch.manual_seed(42)

# Make plots beautiful
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully!")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")
print("Let's build Deep RL agents! üöÄ")

## üß† Deep Q-Networks (DQN)

**DQN = Q-Learning + Deep Neural Networks**

### Core Idea:

**Replace Q-table with neural network:**
```
Old: Q-table[state][action] = value
New: Q(state; Œ∏) ‚Üí [Q-value for each action]
```

**Architecture:**
```
Input (state)
    ‚Üì
Hidden Layer 1 (ReLU)
    ‚Üì
Hidden Layer 2 (ReLU)
    ‚Üì
Output Layer (Q-value for each action)
```

### üéØ Key Innovations of DQN:

**1. Experience Replay:**
- Store transitions (s, a, r, s') in replay buffer
- Sample random mini-batches for training
- Breaks correlation between consecutive samples
- More sample efficient (reuse experiences)

**Why it helps:**
```
Without replay: [exp1, exp2, exp3, ...] ‚Üí highly correlated
With replay:    [exp47, exp2, exp95, ...] ‚Üí independent samples
```

**2. Target Network:**
- Two networks: Q-network (online) and Target network (frozen)
- Target network updated every C steps
- Stabilizes training (moving target problem)

**Loss Function:**
```
Loss = (r + Œ≥ * max_a' Q_target(s', a') - Q(s, a))¬≤
       Ô∏∏‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅÔ∏∏
                     TD Error
```

**3. CNN for Image Processing (Atari):**
- Convolutional layers extract visual features
- Input: 84√ó84√ó4 grayscale frames (4 = frame stacking)
- Output: Q-value for each action

### üéØ DQN Algorithm:

```
1. Initialize Q-network with random weights Œ∏
2. Initialize target network with weights Œ∏‚Åª = Œ∏
3. Initialize replay buffer D
4. For each episode:
     a. Observe initial state s
     b. For each step:
          i.   Choose action a using Œµ-greedy
          ii.  Execute a, observe r, s'
          iii. Store (s, a, r, s', done) in D
          iv.  Sample random mini-batch from D
          v.   Compute target: y = r + Œ≥*max_a' Q(s',a'; Œ∏‚Åª)
          vi.  Update Œ∏ by minimizing (y - Q(s,a; Œ∏))¬≤
          vii. Every C steps: Œ∏‚Åª ‚Üê Œ∏
```

Let's implement DQN!

In [None]:
# Deep Q-Network Architecture

class DQN(nn.Module):
    """
    Deep Q-Network
    
    Input: State representation
    Output: Q-values for all actions
    """
    
    def __init__(self, state_size, action_size, hidden_size=128):
        super(DQN, self).__init__()
        
        # Fully connected layers
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, action_size)
        
    def forward(self, x):
        """
        Forward pass
        
        Args:
            x: State tensor (batch_size, state_size)
        
        Returns:
            Q-values for all actions (batch_size, action_size)
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)  # No activation on output
        return x

# Test the network
state_size = 4  # Example: CartPole has 4 state variables
action_size = 2  # Example: 2 actions (left, right)

dqn = DQN(state_size, action_size).to(device)

# Test forward pass
dummy_state = torch.randn(1, state_size).to(device)
q_values = dqn(dummy_state)

print("‚úÖ DQN Network Created!")
print(f"\nArchitecture:")
print(f"  Input size: {state_size}")
print(f"  Hidden layers: 128 ‚Üí 128")
print(f"  Output size: {action_size}")
print(f"\nNetwork:")
print(dqn)
print(f"\nTest output shape: {q_values.shape}")
print(f"Q-values: {q_values.detach().cpu().numpy()[0]}")

In [None]:
# Experience Replay Buffer

class ReplayBuffer:
    """
    Experience replay buffer for DQN
    """
    
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        """Store experience"""
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        """Sample random mini-batch"""
        batch = random.sample(self.buffer, batch_size)
        
        # Separate components
        states, actions, rewards, next_states, dones = zip(*batch)
        
        return (
            np.array(states),
            np.array(actions),
            np.array(rewards, dtype=np.float32),
            np.array(next_states),
            np.array(dones, dtype=np.uint8)
        )
    
    def __len__(self):
        return len(self.buffer)

# Test replay buffer
buffer = ReplayBuffer(capacity=1000)

# Add some experiences
for i in range(100):
    state = np.random.randn(4)
    action = np.random.randint(2)
    reward = np.random.randn()
    next_state = np.random.randn(4)
    done = False
    buffer.push(state, action, reward, next_state, done)

print("‚úÖ Replay Buffer Implemented!")
print(f"\nBuffer size: {len(buffer)}")
print(f"Capacity: {buffer.buffer.maxlen}")

# Sample a mini-batch
if len(buffer) >= 32:
    states, actions, rewards, next_states, dones = buffer.sample(32)
    print(f"\nSampled mini-batch:")
    print(f"  States shape: {states.shape}")
    print(f"  Actions shape: {actions.shape}")
    print(f"  Rewards shape: {rewards.shape}")
    print(f"\nüí° Random sampling breaks temporal correlation!")

In [None]:
# DQN Agent

class DQNAgent:
    """
    Complete DQN agent with experience replay and target network
    """
    
    def __init__(self, state_size, action_size, learning_rate=0.001, gamma=0.99, 
                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
                 buffer_size=10000, batch_size=64, target_update=10):
        
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.batch_size = batch_size
        self.target_update = target_update
        self.update_counter = 0
        
        # Q-network and target network
        self.q_network = DQN(state_size, action_size).to(device)
        self.target_network = DQN(state_size, action_size).to(device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.target_network.eval()  # Target network in eval mode
        
        # Optimizer and replay buffer
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        self.memory = ReplayBuffer(buffer_size)
        
    def get_action(self, state, training=True):
        """Choose action using Œµ-greedy policy"""
        if training and random.random() < self.epsilon:
            return random.randint(0, self.action_size - 1)
        
        # Convert to tensor
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        
        with torch.no_grad():
            q_values = self.q_network(state)
        
        return q_values.argmax().item()
    
    def train(self):
        """Train on mini-batch from replay buffer"""
        if len(self.memory) < self.batch_size:
            return None
        
        # Sample mini-batch
        states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size)
        
        # Convert to tensors
        states = torch.FloatTensor(states).to(device)
        actions = torch.LongTensor(actions).to(device)
        rewards = torch.FloatTensor(rewards).to(device)
        next_states = torch.FloatTensor(next_states).to(device)
        dones = torch.FloatTensor(dones).to(device)
        
        # Current Q-values
        current_q = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        
        # Target Q-values (use target network)
        with torch.no_grad():
            next_q = self.target_network(next_states).max(1)[0]
            target_q = rewards + (1 - dones) * self.gamma * next_q
        
        # Compute loss
        loss = F.mse_loss(current_q, target_q)
        
        # Optimize
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Update target network
        self.update_counter += 1
        if self.update_counter % self.target_update == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())
        
        return loss.item()
    
    def decay_epsilon(self):
        """Decay exploration rate"""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

print("‚úÖ DQN Agent Implemented!")
print("\nüéØ Key Components:")
print("  1. Q-Network: Approximates Q(s,a)")
print("  2. Target Network: Stabilizes training")
print("  3. Replay Buffer: Breaks correlation")
print("  4. Œµ-greedy: Exploration/exploitation")
print("\nüí° This is the algorithm that mastered Atari games!")

## üéÆ Real AI Example: CartPole with DQN

**Task:** Balance a pole on a moving cart

**CartPole Environment:**
- **State:** [cart position, cart velocity, pole angle, pole angular velocity]
- **Actions:** 0 = push left, 1 = push right
- **Reward:** +1 for each timestep the pole stays upright
- **Done:** Pole falls beyond ¬±12¬∞ or cart moves beyond ¬±2.4 units
- **Goal:** Keep pole balanced for 200+ steps

**Why CartPole?**
- Classic control problem
- Simple but non-trivial
- Tests continuous state handling
- Fast training (see results in minutes!)

**Real-World Analogy:**
- Balancing humanoid robots (Boston Dynamics)
- Drone stabilization
- Inverted pendulum control

Let's train DQN on CartPole!

In [None]:
# Simple CartPole environment (gym-like interface)

class CartPoleEnv:
    """
    Simplified CartPole environment
    """
    
    def __init__(self):
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.length = 0.5
        self.force_mag = 10.0
        self.tau = 0.02  # seconds between state updates
        
        # Thresholds
        self.theta_threshold = 12 * 2 * np.pi / 360  # ¬±12 degrees
        self.x_threshold = 2.4
        
        self.state = None
        self.steps = 0
        
    def reset(self):
        """Reset environment"""
        self.state = np.random.uniform(low=-0.05, high=0.05, size=(4,))
        self.steps = 0
        return self.state.copy()
    
    def step(self, action):
        """Take action"""
        x, x_dot, theta, theta_dot = self.state
        
        force = self.force_mag if action == 1 else -self.force_mag
        costheta = np.cos(theta)
        sintheta = np.sin(theta)
        
        # Physics simulation
        temp = (force + self.masspole * self.length * theta_dot**2 * sintheta) / (self.masscart + self.masspole)
        thetaacc = (self.gravity * sintheta - costheta * temp) / \
                   (self.length * (4.0/3.0 - self.masspole * costheta**2 / (self.masscart + self.masspole)))
        xacc = temp - self.masspole * self.length * thetaacc * costheta / (self.masscart + self.masspole)
        
        # Update state
        x = x + self.tau * x_dot
        x_dot = x_dot + self.tau * xacc
        theta = theta + self.tau * theta_dot
        theta_dot = theta_dot + self.tau * thetaacc
        
        self.state = np.array([x, x_dot, theta, theta_dot])
        self.steps += 1
        
        # Check termination
        done = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold
            or theta > self.theta_threshold
            or self.steps >= 500
        )
        
        reward = 1.0 if not done else 0.0
        
        return self.state.copy(), reward, done

# Test environment
env = CartPoleEnv()
state = env.reset()

print("‚úÖ CartPole Environment Created!")
print(f"\nState: {state}")
print(f"  [cart_pos, cart_vel, pole_angle, pole_angular_vel]")
print(f"\nActions: 0 = Left, 1 = Right")
print(f"Goal: Balance pole for 200+ steps")

# Test random policy
total_reward = 0
for _ in range(100):
    action = np.random.randint(2)
    state, reward, done = env.step(action)
    total_reward += reward
    if done:
        break

print(f"\nüé≤ Random policy reward: {total_reward}")
print(f"üí° DQN should achieve 200+ reward!")

In [None]:
# Train DQN on CartPole

def train_dqn(env, agent, num_episodes=300):
    """
    Train DQN agent on CartPole
    """
    rewards_history = []
    losses_history = []
    epsilon_history = []
    
    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0
        episode_losses = []
        
        for step in range(500):
            # Choose action
            action = agent.get_action(state, training=True)
            
            # Take action
            next_state, reward, done = env.step(action)
            
            # Store in replay buffer
            agent.memory.push(state, action, reward, next_state, done)
            
            # Train
            loss = agent.train()
            if loss is not None:
                episode_losses.append(loss)
            
            episode_reward += reward
            state = next_state
            
            if done:
                break
        
        # Decay epsilon
        agent.decay_epsilon()
        
        # Record metrics
        rewards_history.append(episode_reward)
        epsilon_history.append(agent.epsilon)
        if episode_losses:
            losses_history.append(np.mean(episode_losses))
        
        # Print progress
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(rewards_history[-50:])
            print(f"Episode {episode + 1}/{num_episodes} - Avg Reward: {avg_reward:.2f}, Œµ: {agent.epsilon:.3f}")
            
            if avg_reward >= 195:
                print(f"\nüéâ Solved! Average reward {avg_reward:.2f} >= 195")
                break
    
    return rewards_history, losses_history, epsilon_history

# Create agent and train
env = CartPoleEnv()
agent = DQNAgent(
    state_size=4,
    action_size=2,
    learning_rate=0.001,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995,
    epsilon_min=0.01,
    buffer_size=10000,
    batch_size=64,
    target_update=10
)

print("üöÄ Training DQN on CartPole...\n")
rewards, losses, epsilons = train_dqn(env, agent, num_episodes=300)

print("\n‚úÖ Training complete!")

In [None]:
# Visualize DQN training

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Episode rewards
ax = axes[0, 0]
ax.plot(rewards, alpha=0.3, color='blue', label='Raw')
window = 20
if len(rewards) >= window:
    moving_avg = np.convolve(rewards, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(rewards)), moving_avg, color='red', linewidth=2, label=f'{window}-episode avg')
ax.axhline(195, color='green', linestyle='--', linewidth=2, label='Solved threshold')
ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Total Reward', fontsize=12)
ax.set_title('üìà DQN Learning Progress', fontsize=13, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

# Plot 2: Training loss
ax = axes[0, 1]
if losses:
    ax.plot(losses, color='orange', alpha=0.5)
    if len(losses) >= 20:
        loss_smooth = np.convolve(losses, np.ones(20)/20, mode='valid')
        ax.plot(range(19, len(losses)), loss_smooth, color='red', linewidth=2, label='Smoothed')
ax.set_xlabel('Training Step', fontsize=12)
ax.set_ylabel('Loss (MSE)', fontsize=12)
ax.set_title('üìâ Training Loss', fontsize=13, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

# Plot 3: Epsilon decay
ax = axes[1, 0]
ax.plot(epsilons, color='purple', linewidth=2)
ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Epsilon (Œµ)', fontsize=12)
ax.set_title('üîç Exploration Rate Decay', fontsize=13, fontweight='bold')
ax.grid(alpha=0.3)

# Plot 4: Performance distribution
ax = axes[1, 1]
if len(rewards) >= 50:
    final_rewards = rewards[-50:]
    ax.hist(final_rewards, bins=20, color='skyblue', edgecolor='black', alpha=0.7)
    ax.axvline(np.mean(final_rewards), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(final_rewards):.1f}')
    ax.axvline(195, color='green', linestyle='--', linewidth=2, label='Solved: 195')
ax.set_xlabel('Total Reward', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('üìä Final 50 Episodes Performance', fontsize=13, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Training Summary:")
print(f"  Total episodes: {len(rewards)}")
print(f"  Final 10 episodes avg: {np.mean(rewards[-10:]):.2f}")
print(f"  Best episode: {max(rewards):.0f}")
if np.mean(rewards[-50:]) >= 195:
    print(f"\nüéâ Problem SOLVED! Agent consistently balances pole!")
else:
    print(f"\nüí° Agent learning but needs more training...")

## üéØ Policy Gradients

**Policy Gradients = Directly optimize the policy (not Q-values)**

### Key Difference:

**Value-based (DQN):**
```
Learn Q(s,a) ‚Üí Derive policy: œÄ(s) = argmax_a Q(s,a)
```

**Policy-based (Policy Gradient):**
```
Directly learn policy: œÄ(a|s; Œ∏) = probability of action a in state s
```

### üéØ Why Policy Gradients?

**Advantages:**
1. **Continuous actions:** DQN struggles, PG excels
   - Example: Robot arm angles, steering wheel position
2. **Stochastic policies:** PG naturally outputs probabilities
   - Example: Rock-paper-scissors (need randomness!)
3. **Better convergence:** In some problems (e.g., robotics)
4. **Simplicity:** No need for Q-function approximation

**Disadvantages:**
1. **Sample inefficient:** Needs many episodes
2. **High variance:** Gradients can be noisy
3. **Local optima:** Can get stuck

### üéØ REINFORCE Algorithm:

**Core Idea:** Increase probability of good actions, decrease bad ones

**Objective:**
```
J(Œ∏) = E[Œ£ r_t]  (expected total reward)
```

**Policy Gradient Theorem:**
```
‚àáJ(Œ∏) = E[‚àá log œÄ(a|s; Œ∏) * G_t]
```

**In words:**
- If action led to high return G_t ‚Üí increase its probability
- If action led to low return ‚Üí decrease its probability

**Update Rule:**
```
Œ∏ ‚Üê Œ∏ + Œ± * ‚àá log œÄ(a_t|s_t; Œ∏) * G_t
```

### üéØ REINFORCE Algorithm:

```
1. Initialize policy network œÄ(a|s; Œ∏)
2. For each episode:
     a. Generate episode using œÄ: (s_0,a_0,r_1), (s_1,a_1,r_2), ...
     b. For each step t:
          i.  Calculate return: G_t = Œ£_{k=t}^T Œ≥^(k-t) * r_k
          ii. Update: Œ∏ ‚Üê Œ∏ + Œ± * ‚àá log œÄ(a_t|s_t; Œ∏) * G_t
```

**Key Insight:** This is Monte Carlo - wait until episode ends, then update!

Let's implement REINFORCE!

In [None]:
# Policy Network

class PolicyNetwork(nn.Module):
    """
    Policy network that outputs action probabilities
    """
    
    def __init__(self, state_size, action_size, hidden_size=128):
        super(PolicyNetwork, self).__init__()
        
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, action_size)
        
    def forward(self, x):
        """
        Forward pass
        
        Returns:
            Action probabilities (softmax output)
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return F.softmax(x, dim=-1)  # Convert to probabilities

# Test policy network
policy_net = PolicyNetwork(state_size=4, action_size=2).to(device)

dummy_state = torch.randn(1, 4).to(device)
action_probs = policy_net(dummy_state)

print("‚úÖ Policy Network Created!")
print(f"\nOutput: Action probabilities (sum to 1.0)")
print(f"Action probs: {action_probs.detach().cpu().numpy()[0]}")
print(f"Sum: {action_probs.sum().item():.4f}")
print(f"\nüí° Stochastic policy: sample from this distribution!")

In [None]:
# REINFORCE Agent

class REINFORCEAgent:
    """
    REINFORCE policy gradient agent
    """
    
    def __init__(self, state_size, action_size, learning_rate=0.001, gamma=0.99):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = gamma
        
        # Policy network
        self.policy = PolicyNetwork(state_size, action_size).to(device)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=learning_rate)
        
        # Episode memory
        self.log_probs = []
        self.rewards = []
    
    def get_action(self, state):
        """
        Sample action from policy
        """
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        action_probs = self.policy(state)
        
        # Sample action from probability distribution
        action_dist = torch.distributions.Categorical(action_probs)
        action = action_dist.sample()
        
        # Store log probability for training
        self.log_probs.append(action_dist.log_prob(action))
        
        return action.item()
    
    def store_reward(self, reward):
        """Store reward"""
        self.rewards.append(reward)
    
    def train_episode(self):
        """
        Train on collected episode
        """
        # Calculate returns (discounted cumulative rewards)
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        
        # Normalize returns (reduces variance)
        returns = torch.tensor(returns).to(device)
        if len(returns) > 1:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Calculate policy loss
        policy_loss = []
        for log_prob, G in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * G)  # Negative for gradient ascent
        
        policy_loss = torch.stack(policy_loss).sum()
        
        # Optimize
        self.optimizer.zero_grad()
        policy_loss.backward()
        self.optimizer.step()
        
        # Clear episode memory
        self.log_probs = []
        self.rewards = []
        
        return policy_loss.item()

print("‚úÖ REINFORCE Agent Implemented!")
print("\nüéØ Key differences from DQN:")
print("  1. Outputs action probabilities (not Q-values)")
print("  2. Samples actions stochastically")
print("  3. Updates after full episode (Monte Carlo)")
print("  4. No replay buffer or target network")
print("\nüí° Used in robotics with continuous actions!")

## üé≠ Actor-Critic Methods

**Actor-Critic = Combine value-based + policy-based approaches**

### The Best of Both Worlds:

**Two networks:**
1. **Actor (Policy):** Chooses actions œÄ(a|s; Œ∏)
2. **Critic (Value):** Evaluates actions V(s; w)

**Why both?**
- **Policy gradients:** High variance, slow learning
- **Value functions:** Lower variance estimates
- **Combination:** Actor learns policy, Critic reduces variance!

### üéØ How It Works:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  State  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò
     ‚îÇ
     ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
     ‚îÇ        ‚îÇ
     ‚Üì        ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Actor  ‚îÇ ‚îÇ Critic ‚îÇ
‚îÇ(Policy)‚îÇ ‚îÇ(Value) ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò ‚îî‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò
    ‚îÇ          ‚îÇ
    ‚Üì          ‚Üì
  Action    Advantage
             (TD Error)
```

**Update Process:**
1. **Actor** takes action a
2. **Critic** evaluates: Œ¥ = r + Œ≥V(s') - V(s)  (TD error)
3. **Actor** updates: Œ∏ ‚Üê Œ∏ + Œ± * Œ¥ * ‚àá log œÄ(a|s; Œ∏)
4. **Critic** updates: w ‚Üê w + Œ≤ * Œ¥ * ‚àáV(s; w)

**Key Insight:** TD error Œ¥ acts as "advantage" - how much better action was than expected!

### üéØ Advantages:

**vs DQN:**
- ‚úÖ Works with continuous actions
- ‚úÖ More stable than pure policy gradients
- ‚úÖ Can learn online (no replay buffer needed)

**vs REINFORCE:**
- ‚úÖ Lower variance (critic provides baseline)
- ‚úÖ Can update after each step (not just episodes)
- ‚úÖ Faster learning

### üåü Famous Actor-Critic Algorithms:

1. **A3C (2016):** Asynchronous Advantage Actor-Critic
2. **A2C:** Synchronous version of A3C
3. **PPO (2017):** Proximal Policy Optimization (ChatGPT!)
4. **SAC (2018):** Soft Actor-Critic (robotics)
5. **TD3 (2018):** Twin Delayed DDPG (continuous control)

**PPO is THE algorithm for:**
- ChatGPT/GPT-4 fine-tuning (RLHF)
- OpenAI Five (Dota 2)
- Many robotics applications

Let's implement a simple Actor-Critic!

In [None]:
# Actor-Critic Agent

class ActorCritic(nn.Module):
    """
    Combined Actor-Critic network
    """
    
    def __init__(self, state_size, action_size, hidden_size=128):
        super(ActorCritic, self).__init__()
        
        # Shared layers
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        
        # Actor head (policy)
        self.actor = nn.Linear(hidden_size, action_size)
        
        # Critic head (value)
        self.critic = nn.Linear(hidden_size, 1)
    
    def forward(self, x):
        """
        Forward pass
        
        Returns:
            action_probs: Policy output
            state_value: Value estimate
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        
        # Actor output
        action_probs = F.softmax(self.actor(x), dim=-1)
        
        # Critic output
        state_value = self.critic(x)
        
        return action_probs, state_value

class ActorCriticAgent:
    """
    Actor-Critic agent
    """
    
    def __init__(self, state_size, action_size, learning_rate=0.001, gamma=0.99):
        self.gamma = gamma
        
        # Combined network
        self.model = ActorCritic(state_size, action_size).to(device)
        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
        
        # Episode memory
        self.log_probs = []
        self.values = []
        self.rewards = []
    
    def get_action(self, state):
        """
        Sample action from policy and get value estimate
        """
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        action_probs, state_value = self.model(state)
        
        # Sample action
        action_dist = torch.distributions.Categorical(action_probs)
        action = action_dist.sample()
        
        # Store for training
        self.log_probs.append(action_dist.log_prob(action))
        self.values.append(state_value)
        
        return action.item()
    
    def store_reward(self, reward):
        """Store reward"""
        self.rewards.append(reward)
    
    def train_episode(self):
        """
        Train on collected episode using advantage
        """
        # Calculate returns
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.tensor(returns).to(device)
        
        # Calculate advantages
        values = torch.cat(self.values)
        advantages = returns - values.detach().squeeze()
        
        # Actor loss (policy gradient with advantage)
        actor_loss = []
        for log_prob, advantage in zip(self.log_probs, advantages):
            actor_loss.append(-log_prob * advantage)
        actor_loss = torch.stack(actor_loss).sum()
        
        # Critic loss (MSE between value and return)
        critic_loss = F.mse_loss(values.squeeze(), returns)
        
        # Total loss
        loss = actor_loss + critic_loss
        
        # Optimize
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Clear memory
        self.log_probs = []
        self.values = []
        self.rewards = []
        
        return loss.item()

print("‚úÖ Actor-Critic Agent Implemented!")
print("\nüé≠ Architecture:")
print("  Actor: Learns policy œÄ(a|s)")
print("  Critic: Learns value V(s)")
print("  Advantage: A = G - V(s) (how much better than expected)")
print("\nüí° This is the foundation of AlphaGo and PPO (ChatGPT)!")

## üéØ Interactive Exercises

Test your understanding of Deep RL!

### Exercise 1: DQN vs Tabular Q-Learning

**Question:** Why can't we use tabular Q-learning for Atari games?

**Consider:**
- State space size
- Memory requirements
- Generalization needs

<details>
<summary>üìñ Click here for answer</summary>

**Why DQN is necessary for Atari:**

1. **Massive state space:**
   - Atari screen: 210√ó160√ó3 RGB pixels
   - Each pixel: 0-255 (256 values)
   - Total states: 256^(210√ó160√ó3) ‚âà 10^120,000
   - Universe atoms: ~10^80 (impossibly large!)

2. **Memory requirements:**
   - Table: Need to store Q-value for each state-action
   - With 18 actions: 18 √ó 10^120,000 values
   - Even 1 byte per value ‚Üí impossible to store!

3. **Never see same state twice:**
   - Pixel-perfect states rarely repeat
   - Tabular: No generalization
   - DQN: Similar screens ‚Üí similar Q-values

4. **CNN advantages:**
   - Learns visual features (edges, objects)
   - Generalizes across similar situations
   - Compact representation (millions of parameters vs infinite table)

**Key Insight:** Function approximation (neural networks) is essential for high-dimensional state spaces!
</details>

### Exercise 2: Experience Replay Benefits

**Task:** Explain why experience replay improves DQN training

**Hint:** Think about:
- Correlation between consecutive samples
- Data efficiency
- Stability

<details>
<summary>üìñ Click here for answer</summary>

**Why Experience Replay Works:**

1. **Breaks temporal correlation:**
   ```
   Without replay:
   [s1‚Üís2‚Üís3‚Üís4]  (highly correlated)
   Network overfits to recent trajectory!
   
   With replay:
   [s47, s2, s95, s12]  (random samples)
   Independent samples ‚Üí better generalization
   ```

2. **Data efficiency:**
   - Each experience used multiple times
   - Sampled in different mini-batches
   - Better use of expensive interactions

3. **Stability:**
   - Reduces variance in updates
   - Smooths out noisy gradients
   - Prevents catastrophic forgetting

4. **Batch learning:**
   - Mini-batch gradient descent more stable than online
   - Better hardware utilization (GPU)

**Real Impact:** DQN without replay fails to learn Atari games!
</details>

### Exercise 3: When to Use Which Algorithm?

**Scenario:** You're choosing an RL algorithm for these problems. Which would you use?

1. **Atari Pong:** Discrete actions, image observations
2. **Robot arm control:** Continuous joint angles
3. **Poker bot:** Needs randomness in strategy
4. **Self-driving car:** High-dimensional continuous control

**Choices:** DQN, Policy Gradients, Actor-Critic (PPO)

<details>
<summary>üìñ Click here for answer</summary>

**Algorithm Selection Guide:**

1. **Atari Pong ‚Üí DQN**
   - Discrete actions (up/down)
   - Image processing (CNN)
   - DQN proven to work well
   - Sample efficient

2. **Robot arm control ‚Üí Actor-Critic (SAC/TD3)**
   - Continuous actions (joint angles)
   - DQN can't handle continuous actions directly
   - SAC/TD3 designed for robotics
   - Stable learning

3. **Poker bot ‚Üí Policy Gradients**
   - Needs stochastic policy (randomness)
   - DQN is deterministic (after training)
   - Mixed strategy required (like rock-paper-scissors)
   - Policy gradient naturally stochastic

4. **Self-driving car ‚Üí Actor-Critic (PPO)**
   - Continuous control (steering, acceleration)
   - High-dimensional observations
   - PPO: stable, robust, proven
   - Used by Waymo, Tesla (rumored)

**General Rules:**
- Discrete actions + images ‚Üí **DQN**
- Continuous actions ‚Üí **Actor-Critic (PPO/SAC/TD3)**
- Need stochastic policy ‚Üí **Policy Gradients**
- Sample efficiency matters ‚Üí **DQN**
- Stability crucial ‚Üí **PPO**
</details>

## üéì Key Takeaways

**You just learned:**

### 1. **Deep Q-Networks (DQN)**
   - ‚úÖ Q-learning + neural networks
   - ‚úÖ Experience replay breaks correlation
   - ‚úÖ Target network stabilizes training
   - ‚úÖ First to master Atari at human level (2013)
   - **Used in:** Game AI, discrete control

### 2. **Policy Gradients**
   - ‚úÖ Directly optimize policy œÄ(a|s; Œ∏)
   - ‚úÖ REINFORCE algorithm (Monte Carlo)
   - ‚úÖ Handles continuous actions naturally
   - ‚úÖ Stochastic policies (needed for some games)
   - **Used in:** Robotics, continuous control

### 3. **Actor-Critic Methods**
   - ‚úÖ Combines value + policy approaches
   - ‚úÖ Actor learns policy, Critic evaluates
   - ‚úÖ Lower variance than pure policy gradients
   - ‚úÖ Foundation of modern algorithms (PPO, SAC)
   - **Used in:** ChatGPT (RLHF), robotics, games

### 4. **Real-World Applications**
   - ‚úÖ CartPole: Classic control benchmark
   - ‚úÖ DQN learns to balance pole
   - ‚úÖ Similar to humanoid robot balancing

### üåü Real-World Impact (2024-2025):

**What You Can Build:**
- üéÆ **Game AI:** DQN for Atari, board games
- ü§ñ **Robotics:** Actor-Critic for manipulation
- üöó **Autonomous vehicles:** PPO for driving
- üí¨ **LLM fine-tuning:** PPO for RLHF (ChatGPT)
- ‚úàÔ∏è **Drone control:** Continuous control
- üè≠ **Industrial automation:** Process optimization

**Modern Algorithms:**
- **PPO (2017):** ChatGPT fine-tuning, OpenAI Five
- **SAC (2018):** Robotics, continuous control
- **Rainbow DQN (2017):** Combines 6 DQN improvements
- **TD3 (2018):** Twin delayed DDPG for continuous
- **AlphaZero (2017):** Self-play + MCTS + policy networks

### üìä Algorithm Comparison:

| Feature | DQN | REINFORCE | Actor-Critic | PPO |
|---------|-----|-----------|--------------|-----|
| **Action space** | Discrete | Both | Both | Both |
| **Sample efficiency** | ‚úÖ Good | ‚ùå Poor | ‚ö†Ô∏è Medium | ‚ö†Ô∏è Medium |
| **Stability** | ‚ö†Ô∏è Medium | ‚ùå Poor | ‚úÖ Good | ‚úÖ Excellent |
| **Variance** | ‚úÖ Low | ‚ùå High | ‚ö†Ô∏è Medium | ‚úÖ Low |
| **Online learning** | ‚ùå No (replay) | ‚úÖ Yes | ‚úÖ Yes | ‚úÖ Yes |
| **Best for** | Atari games | Simple problems | Robotics | Everything! |
| **Used in** | DeepMind Atari | Research | AlphaGo | ChatGPT |

---

**üéâ Congratulations!** You now understand:
- How DeepMind's DQN mastered Atari
- The algorithms powering modern robotics
- How ChatGPT uses PPO for RLHF
- Deep RL foundations

**Next:** Advanced RL - Multi-agent, AlphaGo, Real-world applications! üöÄ

## üöÄ Next Steps

**Practice Exercises:**
1. Implement Double DQN (addresses overestimation bias)
2. Add dueling architecture to DQN
3. Try prioritized experience replay
4. Implement PPO (used in ChatGPT!)
5. Train on Gymnasium environments (try MountainCar, LunarLander)

**Coming Next:**
- **Day 3:** Advanced RL Applications - Multi-agent, AlphaGo, RL for robotics, game playing, and optimization

---

**üí° Deep Dive Resources:**
- DeepMind DQN paper (Nature 2015)
- Spinning Up in Deep RL (OpenAI)
- Gymnasium (OpenAI Gym successor)
- Stable-Baselines3 (pre-implemented algorithms)
- David Silver's RL Course (YouTube)

**Try It Yourself:**
```bash
pip install gymnasium
pip install stable-baselines3
```

---

*Remember: Deep RL powers game AI, robotics, and LLM fine-tuning. You now know the core algorithms!* üåü

**üéØ You understand how DeepMind built agents that beat human experts!**