# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/09_reinforcement_learning/09_demo_dqn.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '09_demo_dqn.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 09 - Deep Q-Network (DQN)

Ce notebook d√©montre l'algorithme DQN pour r√©soudre des environnements avec espaces d'√©tats continus.

## Objectifs
- Comprendre les limitations de Q-Learning tabulaire
- Impl√©menter DQN avec PyTorch
- Utiliser Experience Replay et Target Network
- Entra√Æner un agent sur CartPole-v1
- Visualiser l'apprentissage et les performances

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gym
from collections import deque, namedtuple
import random
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Pour reproductibilit√©
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

## 1. Environnement CartPole

CartPole: √©quilibrer un poteau sur un chariot mobile.

- √âtat: 4 valeurs continues (position, vitesse, angle, vitesse angulaire)
- Actions: 2 discr√®tes (gauche, droite)
- R√©compense: +1 √† chaque timestep o√π le poteau reste debout
- Terminal: poteau tombe (angle > 12¬∞) ou chariot sort de la zone

In [None]:
env = gym.make('CartPole-v1')

print("CartPole Environment:")
print(f"  State space: {env.observation_space}")
print(f"  State shape: {env.observation_space.shape}")
print(f"  Action space: {env.action_space}")
print(f"  Number of actions: {env.action_space.n}")
print("\nState variables:")
print("  0: Cart Position")
print("  1: Cart Velocity")
print("  2: Pole Angle")
print("  3: Pole Angular Velocity")
print("\nActions: 0=Push Left, 1=Push Right")

## 2. Neural Network pour Q-Function

Au lieu d'une Q-table, nous utilisons un r√©seau de neurones pour approximer Q(s, a).

In [None]:
class DQN(nn.Module):
    """Deep Q-Network."""
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(DQN, self).__init__()
        
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, state):
        """Forward pass.
        
        Args:
            state: (batch_size, state_dim)
        Returns:
            Q-values: (batch_size, action_dim)
        """
        return self.network(state)

# Tester le mod√®le
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

test_network = DQN(state_dim, action_dim)
print(f"\nDQN Architecture:")
print(test_network)
print(f"\nTotal parameters: {sum(p.numel() for p in test_network.parameters()):,}")

## 3. Experience Replay Buffer

Stocke les transitions pour casser la corr√©lation temporelle et am√©liorer la stabilit√©.

In [None]:
# Transition structure
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward', 'done'))

class ReplayBuffer:
    """Experience Replay Buffer."""
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, next_state, reward, done):
        """Ajoute une transition."""
        self.buffer.append(Transition(state, action, next_state, reward, done))
    
    def sample(self, batch_size):
        """√âchantillonne un batch al√©atoire."""
        transitions = random.sample(self.buffer, batch_size)
        batch = Transition(*zip(*transitions))
        
        # Convertir en tenseurs
        states = torch.FloatTensor(np.array(batch.state)).to(device)
        actions = torch.LongTensor(batch.action).to(device)
        next_states = torch.FloatTensor(np.array(batch.next_state)).to(device)
        rewards = torch.FloatTensor(batch.reward).to(device)
        dones = torch.FloatTensor(batch.done).to(device)
        
        return states, actions, next_states, rewards, dones
    
    def __len__(self):
        return len(self.buffer)

print("Replay Buffer implemented!")

## 4. DQN Agent

In [None]:
class DQNAgent:
    """DQN Agent avec Experience Replay et Target Network."""
    def __init__(self, state_dim, action_dim, learning_rate=0.001, gamma=0.99,
                 epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995,
                 buffer_size=10000, batch_size=64, target_update_freq=10):
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        
        # Networks: policy et target
        self.policy_net = DQN(state_dim, action_dim).to(device)
        self.target_net = DQN(state_dim, action_dim).to(device)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()  # Target network en mode eval
        
        # Optimizer
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=learning_rate)
        
        # Replay buffer
        self.replay_buffer = ReplayBuffer(buffer_size)
        
        # Compteur pour update target
        self.update_count = 0
    
    def select_action(self, state, training=True):
        """Epsilon-greedy action selection."""
        if training and random.random() < self.epsilon:
            return random.randrange(self.action_dim)
        else:
            with torch.no_grad():
                state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
                q_values = self.policy_net(state_tensor)
                return q_values.argmax().item()
    
    def update(self):
        """Update policy network avec batch from replay buffer."""
        if len(self.replay_buffer) < self.batch_size:
            return None
        
        # Sample batch
        states, actions, next_states, rewards, dones = self.replay_buffer.sample(self.batch_size)
        
        # Compute current Q-values
        current_q_values = self.policy_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        
        # Compute target Q-values avec target network
        with torch.no_grad():
            next_q_values = self.target_net(next_states).max(1)[0]
            target_q_values = rewards + (1 - dones) * self.gamma * next_q_values
        
        # Loss (MSE)
        loss = nn.MSELoss()(current_q_values, target_q_values)
        
        # Optimize
        self.optimizer.zero_grad()
        loss.backward()
        # Gradient clipping pour stabilit√©
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), 1.0)
        self.optimizer.step()
        
        # Update target network p√©riodiquement
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())
        
        return loss.item()
    
    def decay_epsilon(self):
        """Decay epsilon."""
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)

print("DQN Agent implemented!")

## 5. Entra√Ænement

In [None]:
def train_dqn(env, agent, num_episodes=500, eval_interval=10):
    """Entra√Æne l'agent DQN."""
    episode_rewards = []
    episode_lengths = []
    losses = []
    eval_rewards = []
    epsilons = []
    
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        episode_length = 0
        episode_losses = []
        
        while not done:
            # Select action
            action = agent.select_action(state)
            
            # Execute action
            next_state, reward, done, info = env.step(action)
            
            # Store transition
            agent.replay_buffer.push(state, action, next_state, reward, float(done))
            
            # Update
            loss = agent.update()
            if loss is not None:
                episode_losses.append(loss)
            
            state = next_state
            episode_reward += reward
            episode_length += 1
        
        # Decay epsilon
        agent.decay_epsilon()
        
        # Logging
        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_length)
        if episode_losses:
            losses.append(np.mean(episode_losses))
        
        # Evaluation
        if (episode + 1) % eval_interval == 0:
            eval_reward = evaluate_agent(env, agent, num_eval=5)
            eval_rewards.append(eval_reward)
            epsilons.append(agent.epsilon)
            
            print(f"Episode {episode+1}/{num_episodes} - "
                  f"Reward: {episode_reward:.1f} - "
                  f"Eval Reward: {eval_reward:.1f} - "
                  f"Epsilon: {agent.epsilon:.3f}")
    
    return episode_rewards, episode_lengths, losses, eval_rewards, epsilons

def evaluate_agent(env, agent, num_eval=10):
    """√âvalue l'agent sans exploration."""
    total_reward = 0
    for _ in range(num_eval):
        state = env.reset()
        done = False
        
        while not done:
            action = agent.select_action(state, training=False)
            state, reward, done, info = env.step(action)
            total_reward += reward
    
    return total_reward / num_eval

In [None]:
# Cr√©er agent
agent = DQNAgent(
    state_dim=state_dim,
    action_dim=action_dim,
    learning_rate=0.001,
    gamma=0.99,
    epsilon_start=1.0,
    epsilon_end=0.01,
    epsilon_decay=0.995,
    buffer_size=10000,
    batch_size=64,
    target_update_freq=10
)

# Entra√Æner
print("Training DQN Agent on CartPole-v1...\n")
rewards, lengths, losses, eval_rewards, epsilons = train_dqn(
    env, agent, num_episodes=500, eval_interval=10
)
print("\nTraining completed!")

## 6. Visualisation des R√©sultats

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Episode rewards
axes[0, 0].plot(rewards, alpha=0.3, label='Episode Reward')
# Moving average
window = 20
if len(rewards) >= window:
    ma_rewards = np.convolve(rewards, np.ones(window)/window, mode='valid')
    axes[0, 0].plot(np.arange(window-1, len(rewards)), ma_rewards, 
                    linewidth=2, label=f'MA({window})')
axes[0, 0].axhline(y=195, color='green', linestyle='--', label='Solved (195)', linewidth=2)
axes[0, 0].set_xlabel('Episode', fontsize=12)
axes[0, 0].set_ylabel('Reward', fontsize=12)
axes[0, 0].set_title('Training Rewards', fontsize=14, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Evaluation rewards
eval_episodes = np.arange(10, 501, 10)
axes[0, 1].plot(eval_episodes, eval_rewards, linewidth=2, marker='o')
axes[0, 1].axhline(y=195, color='green', linestyle='--', label='Solved', linewidth=2)
axes[0, 1].set_xlabel('Episode', fontsize=12)
axes[0, 1].set_ylabel('Average Reward', fontsize=12)
axes[0, 1].set_title('Evaluation Performance', fontsize=14, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Loss
if losses:
    axes[1, 0].plot(losses, linewidth=1, alpha=0.7)
    axes[1, 0].set_xlabel('Episode', fontsize=12)
    axes[1, 0].set_ylabel('Loss', fontsize=12)
    axes[1, 0].set_title('Training Loss (MSE)', fontsize=14, fontweight='bold')
    axes[1, 0].grid(True, alpha=0.3)

# Epsilon decay
axes[1, 1].plot(eval_episodes, epsilons, linewidth=2, color='purple')
axes[1, 1].set_xlabel('Episode', fontsize=12)
axes[1, 1].set_ylabel('Epsilon', fontsize=12)
axes[1, 1].set_title('Exploration Rate', fontsize=14, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Test de l'Agent Entra√Æn√©

In [None]:
# Test final
test_rewards = []
num_tests = 100

for _ in range(num_tests):
    state = env.reset()
    done = False
    episode_reward = 0
    
    while not done:
        action = agent.select_action(state, training=False)
        state, reward, done, info = env.step(action)
        episode_reward += reward
    
    test_rewards.append(episode_reward)

print(f"\nFinal Test Results ({num_tests} episodes):")
print(f"  Mean Reward: {np.mean(test_rewards):.2f}")
print(f"  Std Reward: {np.std(test_rewards):.2f}")
print(f"  Min Reward: {np.min(test_rewards):.2f}")
print(f"  Max Reward: {np.max(test_rewards):.2f}")
print(f"  Success Rate (>195): {(np.array(test_rewards) > 195).mean():.2%}")

# Distribution
plt.figure(figsize=(10, 5))
plt.hist(test_rewards, bins=30, edgecolor='black', alpha=0.7)
plt.axvline(x=195, color='green', linestyle='--', label='Solved Threshold', linewidth=2)
plt.axvline(x=np.mean(test_rewards), color='red', linestyle='--', label='Mean', linewidth=2)
plt.xlabel('Reward', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Test Rewards Distribution', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

## 8. Analyse des Q-Values

In [None]:
# Analyser les Q-values pour diff√©rents √©tats
def analyze_q_values(agent, num_samples=100):
    """Analyse la distribution des Q-values."""
    all_q_values = []
    
    for _ in range(num_samples):
        state = env.reset()
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        
        with torch.no_grad():
            q_values = agent.policy_net(state_tensor).cpu().numpy()[0]
            all_q_values.append(q_values)
    
    all_q_values = np.array(all_q_values)
    
    # Visualisation
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Distribution par action
    for action in range(action_dim):
        axes[0].hist(all_q_values[:, action], bins=20, alpha=0.6, 
                     label=f'Action {action}')
    axes[0].set_xlabel('Q-Value', fontsize=12)
    axes[0].set_ylabel('Frequency', fontsize=12)
    axes[0].set_title('Q-Values Distribution by Action', fontsize=14, fontweight='bold')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Boxplot
    axes[1].boxplot([all_q_values[:, i] for i in range(action_dim)],
                     labels=[f'Action {i}' for i in range(action_dim)])
    axes[1].set_ylabel('Q-Value', fontsize=12)
    axes[1].set_title('Q-Values Boxplot', fontsize=14, fontweight='bold')
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    return all_q_values

q_values = analyze_q_values(agent, num_samples=200)

## 9. Comparaison avec Agent Al√©atoire

In [None]:
# Agent al√©atoire
random_rewards = []
for _ in range(100):
    state = env.reset()
    done = False
    episode_reward = 0
    
    while not done:
        action = env.action_space.sample()
        state, reward, done, info = env.step(action)
        episode_reward += reward
    
    random_rewards.append(episode_reward)

# Comparaison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogrammes
axes[0].hist(random_rewards, bins=20, alpha=0.6, label='Random', edgecolor='black')
axes[0].hist(test_rewards, bins=20, alpha=0.6, label='DQN', edgecolor='black')
axes[0].set_xlabel('Reward', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Rewards Comparison', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# Boxplot
axes[1].boxplot([random_rewards, test_rewards], labels=['Random', 'DQN'])
axes[1].axhline(y=195, color='green', linestyle='--', label='Solved', linewidth=2)
axes[1].set_ylabel('Reward', fontsize=12)
axes[1].set_title('Performance Comparison', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\nRandom Agent: {np.mean(random_rewards):.2f} ¬± {np.std(random_rewards):.2f}")
print(f"DQN Agent: {np.mean(test_rewards):.2f} ¬± {np.std(test_rewards):.2f}")
print(f"Improvement: {((np.mean(test_rewards) - np.mean(random_rewards)) / np.mean(random_rewards) * 100):.1f}%")

## Conclusion

### Ce que nous avons appris:
1. Deep Q-Network pour espaces d'√©tats continus
2. Experience Replay pour casser la corr√©lation temporelle
3. Target Network pour stabiliser l'apprentissage
4. Epsilon-greedy pour balance exploration/exploitation
5. Gradient clipping pour stabilit√©

### Am√©liorations de DQN:
- **Double DQN**: R√©duire la surestimation des Q-values
- **Dueling DQN**: S√©parer value et advantage
- **Prioritized Experience Replay**: √âchantillonner les transitions importantes
- **Rainbow DQN**: Combiner toutes les am√©liorations

### Limitations:
- Actions discr√®tes uniquement
- Peut √™tre instable sur certains environnements
- N√©cessite beaucoup d'interactions

### Pour aller plus loin:
- Policy Gradient methods (REINFORCE, A2C, PPO)
- Actor-Critic algorithms (DDPG, TD3, SAC)
- Multi-agent RL
- Model-based RL