# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/09_reinforcement_learning/09_demo_qlearning.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '09_demo_qlearning.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 09 - Q-Learning : Introduction au Reinforcement Learning

Ce notebook d√©montre l'algorithme Q-Learning sur deux environnements Gym.

## Objectifs
- Comprendre les concepts de RL: √©tat, action, r√©compense, politique
- Impl√©menter Q-Learning avec Q-table
- Entra√Æner un agent sur FrozenLake et CartPole
- Visualiser l'apprentissage et les performances

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gym
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

## 1. Introduction √† Gym

OpenAI Gym fournit des environnements standardis√©s pour tester les algorithmes de RL.

In [None]:
# Cr√©er un environnement FrozenLake
env = gym.make('FrozenLake-v1', is_slippery=False)  # D√©terministe pour commencer

print("FrozenLake Environment:")
print(f"  State space: {env.observation_space}")
print(f"  Action space: {env.action_space}")
print(f"  Number of states: {env.observation_space.n}")
print(f"  Number of actions: {env.action_space.n}")
print("\nActions: 0=Left, 1=Down, 2=Right, 3=Up")
print("\nGrid: S=Start, F=Frozen, H=Hole, G=Goal")
env.reset()
env.render()

## 2. Agent Al√©atoire (Baseline)

Testons d'abord un agent qui choisit des actions au hasard.

In [None]:
def test_random_agent(env, num_episodes=100):
    """Teste un agent al√©atoire."""
    wins = 0
    total_rewards = []
    
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        
        while not done:
            action = env.action_space.sample()  # Action al√©atoire
            state, reward, done, info = env.step(action)
            episode_reward += reward
        
        total_rewards.append(episode_reward)
        if reward > 0:  # Win
            wins += 1
    
    return wins / num_episodes, total_rewards

random_win_rate, random_rewards = test_random_agent(env, 1000)
print(f"Random Agent Win Rate: {random_win_rate:.2%}")

## 3. Q-Learning Algorithm

### √âquation de Bellman:
$$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$$

O√π:
- $\alpha$ = learning rate
- $\gamma$ = discount factor
- $r$ = reward
- $s, a$ = current state, action
- $s', a'$ = next state, action

In [None]:
class QLearningAgent:
    def __init__(self, n_states, n_actions, learning_rate=0.1, discount_factor=0.99, 
                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):
        self.n_states = n_states
        self.n_actions = n_actions
        self.lr = learning_rate
        self.gamma = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        # Q-table: [states x actions]
        self.q_table = np.zeros((n_states, n_actions))
    
    def choose_action(self, state, training=True):
        """Epsilon-greedy action selection."""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit
    
    def update(self, state, action, reward, next_state, done):
        """Met √† jour la Q-table avec l'√©quation de Bellman."""
        current_q = self.q_table[state, action]
        
        if done:
            # Si terminal, pas de futur reward
            target_q = reward
        else:
            # Meilleure action future
            max_future_q = np.max(self.q_table[next_state])
            target_q = reward + self.gamma * max_future_q
        
        # Mise √† jour Q-learning
        self.q_table[state, action] += self.lr * (target_q - current_q)
    
    def decay_epsilon(self):
        """Diminue epsilon progressivement."""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

print("Q-Learning Agent defined!")

## 4. Entra√Ænement sur FrozenLake

In [None]:
def train_q_learning(env, agent, num_episodes=5000, eval_interval=100):
    """Entra√Æne l'agent Q-Learning."""
    rewards_history = []
    win_rates = []
    epsilons = []
    
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        
        while not done:
            # Choisir action
            action = agent.choose_action(state)
            
            # Ex√©cuter action
            next_state, reward, done, info = env.step(action)
            
            # Mettre √† jour Q-table
            agent.update(state, action, reward, next_state, done)
            
            state = next_state
            episode_reward += reward
        
        # Decay epsilon
        agent.decay_epsilon()
        
        rewards_history.append(episode_reward)
        
        # √âvaluation p√©riodique
        if (episode + 1) % eval_interval == 0:
            win_rate = evaluate_agent(env, agent, num_eval=100)
            win_rates.append(win_rate)
            epsilons.append(agent.epsilon)
            
            if (episode + 1) % 1000 == 0:
                print(f"Episode {episode+1}/{num_episodes} - Win Rate: {win_rate:.2%} - Epsilon: {agent.epsilon:.3f}")
    
    return rewards_history, win_rates, epsilons

def evaluate_agent(env, agent, num_eval=100):
    """√âvalue l'agent sans exploration."""
    wins = 0
    for _ in range(num_eval):
        state = env.reset()
        done = False
        
        while not done:
            action = agent.choose_action(state, training=False)
            state, reward, done, info = env.step(action)
        
        if reward > 0:
            wins += 1
    
    return wins / num_eval

In [None]:
# Cr√©er et entra√Æner l'agent
env = gym.make('FrozenLake-v1', is_slippery=False)
agent = QLearningAgent(
    n_states=env.observation_space.n,
    n_actions=env.action_space.n,
    learning_rate=0.1,
    discount_factor=0.99,
    epsilon=1.0,
    epsilon_decay=0.995
)

print("Training Q-Learning Agent on FrozenLake...\n")
rewards, win_rates, epsilons = train_q_learning(env, agent, num_episodes=5000)
print("\nTraining completed!")

## 5. Visualisation des R√©sultats

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Win rate progression
eval_episodes = np.arange(100, 5001, 100)
axes[0].plot(eval_episodes, win_rates, linewidth=2)
axes[0].axhline(y=random_win_rate, color='red', linestyle='--', label='Random Agent', linewidth=2)
axes[0].set_xlabel('Episode', fontsize=12)
axes[0].set_ylabel('Win Rate', fontsize=12)
axes[0].set_title('Win Rate Evolution', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Epsilon decay
axes[1].plot(eval_episodes, epsilons, linewidth=2, color='green')
axes[1].set_xlabel('Episode', fontsize=12)
axes[1].set_ylabel('Epsilon', fontsize=12)
axes[1].set_title('Exploration Rate (Epsilon)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# Rewards distribution (last 1000 episodes)
recent_rewards = rewards[-1000:]
axes[2].hist(recent_rewards, bins=20, edgecolor='black', alpha=0.7)
axes[2].set_xlabel('Reward', fontsize=12)
axes[2].set_ylabel('Frequency', fontsize=12)
axes[2].set_title('Rewards Distribution (Last 1000 Episodes)', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 6. Visualisation de la Q-Table

In [None]:
# Reshape Q-table pour visualisation 4x4
q_table_grid = agent.q_table.max(axis=1).reshape(4, 4)

plt.figure(figsize=(8, 7))
sns.heatmap(q_table_grid, annot=True, fmt='.2f', cmap='YlGnBu', 
            cbar_kws={'label': 'Max Q-Value'})
plt.title('Q-Table Visualization (Max Q-Value per State)', fontsize=14, fontweight='bold')
plt.xlabel('Column', fontsize=12)
plt.ylabel('Row', fontsize=12)
plt.show()

# Politique apprise (meilleure action par √©tat)
action_names = ['‚Üê', '‚Üì', '‚Üí', '‚Üë']
policy = np.array([action_names[a] for a in agent.q_table.argmax(axis=1)]).reshape(4, 4)

fig, ax = plt.subplots(figsize=(8, 7))
ax.axis('tight')
ax.axis('off')
table = ax.table(cellText=policy, cellLoc='center', loc='center',
                colWidths=[0.2]*4, cellColours=[['lightblue']*4]*4)
table.auto_set_font_size(False)
table.set_fontsize(20)
table.scale(1, 3)
plt.title('Learned Policy (Best Action per State)', fontsize=14, fontweight='bold', pad=20)
plt.show()

## 7. Test sur FrozenLake Slippery

Environnement stochastique o√π l'agent peut glisser.

In [None]:
# Environnement avec glissement
env_slippery = gym.make('FrozenLake-v1', is_slippery=True)

agent_slippery = QLearningAgent(
    n_states=env_slippery.observation_space.n,
    n_actions=env_slippery.action_space.n,
    learning_rate=0.1,
    discount_factor=0.99,
    epsilon=1.0,
    epsilon_decay=0.999,  # D√©croissance plus lente pour stochastique
    epsilon_min=0.05
)

print("Training on Slippery FrozenLake...\n")
rewards_slip, win_rates_slip, epsilons_slip = train_q_learning(
    env_slippery, agent_slippery, num_episodes=10000, eval_interval=200
)
print("\nTraining completed!")

In [None]:
# Comparaison Deterministic vs Slippery
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Win rates
eval_episodes_det = np.arange(100, 5001, 100)
eval_episodes_slip = np.arange(200, 10001, 200)

axes[0].plot(eval_episodes_det, win_rates, label='Deterministic', linewidth=2)
axes[0].plot(eval_episodes_slip, win_rates_slip, label='Slippery', linewidth=2)
axes[0].set_xlabel('Episode', fontsize=12)
axes[0].set_ylabel('Win Rate', fontsize=12)
axes[0].set_title('Win Rate Comparison', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Q-tables comparison
q_diff = np.abs(agent.q_table - agent_slippery.q_table).max(axis=1).reshape(4, 4)
sns.heatmap(q_diff, annot=True, fmt='.2f', cmap='Reds', ax=axes[1],
            cbar_kws={'label': 'Q-Value Difference'})
axes[1].set_title('Q-Table Difference (Abs Max)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 8. D√©monstration de l'Agent Entra√Æn√©

In [None]:
def demonstrate_agent(env, agent, num_demos=5):
    """Montre quelques √©pisodes de l'agent entra√Æn√©."""
    action_names = ['Left', 'Down', 'Right', 'Up']
    
    for demo in range(num_demos):
        print(f"\n{'='*50}")
        print(f"Demo {demo + 1}")
        print('='*50)
        
        state = env.reset()
        done = False
        total_reward = 0
        steps = 0
        
        print(f"Initial state: {state}")
        
        while not done and steps < 100:
            action = agent.choose_action(state, training=False)
            next_state, reward, done, info = env.step(action)
            
            print(f"Step {steps+1}: Action={action_names[action]}, Next State={next_state}, Reward={reward}")
            
            state = next_state
            total_reward += reward
            steps += 1
        
        result = "WIN!" if total_reward > 0 else "LOSS"
        print(f"\nResult: {result} - Total Reward: {total_reward} - Steps: {steps}")

demonstrate_agent(env, agent, num_demos=3)

## 9. Analyse de la Convergence

In [None]:
# Moyenne mobile du win rate
def moving_average(data, window=10):
    return np.convolve(data, np.ones(window)/window, mode='valid')

# Convertir rewards en win rate par 100 √©pisodes
win_rate_per_100 = []
for i in range(0, len(rewards), 100):
    batch = rewards[i:i+100]
    win_rate_per_100.append(sum(batch) / len(batch))

ma_win_rate = moving_average(win_rate_per_100, window=5)

plt.figure(figsize=(12, 5))
plt.plot(win_rate_per_100, alpha=0.3, label='Win Rate (per 100 episodes)')
plt.plot(np.arange(4, len(ma_win_rate)+4), ma_win_rate, linewidth=2, label='Moving Average (5)')
plt.axhline(y=1.0, color='green', linestyle='--', label='Optimal', linewidth=2)
plt.xlabel('Batch (100 episodes)', fontsize=12)
plt.ylabel('Win Rate', fontsize=12)
plt.title('Learning Convergence Analysis', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Conclusion

### Ce que nous avons appris:
1. Concepts fondamentaux du RL: √©tat, action, r√©compense, politique
2. Algorithme Q-Learning et √©quation de Bellman
3. Exploration vs Exploitation (epsilon-greedy)
4. Diff√©rence entre environnements d√©terministes et stochastiques
5. Visualisation de l'apprentissage et de la politique

### Points cl√©s:
- Q-Learning converge vers la politique optimale
- Epsilon doit d√©cro√Ætre pour passer d'exploration √† exploitation
- Les environnements stochastiques n√©cessitent plus d'exploration
- La Q-table stocke les valeurs pour chaque paire (√©tat, action)

### Limitations de Q-Learning:
- Limit√© aux espaces d'√©tats discrets et petits
- Ne scale pas pour des environnements complexes
- Solution: Deep Q-Learning (DQN) pour espaces continus

### Pour aller plus loin:
- Deep Q-Network (DQN) pour CartPole, Atari
- Policy Gradient methods (REINFORCE, A3C)
- Actor-Critic algorithms
- Multi-agent RL