# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/10_algorithmes_genetiques/10_demo_applications.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '10_demo_applications.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# D√©monstration : Applications Avanc√©es des Algorithmes G√©n√©tiques

Ce notebook explore des applications pratiques des algorithmes g√©n√©tiques en Machine Learning :
1. **Hyperparameter Tuning** : Optimisation des hyperparam√®tres d'un RandomForest
2. **Feature Selection** : S√©lection automatique des features pertinentes
3. **Comparaison** : AG vs GridSearch vs RandomSearch

**Datasets** : Iris, Breast Cancer (scikit-learn)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import time
import warnings
warnings.filterwarnings('ignore')

# Configuration de visualisation
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Biblioth√®ques import√©es avec succ√®s !")

## 1. Classe Algorithme G√©n√©tique G√©n√©rique

In [None]:
class GeneticAlgorithm:
    """Algorithme g√©n√©tique g√©n√©rique pour optimisation."""
    
    def __init__(self, fitness_func, bounds, pop_size=50, generations=30, 
                 mutation_rate=0.1, crossover_rate=0.8, elitism=0.1):
        """
        Args:
            fitness_func: Fonction de fitness (√† maximiser)
            bounds: Liste de tuples (min, max) pour chaque param√®tre
            pop_size: Taille de la population
            generations: Nombre de g√©n√©rations
            mutation_rate: Taux de mutation
            crossover_rate: Taux de crossover
            elitism: Proportion d'√©lite √† pr√©server
        """
        self.fitness_func = fitness_func
        self.bounds = bounds
        self.n_params = len(bounds)
        self.pop_size = pop_size
        self.generations = generations
        self.mutation_rate = mutation_rate
        self.crossover_rate = crossover_rate
        self.elitism = elitism
        self.n_elite = int(pop_size * elitism)
        
        # Historique
        self.best_fitness_history = []
        self.avg_fitness_history = []
        self.best_individual = None
        self.best_fitness = -np.inf
    
    def initialize_population(self):
        """Initialise la population al√©atoirement."""
        population = []
        for _ in range(self.pop_size):
            individual = [np.random.uniform(low, high) for low, high in self.bounds]
            population.append(individual)
        return np.array(population)
    
    def evaluate_population(self, population):
        """√âvalue la fitness de toute la population."""
        fitness_values = []
        for individual in population:
            fitness = self.fitness_func(individual)
            fitness_values.append(fitness)
        return np.array(fitness_values)
    
    def selection(self, population, fitness_values):
        """S√©lection par tournoi."""
        tournament_size = 3
        selected_idx = np.random.choice(len(population), tournament_size, replace=False)
        tournament_fitness = fitness_values[selected_idx]
        winner_idx = selected_idx[np.argmax(tournament_fitness)]
        return population[winner_idx]
    
    def crossover(self, parent1, parent2):
        """Crossover uniforme."""
        if np.random.rand() > self.crossover_rate:
            return parent1.copy(), parent2.copy()
        
        child1, child2 = parent1.copy(), parent2.copy()
        for i in range(self.n_params):
            if np.random.rand() < 0.5:
                child1[i], child2[i] = child2[i], child1[i]
        return child1, child2
    
    def mutate(self, individual):
        """Mutation gaussienne."""
        for i in range(self.n_params):
            if np.random.rand() < self.mutation_rate:
                low, high = self.bounds[i]
                mutation = np.random.normal(0, (high - low) * 0.1)
                individual[i] = np.clip(individual[i] + mutation, low, high)
        return individual
    
    def evolve(self, verbose=True):
        """Ex√©cute l'algorithme g√©n√©tique."""
        population = self.initialize_population()
        
        for gen in range(self.generations):
            # √âvaluation
            fitness_values = self.evaluate_population(population)
            
            # Statistiques
            best_idx = np.argmax(fitness_values)
            best_gen_fitness = fitness_values[best_idx]
            avg_gen_fitness = np.mean(fitness_values)
            
            self.best_fitness_history.append(best_gen_fitness)
            self.avg_fitness_history.append(avg_gen_fitness)
            
            if best_gen_fitness > self.best_fitness:
                self.best_fitness = best_gen_fitness
                self.best_individual = population[best_idx].copy()
            
            if verbose and gen % 5 == 0:
                print(f"Gen {gen:3d} | Best: {best_gen_fitness:.4f} | Avg: {avg_gen_fitness:.4f}")
            
            # √âlitisme
            elite_indices = np.argsort(fitness_values)[-self.n_elite:]
            elite = population[elite_indices]
            
            # Nouvelle g√©n√©ration
            new_population = list(elite)
            
            while len(new_population) < self.pop_size:
                parent1 = self.selection(population, fitness_values)
                parent2 = self.selection(population, fitness_values)
                child1, child2 = self.crossover(parent1, parent2)
                child1 = self.mutate(child1)
                child2 = self.mutate(child2)
                new_population.extend([child1, child2])
            
            population = np.array(new_population[:self.pop_size])
        
        if verbose:
            print(f"\nMeilleur individu: {self.best_individual}")
            print(f"Meilleure fitness: {self.best_fitness:.4f}")
        
        return self.best_individual, self.best_fitness
    
    def plot_convergence(self):
        """Visualise la convergence de l'algorithme."""
        plt.figure(figsize=(10, 6))
        plt.plot(self.best_fitness_history, label='Best Fitness', linewidth=2)
        plt.plot(self.avg_fitness_history, label='Average Fitness', linewidth=2, alpha=0.7)
        plt.xlabel('Generation')
        plt.ylabel('Fitness (Accuracy)')
        plt.title('Convergence de l\'Algorithme G√©n√©tique')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()

print("Classe GeneticAlgorithm cr√©√©e !")

## 2. Application 1 : Hyperparameter Tuning avec AG

Optimisons les hyperparam√®tres d'un **RandomForest** sur le dataset Iris :
- `n_estimators` : nombre d'arbres [10, 200]
- `max_depth` : profondeur max [2, 20]
- `min_samples_split` : √©chantillons min pour split [2, 20]
- `min_samples_leaf` : √©chantillons min par feuille [1, 10]

In [None]:
# Chargement du dataset Iris
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

print(f"Dataset Iris:")
print(f"  Train: {X_train_iris.shape}, Test: {X_test_iris.shape}")
print(f"  Classes: {np.unique(y_iris)}")

In [None]:
# Fonction de fitness pour hyperparameter tuning
def fitness_rf_hyperparams(params):
    """Fitness = accuracy du RandomForest avec validation crois√©e."""
    n_estimators = int(params[0])
    max_depth = int(params[1])
    min_samples_split = int(params[2])
    min_samples_leaf = int(params[3])
    
    rf = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        random_state=42,
        n_jobs=-1
    )
    
    # Validation crois√©e 3-fold
    scores = cross_val_score(rf, X_train_iris, y_train_iris, cv=3, scoring='accuracy')
    return scores.mean()

# Bounds pour les hyperparam√®tres
bounds_rf = [
    (10, 200),   # n_estimators
    (2, 20),     # max_depth
    (2, 20),     # min_samples_split
    (1, 10)      # min_samples_leaf
]

print("Fonction de fitness RF d√©finie !")
print(f"Hyperparam√®tres √† optimiser: {['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf']}")

In [None]:
# Optimisation avec AG
print("=" * 60)
print("OPTIMISATION HYPERPARAM√àTRES AVEC ALGORITHME G√âN√âTIQUE")
print("=" * 60)

start_time = time.time()

ga_rf = GeneticAlgorithm(
    fitness_func=fitness_rf_hyperparams,
    bounds=bounds_rf,
    pop_size=20,
    generations=20,
    mutation_rate=0.15,
    crossover_rate=0.8,
    elitism=0.1
)

best_params_ga, best_fitness_ga = ga_rf.evolve(verbose=True)

ga_time = time.time() - start_time

print(f"\nTemps d'ex√©cution: {ga_time:.2f}s")
print(f"\nMeilleurs hyperparam√®tres (AG):")
print(f"  n_estimators: {int(best_params_ga[0])}")
print(f"  max_depth: {int(best_params_ga[1])}")
print(f"  min_samples_split: {int(best_params_ga[2])}")
print(f"  min_samples_leaf: {int(best_params_ga[3])}")
print(f"  Accuracy (CV): {best_fitness_ga:.4f}")

In [None]:
# Visualisation convergence
ga_rf.plot_convergence()

In [None]:
# Test sur ensemble de test
rf_best_ga = RandomForestClassifier(
    n_estimators=int(best_params_ga[0]),
    max_depth=int(best_params_ga[1]),
    min_samples_split=int(best_params_ga[2]),
    min_samples_leaf=int(best_params_ga[3]),
    random_state=42
)

rf_best_ga.fit(X_train_iris, y_train_iris)
y_pred_ga = rf_best_ga.predict(X_test_iris)
acc_ga = accuracy_score(y_test_iris, y_pred_ga)

print(f"Accuracy sur test set: {acc_ga:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_ga, target_names=iris.target_names))

## 3. Comparaison : AG vs GridSearch vs RandomSearch

In [None]:
# GridSearchCV
print("=" * 60)
print("GRID SEARCH CV")
print("=" * 60)

param_grid = {
    'n_estimators': [10, 50, 100, 150, 200],
    'max_depth': [2, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 5, 10]
}

start_time = time.time()

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)

grid_search.fit(X_train_iris, y_train_iris)

grid_time = time.time() - start_time

print(f"Temps d'ex√©cution: {grid_time:.2f}s")
print(f"Nombre de combinaisons test√©es: {len(grid_search.cv_results_['params'])}")
print(f"\nMeilleurs hyperparam√®tres (GridSearch):")
print(grid_search.best_params_)
print(f"Meilleur score (CV): {grid_search.best_score_:.4f}")

y_pred_grid = grid_search.predict(X_test_iris)
acc_grid = accuracy_score(y_test_iris, y_pred_grid)
print(f"Accuracy sur test set: {acc_grid:.4f}")

In [None]:
# RandomizedSearchCV
print("=" * 60)
print("RANDOMIZED SEARCH CV")
print("=" * 60)

from scipy.stats import randint

param_distributions = {
    'n_estimators': randint(10, 200),
    'max_depth': randint(2, 20),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10)
}

start_time = time.time()

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_distributions,
    n_iter=100,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=0
)

random_search.fit(X_train_iris, y_train_iris)

random_time = time.time() - start_time

print(f"Temps d'ex√©cution: {random_time:.2f}s")
print(f"Nombre de combinaisons test√©es: {len(random_search.cv_results_['params'])}")
print(f"\nMeilleurs hyperparam√®tres (RandomSearch):")
print(random_search.best_params_)
print(f"Meilleur score (CV): {random_search.best_score_:.4f}")

y_pred_random = random_search.predict(X_test_iris)
acc_random = accuracy_score(y_test_iris, y_pred_random)
print(f"Accuracy sur test set: {acc_random:.4f}")

In [None]:
# Tableau comparatif
comparison_df = pd.DataFrame({
    'M√©thode': ['Algorithme G√©n√©tique', 'Grid Search', 'Random Search'],
    'Temps (s)': [ga_time, grid_time, random_time],
    'CV Score': [best_fitness_ga, grid_search.best_score_, random_search.best_score_],
    'Test Accuracy': [acc_ga, acc_grid, acc_random],
    'N¬∞ √âvaluations': [ga_rf.pop_size * ga_rf.generations, 
                       len(grid_search.cv_results_['params']),
                       len(random_search.cv_results_['params'])]
})

print("\n" + "=" * 80)
print("COMPARAISON DES M√âTHODES D'OPTIMISATION")
print("=" * 80)
print(comparison_df.to_string(index=False))

# Visualisation
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Temps d'ex√©cution
axes[0].bar(comparison_df['M√©thode'], comparison_df['Temps (s)'], color=['#2ecc71', '#3498db', '#e74c3c'])
axes[0].set_ylabel('Temps (secondes)')
axes[0].set_title('Temps d\'Ex√©cution')
axes[0].tick_params(axis='x', rotation=15)

# CV Score
axes[1].bar(comparison_df['M√©thode'], comparison_df['CV Score'], color=['#2ecc71', '#3498db', '#e74c3c'])
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Score de Validation Crois√©e')
axes[1].set_ylim([0.9, 1.0])
axes[1].tick_params(axis='x', rotation=15)

# Nombre d'√©valuations
axes[2].bar(comparison_df['M√©thode'], comparison_df['N¬∞ √âvaluations'], color=['#2ecc71', '#3498db', '#e74c3c'])
axes[2].set_ylabel('Nombre d\'√©valuations')
axes[2].set_title('Nombre d\'√âvaluations')
axes[2].tick_params(axis='x', rotation=15)

plt.tight_layout()
plt.show()

## 4. Application 2 : Feature Selection avec AG

S√©lection automatique des features les plus pertinentes sur le dataset **Breast Cancer** (30 features).

In [None]:
# Chargement du dataset Breast Cancer
cancer = load_breast_cancer()
X_cancer, y_cancer = cancer.data, cancer.target
X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
    X_cancer, y_cancer, test_size=0.3, random_state=42, stratify=y_cancer
)

print(f"Dataset Breast Cancer:")
print(f"  Train: {X_train_cancer.shape}, Test: {X_test_cancer.shape}")
print(f"  Nombre de features: {X_cancer.shape[1]}")
print(f"  Classes: {np.unique(y_cancer)} (0=malignant, 1=benign)")

In [None]:
# Fonction de fitness pour feature selection
def fitness_feature_selection(binary_mask):
    """
    Fitness = accuracy avec p√©nalit√© pour trop de features.
    binary_mask: vecteur binaire (1 = feature s√©lectionn√©e, 0 = rejet√©e)
    """
    # Convertir en masque binaire
    mask = (binary_mask > 0.5).astype(bool)
    
    # Au moins 1 feature doit √™tre s√©lectionn√©e
    if mask.sum() == 0:
        return 0.0
    
    # S√©lection des features
    X_train_selected = X_train_cancer[:, mask]
    
    # Entra√Ænement d'un RandomForest
    rf = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
    scores = cross_val_score(rf, X_train_selected, y_train_cancer, cv=3, scoring='accuracy')
    accuracy = scores.mean()
    
    # P√©nalit√© pour trop de features (encourager la parcimonie)
    n_features_selected = mask.sum()
    penalty = 0.01 * (n_features_selected / len(mask))
    
    return accuracy - penalty

# Bounds : [0, 1] pour chaque feature (sera binaris√© avec seuil 0.5)
n_features = X_cancer.shape[1]
bounds_features = [(0, 1)] * n_features

print(f"Fonction de fitness feature selection d√©finie !")
print(f"Nombre de features: {n_features}")

In [None]:
# Optimisation avec AG
print("=" * 60)
print("FEATURE SELECTION AVEC ALGORITHME G√âN√âTIQUE")
print("=" * 60)

start_time = time.time()

ga_fs = GeneticAlgorithm(
    fitness_func=fitness_feature_selection,
    bounds=bounds_features,
    pop_size=30,
    generations=25,
    mutation_rate=0.1,
    crossover_rate=0.8,
    elitism=0.15
)

best_mask_ga, best_fitness_fs = ga_fs.evolve(verbose=True)

fs_time = time.time() - start_time

# Convertir en masque binaire
best_mask_binary = (best_mask_ga > 0.5).astype(bool)
selected_features = np.array(cancer.feature_names)[best_mask_binary]

print(f"\nTemps d'ex√©cution: {fs_time:.2f}s")
print(f"Nombre de features s√©lectionn√©es: {best_mask_binary.sum()} / {n_features}")
print(f"Fitness: {best_fitness_fs:.4f}")
print(f"\nFeatures s√©lectionn√©es:")
for i, feature in enumerate(selected_features, 1):
    print(f"  {i}. {feature}")

In [None]:
# Visualisation convergence
ga_fs.plot_convergence()

In [None]:
# Test sur ensemble de test
X_train_selected = X_train_cancer[:, best_mask_binary]
X_test_selected = X_test_cancer[:, best_mask_binary]

rf_fs = RandomForestClassifier(n_estimators=100, random_state=42)
rf_fs.fit(X_train_selected, y_train_cancer)
y_pred_fs = rf_fs.predict(X_test_selected)
acc_fs = accuracy_score(y_test_cancer, y_pred_fs)

# Comparaison avec toutes les features
rf_all = RandomForestClassifier(n_estimators=100, random_state=42)
rf_all.fit(X_train_cancer, y_train_cancer)
y_pred_all = rf_all.predict(X_test_cancer)
acc_all = accuracy_score(y_test_cancer, y_pred_all)

print("\n" + "=" * 60)
print("COMPARAISON : FEATURES S√âLECTIONN√âES VS TOUTES LES FEATURES")
print("=" * 60)
print(f"Features s√©lectionn√©es ({best_mask_binary.sum()}):")
print(f"  Accuracy: {acc_fs:.4f}")
print(f"\nToutes les features ({n_features}):")
print(f"  Accuracy: {acc_all:.4f}")
print(f"\nR√©duction de features: {(1 - best_mask_binary.sum() / n_features) * 100:.1f}%")
print(f"Diff√©rence d'accuracy: {(acc_fs - acc_all) * 100:+.2f}%")

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
methods = ['Features S√©lectionn√©es', 'Toutes les Features']
accuracies = [acc_fs, acc_all]
colors = ['#2ecc71', '#95a5a6']
axes[0].bar(methods, accuracies, color=colors)
axes[0].set_ylabel('Accuracy')
axes[0].set_ylim([0.9, 1.0])
axes[0].set_title('Comparaison Accuracy')
for i, v in enumerate(accuracies):
    axes[0].text(i, v + 0.005, f"{v:.4f}", ha='center', fontweight='bold')

# Feature count
feature_counts = [best_mask_binary.sum(), n_features]
axes[1].bar(methods, feature_counts, color=colors)
axes[1].set_ylabel('Nombre de Features')
axes[1].set_title('Nombre de Features Utilis√©es')
for i, v in enumerate(feature_counts):
    axes[1].text(i, v + 0.5, f"{int(v)}", ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

cm_fs = confusion_matrix(y_test_cancer, y_pred_fs)
cm_all = confusion_matrix(y_test_cancer, y_pred_all)

sns.heatmap(cm_fs, annot=True, fmt='d', cmap='Greens', ax=axes[0], cbar=False)
axes[0].set_title(f'Features S√©lectionn√©es ({best_mask_binary.sum()})')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('True')

sns.heatmap(cm_all, annot=True, fmt='d', cmap='Blues', ax=axes[1], cbar=False)
axes[1].set_title(f'Toutes les Features ({n_features})')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('True')

plt.tight_layout()
plt.show()

## 5. Conclusion

### Points Cl√©s

**Hyperparameter Tuning** :
- Les AG offrent un bon compromis entre exploration et exploitation
- Plus rapides que GridSearch pour de grands espaces de recherche
- Performance comparable √† RandomSearch avec moins d'√©valuations

**Feature Selection** :
- R√©duction significative du nombre de features sans perte d'accuracy
- Mod√®les plus simples et plus interpr√©tables
- Utile pour √©viter l'overfitting et r√©duire le temps d'inf√©rence

### Avantages des AG
1. **Exploration globale** : √âvitent les minima locaux
2. **Flexibilit√©** : Peuvent optimiser des fonctions objectives complexes
3. **Parall√©lisables** : √âvaluation de la population en parall√®le
4. **Peu d'hypoth√®ses** : Pas besoin de gradients ou de continuit√©

### Limitations
1. **Temps de calcul** : Peuvent √™tre lents pour des √©valuations co√ªteuses
2. **Hyperparam√®tres** : N√©cessitent du tuning (taille population, taux mutation, etc.)
3. **Convergence** : Pas de garantie de trouver l'optimum global
4. **Comparaison** : GridSearch reste plus exhaustif pour petits espaces