# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/XX_CHAPTER/XX_NOTEBOOK.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '05_demo_reduction_dimensionnalite.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 05 - D√©monstration R√©duction de Dimensionnalit√©

Exploration de PCA, t-SNE et UMAP pour la r√©duction de dimensionnalit√© et la visualisation.

## Objectifs
- Ma√Ætriser PCA et interpr√©ter les composantes principales
- Utiliser t-SNE pour la visualisation
- Appliquer UMAP pour des projections 2D/3D
- Comparer les m√©thodes

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, load_wine, fetch_olivetti_faces
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import umap
from time import time
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Partie 1 : PCA (Principal Component Analysis)

In [None]:
# 1.1 PCA sur Digits Dataset
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print(f"Shape originale: {X_digits.shape}")

# Standardisation
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_digits)

# PCA avec toutes les composantes
pca_full = PCA()
pca_full.fit(X_scaled)

# Variance expliqu√©e
explained_var = pca_full.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Variance par composante
axes[0].bar(range(1, len(explained_var)+1), explained_var, alpha=0.7)
axes[0].set_xlabel('Composante Principale')
axes[0].set_ylabel('Variance Expliqu√©e')
axes[0].set_title('Variance Expliqu√©e par Composante')
axes[0].set_xlim(0, 20)

# Variance cumul√©e
axes[1].plot(range(1, len(cumulative_var)+1), cumulative_var, 'o-', linewidth=2)
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% variance')
axes[1].axhline(y=0.90, color='orange', linestyle='--', label='90% variance')
axes[1].set_xlabel('Nombre de Composantes')
axes[1].set_ylabel('Variance Cumul√©e')
axes[1].set_title('Variance Cumul√©e Expliqu√©e')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Nombre de composantes pour 95% de variance
n_components_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"\nComposantes pour 95% de variance: {n_components_95}/{X_digits.shape[1]}")
print(f"R√©duction de dimensionnalit√©: {100*(1-n_components_95/X_digits.shape[1]):.1f}%")

In [None]:
# 1.2 Visualisation 2D avec PCA
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)

plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y_digits, 
                      cmap='tab10', s=20, alpha=0.6, edgecolors='k', linewidths=0.5)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Digits Dataset - PCA 2D')
plt.colorbar(scatter, label='Digit')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Variance expliqu√©e (PC1+PC2): {pca_2d.explained_variance_ratio_.sum():.1%}")

In [None]:
# 1.3 Visualisation des composantes principales
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

for i in range(8):
    axes[i].imshow(pca_full.components_[i].reshape(8, 8), cmap='RdBu', 
                   vmin=-0.2, vmax=0.2)
    axes[i].set_title(f'PC{i+1} ({explained_var[i]:.1%})')
    axes[i].axis('off')

plt.suptitle('Premi√®res Composantes Principales (Eigenfaces)', y=1.02)
plt.tight_layout()
plt.show()

## Partie 2 : t-SNE (t-Distributed Stochastic Neighbor Embedding)

In [None]:
# 2.1 t-SNE sur Digits (√©chantillon pour vitesse)
n_samples = 1000
indices = np.random.RandomState(42).choice(len(X_digits), n_samples, replace=False)
X_sample = X_scaled[indices]
y_sample = y_digits[indices]

# PCA preprocessing (recommand√© pour t-SNE)
pca_50 = PCA(n_components=50)
X_pca_50 = pca_50.fit_transform(X_sample)

# t-SNE
print("Calcul t-SNE...")
start = time()
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_jobs=-1)
X_tsne = tsne.fit_transform(X_pca_50)
print(f"Temps: {time()-start:.2f}s")

# Visualisation
plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample, 
                      cmap='tab10', s=30, alpha=0.7, edgecolors='k', linewidths=0.5)
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('Digits Dataset - t-SNE 2D')
plt.colorbar(scatter, label='Digit')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# 2.2 Impact du perplexity
perplexity_values = [5, 30, 50, 100]

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for idx, perp in enumerate(perplexity_values):
    print(f"t-SNE avec perplexity={perp}...")
    tsne_p = TSNE(n_components=2, perplexity=perp, random_state=42, n_jobs=-1)
    X_tsne_p = tsne_p.fit_transform(X_pca_50)
    
    axes[idx].scatter(X_tsne_p[:, 0], X_tsne_p[:, 1], c=y_sample,
                      cmap='tab10', s=20, alpha=0.6, edgecolors='k', linewidths=0.5)
    axes[idx].set_title(f'perplexity={perp}')
    axes[idx].set_xlabel('t-SNE 1')
    axes[idx].set_ylabel('t-SNE 2')

plt.tight_layout()
plt.show()

print("\nPerplexity:")
print("- Petit (5-10): Focus sur structure locale")
print("- Moyen (30-50): Recommand√©, bon compromis")
print("- Grand (>50): Focus sur structure globale")

## Partie 3 : UMAP (Uniform Manifold Approximation and Projection)

In [None]:
# 3.1 UMAP sur Digits
print("Calcul UMAP...")
start = time()
umap_model = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_model.fit_transform(X_pca_50)
print(f"Temps: {time()-start:.2f}s")

# Visualisation
plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y_sample,
                      cmap='tab10', s=30, alpha=0.7, edgecolors='k', linewidths=0.5)
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.title('Digits Dataset - UMAP 2D')
plt.colorbar(scatter, label='Digit')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# 3.2 Comparaison PCA vs t-SNE vs UMAP
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

methods = [
    ('PCA', X_pca_2d[indices]),
    ('t-SNE', X_tsne),
    ('UMAP', X_umap)
]

for idx, (name, X_proj) in enumerate(methods):
    scatter = axes[idx].scatter(X_proj[:, 0], X_proj[:, 1], c=y_sample,
                                cmap='tab10', s=20, alpha=0.6, 
                                edgecolors='k', linewidths=0.5)
    axes[idx].set_title(f'{name}', fontsize=14)
    axes[idx].set_xlabel('Dimension 1')
    axes[idx].set_ylabel('Dimension 2')
    axes[idx].grid(True, alpha=0.3)

plt.colorbar(scatter, ax=axes, label='Digit')
plt.tight_layout()
plt.show()

In [None]:
# 3.3 UMAP 3D
umap_3d = umap.UMAP(n_components=3, random_state=42)
X_umap_3d = umap_3d.fit_transform(X_pca_50)

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(14, 10))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(X_umap_3d[:, 0], X_umap_3d[:, 1], X_umap_3d[:, 2],
                     c=y_sample, cmap='tab10', s=30, alpha=0.6,
                     edgecolors='k', linewidths=0.5)

ax.set_xlabel('UMAP 1')
ax.set_ylabel('UMAP 2')
ax.set_zlabel('UMAP 3')
ax.set_title('Digits Dataset - UMAP 3D')
plt.colorbar(scatter, label='Digit', shrink=0.5)
plt.show()

## Partie 4 : Application sur Images (Olivetti Faces)

In [None]:
# 4.1 Chargement Olivetti Faces
faces = fetch_olivetti_faces()
X_faces = faces.data
y_faces = faces.target

print(f"Shape: {X_faces.shape}")
print(f"Nombre de visages: {len(np.unique(y_faces))}")

# Visualisation
fig, axes = plt.subplots(2, 10, figsize=(18, 4))
axes = axes.ravel()

for i in range(20):
    axes[i].imshow(faces.images[i], cmap='gray')
    axes[i].set_title(f'P{y_faces[i]}')
    axes[i].axis('off')

plt.suptitle('Olivetti Faces Dataset')
plt.tight_layout()
plt.show()

In [None]:
# 4.2 PCA sur visages (Eigenfaces)
n_components = 150
pca_faces = PCA(n_components=n_components, whiten=True, random_state=42)
X_pca_faces = pca_faces.fit_transform(X_faces)

print(f"Variance expliqu√©e: {pca_faces.explained_variance_ratio_.sum():.1%}")

# Eigenfaces
fig, axes = plt.subplots(3, 8, figsize=(18, 8))
axes = axes.ravel()

for i in range(24):
    axes[i].imshow(pca_faces.components_[i].reshape(64, 64), cmap='gray')
    axes[i].set_title(f'Eigenface {i+1}')
    axes[i].axis('off')

plt.suptitle('Eigenfaces (Composantes Principales)', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# 4.3 Reconstruction avec PCA
n_components_list = [10, 50, 150]
sample_idx = 0
original = faces.images[sample_idx]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# Original
axes[0].imshow(original, cmap='gray')
axes[0].set_title('Original')
axes[0].axis('off')

# Reconstructions
for idx, n_comp in enumerate(n_components_list):
    pca_rec = PCA(n_components=n_comp)
    X_proj = pca_rec.fit_transform(X_faces)
    X_reconstructed = pca_rec.inverse_transform(X_proj)
    
    axes[idx+1].imshow(X_reconstructed[sample_idx].reshape(64, 64), cmap='gray')
    axes[idx+1].set_title(f'{n_comp} composantes\n({pca_rec.explained_variance_ratio_.sum():.1%} variance)')
    axes[idx+1].axis('off')

plt.tight_layout()
plt.show()

## R√©capitulatif

### PCA

**Avantages:**
- Rapide et d√©terministe
- Pr√©serve la variance globale
- Interpr√©table (composantes principales)
- Permet la reconstruction

**Inconv√©nients:**
- Lin√©aire uniquement
- Sensible √† l'√©chelle (standardisation requise)
- Peut manquer des structures non lin√©aires

**Usage:**
- R√©duction de dimensionnalit√©
- Compression de donn√©es
- D√©bruitage
- Feature extraction

### t-SNE

**Avantages:**
- Excellente visualisation 2D/3D
- Pr√©serve structure locale
- D√©tecte clusters complexes

**Inconv√©nients:**
- Lent sur grands datasets
- Non d√©terministe
- Pas de transformation pour nouvelles donn√©es
- Distances globales non pr√©serv√©es

**Hyperparam√®tres:**
- `perplexity`: 5-50, balance local/global
- `n_iter`: 1000+ pour convergence

### UMAP

**Avantages:**
- Plus rapide que t-SNE
- Pr√©serve structure locale ET globale
- Transformation pour nouvelles donn√©es
- Scalable

**Inconv√©nients:**
- Moins de recherche th√©orique
- Sensible aux hyperparam√®tres

**Hyperparam√®tres:**
- `n_neighbors`: 5-50, structure locale
- `min_dist`: 0.0-1.0, compacit√©

### Comparaison

| Crit√®re | PCA | t-SNE | UMAP |
|---------|-----|-------|------|
| Vitesse | Tr√®s rapide | Lent | Rapide |
| Scalabilit√© | Excellente | Mauvaise | Bonne |
| Structure locale | Non | Oui | Oui |
| Structure globale | Oui | Non | Oui |
| Nouveaux points | Oui | Non | Oui |
| D√©terministe | Oui | Non | Non |

### Workflow recommand√©

1. **Exploration initiale:** PCA pour variance et vitesse
2. **Visualisation:** t-SNE ou UMAP pour clusters
3. **Production:** PCA ou UMAP pour transformation