# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/08_deep_learning_rnn/08_demo_transformers_huggingface.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '08_demo_transformers_huggingface.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 08 - Transformers avec Hugging Face

Ce notebook d√©montre l'utilisation de Transformers pr√©-entra√Æn√©s via Hugging Face.

## Objectifs
- Charger et utiliser des mod√®les pr√©-entra√Æn√©s (BERT, GPT)
- Fine-tuner un mod√®le pour une t√¢che sp√©cifique
- Comprendre les tokenizers et le pipeline Hugging Face
- √âvaluer les performances sur des t√¢ches NLP

In [None]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Hugging Face Transformers
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, pipeline
)
from datasets import Dataset

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 1. Pipeline Simple - Sentiment Analysis

Utilisation du pipeline Hugging Face pour une t√¢che rapide.

In [None]:
# Cr√©er un pipeline de sentiment analysis
sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

# Tester sur quelques phrases
test_sentences = [
    "This movie is absolutely fantastic!",
    "I hated this film, it was terrible.",
    "The acting was great but the plot was boring.",
    "Best movie I've ever seen, highly recommended!",
    "Waste of time and money, very disappointing."
]

print("Sentiment Analysis with Pre-trained Model:\n")
for sentence in test_sentences:
    result = sentiment_analyzer(sentence)[0]
    print(f"Text: {sentence}")
    print(f"  Label: {result['label']}, Score: {result['score']:.4f}\n")

## 2. Pr√©paration des Donn√©es pour Fine-tuning

In [None]:
# Cr√©er un dataset synth√©tique (remplacer par vrai dataset en production)
positive_samples = [
    "excellent product highly recommended",
    "amazing quality great value for money",
    "best purchase ever very satisfied",
    "outstanding service quick delivery",
    "love it perfect exactly what i needed"
] * 50

negative_samples = [
    "terrible quality waste of money",
    "disappointed poor service never again",
    "worst purchase broke immediately",
    "horrible experience bad quality",
    "not recommended cheap materials poor design"
] * 50

# Combiner
texts = positive_samples + negative_samples
labels = [1] * len(positive_samples) + [0] * len(negative_samples)

# Split
train_texts, test_texts, train_labels, test_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)

print(f"Train samples: {len(train_texts)}")
print(f"Test samples: {len(test_texts)}")

## 3. Tokenization

In [None]:
# Charger le tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenizer les donn√©es
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128)

# Cr√©er des datasets Hugging Face
train_dataset = Dataset.from_dict({
    'input_ids': train_encodings['input_ids'],
    'attention_mask': train_encodings['attention_mask'],
    'labels': train_labels
})

test_dataset = Dataset.from_dict({
    'input_ids': test_encodings['input_ids'],
    'attention_mask': test_encodings['attention_mask'],
    'labels': test_labels
})

print(f"Train dataset: {train_dataset}")
print(f"Test dataset: {test_dataset}")

## 4. Charger le Mod√®le et Fine-tuner

In [None]:
# Charger le mod√®le pr√©-entra√Æn√©
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)
model = model.to(device)

print(f"Model loaded: {model_name}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Fonction de m√©triques
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy': accuracy_score(labels, predictions)}

# Arguments d'entra√Ænement
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Cr√©er le Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

print("Trainer configured successfully!")

In [None]:
# Fine-tuner le mod√®le
print("Starting fine-tuning...\n")
train_result = trainer.train()

print("\nFine-tuning completed!")
print(f"Training loss: {train_result.training_loss:.4f}")

## 5. √âvaluation

In [None]:
# √âvaluer le mod√®le
eval_results = trainer.evaluate()

print("Evaluation Results:")
for key, value in eval_results.items():
    print(f"  {key}: {value:.4f}")

In [None]:
# Pr√©dictions sur le test set
predictions = trainer.predict(test_dataset)
pred_labels = np.argmax(predictions.predictions, axis=1)

# Classification report
print("\nClassification Report:")
print(classification_report(test_labels, pred_labels, target_names=['Negative', 'Positive']))

# Confusion matrix
cm = confusion_matrix(test_labels, pred_labels)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('True', fontsize=12)
plt.title('Confusion Matrix - Fine-tuned Model', fontsize=14, fontweight='bold')
plt.show()

## 6. Visualisation de l'Historique d'Entra√Ænement

In [None]:
# Extraire les logs
log_history = trainer.state.log_history

# S√©parer train et eval logs
train_logs = [log for log in log_history if 'loss' in log and 'eval_loss' not in log]
eval_logs = [log for log in log_history if 'eval_loss' in log]

if train_logs and eval_logs:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Training loss
    train_steps = [log['step'] for log in train_logs]
    train_loss = [log['loss'] for log in train_logs]
    axes[0].plot(train_steps, train_loss, linewidth=2, marker='o')
    axes[0].set_xlabel('Steps', fontsize=12)
    axes[0].set_ylabel('Loss', fontsize=12)
    axes[0].set_title('Training Loss', fontsize=14, fontweight='bold')
    axes[0].grid(True, alpha=0.3)
    
    # Evaluation metrics
    eval_epochs = [log['epoch'] for log in eval_logs]
    eval_acc = [log['eval_accuracy'] for log in eval_logs]
    axes[1].plot(eval_epochs, eval_acc, linewidth=2, marker='s', color='green')
    axes[1].set_xlabel('Epoch', fontsize=12)
    axes[1].set_ylabel('Accuracy', fontsize=12)
    axes[1].set_title('Evaluation Accuracy', fontsize=14, fontweight='bold')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print("No training logs available for visualization.")

## 7. Utilisation du Mod√®le Fine-tun√©

In [None]:
# Cr√©er un pipeline avec le mod√®le fine-tun√©
classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Tester sur de nouveaux textes
new_texts = [
    "This product exceeded my expectations, absolutely wonderful!",
    "Complete waste of money, very poor quality.",
    "Good value for the price, would buy again.",
    "Terrible experience, broken on arrival.",
    "Perfect gift, everyone loved it!"
]

print("Predictions on New Texts:\n")
for text in new_texts:
    result = classifier(text)[0]
    sentiment = "Positive" if result['label'] == 'LABEL_1' else "Negative"
    print(f"Text: {text}")
    print(f"  Prediction: {sentiment} (score: {result['score']:.4f})\n")

## 8. Analyse des Embeddings (Bonus)

In [None]:
# Extraire les embeddings pour quelques exemples
sample_texts = [
    "excellent product",
    "terrible quality",
    "good value",
    "waste of money"
]

model.eval()
embeddings = []

with torch.no_grad():
    for text in sample_texts:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        outputs = model.distilbert(**inputs)
        # Prendre la moyenne du dernier hidden state
        embedding = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
        embeddings.append(embedding[0])

embeddings = np.array(embeddings)

# PCA pour visualisation
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

plt.figure(figsize=(10, 7))
colors = ['green', 'red', 'green', 'red']
for i, text in enumerate(sample_texts):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1], 
                c=colors[i], s=200, alpha=0.6, edgecolors='black', linewidth=2)
    plt.annotate(text, (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                 fontsize=11, fontweight='bold', ha='center')

plt.xlabel('PC1', fontsize=12)
plt.ylabel('PC2', fontsize=12)
plt.title('BERT Embeddings Visualization (PCA)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 9. G√©n√©ration de Texte avec GPT-2 (Bonus)

In [None]:
# Pipeline de g√©n√©ration de texte
generator = pipeline("text-generation", model="gpt2")

# Prompts
prompts = [
    "Once upon a time in a distant galaxy",
    "The future of artificial intelligence is",
    "In the year 2050, technology will"
]

print("Text Generation with GPT-2:\n")
for prompt in prompts:
    generated = generator(
        prompt,
        max_length=50,
        num_return_sequences=1,
        temperature=0.7
    )
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated[0]['generated_text']}\n")
    print("-" * 80 + "\n")

## Conclusion

### Ce que nous avons appris:
1. Utiliser les pipelines Hugging Face pour des t√¢ches rapides
2. Fine-tuner un mod√®le BERT pr√©-entra√Æn√©
3. Utiliser le Trainer API pour l'entra√Ænement
4. √âvaluer et d√©ployer des mod√®les Transformers
5. Visualiser les embeddings et g√©n√©rer du texte

### Pour aller plus loin:
- Essayer d'autres mod√®les (RoBERTa, ALBERT, ELECTRA)
- Fine-tuner pour d'autres t√¢ches (NER, QA, summarization)
- Utiliser des datasets r√©els (GLUE, SuperGLUE)
- Optimiser avec quantization et distillation
- D√©ployer en production avec FastAPI/TensorFlow Serving