# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/XX_CHAPTER/XX_NOTEBOOK.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '08_demo_lstm_sentiment.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 08 - LSTM pour Analyse de Sentiment

Ce notebook d√©montre l'utilisation de LSTM pour classifier des critiques de films (positif/n√©gatif).

## Objectifs
- Pr√©parer des donn√©es textuelles (tokenization, padding)
- Impl√©menter un LSTM bidirectionnel avec PyTorch
- Entra√Æner et √©valuer un mod√®le d'analyse de sentiment
- Visualiser l'apprentissage et les pr√©dictions

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 1. Chargement et Pr√©paration des Donn√©es

Nous utilisons le dataset IMDB (via PyTorch/torchtext ou une version simplifi√©e).

In [None]:
# Dataset simplifi√© IMDB (version de d√©monstration)
# Pour un dataset complet, utiliser: from torchtext.datasets import IMDB

positive_reviews = [
    "this movie is absolutely fantastic and amazing",
    "i loved this film it was brilliant and well made",
    "great performance outstanding cinematography highly recommended",
    "excellent storyline wonderful acting must see",
    "best movie ever incredible experience enjoyed every minute",
    "superb direction beautiful visuals truly masterpiece",
    "amazing cast perfect script loved everything about it",
    "wonderful film highly entertaining great fun",
    "outstanding performances breathtaking scenes absolute gem",
    "brilliant movie loved the plot and characters"
]

negative_reviews = [
    "this movie is terrible waste of time awful",
    "horrible film boring script poor acting disappointing",
    "worst movie ever seen complete disaster terrible",
    "bad storyline weak performances not recommended",
    "dreadful film poor quality waste of money",
    "awful direction terrible acting horrible experience",
    "boring plot uninteresting characters complete waste",
    "terrible movie poorly executed bad script",
    "disappointing performances weak storyline not worth watching",
    "horrible film terrible acting waste of time"
]

# R√©pliquer pour avoir plus de donn√©es
reviews = (positive_reviews * 100) + (negative_reviews * 100)
labels = [1] * (len(positive_reviews) * 100) + [0] * (len(negative_reviews) * 100)

print(f"Total reviews: {len(reviews)}")
print(f"Positive: {sum(labels)}, Negative: {len(labels) - sum(labels)}")

## 2. Tokenization et Vocabulaire

In [None]:
def build_vocab(texts, min_freq=1):
    """Construit un vocabulaire √† partir des textes."""
    word_freq = Counter()
    for text in texts:
        words = text.lower().split()
        word_freq.update(words)
    
    # Filtrer par fr√©quence minimum
    vocab = {'<PAD>': 0, '<UNK>': 1}
    for word, freq in word_freq.items():
        if freq >= min_freq:
            vocab[word] = len(vocab)
    
    return vocab

def text_to_sequence(text, vocab, max_len=50):
    """Convertit un texte en s√©quence d'indices."""
    words = text.lower().split()
    sequence = [vocab.get(word, vocab['<UNK>']) for word in words]
    
    # Padding
    if len(sequence) < max_len:
        sequence += [vocab['<PAD>']] * (max_len - len(sequence))
    else:
        sequence = sequence[:max_len]
    
    return sequence

# Construire le vocabulaire
vocab = build_vocab(reviews)
vocab_size = len(vocab)
max_len = 50

print(f"Vocabulary size: {vocab_size}")
print(f"Sample words: {list(vocab.keys())[:10]}")

In [None]:
# Convertir tous les textes en s√©quences
sequences = [text_to_sequence(text, vocab, max_len) for text in reviews]
X = np.array(sequences)
y = np.array(labels)

# Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train set: {X_train.shape}, Test set: {X_test.shape}")

## 3. Dataset PyTorch

In [None]:
class SentimentDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = torch.LongTensor(sequences)
        self.labels = torch.FloatTensor(labels)
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

# Cr√©er les datasets
train_dataset = SentimentDataset(X_train, y_train)
test_dataset = SentimentDataset(X_test, y_test)

# DataLoaders
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

## 4. Mod√®le LSTM Bidirectionnel

In [None]:
class BiLSTM_Sentiment(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout=0.5):
        super(BiLSTM_Sentiment, self).__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        
        # LSTM bidirectionnel
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
        # Fully connected (x2 pour bidirectionnel)
        self.fc = nn.Linear(hidden_dim * 2, 1)
        
    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x)  # (batch_size, seq_len, embedding_dim)
        
        # LSTM
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # lstm_out: (batch_size, seq_len, hidden_dim*2)
        # hidden: (num_layers*2, batch_size, hidden_dim)
        
        # Prendre les derniers √©tats cach√©s (forward et backward)
        hidden_fwd = hidden[-2, :, :]  # (batch_size, hidden_dim)
        hidden_bwd = hidden[-1, :, :]  # (batch_size, hidden_dim)
        hidden_concat = torch.cat([hidden_fwd, hidden_bwd], dim=1)
        
        # Dropout et classification
        hidden_dropout = self.dropout(hidden_concat)
        output = self.fc(hidden_dropout)  # (batch_size, 1)
        
        return output.squeeze(1)  # (batch_size,)

# Hyperparam√®tres
embedding_dim = 100
hidden_dim = 128
num_layers = 2
dropout = 0.5

# Cr√©er le mod√®le
model = BiLSTM_Sentiment(vocab_size, embedding_dim, hidden_dim, num_layers, dropout)
model = model.to(device)

print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

## 5. Entra√Ænement

In [None]:
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for sequences, labels in loader:
        sequences, labels = sequences.to(device), labels.to(device)
        
        # Forward
        optimizer.zero_grad()
        outputs = model(sequences)
        loss = criterion(outputs, labels)
        
        # Backward
        loss.backward()
        optimizer.step()
        
        # M√©triques
        total_loss += loss.item()
        predictions = (torch.sigmoid(outputs) > 0.5).float()
        correct += (predictions == labels).sum().item()
        total += labels.size(0)
    
    return total_loss / len(loader), correct / total

def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    all_predictions = []
    all_labels = []
    
    with torch.no_grad():
        for sequences, labels in loader:
            sequences, labels = sequences.to(device), labels.to(device)
            
            outputs = model(sequences)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            predictions = (torch.sigmoid(outputs) > 0.5).float()
            correct += (predictions == labels).sum().item()
            total += labels.size(0)
            
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    
    return total_loss / len(loader), correct / total, all_predictions, all_labels

In [None]:
# Configuration entra√Ænement
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 20

# Historique
history = {
    'train_loss': [], 'train_acc': [],
    'test_loss': [], 'test_acc': []
}

# Entra√Ænement
print("Starting training...\n")
for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc, _, _ = evaluate(model, test_loader, criterion, device)
    
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['test_loss'].append(test_loss)
    history['test_acc'].append(test_acc)
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}")
        print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
        print(f"  Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.4f}\n")

print("Training completed!")

## 6. Visualisation des R√©sultats

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(history['train_loss'], label='Train Loss', linewidth=2)
axes[0].plot(history['test_loss'], label='Test Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training and Test Loss', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(history['train_acc'], label='Train Accuracy', linewidth=2)
axes[1].plot(history['test_acc'], label='Test Accuracy', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Training and Test Accuracy', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. √âvaluation Finale

In [None]:
# √âvaluation compl√®te
test_loss, test_acc, predictions, true_labels = evaluate(
    model, test_loader, criterion, device
)

print("Final Test Results:")
print(f"  Loss: {test_loss:.4f}")
print(f"  Accuracy: {test_acc:.4f}\n")

# Classification report
print("Classification Report:")
print(classification_report(true_labels, predictions, target_names=['Negative', 'Positive']))

# Matrice de confusion
cm = confusion_matrix(true_labels, predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('True', fontsize=12)
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()

## 8. Pr√©dictions sur Nouveaux Textes

In [None]:
def predict_sentiment(text, model, vocab, device, max_len=50):
    """Pr√©dit le sentiment d'un texte."""
    model.eval()
    
    # Pr√©traitement
    sequence = text_to_sequence(text, vocab, max_len)
    sequence_tensor = torch.LongTensor([sequence]).to(device)
    
    # Pr√©diction
    with torch.no_grad():
        output = model(sequence_tensor)
        probability = torch.sigmoid(output).item()
    
    sentiment = "Positive" if probability > 0.5 else "Negative"
    return sentiment, probability

# Tests
test_texts = [
    "this movie is absolutely brilliant and amazing",
    "terrible waste of time horrible acting",
    "great film loved every minute of it",
    "boring and disappointing not recommended",
    "outstanding performances wonderful experience"
]

print("Sentiment Predictions:\n")
for text in test_texts:
    sentiment, prob = predict_sentiment(text, model, vocab, device)
    print(f"Text: {text}")
    print(f"  Sentiment: {sentiment} (confidence: {prob:.4f})\n")

## Conclusion

### Ce que nous avons appris:
1. Pr√©traitement de texte: tokenization, vocabulaire, padding
2. Architecture LSTM bidirectionnelle pour NLP
3. Entra√Ænement et √©valuation d'un mod√®le de sentiment
4. Utilisation d'embeddings et de couches r√©currentes

### Pour aller plus loin:
- Utiliser des embeddings pr√©-entra√Æn√©s (GloVe, Word2Vec)
- Essayer des architectures plus complexes (attention, Transformers)
- Fine-tuner des mod√®les pr√©-entra√Æn√©s (BERT, RoBERTa)
- Travailler avec des datasets r√©els (IMDB complet, Twitter)