# üçΩÔ∏è Projet Fouille d'Opinions - Version OPTIMIS√âE V2

**Am√©liorations par rapport √† V1 (83.83%) :**
- ‚úÖ **5 epochs** au lieu de 3
- ‚úÖ **Mean pooling** au lieu du token CLS
- ‚úÖ **Couche cach√©e suppl√©mentaire** (768 ‚Üí 256 ‚Üí 4)
- ‚úÖ **Cosine scheduler** avec warmup
- ‚úÖ **Label smoothing** (0.1)
- ‚úÖ **Gradient clipping** pour stabilit√©
- ‚úÖ **Early stopping** sur validation
- ‚úÖ Option **camembert-large** pour meilleure performance

**Objectif : atteindre ~87-92% d'accuracy**

## 1. Installation et v√©rification GPU

In [None]:
!pip install -q transformers datasets torch pandas numpy tqdm

In [None]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Configuration - MODIFIEZ ICI

In [None]:
# ============================================
# CONFIGURATION - Modifiez ces valeurs pour exp√©rimenter
# ============================================

CONFIG = {
    # Mod√®le - Options: "camembert-base", "camembert/camembert-large", "flaubert/flaubert_base_cased"
    "model_name": "camembert-base",  # Changez en "camembert/camembert-large" pour plus de performance
    
    # Hyperparam√®tres d'entra√Ænement
    "num_epochs": 5,           # Plus d'epochs = meilleur apprentissage
    "batch_size": 16,          # R√©duire √† 8 si OOM avec camembert-large
    "learning_rate": 2e-5,     # Taux d'apprentissage
    "max_length": 256,         # Longueur max des textes
    
    # R√©gularisation
    "dropout": 0.2,            # Dropout (augment√© de 0.1)
    "label_smoothing": 0.1,    # Label smoothing pour r√©duire overfitting
    "weight_decay": 0.01,      # L2 regularization
    "max_grad_norm": 1.0,      # Gradient clipping
    
    # Scheduler
    "warmup_ratio": 0.1,       # Warmup steps ratio
    "scheduler_type": "cosine", # "linear" ou "cosine"
    
    # Architecture
    "use_mean_pooling": True,  # Mean pooling au lieu de CLS
    "hidden_dim": 256,         # Couche cach√©e interm√©diaire
    
    # Early stopping
    "patience": 2,             # Arr√™ter si pas d'am√©lioration pendant N epochs
}

print("üìã Configuration:")
for k, v in CONFIG.items():
    print(f"  {k}: {v}")

## 3. Upload des donn√©es

In [None]:
import os
from google.colab import files

os.makedirs('/content/data', exist_ok=True)

print("Uploadez vos fichiers de donn√©es (ftdataset_train.tsv et ftdataset_val.tsv):")
uploaded = files.upload()

for filename in uploaded.keys():
    os.rename(filename, f'/content/data/{filename}')
    print(f"‚úÖ {filename} d√©plac√© vers /content/data/")

!ls -la /content/data/

## 4. Utilitaires de donn√©es

In [None]:
from typing import Optional
import torch
from torch.utils.data import Dataset
from transformers import PreTrainedTokenizer

LABEL_TO_IDX = {"Positive": 0, "N√©gative": 1, "Neutre": 2, "NE": 3}
IDX_TO_LABEL = {v: k for k, v in LABEL_TO_IDX.items()}
ASPECTS = ["Prix", "Cuisine", "Service"]


class OpinionDataset(Dataset):
    def __init__(self, texts, tokenizer, labels=None, max_length=256):
        self.encodings = tokenizer(
            texts, truncation=True, padding=True,
            max_length=max_length, return_tensors="pt"
        )
        self.labels = labels

    def __len__(self):
        return len(self.encodings["input_ids"])

    def __getitem__(self, idx):
        item = {
            "input_ids": self.encodings["input_ids"][idx],
            "attention_mask": self.encodings["attention_mask"][idx],
        }
        if self.labels:
            for aspect in ASPECTS:
                item[f"label_{aspect.lower()}"] = torch.tensor(self.labels[aspect][idx], dtype=torch.long)
        return item


def prepare_labels(data):
    labels = {aspect: [] for aspect in ASPECTS}
    for item in data:
        for aspect in ASPECTS:
            labels[aspect].append(LABEL_TO_IDX.get(item[aspect], 3))
    return labels


def get_texts(data):
    return [item["Avis"] for item in data]


def collate_fn(features):
    batch = {
        "input_ids": torch.stack([f["input_ids"] for f in features]),
        "attention_mask": torch.stack([f["attention_mask"] for f in features]),
    }
    for aspect in ASPECTS:
        key = f"label_{aspect.lower()}"
        if key in features[0]:
            batch[key] = torch.stack([f[key] for f in features])
    return batch


print("‚úÖ Utilitaires de donn√©es charg√©s")

## 5. Mod√®le OPTIMIS√â

In [None]:
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import AutoConfig, AutoTokenizer, AutoModel, get_cosine_schedule_with_warmup, get_linear_schedule_with_warmup
from tqdm.auto import tqdm
import math


class OptimizedClassifier(nn.Module):
    """Classificateur optimis√© avec mean pooling et couche cach√©e."""

    def __init__(self, model_name, num_classes=4, hidden_dim=256, dropout=0.2, use_mean_pooling=True):
        super().__init__()
        self.config = AutoConfig.from_pretrained(model_name)
        self.encoder = AutoModel.from_pretrained(model_name)
        self.use_mean_pooling = use_mean_pooling

        hidden_size = self.config.hidden_size

        # T√™tes de classification avec couche cach√©e
        self.classifiers = nn.ModuleDict({
            aspect: nn.Sequential(
                nn.Dropout(dropout),
                nn.Linear(hidden_size, hidden_dim),
                nn.GELU(),
                nn.Dropout(dropout),
                nn.Linear(hidden_dim, num_classes)
            ) for aspect in ASPECTS
        })

    def mean_pooling(self, token_embeddings, attention_mask):
        """Mean pooling sur les tokens (meilleur que CLS pour la classification)."""
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
        sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
        return sum_embeddings / sum_mask

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)

        if self.use_mean_pooling:
            pooled = self.mean_pooling(outputs.last_hidden_state, attention_mask)
        else:
            pooled = outputs.last_hidden_state[:, 0, :]  # CLS token

        return {aspect: self.classifiers[aspect](pooled) for aspect in ASPECTS}


class OptimizedTrainer:
    """Trainer optimis√© avec toutes les am√©liorations."""

    def __init__(self, config):
        self.config = config
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        print(f"üîß Chargement de {config['model_name']}...")
        self.tokenizer = AutoTokenizer.from_pretrained(config['model_name'])
        self.model = OptimizedClassifier(
            model_name=config['model_name'],
            hidden_dim=config['hidden_dim'],
            dropout=config['dropout'],
            use_mean_pooling=config['use_mean_pooling']
        ).to(self.device)

        # Label smoothing cross entropy
        self.criterion = nn.CrossEntropyLoss(label_smoothing=config['label_smoothing'])

        # Compter les param√®tres
        total_params = sum(p.numel() for p in self.model.parameters())
        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        print(f"‚úÖ Mod√®le charg√© sur {self.device}")
        print(f"   Param√®tres: {total_params/1e6:.1f}M (trainable: {trainable_params/1e6:.1f}M)")

    def train(self, train_data, val_data):
        config = self.config
        print(f"\nüìä Entra√Ænement sur {self.device}")
        print(f"   Train: {len(train_data)} | Val: {len(val_data)}")

        # Datasets
        train_dataset = OpinionDataset(
            get_texts(train_data), self.tokenizer,
            labels=prepare_labels(train_data), max_length=config['max_length']
        )
        val_dataset = OpinionDataset(
            get_texts(val_data), self.tokenizer,
            labels=prepare_labels(val_data), max_length=config['max_length']
        )

        train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True, collate_fn=collate_fn)
        val_loader = DataLoader(val_dataset, batch_size=config['batch_size'], shuffle=False, collate_fn=collate_fn)

        # Optimizer avec weight decay
        optimizer = AdamW(
            self.model.parameters(),
            lr=config['learning_rate'],
            weight_decay=config['weight_decay']
        )

        # Scheduler
        total_steps = len(train_loader) * config['num_epochs']
        warmup_steps = int(total_steps * config['warmup_ratio'])

        if config['scheduler_type'] == 'cosine':
            scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)
        else:
            scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_steps)

        # Training loop avec early stopping
        best_val_acc = 0.0
        best_model_state = None
        patience_counter = 0
        history = {'train_loss': [], 'val_acc': []}

        for epoch in range(config['num_epochs']):
            print(f"\n{'='*50}")
            print(f"Epoch {epoch + 1}/{config['num_epochs']}")
            print(f"{'='*50}")

            # Training
            self.model.train()
            total_loss = 0.0
            progress = tqdm(train_loader, desc="Training")

            for batch in progress:
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)

                logits = self.model(input_ids, attention_mask)

                loss = sum(
                    self.criterion(logits[aspect], batch[f"label_{aspect.lower()}"].to(self.device))
                    for aspect in ASPECTS
                )

                optimizer.zero_grad()
                loss.backward()

                # Gradient clipping
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), config['max_grad_norm'])

                optimizer.step()
                scheduler.step()

                total_loss += loss.item()
                progress.set_postfix({"loss": f"{loss.item():.4f}", "lr": f"{scheduler.get_last_lr()[0]:.2e}"})

            avg_loss = total_loss / len(train_loader)
            history['train_loss'].append(avg_loss)
            print(f"\nüìâ Train Loss: {avg_loss:.4f}")

            # Validation
            val_acc, val_details = self._evaluate(val_loader)
            history['val_acc'].append(val_acc)

            print(f"\nüìä Validation Accuracy: {val_acc:.2f}%")
            for aspect, acc in val_details.items():
                print(f"   {aspect}: {acc:.2f}%")

            # Early stopping check
            if val_acc > best_val_acc:
                best_val_acc = val_acc
                best_model_state = {k: v.cpu().clone() for k, v in self.model.state_dict().items()}
                patience_counter = 0
                print(f"   ‚≠ê Nouveau meilleur mod√®le!")
            else:
                patience_counter += 1
                print(f"   ‚è≥ Patience: {patience_counter}/{config['patience']}")

            if patience_counter >= config['patience']:
                print(f"\n‚ö†Ô∏è Early stopping √† l'epoch {epoch + 1}")
                break

        # Restaurer le meilleur mod√®le
        if best_model_state:
            self.model.load_state_dict(best_model_state)
            self.model.to(self.device)

        print(f"\n{'='*50}")
        print(f"üèÜ MEILLEURE ACCURACY: {best_val_acc:.2f}%")
        print(f"{'='*50}")

        return history, best_val_acc

    def _evaluate(self, dataloader):
        self.model.eval()
        correct = {aspect: 0 for aspect in ASPECTS}
        total = 0

        with torch.no_grad():
            for batch in dataloader:
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)
                logits = self.model(input_ids, attention_mask)

                for aspect in ASPECTS:
                    labels = batch[f"label_{aspect.lower()}"].to(self.device)
                    preds = torch.argmax(logits[aspect], dim=-1)
                    correct[aspect] += (preds == labels).sum().item()

                total += input_ids.size(0)

        details = {aspect: 100 * correct[aspect] / total for aspect in ASPECTS}
        avg_acc = sum(details.values()) / len(ASPECTS)
        return avg_acc, details

    def predict(self, texts):
        self.model.eval()
        predictions = []

        for i in range(0, len(texts), 32):
            batch_texts = texts[i:i+32]
            encodings = self.tokenizer(
                batch_texts, truncation=True, padding=True,
                max_length=self.config['max_length'], return_tensors="pt"
            )

            input_ids = encodings["input_ids"].to(self.device)
            attention_mask = encodings["attention_mask"].to(self.device)

            with torch.no_grad():
                logits = self.model(input_ids, attention_mask)

            for j in range(len(batch_texts)):
                pred = {aspect: IDX_TO_LABEL[torch.argmax(logits[aspect][j]).item()] for aspect in ASPECTS}
                predictions.append(pred)

        return predictions


print("‚úÖ Mod√®le optimis√© d√©fini")

## 6. Chargement des donn√©es

In [None]:
import pandas as pd

df_train = pd.read_csv("/content/data/ftdataset_train.tsv", sep=' *\t *', encoding='utf-8', engine='python')
df_val = pd.read_csv("/content/data/ftdataset_val.tsv", sep=' *\t *', encoding='utf-8', engine='python')

train_data = df_train.to_dict(orient='records')
val_data = df_val.to_dict(orient='records')

print(f"‚úÖ Donn√©es charg√©es: Train={len(train_data)}, Val={len(val_data)}")

# Afficher la distribution des classes
print("\nüìä Distribution des classes:")
for aspect in ASPECTS:
    counts = df_train[aspect].value_counts()
    print(f"  {aspect}: {dict(counts)}")

## 7. Entra√Ænement üöÄ

In [None]:
# Cr√©er et entra√Æner le mod√®le
trainer = OptimizedTrainer(CONFIG)
history, best_acc = trainer.train(train_data, val_data)

## 8. Visualisation

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(history['train_loss'], 'b-', marker='o')
axes[0].set_title('Train Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].grid(True)

axes[1].plot(history['val_acc'], 'g-', marker='o')
axes[1].set_title('Validation Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy (%)')
axes[1].grid(True)

plt.tight_layout()
plt.savefig('/content/training_history.png', dpi=150)
plt.show()

print(f"\nüìà Graphique sauvegard√© dans /content/training_history.png")

## 9. √âvaluation finale

In [None]:
print("\nüìà √âvaluation finale...")

val_texts = get_texts(val_data)
predictions = trainer.predict(val_texts)

correct = {aspect: 0 for aspect in ASPECTS}
n = len(val_data)

for pred, ref in zip(predictions, val_data):
    for aspect in ASPECTS:
        if pred[aspect] == ref[aspect]:
            correct[aspect] += 1

print("\n" + "="*50)
print("üìä R√âSULTATS FINAUX")
print("="*50)
for aspect in ASPECTS:
    acc = 100 * correct[aspect] / n
    print(f"  {aspect}: {acc:.2f}%")

macro_acc = sum(100 * correct[aspect] / n for aspect in ASPECTS) / len(ASPECTS)
print(f"\nüéØ MACRO ACCURACY: {macro_acc:.2f}%")
print("="*50)

## 10. Test manuel

In [None]:
test_texts = [
    "Excellente cuisine, plats savoureux et copieux. Le service √©tait un peu lent mais correct. Prix raisonnables.",
    "Tr√®s d√©√ßu par ce restaurant. La nourriture √©tait froide et le serveur d√©sagr√©able. Bien trop cher.",
    "Bon rapport qualit√©-prix. Service efficace et souriant. La cuisine √©tait correcte sans √™tre exceptionnelle.",
    "Restaurant moyen, rien d'exceptionnel. Les prix sont corrects."
]

print("\nüß™ Test sur quelques exemples:\n")
predictions = trainer.predict(test_texts)

for text, pred in zip(test_texts, predictions):
    print(f"üìù \"{text[:70]}...\"")
    print(f"   ‚Üí Prix: {pred['Prix']}, Cuisine: {pred['Cuisine']}, Service: {pred['Service']}\n")

## 11. Sauvegarde du mod√®le

In [None]:
# Sauvegarder le mod√®le
torch.save({
    'model_state_dict': trainer.model.state_dict(),
    'config': CONFIG,
    'best_accuracy': best_acc
}, '/content/model_optimized_v2.pt')

print(f"‚úÖ Mod√®le sauvegard√© (accuracy: {best_acc:.2f}%)")

# T√©l√©charger
from google.colab import files
files.download('/content/model_optimized_v2.pt')

## 12. üî¨ Exp√©rimentations (Optionnel)

Modifiez la configuration ci-dessus et relancez pour tester diff√©rentes combinaisons :

**Suggestions :**
1. `model_name: "camembert/camembert-large"` + `batch_size: 8`
2. `num_epochs: 7` + `patience: 3`
3. `learning_rate: 1e-5` (plus petit = plus stable)
4. `hidden_dim: 512` (couche plus grande)