# Fine-tuning Embedding Model pour Pocket Arbiter

**ISO Reference**: ISO/IEC 42001 A.6.2.2, ISO/IEC 25010 S4.2

Ce notebook fine-tune un modele d'embedding multilingue sur le domaine echecs/arbitrage FR.

**Instructions**: Clique sur **Execution** â†’ **Tout executer** (c'est tout!)

**ZERO configuration requise** - tout est automatique.

In [None]:
# Installation des dependances (AUCUN LOGIN REQUIS)
!pip install -q sentence-transformers datasets accelerate
print("Installation terminee!")

In [None]:
# Verification GPU
import torch
print(f"GPU disponible: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("ATTENTION: Pas de GPU! Va dans Runtime > Change runtime type > GPU")

In [None]:
# Telechargement automatique des donnees depuis GitHub
!wget -q https://raw.githubusercontent.com/pierrealexandreguillemin-a11y/pocket_arbiter/main/data/training/triplets_training.jsonl
print("Donnees telechargees!")

In [None]:
# Chargement des triplets
import json

triplets = []
with open("triplets_training.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():
            triplets.append(json.loads(line))

print(f"Triplets charges: {len(triplets)}")
print(f"Exemple: {triplets[0]['anchor'][:50]}...")

In [None]:
# Configuration
# Modele multilingue 768 dimensions - AUCUNE AUTH REQUISE
MODEL_ID = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
OUTPUT_DIR = "embedding-chess-fr"
EPOCHS = 3
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
WARMUP_RATIO = 0.1

In [None]:
# Chargement du modele (telechargement automatique, pas de login)
from sentence_transformers import SentenceTransformer

print(f"Chargement de {MODEL_ID}...")
model = SentenceTransformer(MODEL_ID)
print(f"Dimension: {model.get_sentence_embedding_dimension()}")
print("Modele charge avec succes!")

In [None]:
# Preparation du dataset
from datasets import Dataset

dataset = Dataset.from_list(triplets)
print(f"Dataset: {len(dataset)} examples")
print(dataset)

In [None]:
# Configuration du training
from sentence_transformers import (
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss

# Loss function
loss = MultipleNegativesRankingLoss(model)

# Arguments
args = SentenceTransformerTrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    warmup_ratio=WARMUP_RATIO,
    fp16=True,  # Mixed precision pour GPU
    logging_steps=10,
    save_strategy="epoch",
    report_to="none",
)

print("Training configure!")

In [None]:
# Entrainement
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    loss=loss,
)

print("Debut de l'entrainement...")
trainer.train()
print("Entrainement termine!")

In [None]:
# Sauvegarde du modele
model.save(OUTPUT_DIR)
print(f"Modele sauvegarde dans {OUTPUT_DIR}/")

In [None]:
# Test du modele
test_queries = [
    "Quelle est la regle du toucher-jouer ?",
    "Comment fonctionne le roque ?",
    "Que faire en cas de partie nulle ?",
]

print("Test du modele fine-tune:")
for q in test_queries:
    emb = model.encode(q)
    print(f"  {q[:40]}... -> dim={emb.shape[0]}")

In [None]:
# Telechargement du modele
import shutil
from google.colab import files

# Creer une archive
shutil.make_archive(OUTPUT_DIR, 'zip', OUTPUT_DIR)
print(f"Archive creee: {OUTPUT_DIR}.zip")

# Telecharger
files.download(f"{OUTPUT_DIR}.zip")
print("Telechargement lance! Le fichier va se telecharger automatiquement.")

## Prochaines etapes

1. Le fichier `embedding-chess-fr.zip` va se telecharger automatiquement
2. Extraire dans `models/`
3. Executer `python -m scripts.training.evaluate_finetuned`
4. Si recall >= 80%, regenerer les embeddings du corpus