# Generation de Triplets Synthetiques - Pocket Arbiter

> **Objectif**: Generer des questions synthetiques pour fine-tuner un modele d'embeddings
>
> **API**: Cerebras (gratuit, 24M tokens/jour, ~2600 tok/s)
>
> **Corpus**: FFE (francais) + FIDE (anglais)

## Setup Kaggle

1. **Add Secret**: Settings > Secrets > Add `CEREBRAS_API_KEY`
2. **Upload Data**: Uploader `chunks_for_embedding_fr.json` et `chunks_for_embedding_intl.json`
3. **Run All**: Executer toutes les cellules

In [None]:
# Installation des dependances
!pip install -q cerebras-cloud-sdk

In [None]:
import json
import time
import os
from pathlib import Path
from datetime import datetime
from typing import Optional

from cerebras.cloud.sdk import Cerebras

# Kaggle Secrets
try:
    from kaggle_secrets import UserSecretsClient
    secrets = UserSecretsClient()
    CEREBRAS_API_KEY = secrets.get_secret("CEREBRAS_API_KEY")
    print("API key loaded from Kaggle Secrets")
except:
    CEREBRAS_API_KEY = os.environ.get("CEREBRAS_API_KEY")
    print("API key loaded from environment")

# Configuration
MODEL = "llama-3.3-70b"  # Modele rapide et performant
QUESTIONS_PER_CHUNK = 3
MAX_RETRIES = 3
RETRY_DELAY = 2

## Categories de Questions Metier

Trois categories adaptees aux besoins reels des arbitres d'echecs:

In [None]:
# Categories de questions par corpus
QUESTION_CATEGORIES = {
    "ffe": {
        "arbitre_terrain": {
            "description": "Questions d'un arbitre experimente sur le terrain (cas particuliers)",
            "examples": [
                "Quel sera l'Elo d'un joueur apres ce tournoi avec ces resultats?",
                "Quels sont les affichages obligatoires en salle de jeu?",
                "Un portable sonne pendant la partie, quelle sanction?",
                "Le joueur arrive 10 minutes apres le debut, est-il forfait?",
                "Comment gerer une reclamation sur un mat illegal?",
                "Quelle est la procedure si un joueur refuse de signer la feuille?",
            ],
        },
        "arbitre_organisateur": {
            "description": "Questions pour l'organisation d'un tournoi ou formation arbitrale",
            "examples": [
                "Quelles sont les conditions pour devenir arbitre AF2?",
                "Quel est le delai d'homologation d'un tournoi FFE?",
                "Quel materiel est obligatoire pour un tournoi homologue?",
                "Comment calculer les droits d'engagement?",
                "Quelle est la procedure pour homologuer un open?",
                "Combien d'arbitres faut-il pour un tournoi de 100 joueurs?",
            ],
        },
        "question_joueur": {
            "description": "Questions posees par des joueurs (langage oral/verbal)",
            "examples": [
                "J'ai le droit de proposer nulle quand exactement?",
                "Il a touche une piece, il doit la jouer non?",
                "C'est quoi le departage au juste?",
                "Je peux aller aux toilettes pendant ma partie?",
                "Mon adversaire ecrit ses coups avant de jouer, c'est legal?",
                "J'ai oublie d'appuyer sur la pendule, qu'est-ce qui se passe?",
            ],
        },
    },
    "fide": {
        "arbiter_field": {
            "description": "Questions from an experienced arbiter during a tournament",
            "examples": [
                "What is the penalty for a mobile phone ringing?",
                "How to handle a claim of threefold repetition?",
                "When can a player claim a draw under the 50-move rule?",
                "What happens if a player makes an illegal move in time trouble?",
                "How to proceed when both flags have fallen?",
            ],
        },
        "arbiter_organizer": {
            "description": "Questions for tournament organization or arbiter certification",
            "examples": [
                "What are the requirements to become a FIDE Arbiter?",
                "What equipment is mandatory for a FIDE-rated tournament?",
                "How to submit results for FIDE rating?",
                "What is the time control for classical FIDE games?",
            ],
        },
        "player_question": {
            "description": "Questions asked by players (natural spoken language)",
            "examples": [
                "Can I offer a draw before making my move?",
                "My opponent touched a piece, does he have to move it?",
                "What's the tiebreak system in this tournament?",
                "Am I allowed to leave the playing hall?",
            ],
        },
    },
}

## Prompts de Generation

In [None]:
def build_prompt_ffe(categories: dict, num_questions: int) -> str:
    """Build system prompt for FFE corpus (French)."""
    cat_terrain = categories["arbitre_terrain"]
    cat_orga = categories["arbitre_organisateur"]
    cat_joueur = categories["question_joueur"]

    examples_terrain = "\n".join(f"  - {ex}" for ex in cat_terrain["examples"][:3])
    examples_orga = "\n".join(f"  - {ex}" for ex in cat_orga["examples"][:3])
    examples_joueur = "\n".join(f"  - {ex}" for ex in cat_joueur["examples"][:3])

    return f"""Tu es un arbitre d'echecs FFE experimente (AF3 minimum).
Tu generes des questions REALISTES que l'on te pose sur le terrain ou en formation.

TROIS CATEGORIES DE QUESTIONS (varier obligatoirement):

1. ARBITRE TERRAIN - Cas particuliers concrets en competition:
{examples_terrain}

2. ARBITRE ORGANISATEUR - Organisation tournoi ou formation arbitrale:
{examples_orga}

3. JOUEUR - Questions orales d'un joueur (langage familier, verbal):
{examples_joueur}

REGLES STRICTES:
- Langue: FRANCAIS uniquement
- La reponse DOIT etre trouvable dans le texte fourni
- Style: questions naturelles, pas academiques
- Jargon FFE: Elo, cadence, forfait, appariement, homologation, departage
- Genere {num_questions} questions: au moins 1 de chaque categorie si possible

FORMAT JSON UNIQUEMENT (pas de texte autour):
{{"questions": [{{"question": "...", "category": "arbitre_terrain|arbitre_organisateur|question_joueur", "difficulty": "easy|medium|hard"}}]}}"""


def build_prompt_fide(categories: dict, num_questions: int) -> str:
    """Build system prompt for FIDE corpus (English)."""
    cat_field = categories["arbiter_field"]
    cat_orga = categories["arbiter_organizer"]
    cat_player = categories["player_question"]

    examples_field = "\n".join(f"  - {ex}" for ex in cat_field["examples"][:3])
    examples_orga = "\n".join(f"  - {ex}" for ex in cat_orga["examples"][:3])
    examples_player = "\n".join(f"  - {ex}" for ex in cat_player["examples"][:3])

    return f"""You are an experienced FIDE arbiter (IA or FA level).
Generate REALISTIC questions that arbiters or players ask during tournaments.

THREE CATEGORIES OF QUESTIONS (must vary):

1. ARBITER ON FIELD - Specific cases during competition:
{examples_field}

2. ARBITER/ORGANIZER - Tournament organization or certification:
{examples_orga}

3. PLAYER - Oral questions from players (casual spoken language):
{examples_player}

STRICT RULES:
- Language: ENGLISH only
- The answer MUST be found in the provided text
- Style: natural questions, not academic
- FIDE terminology: rating, time control, forfeit, pairing, tiebreak
- Generate {num_questions} questions: at least 1 from each category if possible

JSON FORMAT ONLY (no surrounding text):
{{"questions": [{{"question": "...", "category": "arbiter_field|arbiter_organizer|player_question", "difficulty": "easy|medium|hard"}}]}}"""

## Fonction de Generation

In [None]:
def generate_questions(
    client: Cerebras,
    chunk_text: str,
    chunk_id: str,
    corpus: str = "ffe",
    num_questions: int = 3,
) -> list[dict]:
    """
    Generate synthetic questions for a chunk using Cerebras API.

    Args:
        client: Cerebras client
        chunk_text: Text content of the chunk
        chunk_id: Unique identifier for the chunk
        corpus: "ffe" (French) or "fide" (English)
        num_questions: Number of questions to generate

    Returns:
        List of question dicts with keys: question, category, difficulty, chunk_id
    """
    categories = QUESTION_CATEGORIES.get(corpus, QUESTION_CATEGORIES["ffe"])

    if corpus == "fide":
        system_prompt = build_prompt_fide(categories, num_questions)
        user_prompt = f"""FIDE regulatory text:
\"\"\"
{chunk_text[:2500]}
\"\"\"

Generate exactly {num_questions} varied questions based on this text."""
    else:
        system_prompt = build_prompt_ffe(categories, num_questions)
        user_prompt = f"""Texte reglementaire FFE:
\"\"\"
{chunk_text[:2500]}
\"\"\"

Genere exactement {num_questions} questions variees basees sur ce texte."""

    for attempt in range(MAX_RETRIES):
        try:
            response = client.chat.completions.create(
                model=MODEL,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt},
                ],
                max_completion_tokens=800,
                temperature=0.7,
                top_p=1,
            )

            content = response.choices[0].message.content
            if content is None:
                continue
            
            content = content.strip()

            # Extract JSON
            if "```json" in content:
                content = content.split("```json")[1].split("```")[0]
            elif "```" in content:
                parts = content.split("```")
                if len(parts) >= 2:
                    content = parts[1]

            # Find JSON object
            start = content.find("{")
            if start >= 0:
                depth = 0
                for i, c in enumerate(content[start:]):
                    if c == "{":
                        depth += 1
                    elif c == "}":
                        depth -= 1
                        if depth == 0:
                            content = content[start:start + i + 1]
                            break

            data = json.loads(content)
            questions = data.get("questions", [])

            # Add metadata
            for q in questions:
                q["chunk_id"] = chunk_id
                q["corpus"] = corpus

            return questions

        except json.JSONDecodeError as e:
            if attempt < MAX_RETRIES - 1:
                time.sleep(RETRY_DELAY)
            else:
                print(f"  JSON error {chunk_id}: {e}")
                return []
        except Exception as e:
            if attempt < MAX_RETRIES - 1:
                time.sleep(RETRY_DELAY)
            else:
                print(f"  Error {chunk_id}: {type(e).__name__}: {e}")
                return []

    return []

## Pipeline de Generation

In [None]:
def run_generation(
    chunks: list[dict],
    corpus: str = "ffe",
    max_chunks: Optional[int] = None,
    questions_per_chunk: int = 3,
    checkpoint_every: int = 100,
) -> list[dict]:
    """
    Run question generation pipeline.

    Args:
        chunks: List of chunks with 'id' and 'text' keys
        corpus: "ffe" (French) or "fide" (English)
        max_chunks: Limit number of chunks (None = all)
        questions_per_chunk: Questions to generate per chunk
        checkpoint_every: Save checkpoint every N chunks

    Returns:
        List of all generated questions
    """
    # Setup client
    client = Cerebras(api_key=CEREBRAS_API_KEY)

    if max_chunks:
        chunks = chunks[:max_chunks]

    lang = "francais" if corpus == "ffe" else "english"
    print(f"=" * 60)
    print(f"CEREBRAS TRIPLET GENERATION")
    print(f"=" * 60)
    print(f"Model: {MODEL}")
    print(f"Corpus: {corpus.upper()} ({lang})")
    print(f"Chunks: {len(chunks)}")
    print(f"Questions/chunk: {questions_per_chunk}")
    print(f"Estimated total: {len(chunks) * questions_per_chunk}")
    print(f"=" * 60)
    print()

    all_questions: list[dict] = []
    errors = 0
    start_time = time.time()

    for i, chunk in enumerate(chunks):
        chunk_id = chunk.get("id", f"chunk-{i}")
        chunk_text = chunk.get("text", "")

        # Progress
        if i % 10 == 0:
            elapsed = time.time() - start_time
            rate = len(all_questions) / elapsed * 60 if elapsed > 0 else 0
            print(f"[{i:4d}/{len(chunks)}] {len(all_questions):4d} questions ({rate:.1f}/min)")

        questions = generate_questions(
            client=client,
            chunk_text=chunk_text,
            chunk_id=chunk_id,
            corpus=corpus,
            num_questions=questions_per_chunk,
        )

        if questions:
            all_questions.extend(questions)
        else:
            errors += 1

        # Checkpoint
        if (i + 1) % checkpoint_every == 0:
            checkpoint_path = f"checkpoint_{corpus}_{i+1}.json"
            with open(checkpoint_path, "w", encoding="utf-8") as f:
                json.dump(all_questions, f, ensure_ascii=False, indent=2)
            print(f"  >> Checkpoint saved: {checkpoint_path}")

    # Final stats
    elapsed = time.time() - start_time
    print()
    print(f"=" * 60)
    print(f"DONE ({corpus.upper()})")
    print(f"=" * 60)
    print(f"Total questions: {len(all_questions)}")
    print(f"Errors: {errors}")
    print(f"Time: {elapsed/60:.1f} min")
    print(f"Rate: {len(all_questions)/elapsed*60:.1f} questions/min")

    return all_questions

## Chargement des Chunks

**Option 1**: Upload les fichiers JSON dans Kaggle

**Option 2**: Charger depuis GitHub (cell ci-dessous)

In [None]:
# Option 1: Charger depuis fichiers uploades
def load_chunks_from_file(path: str) -> list[dict]:
    """Load chunks from uploaded JSON file."""
    with open(path, encoding="utf-8") as f:
        data = json.load(f)
    # Handle both formats: list or {"chunks": [...]}
    return data.get("chunks", data) if isinstance(data, dict) else data

# Option 2: Charger depuis GitHub
def load_chunks_from_github(corpus: str = "fr") -> list[dict]:
    """Load chunks directly from GitHub repo."""
    import urllib.request
    
    base_url = "https://raw.githubusercontent.com/YOUR_USERNAME/pocket_arbiter/main/corpus/processed"
    filename = f"chunks_for_embedding_{corpus}.json"
    url = f"{base_url}/{filename}"
    
    print(f"Loading from: {url}")
    with urllib.request.urlopen(url) as response:
        data = json.loads(response.read().decode("utf-8"))
    
    chunks = data.get("chunks", data) if isinstance(data, dict) else data
    print(f"Loaded {len(chunks)} chunks")
    return chunks

In [None]:
# ============================================================
# CHARGER LES CHUNKS (decommenter l'option choisie)
# ============================================================

# Option 1: Fichiers uploades sur Kaggle
chunks_fr = load_chunks_from_file("/kaggle/input/pocket-arbiter-chunks/chunks_for_embedding_fr.json")
# chunks_intl = load_chunks_from_file("/kaggle/input/pocket-arbiter-chunks/chunks_for_embedding_intl.json")

# Option 2: Depuis GitHub (remplacer YOUR_USERNAME)
# chunks_fr = load_chunks_from_github("fr")
# chunks_intl = load_chunks_from_github("intl")

print(f"Chunks FR: {len(chunks_fr)}")

## Test sur 5 Chunks

In [None]:
# Test rapide sur 5 chunks
test_questions = run_generation(
    chunks=chunks_fr,
    corpus="ffe",
    max_chunks=5,
    questions_per_chunk=3,
)

# Afficher les resultats
print("\n" + "=" * 60)
print("EXEMPLES DE QUESTIONS GENEREES")
print("=" * 60)
for q in test_questions[:9]:
    print(f"\n[{q.get('category', 'N/A')}] {q.get('difficulty', 'N/A')}")
    print(f"  Q: {q['question']}")
    print(f"  Chunk: {q['chunk_id']}")

## Generation Complete FFE

In [None]:
# ============================================================
# GENERATION COMPLETE FFE (decommenter pour lancer)
# Temps estime: ~30-60 min pour 1827 chunks
# ============================================================

# questions_ffe = run_generation(
#     chunks=chunks_fr,
#     corpus="ffe",
#     max_chunks=None,  # Tous les chunks
#     questions_per_chunk=3,
#     checkpoint_every=100,
# )

# # Sauvegarder
# output_path = "synthetic_questions_ffe.json"
# with open(output_path, "w", encoding="utf-8") as f:
#     json.dump(questions_ffe, f, ensure_ascii=False, indent=2)
# print(f"\nSaved to: {output_path}")

## Generation Complete FIDE (optionnel)

In [None]:
# ============================================================
# GENERATION COMPLETE FIDE (decommenter pour lancer)
# ============================================================

# chunks_intl = load_chunks_from_file("/kaggle/input/pocket-arbiter-chunks/chunks_for_embedding_intl.json")

# questions_fide = run_generation(
#     chunks=chunks_intl,
#     corpus="fide",
#     max_chunks=None,
#     questions_per_chunk=3,
#     checkpoint_every=100,
# )

# # Sauvegarder
# output_path = "synthetic_questions_fide.json"
# with open(output_path, "w", encoding="utf-8") as f:
#     json.dump(questions_fide, f, ensure_ascii=False, indent=2)
# print(f"\nSaved to: {output_path}")

## Statistiques et Export

In [None]:
def compute_stats(questions: list[dict]) -> dict:
    """Compute statistics on generated questions."""
    stats = {
        "total": len(questions),
        "by_category": {},
        "by_difficulty": {},
        "unique_chunks": len(set(q.get("chunk_id", "") for q in questions)),
    }
    
    for q in questions:
        cat = q.get("category", "unknown")
        diff = q.get("difficulty", "unknown")
        stats["by_category"][cat] = stats["by_category"].get(cat, 0) + 1
        stats["by_difficulty"][diff] = stats["by_difficulty"].get(diff, 0) + 1
    
    return stats

# Exemple d'utilisation
if 'test_questions' in dir() and test_questions:
    stats = compute_stats(test_questions)
    print("\nStatistiques:")
    print(f"  Total: {stats['total']}")
    print(f"  Chunks uniques: {stats['unique_chunks']}")
    print(f"  Par categorie: {stats['by_category']}")
    print(f"  Par difficulte: {stats['by_difficulty']}")

## Download des Resultats

Apres generation, telecharger les fichiers JSON depuis l'onglet **Output** de Kaggle,
puis les committer dans le repo GitHub sous `data/synthetic_triplets/`.

In [None]:
# Liste des fichiers generes
import os
print("Fichiers generes:")
for f in os.listdir("."):
    if f.endswith(".json"):
        size = os.path.getsize(f) / 1024
        print(f"  {f} ({size:.1f} KB)")