# Concept GNN Pipeline – Colab GPU A100
Ce notebook orchestre l'intégralité du pipeline (génération, entraînement, évaluation) à partir du repo GitHub **pascalseppey/concept-gnn-pipeline**. Chaque étape est isolée pour validation avant launch massive.

## 1. Pré-requis- Runtime Colab : **GPU A100** (Runtime > Modifier le type de matériel).- Repo GitHub : `https://github.com/pascalseppey/concept-gnn-pipeline.git`.

In [None]:
!nvidia-smi

## 2. Cloner le repo et installer les dépendances

In [None]:
REPO_URL = "https://github.com/pascalseppey/concept-gnn-pipeline.git"
REPO_DIR = "/content/concept-gnn-pipeline"

if not os.path.exists(REPO_DIR):
    !git clone --depth 1 {REPO_URL} {REPO_DIR}
else:
    %cd {REPO_DIR}
    !git pull

%cd {REPO_DIR}
!pip install -U pip
!pip install -r requirements.txt

## 3. Inspecter / adapter les configs

In [None]:
!ls config
!cat config/bins.yml

## 4. Balayage métrique déterministe

In [None]:
!python scripts/effect_metric_sweep.py --config config/bins.yml --output data/logs/effect_metric_sweep.csv
!head -n 5 data/logs/effect_metric_sweep.csv

## 5. Génération dataset (couverture contrôlée)

In [None]:
DATASET_PATH = 'data/logs/train_dataset.jsonl'
!python scripts/generate_dataset.py --config config/generator.yml --max-samples 20000 --coverage-threshold 0.7 --min-per-bin 20 --output {DATASET_PATH}

## 6. Inspection rapide du dataset

In [None]:
import json
import pandas as pd
records = []
with open(DATASET_PATH) as f:
    for idx, line in enumerate(f):
        if idx >= 5000: break
        rec = json.loads(line)
        metrics = rec['metrics']
        metrics['sequence_id'] = rec['sequence_id']
        records.append(metrics)

df = pd.DataFrame(records)
df.describe()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(6,4))
sns.histplot(df['fft_anisotropy'], bins=30)
plt.title('Anisotropie FFT')
plt.show()

## 7. Entraînement GNN

In [None]:
LOG_DIR = 'data/logs/run_colab'
!python scripts/train_gnn.py --config config/train.yml --dataset {DATASET_PATH} --log-dir {LOG_DIR} --epochs 20 --checkpoint-every 5

## 8. Suivi des métriques

In [None]:
import json
import matplotlib.pyplot as plt
records = [json.loads(line) for line in open(f'{LOG_DIR}/metrics_log.jsonl')]
epochs = [r['epoch'] for r in records]
train_loss = [r['train_loss'] for r in records]
train_acc = [r['train_acc'] for r in records]
val_acc = [r['val_acc'] for r in records]
fig, ax1 = plt.subplots(figsize=(7,4))
ax1.plot(epochs, train_loss, color='tab:red', label='Loss')
ax2 = ax1.twinx()
ax2.plot(epochs, train_acc, color='tab:blue', label='Train Acc')
ax2.plot(epochs, val_acc, color='tab:green', label='Val Acc')
fig.legend(loc='upper center', bbox_to_anchor=(0.5,1.1), ncol=3)
plt.show()

## 9. Évaluation & Inversion

In [None]:
!python scripts/evaluate_gnn.py --checkpoint {LOG_DIR}/ckpt_epoch20.pt --dataset data/logs/train_dataset.jsonl --topk 5

## 10. Export des artifacts

In [None]:
!tar -czf run_colab_artifacts.tar.gz {LOG_DIR} data/logs/effect_metric_sweep.csv
from google.colab import files
files.download('run_colab_artifacts.tar.gz')

## 11. Prochaines étapes- Ajuster la génération (max-samples, coverage).- Étendre l'architecture (attention multi-échelle, multi-head).- Passer en streaming multi-GPU (GH200/H100).- Intégrer inversion/estimation WebP dans un service production.