## T5-Nano (Python → C++) Master Pipeline

This notebook runs the complete project pipeline:

- Data prep (XLCoST snippet-level Python↔C++)
- Tokenizer training (Byte-Level BPE, vocab=16k)
- Model init (T5-Nano, random weights)
- Training (Seq2SeqTrainer)
- Inference demo

It also includes visualizations at each stage (length histograms, tokenizer stats, training curves).

**Assumptions**
- Run from the repo root.
- Dependencies installed from `requirements.txt`.


In [None]:
from __future__ import annotations

import json
import os
import subprocess
from pathlib import Path

REPO_ROOT = Path.cwd()
DATA_RAW = REPO_ROOT / "data" / "raw"
DATA_PROCESSED = REPO_ROOT / "data" / "processed"
TOKENIZER_DIR = REPO_ROOT / "custom_tokenizer"
CHECKPOINT_DIR = REPO_ROOT / "t5_nano_checkpoints"
FINAL_MODEL_DIR = REPO_ROOT / "final_model"

# Toggle this to run fast sanity-checks.
QUICK_RUN = True
MAX_SAMPLES = 2000 if QUICK_RUN else None  # per split cap in data_prep.py
EPOCHS = 1 if QUICK_RUN else 30
BATCH_SIZE = 8 if QUICK_RUN else 32

print("repo:", REPO_ROOT)
print("quick_run:", QUICK_RUN)


## 0) (Optional) Install dependencies

If you’re in a fresh environment:

```bash
pip install -r requirements.txt
```

If you see SSL / certificate issues during install, fix your Python/certificates setup first (macOS frameworks Python sometimes needs certificate install scripts).

## 1) Data prep (XLCoST → `data/processed/`)

This runs `data_prep.py` to:
- Download + extract XLCoST (snippet-level)
- Build `train/validation/test` parallel pairs
- Write `corpus.txt` for tokenizer training
- Save the Arrow dataset (if `datasets` is installed)
- Always export JSONL

In [None]:
cmd = ["python", "data_prep.py"]
if MAX_SAMPLES is not None:
    cmd += ["--max_samples", str(MAX_SAMPLES)]

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)

print("\nProduced:")
for p in sorted(DATA_PROCESSED.glob("*")):
    print("-", p)


In [None]:
import pandas as pd

# Load dataset for inspection.
# Prefer Arrow dataset (fast); fallback to JSONL if needed.
arrow_dir = DATA_PROCESSED / "xlcost_py_cpp_snippet"

if arrow_dir.exists():
    from datasets import load_from_disk

    ds = load_from_disk(str(arrow_dir))
    train_df = pd.DataFrame(ds["train"])
    val_df = pd.DataFrame(ds["validation"]) if "validation" in ds else None
    test_df = pd.DataFrame(ds["test"]) if "test" in ds else None
else:
    train_df = pd.read_json(DATA_PROCESSED / "train.jsonl", lines=True)
    val_df = pd.read_json(DATA_PROCESSED / "validation.jsonl", lines=True)
    test_df = pd.read_json(DATA_PROCESSED / "test.jsonl", lines=True)

print("train rows:", len(train_df))
print("val rows:", len(val_df) if val_df is not None else None)
print("test rows:", len(test_df) if test_df is not None else None)
train_df.head(3)


In [None]:
# Visualize dataset lengths (character length)
import matplotlib.pyplot as plt

train_df["source_len"] = train_df["source"].astype(str).map(len)
train_df["target_len"] = train_df["target"].astype(str).map(len)

fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].hist(train_df["source_len"], bins=50)
ax[0].set_title("Train source char length")
ax[0].set_xlabel("chars")

ax[1].hist(train_df["target_len"], bins=50)
ax[1].set_title("Train target char length")
ax[1].set_xlabel("chars")

plt.tight_layout()
plt.show()

print(train_df[["source_len", "target_len"]].describe(percentiles=[0.5, 0.9, 0.95, 0.99]))


## 2) Train tokenizer (Byte-Level BPE)

This runs `train_tokenizer.py` to create:
- `custom_tokenizer/vocab.json`
- `custom_tokenizer/merges.txt`

Then we load it and inspect basic stats.

In [None]:
cmd = ["python", "train_tokenizer.py"]
print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)

print("Tokenizer files:")
for p in sorted(TOKENIZER_DIR.glob("*") ):
    print("-", p)


In [None]:
import model_config

tok = model_config.load_tokenizer()
print("vocab_size:", tok.vocab_size)
print("special tokens:")
print({
    "bos": tok.bos_token,
    "pad": tok.pad_token,
    "eos": tok.eos_token,
    "unk": tok.unk_token,
    "mask": tok.mask_token,
})

sample = train_df.iloc[0]["source"]
enc = tok(sample)
print("\nsample chars:", len(sample))
print("sample tokens:", len(enc["input_ids"]))
print("first 30 token ids:", enc["input_ids"][:30])


In [None]:
# Token length distribution
import matplotlib.pyplot as plt

def token_len(s: str) -> int:
    return len(tok(s, truncation=False)["input_ids"])

lens = train_df["source"].astype(str).head(2000).map(token_len)
plt.figure(figsize=(8, 4))
plt.hist(lens, bins=50)
plt.title("Token length distribution (first 2k sources)")
plt.xlabel("tokens")
plt.ylabel("count")
plt.show()

print(lens.describe(percentiles=[0.5, 0.9, 0.95, 0.99]))


## 3) Build T5-Nano (random init) + verify parameter count

This uses `model_config.py` (no pretrained weights).

In [None]:
model = model_config.build_t5_nano(tok)
params = model_config.count_parameters(model)
print(f"T5-Nano parameter count: {params:,}")
print("in_expected_range:", 20_000_000 <= params <= 40_000_000)


## 4) Train

This runs `train.py`.

Notes:
- `fp16=True` requires a CUDA GPU. If you’re on CPU, edit `train.py` to set `fp16=False`.
- In `QUICK_RUN` mode we train for fewer epochs and smaller batch size.

In [None]:
cmd = [
    "python",
    "train.py",
    "--per_device_batch_size",
    str(BATCH_SIZE),
    "--num_train_epochs",
    str(EPOCHS),
]
print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)

print("Final model dir exists:", FINAL_MODEL_DIR.exists())


In [None]:
# Plot training curves (train/eval loss) from Trainer state
import matplotlib.pyplot as plt

trainer_states = list(CHECKPOINT_DIR.glob("checkpoint-*/trainer_state.json"))
if not trainer_states:
    # Sometimes Trainer writes trainer_state.json in the root output_dir
    root_state = CHECKPOINT_DIR / "trainer_state.json"
    trainer_states = [root_state] if root_state.exists() else []

if not trainer_states:
    print("No trainer_state.json found yet in", CHECKPOINT_DIR)
else:
    # Use the most recent state file
    state_path = max(trainer_states, key=lambda p: p.stat().st_mtime)
    print("Using:", state_path)

    state = json.loads(state_path.read_text())
    logs = state.get("log_history", [])

    steps, train_losses = [], []
    eval_steps, eval_losses = [], []
    for item in logs:
        if "loss" in item and "eval_loss" not in item:
            steps.append(item.get("step"))
            train_losses.append(item["loss"])
        if "eval_loss" in item:
            eval_steps.append(item.get("step"))
            eval_losses.append(item["eval_loss"])

    plt.figure(figsize=(10, 4))
    if train_losses:
        plt.plot(steps, train_losses, label="train_loss")
    if eval_losses:
        plt.plot(eval_steps, eval_losses, label="eval_loss")
    plt.title("Training curves")
    plt.xlabel("step")
    plt.ylabel("loss")
    plt.legend()
    plt.grid(True, alpha=0.2)
    plt.show()


## 5) Inference demo

This loads from `./final_model` and runs beam search (`num_beams=4`).

In [None]:
import inference

sample_python = """\
def factorial(n):
    out = 1
    for i in range(2, n + 1):
        out *= i
    return out
"""

print("=== Python ===")
print(sample_python)
print("=== C++ (generated) ===")
inference.translate(sample_python)


In [None]:
# (Optional) Make results more visual: sample a few random translations from the validation set
import random

if val_df is None:
    print("No validation split loaded")
else:
    for idx in random.sample(range(len(val_df)), k=min(3, len(val_df))):
        py = val_df.iloc[idx]["source"]
        # Strip the prefix for prettier display
        if py.startswith(TASK_PREFIX):
            py = py[len(TASK_PREFIX):]
        print("\n--- Example", idx, "---")
        print("[Python]")
        print(py)
        print("[Model C++]")
        inference.translate(py)
        print("[Reference C++]")
        print(val_df.iloc[idx]["target"])
