## Colab Master Pipeline ‚Äî T5-Nano (Python ‚Üí C++)

End-to-end pipeline:
- Clone repo
- Install deps
- XLCoST data prep ‚Üí `data/processed/`
- Train tokenizer ‚Üí `custom_tokenizer/`
- Build T5-Nano (random init)
- Train ‚Üí `t5_nano_checkpoints/` + `final_model/`
- Inference demo

**Colab GPU**: Runtime ‚Üí Change runtime type ‚Üí GPU


In [None]:
# --- Storage setup (VS Code + Colab compatible) ---
# drive.mount() doesn't work in VS Code extension - we save locally
# After training, run the "Upload to Drive" cell to persist your model

DRIVE_MOUNTED = False  # Set to True only if using Colab web UI with drive.mount()
print("Models will be saved locally to final_model/")
print("After training, run the 'Upload to Drive' cell to save to Google Drive.")

In [4]:
# --- 0) Clone repo (idempotent) ---
%cd /content

REPO_URL = "https://github.com/ns-1456/NMT.git"
REPO_DIR = "NMT"
BRANCH = "python-to-cpp-transpiler"  # change if needed

import os

if not os.path.isdir(REPO_DIR):
    !git clone --depth 1 -b {BRANCH} {REPO_URL} {REPO_DIR}

%cd /content/{REPO_DIR} 
!git status -sb || true


In [5]:
# --- 1) Install deps (avoid reinstalling torch in Colab) ---
!pip -q install -U pip
!pip -q install transformers datasets tokenizers pandas scikit-learn accelerate gdown tqdm matplotlib


In [6]:
from __future__ import annotations

import json
import os
import subprocess
from pathlib import Path

os.environ["TOKENIZERS_PARALLELISM"] = "false"

REPO_ROOT = Path.cwd()
RAW_DIR = REPO_ROOT / "data" / "raw"
PROCESSED_DIR = REPO_ROOT / "data" / "processed"
TOKENIZER_DIR = REPO_ROOT / "custom_tokenizer"

# Save models locally (upload to Drive after training)
CHECKPOINT_DIR = REPO_ROOT / "t5_nano_checkpoints"
FINAL_MODEL_DIR = REPO_ROOT / "final_model"

QUICK_RUN = False
MAX_SAMPLES = 2000 if QUICK_RUN else None
EPOCHS = 1 if QUICK_RUN else 30

print(f"repo: {REPO_ROOT}")
print(f"checkpoints: {CHECKPOINT_DIR}")
print(f"final model: {FINAL_MODEL_DIR}")
BATCH_SIZE = 8 if QUICK_RUN else 32

print("repo:", REPO_ROOT)
print("quick_run:", QUICK_RUN)


## 2) Data prep (XLCoST)

This downloads + extracts XLCoST and writes:
- `data/processed/corpus.txt`
- `data/processed/train.jsonl`, `validation.jsonl`, `test.jsonl`
- `data/processed/xlcost_py_cpp_snippet/` (Arrow dataset, if `datasets` is installed)


In [7]:
# Clean stale artifacts that commonly cause confusion
subprocess.run(["rm", "-rf", str(RAW_DIR / "XLCoST_data")], check=False)
subprocess.run(["rm", "-rf", str(RAW_DIR / "__MACOSX")], check=False)
subprocess.run(["rm", "-f", str(RAW_DIR / "XLCoST_data.zip")], check=False)

cmd = ["python", "-u", "data_prep.py"]
if MAX_SAMPLES is not None:
    cmd += ["--max_samples", str(MAX_SAMPLES)]

print("Running:", " ".join(cmd))
proc = subprocess.run(cmd, text=True, capture_output=True)

print("\n--- data_prep.py stdout ---\n")
print(proc.stdout)

if proc.returncode != 0:
    print("\n--- data_prep.py stderr ---\n")
    print(proc.stderr)
    raise RuntimeError(f"data_prep.py failed with exit code {proc.returncode}")

print("\nProduced:")
for p in sorted(PROCESSED_DIR.glob("*")):
    print("-", p)


In [8]:
# Quick sanity: locate where pair_data_tok_1 ended up (debug helper)
import os

hits = []
for root, dirs, _files in os.walk(RAW_DIR):
    if "pair_data_tok_1" in dirs:
        hits.append(Path(root))

print("Found roots containing pair_data_tok_1:")
for h in hits[:10]:
    print("-", h)


In [9]:
# Inspect dataset + basic visualization
import pandas as pd
import matplotlib.pyplot as plt

arrow_dir = PROCESSED_DIR / "xlcost_py_cpp_snippet"
if arrow_dir.exists():
    from datasets import load_from_disk
    ds = load_from_disk(str(arrow_dir))
    train_df = pd.DataFrame(ds["train"])
else:
    train_df = pd.read_json(PROCESSED_DIR / "train.jsonl", lines=True)

print("train rows:", len(train_df))
train_df["source_len"] = train_df["source"].astype(str).map(len)
train_df["target_len"] = train_df["target"].astype(str).map(len)

fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].hist(train_df["source_len"], bins=50)
ax[0].set_title("Train source char length")
ax[1].hist(train_df["target_len"], bins=50)
ax[1].set_title("Train target char length")
plt.tight_layout()
plt.show()

train_df.head(3)


## 3) Train tokenizer (Byte-Level BPE)


In [10]:
subprocess.run(["python", "-u", "train_tokenizer.py"], check=True)
print("Tokenizer dir:", TOKENIZER_DIR)
!ls -la custom_tokenizer | head


## 4) Build T5-Nano + verify parameter count


In [11]:
import model_config

tok = model_config.load_tokenizer()
model = model_config.build_t5_nano(tok)
params = model_config.count_parameters(model)
print(f"T5-Nano parameter count: {params:,}")


## 5) Train

`train.py` uses `fp16=True`, so this requires a GPU.


In [12]:
!git pull

In [13]:
# Train model (saves locally, upload to Drive after)
!python -u train.py \
    --output_dir "{CHECKPOINT_DIR}" \
    --final_model_dir "{FINAL_MODEL_DIR}" \
    --per_device_batch_size 32 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 30

In [34]:
# --- Upload model to Google Drive (VS Code workaround) ---
# Run this after training to persist your model to Drive

import shutil
import subprocess

# Zip the model
model_zip = "final_model.zip"
if FINAL_MODEL_DIR.exists():
    print(f"Zipping {FINAL_MODEL_DIR}...")
    shutil.make_archive("final_model", "zip", FINAL_MODEL_DIR)
    print(f"Created {model_zip}")
    
    # Upload using gdown's gdrive (or manually download)
    print("\nTo save to Google Drive, either:")
    print("1. Download locally: from google.colab import files; files.download('final_model.zip')")
    print("2. Or copy to Drive folder if you have it mounted elsewhere")
    print(f"\nModel size: {Path(model_zip).stat().st_size / 1024 / 1024:.1f} MB")
else:
    print(f"Model not found at {FINAL_MODEL_DIR}. Train first!")

In [38]:
!ls -lh /content
!ls -lh /content/NMT
!ls -lh /content/NMT/final_model.zip

In [35]:
# Plot training curves (if trainer_state.json exists)
import matplotlib.pyplot as plt

trainer_states = list(CHECKPOINT_DIR.glob("checkpoint-*/trainer_state.json"))
if not trainer_states:
    root_state = CHECKPOINT_DIR / "trainer_state.json"
    trainer_states = [root_state] if root_state.exists() else []

if not trainer_states:
    print("No trainer_state.json found")
else:
    state_path = max(trainer_states, key=lambda p: p.stat().st_mtime)
    state = json.loads(state_path.read_text())
    logs = state.get("log_history", [])

    steps, train_losses = [], []
    eval_steps, eval_losses = [], []
    for item in logs:
        if "loss" in item and "eval_loss" not in item:
            steps.append(item.get("step"))
            train_losses.append(item["loss"])
        if "eval_loss" in item:
            eval_steps.append(item.get("step"))
            eval_losses.append(item["eval_loss"])

    plt.figure(figsize=(10, 4))
    if train_losses:
        plt.plot(steps, train_losses, label="train_loss")
    if eval_losses:
        plt.plot(eval_steps, eval_losses, label="eval_loss")
    plt.title("Training curves")
    plt.xlabel("step")
    plt.ylabel("loss")
    plt.legend()
    plt.grid(True, alpha=0.2)
    plt.show()


## 6) Inference demo


In [2]:
# Demo cell - run the cell below to see multiple translation examples

In [None]:
import sys
import os
from pathlib import Path

# Clone repo if not already cloned
repo_url = "https://github.com/ns-1456/NMT.git"
repo_dir = Path("/content/NMT")
branch = "python-to-cpp-transpiler"

if not repo_dir.exists():
    print(f"üì• Cloning repository from {repo_url}...")
    os.chdir("/content")
    os.system(f"git clone --depth 1 -b {branch} {repo_url} NMT")
    print("‚úÖ Repository cloned")
else:
    print(f"‚úÖ Repository already exists at {repo_dir}")

# Change to the NMT directory
os.chdir(repo_dir)
print(f"üìÅ Current directory: {os.getcwd()}")

# Add repo directory to Python path
if str(repo_dir) not in sys.path:
    sys.path.insert(0, str(repo_dir))

# Check if inference.py exists
inference_file = repo_dir / "inference.py"
if not inference_file.exists():
    print(f"‚ùå Error: inference.py not found at {inference_file}")
    print(f"Directory contents: {list(repo_dir.iterdir())[:10]}")
    raise FileNotFoundError(f"inference.py not found at {inference_file}")

# Import inference module
import inference
print("‚úÖ Successfully imported inference module")

# Point inference to the trained model location (use cloud storage if available)
try:
    # Try to use FINAL_MODEL_DIR from earlier cells
    inference.FINAL_MODEL_DIR = FINAL_MODEL_DIR
    print(f"‚úÖ Using FINAL_MODEL_DIR: {FINAL_MODEL_DIR}")
except NameError:
    # Default to local final_model directory
    inference.FINAL_MODEL_DIR = repo_dir / "final_model"
    print(f"‚úÖ Using default FINAL_MODEL_DIR: {inference.FINAL_MODEL_DIR}")
    
    # Check if model exists, if not, download from cloud storage
    if not inference.FINAL_MODEL_DIR.exists():
        print("‚ö†Ô∏è  Local model not found. Checking cloud storage...")
        
        # Option 1: Download from Google Drive (if you have the file ID)
        # Uncomment and set your Google Drive file ID:
        # from google.colab import drive
        # drive.mount('/content/drive')
        # !cp -r /content/drive/MyDrive/path/to/final_model {inference.FINAL_MODEL_DIR}
        
        # Option 2: Download from a direct URL (if you've uploaded to cloud storage)
        # Uncomment and set your model URL:
        # import gdown
        # model_url = "YOUR_CLOUD_STORAGE_URL_HERE"  # e.g., Google Drive shareable link
        # gdown.download_folder(model_url, output=str(inference.FINAL_MODEL_DIR), quiet=False)
        
        # Option 3: Download final_model.zip and extract
        model_zip = repo_dir / "final_model.zip"
        if model_zip.exists():
            print(f"üì¶ Found final_model.zip, extracting...")
            import zipfile
            with zipfile.ZipFile(model_zip, 'r') as zip_ref:
                zip_ref.extractall(repo_dir)
            print("‚úÖ Model extracted")
        else:
            print("‚ö†Ô∏è  Model not found locally. Please:")
            print("   1. Upload final_model.zip to the repo directory, or")
            print("   2. Upload final_model/ folder to cloud storage and download it here")

print("üöÄ T5-Nano Python ‚Üí C++ Translator Demo")
print("=" * 80)

# Example 1: Simple function
print("\n" + "=" * 80)
print("üìù Example 1: Sum Function")
print("\n--- Python Input ---")
python_code_1 = """def sum_upto(n):
    s = 0
    for i in range(n + 1):
        s += i
    return s"""
print(python_code_1)
print("\n--- Generated C++ ---")
cpp_1 = inference.translate(python_code_1)

# Example 2: List operations
print("\n" + "=" * 80)
print("üìù Example 2: Find Maximum")
print("\n--- Python Input ---")
python_code_2 = """def find_max(arr):
    if not arr:
        return None
    max_val = arr[0]
    for val in arr:
        if val > max_val:
            max_val = val
    return max_val"""
print(python_code_2)
print("\n--- Generated C++ ---")
cpp_2 = inference.translate(python_code_2)

# Example 3: String manipulation
print("\n" + "=" * 80)
print("üìù Example 3: String Reversal")
print("\n--- Python Input ---")
python_code_3 = """def reverse_string(s):
    result = ""
    for i in range(len(s) - 1, -1, -1):
        result += s[i]
    return result"""
print(python_code_3)
print("\n--- Generated C++ ---")
cpp_3 = inference.translate(python_code_3)

# Example 4: Conditional logic
print("\n" + "=" * 80)
print("üìù Example 4: Even Check")
print("\n--- Python Input ---")
python_code_4 = """def is_even(n):
    if n % 2 == 0:
        return True
    else:
        return False"""
print(python_code_4)
print("\n--- Generated C++ ---")
cpp_4 = inference.translate(python_code_4)

print("\n" + "=" * 80)
print("‚ú® Demo complete!")


üì• Cloning repository from https://github.com/ns-1456/NMT.git...
