# Colab GPU Training for this repo (Transformer + Lightning + MLflow)

This notebook is meant to be pasted/run in **Google Colab** (GPU runtime) to train this repo on a GPU and keep artifacts in Google Drive.

It also doubles as a quick VS Code notebook sanity-check template.

> Tip: In Colab go to **Runtime → Change runtime type → GPU** before running.

In [None]:
# 1) Minimal notebook skeleton (idempotent cell)
seed = 42
msg = "hello from notebook"

def add(a, b):
    return a + b

add(1, 2), seed, msg

## 2) Verify environment (Python, GPU, paths)

This prints Python info and checks whether CUDA is available.

In [None]:
import os, sys
print("python:", sys.version)
print("executable:", sys.executable)
print("cwd:", os.getcwd())

try:
    import torch
    print("torch:", torch.__version__)
    print("cuda available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("gpu:", torch.cuda.get_device_name(0))
except Exception as e:
    print("Torch import failed:", e)

try:
    import lightning
    print("lightning:", lightning.__version__)
except Exception as e:
    print("Lightning import failed:", e)

try:
    import mlflow
    print("mlflow:", mlflow.__version__)
except Exception as e:
    print("MLflow import failed:", e)

## 3) Sanity-check stdout/stderr

This intentionally throws and catches an exception to show tracebacks in notebooks.

In [None]:
print("stdout: hello")

try:
    1 / 0
except Exception as e:
    print("caught error:", repr(e))

print("still running")

## 4) Persist and reload notebook state (basic I/O)

Writes a small JSON file and reads it back.

In [None]:
import json
from pathlib import Path

Path("data").mkdir(exist_ok=True)
state_path = Path("data/state.json")

state = {"seed": seed, "msg": msg, "sum": add(10, 20)}
state_path.write_text(json.dumps(state, indent=2), encoding="utf-8")

loaded = json.loads(state_path.read_text(encoding="utf-8"))
assert loaded["sum"] == 30
loaded

## 5) Add a simple unit test cell

Uses `unittest` in a notebook-friendly way.

In [None]:
import unittest


def add2(a, b):
    return a + b


class TestAdd(unittest.TestCase):
    def test_add(self):
        self.assertEqual(add2(1, 2), 3)

    def test_zero(self):
        self.assertEqual(add2(0, 0), 0)


unittest.main(argv=[""], exit=False)

## 6) Capture output to a file and display it

Writes a simple log file and prints the last lines.

In [None]:
from pathlib import Path

Path("logs").mkdir(exist_ok=True)
log_path = Path("logs/run.log")

with log_path.open("a", encoding="utf-8") as f:
    f.write("starting run\n")
    f.write(f"seed={seed}\n")
    f.write("done\n")

print("Last 10 lines:")
print("\n".join(log_path.read_text(encoding="utf-8").splitlines()[-10:]))

---

# Colab section (GPU training + MLflow on Drive)

The cells below are specifically for Google Colab.

They will:

1. Mount Google Drive
2. Clone your repo
3. Install requirements
4. Point MLflow tracking to Drive (so runs persist)
5. Run training on GPU

> MLflow UI: On Colab, a full web UI is awkward to expose. The typical workflow is: log to Drive, then run `mlflow ui` locally against that same `mlruns` folder (or use a remote tracking server).

In [None]:
# (Colab) 1) Mount Drive
from google.colab import drive

drive.mount('/content/drive')

In [None]:
# (Colab) 2) Clone repo
# Replace with your repo URL.
REPO_URL = "https://github.com/<owner>/<repo>.git"

import os

if not os.path.exists("AttentionIsAllYouNeed"):
    !git clone {REPO_URL} AttentionIsAllYouNeed
%cd AttentionIsAllYouNeed

In [None]:
# (Colab) 3) Install dependencies
!pip -q install -r requirements.txt

In [None]:
# (Colab) 4) Point MLflow + data to Drive for persistence
from pathlib import Path

DRIVE_ROOT = Path("/content/drive/MyDrive")
PROJECT_ROOT = DRIVE_ROOT / "attention_is_all_you_need"
DATA_DIR = PROJECT_ROOT / "data" / "raw" / "wmt14_en_de"
MLRUNS_DIR = PROJECT_ROOT / "mlruns"

DATA_DIR.mkdir(parents=True, exist_ok=True)
MLRUNS_DIR.mkdir(parents=True, exist_ok=True)

print("DATA_DIR:", DATA_DIR)
print("MLRUNS_DIR:", MLRUNS_DIR)

# Copy your prepared dataset folder into Drive once.
# If you already have a local folder in the repo, you can copy it like:
# !cp -r data/raw/wmt14_en_de/* "{DATA_DIR}/"

# Check expected files
for fn in ["train.en.bpe32000", "train.de.bpe32000", "valid.en.bpe32000", "valid.de.bpe32000"]:
    print(fn, "exists?", (DATA_DIR / fn).exists())

In [None]:
# (Colab) 5) Run repo smoke test using Drive-backed data
import os

os.environ["MLFLOW_TRACKING_URI"] = f"file:{MLRUNS_DIR}"

# Make sure repo expects data/raw/wmt14_en_de
!mkdir -p data/raw/wmt14_en_de
!rm -rf data/raw/wmt14_en_de
!ln -s "{DATA_DIR}" data/raw/wmt14_en_de

!python -m src.smoke_test

In [None]:
# (Colab) 6) Train on GPU
# This runs the repo entrypoint; it will use MLflow tracking via the env var set above.
# Ensure src/train.py has accelerator='gpu' and devices=1 when running in Colab.

!python -m src.train

In [None]:
# (Colab) 7) Export latest MLflow run summary to the repo docs folder
!python scripts/export_mlflow_run.py --tracking-uri "file:{MLRUNS_DIR}" --experiment attention_is_all_you_need_cpu --out docs/assets/latest_run_summary.json
!ls -lah docs/assets | head