# Notebook 04 — Pig Latin LoRA Multi-Model Benchmark (Tinker + W&B)

**Goal:** Compare multiple base models on the same dataset and training setup, tracking:
- training loss curves
- wall time / step time
- a simple quality metric (exact match on a small validation set)
- a simple “cost proxy” (completion tokens per epoch)

**Models tested:**
- meta-llama/Llama-3.2-3B
- Qwen/Qwen3-4B-Instruct-2507
- Qwen/Qwen3-30B-A3B-Instruct-2507

**Workflow:**
1) Load environment + initialize Tinker client  
2) Load dataset + sanity check  
3) Define experiment matrix  
4) Define helpers + training/eval function  
5) Run a single smoke test  
6) Run the full matrix + log to W&B (optional)

In [4]:
import os
import json
import time
import random
from datetime import datetime
from pathlib import Path

import numpy as np
from dotenv import load_dotenv

import tinker
from tinker import types

import wandb

## 1) Environment + Tinker client

We explicitly load `.env` from the repo root (not the current notebook folder), then create a single `service_client`.
This avoids “API key not found” and avoids re-creating sessions repeatedly.

In [5]:
def find_repo_root(start=None):
    p = Path(start or Path.cwd()).resolve()
    for parent in [p, *p.parents]:
        if (parent / ".git").exists() or (parent / "pyproject.toml").exists():
            return parent
    return p

REPO_ROOT = find_repo_root()
ENV_PATH = REPO_ROOT / ".env"

print("REPO_ROOT:", REPO_ROOT)
print(".env exists:", ENV_PATH.exists())

load_dotenv(dotenv_path=ENV_PATH, override=True)

api_key = os.getenv("TINKER_API_KEY")
print("TINKER_API_KEY present:", bool(api_key))
print("TINKER_API_KEY startswith:", (api_key or "")[:6])

assert api_key, "Missing TINKER_API_KEY. Put it in repo-root .env as TINKER_API_KEY=..."

service_client = tinker.ServiceClient(api_key=api_key)
print("ServiceClient ready ✅")

REPO_ROOT: C:\Users\user\Desktop\tinker-hello-world
.env exists: True
TINKER_API_KEY present: True
TINKER_API_KEY startswith: tml-CD
ServiceClient ready ✅


## 2) Dataset

We load `data/piglatin/sample.jsonl` and verify it contains `{ "input": ..., "output": ... }` rows.

In [6]:
DATA_PATH = REPO_ROOT / "data" / "piglatin" / "sample.jsonl"
print("DATA_PATH:", DATA_PATH)
print("Exists:", DATA_PATH.exists())
assert DATA_PATH.exists(), f"Missing dataset at {DATA_PATH}"

def read_jsonl(path):
    path = Path(path)
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            rows.append(json.loads(line))
    return rows

rows = read_jsonl(DATA_PATH)
print("Rows:", len(rows))
print("First row keys:", rows[0].keys())
print("First row sample:", rows[0])

assert "input" in rows[0] and "output" in rows[0], "Dataset rows must contain 'input' and 'output'"

DATA_PATH: C:\Users\user\Desktop\tinker-hello-world\data\piglatin\sample.jsonl
Exists: True
Rows: 300
First row keys: dict_keys(['input', 'output'])
First row sample: {'input': 'Translate this to Pig Latin:\nHaving assembled the senate, he reminded them of the injustice of his', 'output': 'aving-Hay assembled-ay e-thay enate-say, e-hay eminded-ray em-thay of-ay e-thay injustice-ay of-ay is-hay'}


## 3) Experiment matrix

We define:
- which base models to test
- training presets (light vs baseline)
- safe overrides for the 30B model
- a run naming scheme

In [7]:
WANDB_PROJECT = "tinker-hello-world"
WANDB_ENTITY = None
WANDB_MODE = "online"        # use "offline" if you want
WANDB_GROUP = "piglatin-multimodel-benchmark-v1"
USE_WANDB = False            # keep False for smoke test; turn True later

BASE_MODELS = [
    "meta-llama/Llama-3.2-3B",
    "Qwen/Qwen3-4B-Instruct-2507",
    "Qwen/Qwen3-30B-A3B-Instruct-2507",
]

PRESETS = {
    "light": dict(max_train_rows=120, lora_rank=8,  lora_alpha=16, learning_rate=2e-4, num_epochs=2, batch_size=8),
    "baseline": dict(max_train_rows=300, lora_rank=16, lora_alpha=32, learning_rate=1e-4, num_epochs=3, batch_size=8),
}

MODEL_OVERRIDES = {
    "Qwen/Qwen3-30B-A3B-Instruct-2507": dict(batch_size=4, max_train_rows=120),
}

EXPERIMENT_MATRIX = []
for m in BASE_MODELS:
    EXPERIMENT_MATRIX.append((m, "light"))
for m in ["meta-llama/Llama-3.2-3B", "Qwen/Qwen3-4B-Instruct-2507"]:
    EXPERIMENT_MATRIX.append((m, "baseline"))

def make_run_name(base_model: str, preset: str) -> str:
    short = base_model.split("/")[-1]
    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
    return f"piglatin-{preset}-{short}-{ts}"

print("Planned runs:")
for m, p in EXPERIMENT_MATRIX:
    print(" -", m, p)

Planned runs:
 - meta-llama/Llama-3.2-3B light
 - Qwen/Qwen3-4B-Instruct-2507 light
 - Qwen/Qwen3-30B-A3B-Instruct-2507 light
 - meta-llama/Llama-3.2-3B baseline
 - Qwen/Qwen3-4B-Instruct-2507 baseline


## 4) Training example format

We train next-token prediction where:
- prompt tokens are weighted `0`
- completion tokens are weighted `1`

This makes the loss focus on the Pig Latin answer portion.

In [8]:
def process_example(example: dict, tokenizer) -> types.Datum:
    prompt = f"English: {example['input']}\nPig Latin:"
    prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)
    prompt_weights = [0] * len(prompt_tokens)

    completion_tokens = tokenizer.encode(f" {example['output']}\n\n", add_special_tokens=False)
    completion_weights = [1] * len(completion_tokens)

    tokens = prompt_tokens + completion_tokens
    weights = prompt_weights + completion_weights

    # Shift for next-token prediction
    input_tokens = tokens[:-1]
    target_tokens = tokens[1:]
    weights = weights[1:]

    return types.Datum(
        model_input=types.ModelInput.from_ints(tokens=input_tokens),
        loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens),
    )

def decode_completion(decoded: str) -> str:
    if "Pig Latin:" in decoded:
        decoded = decoded.split("Pig Latin:")[-1]
    return decoded.strip().splitlines()[0].strip()

## 5) `run_one()` — train + evaluate one configuration

Key design decisions:
- `data_path=None` (prevents NameError during function definition)
- cost proxy is computed from tokenizing the **outputs**, not from summing Tinker internals
- W&B is optional and can be enabled after smoke test

In [9]:
# =========================
# Core training + eval cell
# =========================

from __future__ import annotations

import os
import json
import time
import random
from datetime import datetime
from pathlib import Path

import numpy as np

import tinker
from tinker import types

# Optional: W&B
try:
    import wandb
except Exception:
    wandb = None

# Optional: dotenv (loads TINKER_API_KEY if you use a .env file)
try:
    from dotenv import load_dotenv
except Exception:
    load_dotenv = None


# ---------- Paths / env ----------
def find_repo_root(start: str | Path | None = None) -> Path:
    p = Path(start or Path.cwd()).resolve()
    for parent in [p, *p.parents]:
        if (parent / ".git").exists() or (parent / "pyproject.toml").exists():
            return parent
    return p


REPO_ROOT = find_repo_root()
ENV_PATH = REPO_ROOT / ".env"

if load_dotenv is not None and ENV_PATH.exists():
    load_dotenv(dotenv_path=ENV_PATH, override=True)

DATA_PATH = str(REPO_ROOT / "data" / "piglatin" / "sample.jsonl")


# ---------- Config ----------
WANDB_PROJECT = "tinker-hello-world"
WANDB_ENTITY = None
WANDB_MODE = "online"
WANDB_GROUP = "piglatin-multimodel-benchmark-v1"
USE_WANDB = True  # flip to False for smoke tests / debugging

BASE_MODELS = [
    "meta-llama/Llama-3.2-3B",
    "Qwen/Qwen3-4B-Instruct-2507",
    "Qwen/Qwen3-30B-A3B-Instruct-2507",
]

PRESETS = {
    "light": dict(max_train_rows=120, lora_rank=8,  lora_alpha=16, learning_rate=2e-4, num_epochs=2, batch_size=8),
    "baseline": dict(max_train_rows=300, lora_rank=16, lora_alpha=32, learning_rate=1e-4, num_epochs=3, batch_size=8),
}

# Keep the 30B run safer on first pass
MODEL_OVERRIDES = {
    "Qwen/Qwen3-30B-A3B-Instruct-2507": dict(batch_size=4, max_train_rows=120),
}

# Run matrix: light for all, baseline for smaller ones
EXPERIMENT_MATRIX = []
for m in BASE_MODELS:
    EXPERIMENT_MATRIX.append((m, "light"))
for m in ["meta-llama/Llama-3.2-3B", "Qwen/Qwen3-4B-Instruct-2507"]:
    EXPERIMENT_MATRIX.append((m, "baseline"))


def make_run_name(base_model: str, preset: str) -> str:
    short = base_model.split("/")[-1]
    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
    return f"piglatin-{preset}-{short}-{ts}"


# ---------- Data helpers ----------
def read_jsonl(path: str | Path) -> list[dict]:
    path = Path(path)
    rows = []
    with path.open("r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            rows.append(json.loads(line))
    return rows


# ---------- Tinker helpers ----------
def process_example(example: dict, tokenizer) -> types.Datum:
    """
    Build a next-token prediction datum:
    - Prompt tokens are weight 0 (no loss)
    - Completion tokens are weight 1 (loss)
    """
    prompt = f"English: {example['input']}\nPig Latin:"
    prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)
    prompt_weights = [0] * len(prompt_tokens)

    completion_tokens = tokenizer.encode(f" {example['output']}\n", add_special_tokens=False)
    completion_weights = [1] * len(completion_tokens)

    tokens = prompt_tokens + completion_tokens
    weights = prompt_weights + completion_weights

    # Shift for next-token prediction
    input_tokens = tokens[:-1]
    target_tokens = tokens[1:]
    weights = weights[1:]  # align with targets

    return types.Datum(
        model_input=types.ModelInput.from_ints(tokens=input_tokens),
        loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens),
    )


def _as_1d(x) -> np.ndarray:
    """Force scalars / tuples / lists / arrays into a 1D numpy array."""
    if hasattr(x, "tolist"):
        x = x.tolist()
    arr = np.asarray(x)
    return arr.reshape(-1)


def norm(s: str) -> str:
    return " ".join(s.strip().lower().split())


# ---------- Main experiment ----------
def run_one(
    base_model: str,
    preset_name: str,
    data_path: str | None = None,
    seed: int = 7,
    val_rows: int = 25,
):
    if data_path is None:
        data_path = DATA_PATH

    # Client (reads TINKER_API_KEY from env)
    service_client = tinker.ServiceClient()

    cfg = dict(PRESETS[preset_name])
    cfg.update(MODEL_OVERRIDES.get(base_model, {}))

    run_name = make_run_name(base_model, preset_name)

    random.seed(seed)
    np.random.seed(seed)

    examples = read_jsonl(data_path)[: cfg["max_train_rows"]]
    random.shuffle(examples)

    val = examples[:val_rows]
    train = examples[val_rows:]
    assert len(train) > 0, "Train split is empty; increase max_train_rows or reduce val_rows."

    print("\n==============================")
    print("RUN:", run_name)
    print("MODEL:", base_model)
    print("CFG:", cfg)
    print("DATA:", data_path)
    print("==============================")

    wandb_run = None
    if USE_WANDB and wandb is not None:
        wandb_run = wandb.init(
            project=WANDB_PROJECT,
            entity=WANDB_ENTITY,
            mode=WANDB_MODE,
            name=run_name,
            group=WANDB_GROUP,
            tags=[preset_name, base_model.split("/")[-1]],
            config=dict(
                base_model=base_model,
                preset=preset_name,
                **cfg,
                seed=seed,
                train_rows=len(train),
                val_rows=len(val),
            ),
        )

    # Training client + tokenizer
    training_client = service_client.create_lora_training_client(
        base_model=base_model,
        rank=cfg["lora_rank"],
    )
    tokenizer = training_client.get_tokenizer()

    processed_train = [process_example(ex, tokenizer) for ex in train]

    # Cost proxy (weighted tokens == completion tokens)
    completion_tokens_per_epoch = int(
        sum(_as_1d(d.loss_fn_inputs["weights"]).sum() for d in processed_train)
    )

    num_examples = len(processed_train)
    steps_per_epoch = int(np.ceil(num_examples / cfg["batch_size"]))
    total_steps = cfg["num_epochs"] * steps_per_epoch

    t0 = time.perf_counter()
    global_step = 0
    step_times = []

    for epoch in range(cfg["num_epochs"]):
        random.shuffle(processed_train)

        for start in range(0, num_examples, cfg["batch_size"]):
            batch = processed_train[start : start + cfg["batch_size"]]

            st = time.perf_counter()

            fwdbwd_future = training_client.forward_backward(batch, loss_fn="cross_entropy")
            optim_future = training_client.optim_step(types.AdamParams(learning_rate=cfg["learning_rate"]))

            fwdbwd_result = fwdbwd_future.result()
            optim_future.result()

            # Robust weighted token-level loss (same spirit as Notebook 03)
            logprobs = np.concatenate([_as_1d(out["logprobs"]) for out in fwdbwd_result.loss_fn_outputs])
            weights = np.concatenate([_as_1d(ex.loss_fn_inputs["weights"]) for ex in batch])

            if logprobs.shape[0] != weights.shape[0]:
                raise ValueError(f"Length mismatch: logprobs={logprobs.shape} weights={weights.shape}")

            loss = -float(np.dot(logprobs, weights) / weights.sum())

            global_step += 1
            dt = time.perf_counter() - st
            step_times.append(dt)

            if USE_WANDB and wandb_run is not None:
                wandb.log(
                    {
                        "train/loss": loss,
                        "train/epoch": epoch + 1,
                        "train/step": global_step,
                        "perf/step_time_s": dt,
                        "perf/completion_tokens_per_epoch": completion_tokens_per_epoch,
                        "perf/est_completion_tokens_total": completion_tokens_per_epoch * cfg["num_epochs"],
                    },
                    step=global_step,
                )

            print(f"Epoch {epoch+1}/{cfg['num_epochs']} Step {global_step}/{total_steps} Loss {loss:.4f} StepTime {dt:.2f}s")

    wall_time_s = time.perf_counter() - t0
    avg_step_time_s = float(np.mean(step_times)) if step_times else None

    # Save adapter + sampling client
    sampling_client = training_client.save_weights_and_get_sampling_client(name=run_name)

    # Eval (decode only completion tokens)
    params = types.SamplingParams(max_tokens=60, temperature=0.0, top_p=1.0)

    correct = 0
    rows_out = []

    for ex in val:
        prompt = f"English: {ex['input']}\nPig Latin:"
        prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)
        model_input = types.ModelInput.from_ints(tokens=prompt_tokens)

        result = sampling_client.sample(
            prompt=model_input,
            sampling_params=params,
            num_samples=1
        ).result()

        seq_tokens = result.sequences[0].tokens

        # Strip prompt prefix if present
        if len(seq_tokens) >= len(prompt_tokens) and seq_tokens[:len(prompt_tokens)] == prompt_tokens:
            new_tokens = seq_tokens[len(prompt_tokens):]
        else:
            new_tokens = seq_tokens

        decoded_new = tokenizer.decode(new_tokens).strip()
        pred = decoded_new.splitlines()[0].strip() if decoded_new else ""

        gold = ex["output"].strip()
        is_ok = (norm(pred) == norm(gold))

        correct += int(is_ok)
        rows_out.append([ex["input"], gold, pred, is_ok])

        # Print first 5 samples for sanity
        if len(rows_out) <= 5:
            print("----")
            print("IN  :", ex["input"])
            print("GOLD:", gold)
            print("PRED:", pred if pred else "<EMPTY>")
            print("OK? :", is_ok)

    exact_match = correct / len(val) if val else 0.0

    if USE_WANDB and wandb_run is not None:
        table = wandb.Table(columns=["input", "expected", "pred", "ok"], data=rows_out[:10])
        wandb.log(
            {
                "eval/exact_match_norm": exact_match,
                "eval/samples": table,
                "perf/wall_time_s": wall_time_s,
                "perf/avg_step_time_s": avg_step_time_s,
            }
        )
        wandb_run.summary["perf/wall_time_s"] = wall_time_s
        wandb_run.summary["eval/exact_match_norm"] = exact_match
        wandb.finish()

    print(f"Done. wall_time_s={wall_time_s:.1f} eval_exact_match_norm={exact_match:.3f}")
    return {
        "run_name": run_name,
        "wall_time_s": wall_time_s,
        "avg_step_time_s": avg_step_time_s,
        "exact_match_norm": exact_match,
    }

## 6) Smoke test

Run exactly one small experiment (no W&B) to verify:
- client works
- dataset loads
- training loop runs
- evaluation runs

In [11]:
USE_WANDB = True
for m, p in EXPERIMENT_MATRIX:
    print("\n\n### RUNNING:", m, p)
    run_one(m, p)



### RUNNING: meta-llama/Llama-3.2-3B light

RUN: piglatin-light-Llama-3.2-3B-20251211_224809
MODEL: meta-llama/Llama-3.2-3B
CFG: {'max_train_rows': 120, 'lora_rank': 8, 'lora_alpha': 16, 'learning_rate': 0.0002, 'num_epochs': 2, 'batch_size': 8}
DATA: C:\Users\user\Desktop\tinker-hello-world\data\piglatin\sample.jsonl


wandb: Currently logged in as: nick99 (itprodirect) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin


Epoch 1/2 Step 1/24 Loss 3.2464 StepTime 1.34s
Epoch 1/2 Step 2/24 Loss 2.9514 StepTime 1.21s
Epoch 1/2 Step 3/24 Loss 2.3668 StepTime 1.22s
Epoch 1/2 Step 4/24 Loss 1.9308 StepTime 1.17s
Epoch 1/2 Step 5/24 Loss 1.7277 StepTime 1.18s
Epoch 1/2 Step 6/24 Loss 1.6012 StepTime 1.27s
Epoch 1/2 Step 7/24 Loss 1.3938 StepTime 1.26s
Epoch 1/2 Step 8/24 Loss 1.0912 StepTime 1.18s
Epoch 1/2 Step 9/24 Loss 1.0122 StepTime 1.20s
Epoch 1/2 Step 10/24 Loss 0.8204 StepTime 1.24s
Epoch 1/2 Step 11/24 Loss 0.5806 StepTime 1.19s
Epoch 1/2 Step 12/24 Loss 0.5921 StepTime 1.19s
Epoch 2/2 Step 13/24 Loss 0.4818 StepTime 1.16s
Epoch 2/2 Step 14/24 Loss 0.3271 StepTime 1.26s
Epoch 2/2 Step 15/24 Loss 0.3340 StepTime 1.27s
Epoch 2/2 Step 16/24 Loss 0.3452 StepTime 1.12s
Epoch 2/2 Step 17/24 Loss 0.2143 StepTime 1.25s
Epoch 2/2 Step 18/24 Loss 0.3833 StepTime 1.24s
Epoch 2/2 Step 19/24 Loss 0.3267 StepTime 1.30s
Epoch 2/2 Step 20/24 Loss 0.4300 StepTime 1.19s
Epoch 2/2 Step 21/24 Loss 0.3194 StepTime 1.11s
E

0,1
eval/exact_match_norm,▁
perf/avg_step_time_s,▁
perf/completion_tokens_per_epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/est_completion_tokens_total,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/step_time_s,█▄▄▃▃▆▅▃▄▅▄▃▂▆▆▁▅▅▇▃▁▃▃▅
perf/wall_time_s,▁
train/epoch,▁▁▁▁▁▁▁▁▁▁▁▁████████████
train/loss,█▇▆▅▅▄▄▃▃▂▂▂▂▁▁▁▁▁▁▂▁▁▁▁
train/step,▁▁▂▂▂▃▃▃▃▄▄▄▅▅▅▆▆▆▆▇▇▇██

0,1
eval/exact_match_norm,0.0
perf/avg_step_time_s,1.21561
perf/completion_tokens_per_epoch,3855.0
perf/est_completion_tokens_total,7710.0
perf/step_time_s,1.25344
perf/wall_time_s,29.27488
train/epoch,2.0
train/loss,0.18381
train/step,24.0


Done. wall_time_s=29.3 eval_exact_match_norm=0.000


### RUNNING: Qwen/Qwen3-4B-Instruct-2507 light

RUN: piglatin-light-Qwen3-4B-Instruct-2507-20251211_224952
MODEL: Qwen/Qwen3-4B-Instruct-2507
CFG: {'max_train_rows': 120, 'lora_rank': 8, 'lora_alpha': 16, 'learning_rate': 0.0002, 'num_epochs': 2, 'batch_size': 8}
DATA: C:\Users\user\Desktop\tinker-hello-world\data\piglatin\sample.jsonl


Epoch 1/2 Step 1/24 Loss 2.3160 StepTime 1.54s
Epoch 1/2 Step 2/24 Loss 2.4552 StepTime 1.50s
Epoch 1/2 Step 3/24 Loss 1.4947 StepTime 1.46s
Epoch 1/2 Step 4/24 Loss 1.2732 StepTime 1.69s
Epoch 1/2 Step 5/24 Loss 0.9144 StepTime 1.56s
Epoch 1/2 Step 6/24 Loss 0.8122 StepTime 1.47s
Epoch 1/2 Step 7/24 Loss 0.7046 StepTime 1.46s
Epoch 1/2 Step 8/24 Loss 0.5047 StepTime 1.54s
Epoch 1/2 Step 9/24 Loss 0.5939 StepTime 1.53s
Epoch 1/2 Step 10/24 Loss 0.5199 StepTime 1.59s
Epoch 1/2 Step 11/24 Loss 0.2109 StepTime 1.47s
Epoch 1/2 Step 12/24 Loss 0.1673 StepTime 1.71s
Epoch 2/2 Step 13/24 Loss 0.1878 StepTime 1.48s
Epoch 2/2 Step 14/24 Loss 0.2043 StepTime 1.51s
Epoch 2/2 Step 15/24 Loss 0.1281 StepTime 1.47s
Epoch 2/2 Step 16/24 Loss 0.1277 StepTime 1.74s
Epoch 2/2 Step 17/24 Loss 0.0882 StepTime 1.41s
Epoch 2/2 Step 18/24 Loss 0.1461 StepTime 1.45s
Epoch 2/2 Step 19/24 Loss 0.1831 StepTime 1.47s
Epoch 2/2 Step 20/24 Loss 0.3136 StepTime 1.55s
Epoch 2/2 Step 21/24 Loss 0.2143 StepTime 1.47s
E

0,1
eval/exact_match_norm,▁
perf/avg_step_time_s,▁
perf/completion_tokens_per_epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/est_completion_tokens_total,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/step_time_s,▄▃▂▇▄▃▂▄▄▅▃▇▃▄▃█▂▂▃▄▃▁▃▇
perf/wall_time_s,▁
train/epoch,▁▁▁▁▁▁▁▁▁▁▁▁████████████
train/loss,██▅▅▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁
train/step,▁▁▂▂▂▃▃▃▃▄▄▄▅▅▅▆▆▆▆▇▇▇██

0,1
eval/exact_match_norm,0.04
perf/avg_step_time_s,1.52617
perf/completion_tokens_per_epoch,3862.0
perf/est_completion_tokens_total,7724.0
perf/step_time_s,1.69472
perf/wall_time_s,36.7316
train/epoch,2.0
train/loss,0.14582
train/step,24.0


Done. wall_time_s=36.7 eval_exact_match_norm=0.040


### RUNNING: Qwen/Qwen3-30B-A3B-Instruct-2507 light

RUN: piglatin-light-Qwen3-30B-A3B-Instruct-2507-20251211_225143
MODEL: Qwen/Qwen3-30B-A3B-Instruct-2507
CFG: {'max_train_rows': 120, 'lora_rank': 8, 'lora_alpha': 16, 'learning_rate': 0.0002, 'num_epochs': 2, 'batch_size': 4}
DATA: C:\Users\user\Desktop\tinker-hello-world\data\piglatin\sample.jsonl


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Epoch 1/2 Step 1/48 Loss 0.7406 StepTime 2.30s
Epoch 1/2 Step 2/48 Loss 0.7771 StepTime 2.09s
Epoch 1/2 Step 3/48 Loss 0.4513 StepTime 2.86s
Epoch 1/2 Step 4/48 Loss 0.9059 StepTime 2.38s
Epoch 1/2 Step 5/48 Loss 0.4594 StepTime 2.00s
Epoch 1/2 Step 6/48 Loss 0.1700 StepTime 1.98s
Epoch 1/2 Step 7/48 Loss 0.3442 StepTime 2.07s
Epoch 1/2 Step 8/48 Loss 0.3202 StepTime 2.44s
Epoch 1/2 Step 9/48 Loss 0.2918 StepTime 1.99s
Epoch 1/2 Step 10/48 Loss 0.1661 StepTime 2.05s
Epoch 1/2 Step 11/48 Loss 0.1953 StepTime 3.19s
Epoch 1/2 Step 12/48 Loss 0.3437 StepTime 2.88s
Epoch 1/2 Step 13/48 Loss 0.2982 StepTime 4.13s
Epoch 1/2 Step 14/48 Loss 0.2668 StepTime 2.90s
Epoch 1/2 Step 15/48 Loss 0.2236 StepTime 1.85s
Epoch 1/2 Step 16/48 Loss 0.1160 StepTime 3.81s
Epoch 1/2 Step 17/48 Loss 0.1929 StepTime 1.96s
Epoch 1/2 Step 18/48 Loss 0.1847 StepTime 2.09s
Epoch 1/2 Step 19/48 Loss 0.1256 StepTime 2.02s
Epoch 1/2 Step 20/48 Loss 0.1304 StepTime 2.09s
Epoch 1/2 Step 21/48 Loss 0.1068 StepTime 2.05s
E

0,1
eval/exact_match_norm,▁
perf/avg_step_time_s,▁
perf/completion_tokens_per_epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/est_completion_tokens_total,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/step_time_s,▂▂▄▃▁▂▃▁▂▅█▄▁▇▁▂▂▂▂▂▂▂▅▂▂▂▂▁▂▁▂▁▁▄▁▂▂▂▃▁
perf/wall_time_s,▁
train/epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁████████████████████
train/loss,▇▇▄█▅▄▃▃▂▂▃▃▃▂▂▂▂▂▁▁▁▁▁▁▂▁▂▂▁▁▁▁▁▁▂▁▁▂▁▁
train/step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███

0,1
eval/exact_match_norm,0.32
perf/avg_step_time_s,2.26508
perf/completion_tokens_per_epoch,3862.0
perf/est_completion_tokens_total,7724.0
perf/step_time_s,2.00046
perf/wall_time_s,108.94527
train/epoch,2.0
train/loss,0.02038
train/step,48.0


Done. wall_time_s=108.9 eval_exact_match_norm=0.320


### RUNNING: meta-llama/Llama-3.2-3B baseline

RUN: piglatin-baseline-Llama-3.2-3B-20251211_225651
MODEL: meta-llama/Llama-3.2-3B
CFG: {'max_train_rows': 300, 'lora_rank': 16, 'lora_alpha': 32, 'learning_rate': 0.0001, 'num_epochs': 3, 'batch_size': 8}
DATA: C:\Users\user\Desktop\tinker-hello-world\data\piglatin\sample.jsonl


Epoch 1/3 Step 1/105 Loss 3.5147 StepTime 1.09s
Epoch 1/3 Step 2/105 Loss 3.2103 StepTime 1.21s
Epoch 1/3 Step 3/105 Loss 2.8060 StepTime 1.19s
Epoch 1/3 Step 4/105 Loss 2.2765 StepTime 1.25s
Epoch 1/3 Step 5/105 Loss 2.0698 StepTime 1.07s
Epoch 1/3 Step 6/105 Loss 2.0122 StepTime 1.13s
Epoch 1/3 Step 7/105 Loss 1.6839 StepTime 1.21s
Epoch 1/3 Step 8/105 Loss 1.9408 StepTime 1.19s
Epoch 1/3 Step 9/105 Loss 1.6102 StepTime 1.22s
Epoch 1/3 Step 10/105 Loss 1.2944 StepTime 1.24s
Epoch 1/3 Step 11/105 Loss 1.3189 StepTime 1.13s
Epoch 1/3 Step 12/105 Loss 1.0346 StepTime 1.20s
Epoch 1/3 Step 13/105 Loss 0.9026 StepTime 1.21s
Epoch 1/3 Step 14/105 Loss 0.9712 StepTime 1.07s
Epoch 1/3 Step 15/105 Loss 0.8658 StepTime 1.20s
Epoch 1/3 Step 16/105 Loss 0.9484 StepTime 1.10s
Epoch 1/3 Step 17/105 Loss 0.7535 StepTime 1.19s
Epoch 1/3 Step 18/105 Loss 0.6817 StepTime 1.19s
Epoch 1/3 Step 19/105 Loss 0.7975 StepTime 1.32s
Epoch 1/3 Step 20/105 Loss 0.7419 StepTime 1.26s
Epoch 1/3 Step 21/105 Loss 0.

0,1
eval/exact_match_norm,▁
perf/avg_step_time_s,▁
perf/completion_tokens_per_epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/est_completion_tokens_total,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/step_time_s,▅▃▅▅▆▂▄█▆▄▅▆▂▇▃▅▇▁▆▃▇▁▄▁▆▃▆▄▇▅▅▅▂▄▄▆▃▁▂▃
perf/wall_time_s,▁
train/epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅▅█████████████
train/loss,█▆▅▅▄▄▃▃▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/step,▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇████

0,1
eval/exact_match_norm,0.2
perf/avg_step_time_s,1.18341
perf/completion_tokens_per_epoch,11166.0
perf/est_completion_tokens_total,33498.0
perf/step_time_s,1.24072
perf/wall_time_s,124.7234
train/epoch,3.0
train/loss,0.0269
train/step,105.0


Done. wall_time_s=124.7 eval_exact_match_norm=0.200


### RUNNING: Qwen/Qwen3-4B-Instruct-2507 baseline

RUN: piglatin-baseline-Qwen3-4B-Instruct-2507-20251211_230009
MODEL: Qwen/Qwen3-4B-Instruct-2507
CFG: {'max_train_rows': 300, 'lora_rank': 16, 'lora_alpha': 32, 'learning_rate': 0.0001, 'num_epochs': 3, 'batch_size': 8}
DATA: C:\Users\user\Desktop\tinker-hello-world\data\piglatin\sample.jsonl


Epoch 1/3 Step 1/105 Loss 3.1657 StepTime 1.53s
Epoch 1/3 Step 2/105 Loss 2.2812 StepTime 1.48s
Epoch 1/3 Step 3/105 Loss 1.8943 StepTime 1.56s
Epoch 1/3 Step 4/105 Loss 1.4823 StepTime 1.78s
Epoch 1/3 Step 5/105 Loss 0.9688 StepTime 1.50s
Epoch 1/3 Step 6/105 Loss 1.1643 StepTime 1.56s
Epoch 1/3 Step 7/105 Loss 1.0727 StepTime 1.47s
Epoch 1/3 Step 8/105 Loss 1.0806 StepTime 1.61s
Epoch 1/3 Step 9/105 Loss 0.9097 StepTime 1.49s
Epoch 1/3 Step 10/105 Loss 0.6108 StepTime 1.52s
Epoch 1/3 Step 11/105 Loss 0.6487 StepTime 5.90s
Epoch 1/3 Step 12/105 Loss 0.4989 StepTime 7.38s
Epoch 1/3 Step 13/105 Loss 0.5551 StepTime 9.81s
Epoch 1/3 Step 14/105 Loss 0.5976 StepTime 14.98s
Epoch 1/3 Step 15/105 Loss 0.4659 StepTime 1.45s
Epoch 1/3 Step 16/105 Loss 0.5718 StepTime 1.47s
Epoch 1/3 Step 17/105 Loss 0.3572 StepTime 1.55s
Epoch 1/3 Step 18/105 Loss 0.3796 StepTime 1.70s
Epoch 1/3 Step 19/105 Loss 0.4370 StepTime 1.48s
Epoch 1/3 Step 20/105 Loss 0.4340 StepTime 1.40s
Epoch 1/3 Step 21/105 Loss 0

0,1
eval/exact_match_norm,▁
perf/avg_step_time_s,▁
perf/completion_tokens_per_epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/est_completion_tokens_total,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/step_time_s,▁▁▁▁█▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
perf/wall_time_s,▁
train/epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅██████████
train/loss,█▆▄▄▄▃▂▂▂▂▂▂▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████

0,1
eval/exact_match_norm,0.36
perf/avg_step_time_s,1.83279
perf/completion_tokens_per_epoch,11191.0
perf/est_completion_tokens_total,33573.0
perf/step_time_s,1.45524
perf/wall_time_s,192.90679
train/epoch,3.0
train/loss,0.01466
train/step,105.0


Done. wall_time_s=192.9 eval_exact_match_norm=0.360
