# Fine‑tuning **Llama‑4‑Maverick 17B‑128E** with QLoRA on Google Colab  

This end‑to‑end notebook shows how to:
1. Install **[Unsloth](https://github.com/unslothai/unsloth)** for fast 4‑bit loading & LoRA training.
2. Convert our **Teach** human‑labelled CSV + framework JSON into **ChatML `.jsonl`**.
3. Fine‑tune Meta’s *Llama‑4‑Maverick* with **Transformers ✕ PEFT ✕ bitsandbytes** (QLoRA under the hood).
4. Push the LoRA adapter **and** a merged FP16 model to the Hugging Face Hub.

> **Colab requirements** · A100‑40 GB, L4‑24 GB or A10G‑24 GB tier (Pro/Pro+).  
> The Maverick MoE model loads in ∼19 GB VRAM at 4‑bit, leaving room for LoRA params.

References: Unsloth Llama‑4 guide :contentReference[oaicite:0]{index=0}, Colab install doc :contentReference[oaicite:1]{index=1}.

In [None]:
# ⚙️ 0. Environment check
import os, subprocess, json, torch, sys
print(f"Torch version: {torch.__version__}")
!nvidia-smi -L
!nvidia-smi --query-gpu=name,memory.total --format=csv


In [None]:
# ⚙️ 1. Install libraries (takes 2–3 min)
# NOTE: The unsloth[colab] meta‑package chooses the right CUDA/torch wheels automatically.
!pip install --quiet --upgrade "unsloth[colab]" \
                                 transformers==4.41.2 \
                                 peft==0.10.0 \
                                 bitsandbytes==0.43.1 \
                                 datasets==2.19.1 \
                                 trl==0.8.6 \
                                 accelerate==0.29.3 \
                                 huggingface_hub==0.23.3 

import importlib, platform, torch; print("✔️ Install complete on", platform.platform())

In [None]:
# 🔑 2. Login to the Hugging Face Hub (one‑time per session)
from huggingface_hub import notebook_login
notebook_login()

## 3 Data upload / mounting
Place the following files in **Google Drive** → `/MyDrive/teach_data/`:
```
Teach_1.json                     # framework spec
peru_cleaned_transcripts.csv     # human labels + transcripts
```
Alternatively, upload them directly to the Colab *Files* pane and adjust paths below.

In [None]:
# ▶️ 3a (Optional) Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# 📂 3b Define paths
DATA_DIR = "/content/drive/MyDrive/teach_data"   # change if not using Drive
CSV_PATH = f"{DATA_DIR}/peru_cleaned_transcripts.csv"
FRAME_PATH = f"{DATA_DIR}/Teach_1.json"

assert os.path.exists(CSV_PATH), "❌ CSV not found – check path!"
assert os.path.exists(FRAME_PATH), "❌ Framework JSON not found!"

## 4 Materialise ChatML `.jsonl`
Every training example is a triple of *system / user / assistant* messages.  
We reuse the helper functions shipped in the evaluation repo.

In [None]:
# 🛠️ 4. Build train/val JSONL
import pandas as pd, json, random, math, tqdm, numpy as np

RNG = random.Random(42)
TRAIN_OUT = "train_chatml.jsonl"
VAL_OUT   = "val_chatml.jsonl"

df = pd.read_csv(CSV_PATH, dtype=str)
framework = json.load(open(FRAME_PATH))

# Extract IDs
clip_info = df['School_Clip'].str.extract(r'(?P<base_id>\d{6,7})\s*Clip\s*(?P<clip>[12])')
df['base_id'] = clip_info['base_id']
df['clip_number'] = clip_info['clip'].map({'1':'first','2':'last'})

# Helper: prompt builder mirroring evaluation pipeline
def make_prompt(comp, transcript):
    system = "You are an expert Teach framework scorer. Reply ONLY with valid JSON: {\"score\":<label>,\"analysis\":<text>}"
    user = (
        f"### Component: {comp['name']}\n"
        f"### Allowed labels: {', '.join(map(str, comp.get('scoreList',['Y','N'])))}\n\n"
        f"Transcript:\n{transcript}\n"
    )
    return system, user

rows_train, rows_val = [], []

for _, row in tqdm.tqdm(df.iterrows(), total=len(df)):
    transcript = (
        row.get('First Audio Transcript Text', '') if row['clip_number']=='first'
        else row.get('Last Audio Transcript Text', '')
    )
    if not isinstance(transcript, str) or len(transcript.strip())==0:
        continue  # skip blank transcripts
    for domain in framework['structure']['domains']:
        for comp in domain['components']:
            cname = comp['name']
            if cname not in row or pd.isna(row[cname]):
                continue
            label = str(row[cname]).strip()
            sys_msg, user_msg = make_prompt(comp, transcript)
            assistant_msg = json.dumps({"score": label, "analysis": ""}, ensure_ascii=False)
            example = {
                "messages": [
                    {"role": "system", "content": sys_msg},
                    {"role": "user", "content": user_msg},
                    {"role": "assistant", "content": assistant_msg}
                ]
            }
            # 80‑20 split on the fly
            (rows_val if RNG.random()<0.2 else rows_train).append(example)

# Write files
for path, rows in [(TRAIN_OUT, rows_train),(VAL_OUT,rows_val)]:
    with open(path,'w',encoding='utf-8') as f:
        for ex in rows:
            f.write(json.dumps(ex, ensure_ascii=False)+'\n')
    print(f"Wrote {len(rows):,} examples → {path}")

## 5 Training hyper‑parameters
* **r** = 64, **α** = 16 → good trade‑off for 17 B models.  
* **batch_size** auto‑scales to GPU memory via gradient accumulation.  
* **packing** = `True` packs multiple small conversations into one window (Unsloth feature).

In [None]:
# ⚙️ 6. Load model (4‑bit) & attach LoRA
from unsloth import FastLanguageModel
from transformers import BitsAndBytesConfig
from peft import LoraConfig
from datasets import load_dataset
from trl import SFTTrainer

MODEL_NAME = "unsloth/Llama-4-Maverick-17B-128E-bnb-4bit"  # 4‑bit checkpoint
MAX_LEN = 2048

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_LEN,
    dtype=None,               # Autodetect bfloat16 support
    load_in_4bit=True,
    quantization_config=bnb_cfg,
)

peft_cfg = LoraConfig(
    r=64, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","down_proj","up_proj"]
)
model = FastLanguageModel.get_peft_model(model, peft_cfg)

tokenizer.pad_token = tokenizer.eos_token  # safer packing
print(f"Model loaded – trainable params: {model.num_parameters(only_trainable=True):,}")

In [None]:
# 📚 7. Load dataset into 🤗 Datasets
train_ds = load_dataset('json', data_files=TRAIN_OUT, split='train')
val_ds   = load_dataset('json', data_files=VAL_OUT,   split='train')
print(train_ds[0]['messages'][0].keys(), "| total train ex:", len(train_ds))

In [None]:
# 🚀 8. Configure SFTTrainer
trainer = SFTTrainer(
    model                 = model,
    train_dataset         = train_ds,
    eval_dataset          = val_ds,
    dataset_text_field    = "messages",
    max_seq_length        = MAX_LEN,
    packing               = True,
    gradient_checkpointing= True,
    bf16                  = torch.cuda.is_bf16_supported(),
    logging_steps         = 25,
    eval_steps            = 200,
    save_steps            = 200,
    num_train_epochs      = 3,
    learning_rate         = 2e-4,
    per_device_train_batch_size = 1,   # fits 24 GB with 4‑bit
    gradient_accumulation_steps = 8,   # effective batch 8
    warmup_ratio          = 0.05,
)
print("Trainer initialised – starting fine‑tuning …")

In [None]:
# 🏋️ 9. Train (≈ 1–3 hrs depending on GPU & dataset size)
trainer.train()

# Save adapter locally
ADAPTER_DIR = "lora-teach-llama4-maverick"
trainer.model.save_pretrained(ADAPTER_DIR)
tokenizer.save_pretrained(ADAPTER_DIR)
print(f"Adapter saved to {ADAPTER_DIR}")

In [None]:
# ☁️ 10. Push to the Hub (LoRA only)
HF_REPO = "mattkrasnow/llama4-maverick-teach-lora"
trainer.push_to_hub(HF_REPO)

In [None]:
# ➕ 11. (Optional) Merge LoRA into FP16 base & push full model
MERGED = "llama4-maverick-teach-merged"
merged_model = model.merge_and_unload()
merged_model.save_pretrained(MERGED, safetensors=True)
tokenizer.save_pretrained(MERGED)

from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id=HF_REPO+"-merged", exist_ok=True)
api.upload_folder(folder_path=MERGED, repo_id=HF_REPO+"-merged")
print("Merged model pushed ✔️")

In [None]:
# 🔬 12. Quick inference sanity‑check
from transformers import pipeline

pipe = pipeline("text-generation",
                model=ADAPTER_DIR,
                tokenizer=ADAPTER_DIR,
                device_map="auto")
prompt = (
    "### Component: Supportive Learning Environment\n"
    "### Allowed labels: Y, N, N/A\n\n"
    "Transcript:\nTeacher greets each student by name and checks how they feel about their homework…"
)
gen = pipe(prompt, max_new_tokens=64, do_sample=False)[0]['generated_text']
print(gen)

## 13 Next steps
* Evaluate the fine‑tuned model with your existing *`run_evaluation()`* pipeline.  
* Perform hyper‑parameter sweeps (rank, learning rate) to maximise Cohen’s κ and Teach reliability pass‑rate.  