# GPT-2-XL Full Fine-Tuning — Voice

Fine-tune GPT-2-XL (1.5B parameters) on essay pairs to capture writing voice.
Full fine-tuning — every weight gets updated, no LoRA adapter.

**Why GPT-2-XL?** It's pre-RLHF (released 2019, before alignment training existed),
so it's a "raw" model. Comparing it against Llama (post-RLHF) answers the question:
does alignment training help or hurt voice capture?

**Why full fine-tuning instead of LoRA?** GPT-2-XL is small enough (1.5B params) to
fine-tune fully on a T4 GPU. Full fine-tuning gives maximum capacity to learn voice,
and trains in minutes. The tradeoff: higher overfitting risk on a small dataset.

**Context window:** 1,024 tokens. This limits GPT-2 to short pieces (Notes, openings).
The constraint is actually useful — it forces a clean 1:1 comparison with Llama on
short pairs, isolating voice at the sentence level without essay-level structure.

## Prerequisites

1. **Set runtime to GPU** (Runtime → Change runtime type → T4 GPU).
2. That's it. No license, no token, no approval wait. GPT-2-XL is fully open.

## Cell Map

| Cell | What it does | When to stop |
|------|-------------|--------------|
| 0 | This intro | — |
| 1 | Install dependencies | — |
| 2 | Upload training data | SKIP for baseline only |
| 3 | Config + mount Drive | — |
| 4 | Load base GPT-2-XL | — |
| 5 | Canary baselines | **STOP HERE for baseline** |
| 5b | Save baselines to Drive | — |
| 6 | Load training data | — |
| 7 | Train | — |
| 8 | Canary on fine-tuned | — |
| 8b | Save fine-tuned outputs to Drive | — |
| 9 | Side-by-side comparison | — |
| 10 | Save model to Drive | — |

In [None]:
# === Cell 1: Install dependencies ===
# bitsandbytes is needed for the 8-bit Adam optimizer (saves ~9GB VRAM during training)
!pip install -q transformers datasets bitsandbytes

In [None]:
# === Cell 2: Upload training data (SKIP this cell for baseline-only run) ===
# ---------------------------------------------------------------------------
# Upload training data from your local machine
#
# This opens a file picker. Select your JSONL files:
#   - gpt2_train.jsonl
#   - gpt2_val.jsonl
#
# These were generated locally by running:
#   python3 2_scripts/format_jsonl.py
#
# The files live in Colab's temporary storage for this session.
# They'll disappear when the runtime disconnects, but that's fine —
# the trained model gets saved to Google Drive.
# ---------------------------------------------------------------------------

import os
from google.colab import files

print("Select your JSONL files (gpt2_train.jsonl and gpt2_val.jsonl):")
uploaded = files.upload()

DATA_DIR = "/content/data"
os.makedirs(DATA_DIR, exist_ok=True)

for filename in uploaded:
    dest = os.path.join(DATA_DIR, filename)
    os.rename(filename, dest)
    print(f"  {filename} → {dest}")

print(f"\nUploaded {len(uploaded)} file(s) to {DATA_DIR}/")

In [None]:
# === Cell 3: Config + mount Google Drive ===

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from datasets import load_dataset
from google.colab import drive

# Mount Google Drive for saving the trained model
drive.mount("/content/drive")

# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------

MODEL_ID = "gpt2-xl"

# Training hyperparameters
LR = 5e-5
EPOCHS = 3
BATCH_SIZE = 1
MAX_SEQ_LEN = 1024

# Generation hyperparameters — identical across all 4 models for fair comparison
GEN_KWARGS = dict(
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    max_new_tokens=512,
    do_sample=True,
)

N_SAMPLES = 5

# Paths
DATA_DIR = "/content/data"
DRIVE_BASE = "/content/drive/MyDrive/voice-ft"
CHECKPOINT_DIR = f"{DRIVE_BASE}/checkpoints/gpt2"

print(f"Model: {MODEL_ID}")
print(f"LR={LR}, epochs={EPOCHS}, batch={BATCH_SIZE}, max_seq_len={MAX_SEQ_LEN}")
print(f"Data dir: {DATA_DIR}")
print(f"Checkpoints: {CHECKPOINT_DIR}")

In [None]:
# === Cell 4: Load base GPT-2-XL ===

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# GPT-2 has no pad token by default. Same fix as Llama: use eos_token.
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

# Move to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"  (100% trainable — full fine-tuning, no LoRA)")
print(f"Model on: {device}")

In [None]:
# === Cell 5: Generate BASELINE canary outputs (STOP HERE for baseline-only run) ===
# ---------------------------------------------------------------------------
# Canary prompts — fixed prompts run after every training iteration.
# They diagnose what the model learned vs memorized.
#
# Source of truth: 4_experiments/canary-prompts.md
#
# Only A and B for GPT-2 (both short). Canary C is Llama-only because
# it requires essay-length output that won't fit in 1,024 tokens.
# ---------------------------------------------------------------------------

canary_prompts = {
    "A": "Write a personal Substack Note/Tweet about class in America, told from the perspective of a Chinese first generation immigrant whose family is lower-middle class.",
    "B": "Write a personal Substack Note/Tweet about Eileen Gu and Alyssa Liu, both winter Olympic gold medalists. Both grew up in the Bay Area, are half-asian and half-white, conceived via anonymous egg donor, and raised by a single parent. Eileen competed for China in skiing and is maximizing her influencer career while studying at Stanford. Meanwhile, Alyssa competed for the United States, took breaks from skating, and is inactive on social media.",
}


def generate(model, tokenizer, prompt, n=N_SAMPLES):
    """Generate n samples from the model for a given prompt."""
    # Format to match training data: prompt, then separator
    input_text = f"{prompt}\n\n---\n\n"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]

    outputs_list = []
    for i in range(n):
        with torch.no_grad():
            output = model.generate(**inputs, **GEN_KWARGS, pad_token_id=tokenizer.eos_token_id)
        # Decode only the NEW tokens (skip the prompt)
        generated_text = tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
        outputs_list.append(generated_text)

    return outputs_list


# ---------------------------------------------------------------------------
# Generate BASELINE outputs (before fine-tuning)
# ---------------------------------------------------------------------------

baseline_outputs = {}
for name, prompt in canary_prompts.items():
    print(f"\n{'='*60}")
    print(f"BASELINE — Canary {name}")
    print(f"Prompt: {prompt[:80]}...")
    print(f"{'='*60}")

    samples = generate(model, tokenizer, prompt)
    baseline_outputs[name] = samples

    print(f"\nSample 1 of {N_SAMPLES}:")
    print(samples[0][:500])
    print("..." if len(samples[0]) > 500 else "")

print(f"\nBaseline generation complete. {N_SAMPLES} samples per canary saved.")

In [None]:
# === Cell 5b: Save baselines to Drive (insurance against disconnection) ===
import json, os

baseline_path = f"{DRIVE_BASE}/baselines/gpt2_baselines.json"
os.makedirs(f"{DRIVE_BASE}/baselines", exist_ok=True)

with open(baseline_path, "w") as f:
    json.dump(baseline_outputs, f, indent=2)

print(f"Baselines saved to: {baseline_path}")

In [None]:
# === Cell 6: Load training data ===

train_dataset = load_dataset("json", data_files=f"{DATA_DIR}/gpt2_train.jsonl", split="train")
val_dataset = load_dataset("json", data_files=f"{DATA_DIR}/gpt2_val.jsonl", split="train")

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"\nFirst training example (truncated):")
print(train_dataset[0]["text"][:300])


def tokenize(example):
    """Tokenize a completion-style example."""
    tokenized = tokenizer(
        example["text"],
        truncation=True,
        max_length=MAX_SEQ_LEN,
        padding=False,
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized


train_dataset = train_dataset.map(tokenize, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(tokenize, remove_columns=val_dataset.column_names)

print(f"\nTokenized. First example length: {len(train_dataset[0]['input_ids'])} tokens")

In [None]:
# === Cell 7: Train ===

training_args = TrainingArguments(
    output_dir=CHECKPOINT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    warmup_steps=10,
    logging_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,
    fp16=True,
    optim="adamw_bnb_8bit",
    report_to="none",
    gradient_accumulation_steps=1,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("Starting training...")
print(f"  {len(train_dataset)} training examples x {EPOCHS} epochs = {len(train_dataset) * EPOCHS} steps")
print(f"  Checkpoints saved to: {CHECKPOINT_DIR}")

trainer.train()

print("\nTraining complete!")

In [None]:
# === Cell 8: Generate from FINE-TUNED model on canary prompts ===

finetuned_outputs = {}
for name, prompt in canary_prompts.items():
    print(f"\n{'='*60}")
    print(f"FINE-TUNED — Canary {name}")
    print(f"Prompt: {prompt[:80]}...")
    print(f"{'='*60}")

    samples = generate(model, tokenizer, prompt)
    finetuned_outputs[name] = samples

    print(f"\nSample 1 of {N_SAMPLES}:")
    print(samples[0][:500])
    print("..." if len(samples[0]) > 500 else "")

print(f"\nFine-tuned generation complete. {N_SAMPLES} samples per canary saved.")

In [None]:
# === Cell 8b: Save fine-tuned outputs to Drive ===
import json

finetuned_path = f"{DRIVE_BASE}/baselines/gpt2_finetuned.json"

with open(finetuned_path, "w") as f:
    json.dump(finetuned_outputs, f, indent=2)

print(f"Fine-tuned outputs saved to: {finetuned_path}")

In [None]:
# === Cell 9: Side-by-side comparison ===

for name, prompt in canary_prompts.items():
    print(f"\n{'#'*70}")
    print(f"  CANARY {name}")
    print(f"  Prompt: {prompt}")
    print(f"{'#'*70}")

    print(f"\n{'─'*35} BASELINE {'─'*35}")
    print(baseline_outputs[name][0])

    print(f"\n{'─'*33} FINE-TUNED {'─'*33}")
    print(finetuned_outputs[name][0])

    print()

print("\n" + "="*70)
print("What to look for:")
print("  - Does the fine-tuned version sound more like you?")
print("  - Is Canary B (novel topic) closer to your voice, or only A (known topic)?")
print("  - If only A improved → memorized content, not voice.")
print("  - Compare these against the Llama notebook's outputs — which model")
print("    captured voice better? That's the RLHF question.")
print("="*70)

In [None]:
# === Cell 10: Save fine-tuned model to Google Drive ===

save_path = f"{DRIVE_BASE}/models/gpt2-voice-v1"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Fine-tuned model saved to: {save_path}")
print(f"\nTo reload this model later:")
print(f"""\n\
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("{save_path}")
tokenizer = AutoTokenizer.from_pretrained("{save_path}")
""")
print("Note: This is the full model (~6GB), not a small adapter like Llama's LoRA.")