# GPT-2-XL Full Fine-Tuning — Voice

Fine-tune GPT-2-XL (1.5B parameters) on essay pairs to capture writing voice.
Full fine-tuning — every weight gets updated, no LoRA adapter.

**Why GPT-2-XL?** It's pre-RLHF (released 2019, before alignment training existed),
so it's a "raw" model. Comparing it against Llama (post-RLHF) answers the question:
does alignment training help or hurt voice capture?

**Why full fine-tuning instead of LoRA?** GPT-2-XL is small enough (1.5B params) to
fine-tune fully on a T4 GPU. Full fine-tuning gives maximum capacity to learn voice,
and trains in minutes. The tradeoff: higher overfitting risk on a small dataset.

**Context window:** 1,024 tokens. This limits GPT-2 to short pieces (Notes, openings).
The constraint is actually useful — it forces a clean 1:1 comparison with Llama on
short pairs, isolating voice at the sentence level without essay-level structure.

## Prerequisites

1. **Set runtime to GPU** (Runtime → Change runtime type → T4 GPU).
2. That's it. No license, no token, no approval wait. GPT-2-XL is fully open.

In [None]:
# bitsandbytes is needed for the 8-bit Adam optimizer (saves ~9GB VRAM during training)
!pip install -q transformers datasets bitsandbytes

In [None]:
# ---------------------------------------------------------------------------
# Upload training data from your local machine
#
# This opens a file picker. Select your JSONL files:
#   - gpt2_train.jsonl
#   - gpt2_val.jsonl
#
# These were generated locally by running:
#   python3 2_scripts/format_jsonl.py
#
# The files live in Colab's temporary storage for this session.
# They'll disappear when the runtime disconnects, but that's fine —
# the trained model gets saved to Google Drive.
# ---------------------------------------------------------------------------

import os
from google.colab import files

print("Select your JSONL files (gpt2_train.jsonl and gpt2_val.jsonl):")
uploaded = files.upload()

DATA_DIR = "/content/data"
os.makedirs(DATA_DIR, exist_ok=True)

for filename in uploaded:
    dest = os.path.join(DATA_DIR, filename)
    os.rename(filename, dest)
    print(f"  {filename} → {dest}")

print(f"\nUploaded {len(uploaded)} file(s) to {DATA_DIR}/")

In [None]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from datasets import load_dataset
from google.colab import drive

# Mount Google Drive for saving the trained model
drive.mount("/content/drive")

# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------

# GPT-2-XL: 1.5B parameters, 1,024 token context window.
# No quantization needed for inference — it fits on a T4 as-is.
# No LoRA — we fine-tune every weight directly.
# No chat template — GPT-2 is a plain completion model. It sees:
#   "Write about X...\n\n---\n\nThe essay text..."
# and learns to predict what comes after the ---.
#
# Memory during training is tighter than you'd think:
#   Model weights: ~6GB
#   Optimizer states (Adam keeps 2 copies): ~12GB in fp32
#   Gradients: ~6GB
#   Total in fp32: ~24GB — exceeds T4's 16GB!
#
# Fix: fp16 halves the gradient/activation memory, and 8-bit Adam
# (via bitsandbytes) shrinks optimizer states from 12GB to ~3GB.
# Together they bring it under 16GB comfortably.
MODEL_ID = "gpt2-xl"

# Training hyperparameters
LR = 5e-5          # Lower than Llama's 2e-4 because full fine-tuning updates
                    # ALL weights, not just a small adapter. Too high = catastrophic
                    # forgetting (the model forgets how to write English).
EPOCHS = 3
BATCH_SIZE = 1
MAX_SEQ_LEN = 1024  # GPT-2's hard limit. Pairs longer than this get truncated.

# Generation hyperparameters — identical across all 4 models for fair comparison
GEN_KWARGS = dict(
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    max_new_tokens=512,   # Shorter than Llama (1024) because GPT-2's total window is 1024.
                          # Prompt + output must fit in 1024 tokens total.
    do_sample=True,
)

N_SAMPLES = 5

# Paths
DATA_DIR = "/content/data"
DRIVE_BASE = "/content/drive/MyDrive/voice-ft"
CHECKPOINT_DIR = f"{DRIVE_BASE}/checkpoints/gpt2"

print(f"Model: {MODEL_ID}")
print(f"LR={LR}, epochs={EPOCHS}, batch={BATCH_SIZE}, max_seq_len={MAX_SEQ_LEN}")
print(f"Data dir: {DATA_DIR}")
print(f"Checkpoints: {CHECKPOINT_DIR}")

In [None]:
# ---------------------------------------------------------------------------
# Load GPT-2-XL
#
# No quantization, no special preparation. Just load and go.
# Compare this to the Llama notebook's Cell 4 — that one needs:
#   - HuggingFace login
#   - BitsAndBytesConfig for 4-bit quantization
#   - prepare_model_for_kbit_training()
# GPT-2-XL skips all of that because it's small enough to fit as-is.
# ---------------------------------------------------------------------------

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# GPT-2 has no pad token by default. Same fix as Llama: use eos_token.
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

# Move to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"  (100% trainable — full fine-tuning, no LoRA)")
print(f"Model on: {device}")

In [None]:
# ---------------------------------------------------------------------------
# Canary prompts — same prompts as the Llama notebook.
# Replace these with your actual canary prompts from
# 4_experiments/canary-prompts.md
#
# Only A and B for GPT-2 (both short). Canary C is Llama-only because
# it requires essay-length output that won't fit in 1,024 tokens.
# ---------------------------------------------------------------------------

canary_prompts = {
    "A": "Write a short essay about why writing transforms how you think. Thesis: writing is not a record of thought — it is the act of thinking itself.",
    "B": "Write a short essay about what rock climbing teaches you about fear. Thesis: the hardest part is not the wall — it is trusting your hands after they've slipped.",
}


def generate(model, tokenizer, prompt, n=N_SAMPLES):
    """Generate n samples from the model for a given prompt.

    GPT-2 is a plain completion model — no chat template.
    We feed the prompt directly and the model continues from there.

    Compare to the Llama notebook's generate():
      - Llama uses tokenizer.apply_chat_template() to wrap the prompt
        with special [INST] tokens and role markers.
      - GPT-2 just sees raw text. The prompt format matches the training
        data: 'prompt text\n\n---\n\n' and the model continues after that.
    """
    # Format to match training data: prompt, then separator
    input_text = f"{prompt}\n\n---\n\n"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]

    outputs_list = []
    for i in range(n):
        with torch.no_grad():
            output = model.generate(**inputs, **GEN_KWARGS, pad_token_id=tokenizer.eos_token_id)
        # Decode only the NEW tokens (skip the prompt)
        generated_text = tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
        outputs_list.append(generated_text)

    return outputs_list


# ---------------------------------------------------------------------------
# Generate BASELINE outputs (before fine-tuning)
# ---------------------------------------------------------------------------

baseline_outputs = {}
for name, prompt in canary_prompts.items():
    print(f"\n{'='*60}")
    print(f"BASELINE — Canary {name}")
    print(f"Prompt: {prompt[:80]}...")
    print(f"{'='*60}")

    samples = generate(model, tokenizer, prompt)
    baseline_outputs[name] = samples

    print(f"\nSample 1 of {N_SAMPLES}:")
    print(samples[0][:500])
    print("..." if len(samples[0]) > 500 else "")

print(f"\nBaseline generation complete. {N_SAMPLES} samples per canary saved.")

In [None]:
# ---------------------------------------------------------------------------
# Load training data
#
# GPT-2 training format (completion-style JSONL):
#   {"text": "Write about X...\n\n---\n\nThe essay text..."}
#
# Compare to Llama's format (chat-style JSONL):
#   {"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}
#
# GPT-2 doesn't know about roles or chat. It just sees a stream of text
# and learns to predict the next token. The --- separator is the only
# signal that says "everything after this is the response."
# ---------------------------------------------------------------------------

train_dataset = load_dataset("json", data_files=f"{DATA_DIR}/gpt2_train.jsonl", split="train")
val_dataset = load_dataset("json", data_files=f"{DATA_DIR}/gpt2_val.jsonl", split="train")

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"\nFirst training example (truncated):")
print(train_dataset[0]["text"][:300])


def tokenize(example):
    """Tokenize a completion-style example.

    Much simpler than Llama's format_chat():
    - No chat template to apply
    - No special role tokens
    - Just tokenize the raw text
    """
    tokenized = tokenizer(
        example["text"],
        truncation=True,
        max_length=MAX_SEQ_LEN,
        padding=False,
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized


train_dataset = train_dataset.map(tokenize, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(tokenize, remove_columns=val_dataset.column_names)

print(f"\nTokenized. First example length: {len(train_dataset[0]['input_ids'])} tokens")

In [None]:
# ---------------------------------------------------------------------------
# Training
#
# Key differences from the Llama notebook:
# - LR is 5e-5 (vs Llama's 2e-4). Lower because full fine-tuning updates
#   ALL 1.5B parameters. Too high and you get catastrophic forgetting.
# - No LoRA — the Trainer updates the model directly.
# - fp16=True to halve gradient/activation memory.
# - adamw_bnb_8bit: 8-bit Adam via bitsandbytes. Standard Adam keeps 2 state
#   tensors per parameter (~12GB for GPT-2-XL). 8-bit Adam compresses those
#   to ~3GB. Combined with fp16, total VRAM fits on a T4.
# - Should train MUCH faster than Llama (smaller model, no quantization overhead).
# ---------------------------------------------------------------------------

training_args = TrainingArguments(
    output_dir=CHECKPOINT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    warmup_steps=10,
    logging_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,
    fp16=True,                      # Half-precision training — saves ~3GB on gradients/activations.
    optim="adamw_bnb_8bit",         # 8-bit Adam — compresses optimizer states from ~12GB to ~3GB.
    report_to="none",
    gradient_accumulation_steps=1,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Causal LM, not masked LM
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("Starting training...")
print(f"  {len(train_dataset)} training examples x {EPOCHS} epochs = {len(train_dataset) * EPOCHS} steps")
print(f"  Checkpoints saved to: {CHECKPOINT_DIR}")

trainer.train()

print("\nTraining complete!")

In [None]:
# ---------------------------------------------------------------------------
# Generate from FINE-TUNED model
# Same canary prompts, same generate() function, same GEN_KWARGS.
# ---------------------------------------------------------------------------

finetuned_outputs = {}
for name, prompt in canary_prompts.items():
    print(f"\n{'='*60}")
    print(f"FINE-TUNED — Canary {name}")
    print(f"Prompt: {prompt[:80]}...")
    print(f"{'='*60}")

    samples = generate(model, tokenizer, prompt)
    finetuned_outputs[name] = samples

    print(f"\nSample 1 of {N_SAMPLES}:")
    print(samples[0][:500])
    print("..." if len(samples[0]) > 500 else "")

print(f"\nFine-tuned generation complete. {N_SAMPLES} samples per canary saved.")

In [None]:
# ---------------------------------------------------------------------------
# Side-by-side comparison
# ---------------------------------------------------------------------------

for name, prompt in canary_prompts.items():
    print(f"\n{'#'*70}")
    print(f"  CANARY {name}")
    print(f"  Prompt: {prompt}")
    print(f"{'#'*70}")

    print(f"\n{'─'*35} BASELINE {'─'*35}")
    print(baseline_outputs[name][0])

    print(f"\n{'─'*33} FINE-TUNED {'─'*33}")
    print(finetuned_outputs[name][0])

    print()

print("\n" + "="*70)
print("What to look for:")
print("  - Does the fine-tuned version sound more like you?")
print("  - Is Canary B (novel topic) closer to your voice, or only A (known topic)?")
print("  - If only A improved → memorized content, not voice.")
print("  - Compare these against the Llama notebook's outputs — which model")
print("    captured voice better? That's the RLHF question.")
print("="*70)

In [None]:
# ---------------------------------------------------------------------------
# Save fine-tuned model to Google Drive
#
# Unlike Llama (which saves only a ~25MB LoRA adapter), this saves the
# FULL 1.5B parameter model (~6GB). That's the tradeoff of full fine-tuning:
# you get maximum learning capacity, but the artifact is the whole model.
# ---------------------------------------------------------------------------

save_path = f"{DRIVE_BASE}/models/gpt2-voice-v1"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Fine-tuned model saved to: {save_path}")
print(f"\nTo reload this model later:")
print(f"""\n\
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("{save_path}")
tokenizer = AutoTokenizer.from_pretrained("{save_path}")
""")
print("Note: This is the full model (~6GB), not a small adapter like Llama's LoRA.")