# Llama 3.1 8B LoRA Fine-Tuning — Voice

Fine-tune Llama 3.1 8B on essay pairs (prompt → finished piece) to capture writing voice.
Uses LoRA so we only train ~0.1-0.5% of parameters — the rest stay frozen.

**Four-way comparison:** base Llama vs fine-tuned Llama vs base GPT-2-XL vs fine-tuned GPT-2-XL.
This notebook handles the Llama half.

## Prerequisites

1. **Accept the Meta license** at [huggingface.co/meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). This usually takes a few hours to approve.
2. **Create a Hugging Face access token** at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). You'll paste it when prompted below.
3. **Set runtime to GPU** (Runtime → Change runtime type → T4 GPU).

Training data is pulled from GitHub in Cell 2 — no manual upload needed.

## Cell Map

| Cell | What it does | When to stop |
|------|-------------|--------------|
| 0 | This intro | — |
| 1 | Install dependencies | — |
| 2 | Pull training data from GitHub | SKIP for baseline only |
| 3 | Config + mount Drive | — |
| 4 | Load base Llama (HF login + quantization) | — |
| 5 | Canary baselines | **STOP HERE for baseline** |
| 5b | Save baselines to Drive | — |
| 6 | Apply LoRA adapter | — |
| 7 | Load training data | — |
| 8 | Train | — |
| 9 | Canary on fine-tuned | — |
| 9b | Save fine-tuned outputs to Drive | — |
| 10 | Side-by-side comparison | — |
| 11 | Save LoRA adapter to Drive | — |

In [None]:
# === Cell 1: Install dependencies ===
!pip install -q transformers datasets peft bitsandbytes accelerate

In [None]:
# === Cell 2: Pull training data from GitHub ===

import os

REPO_URL = "https://github.com/lowyelling/voice-fine-tuning.git"
BRANCH = "15-pairs"  # Change this when you merge to main
REPO_DIR = "/content/voice-fine-tuning"
DATA_DIR_LOCAL = "/content/data"

if os.path.exists(REPO_DIR):
    !cd {REPO_DIR} && git checkout {BRANCH} && git pull
    print(f"Pulled latest from {BRANCH}.")
else:
    !git clone -b {BRANCH} {REPO_URL} {REPO_DIR}
    print(f"Cloned repo on branch {BRANCH}.")

# Copy JSONL files to the expected data dir
os.makedirs(DATA_DIR_LOCAL, exist_ok=True)
!cp {REPO_DIR}/1_data/jsonl/llama_*.jsonl {DATA_DIR_LOCAL}/

print("\nTraining data ready:")
!ls -la {DATA_DIR_LOCAL}/

In [None]:
# === Cell 3: Config + mount Google Drive ===

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from google.colab import drive

# Mount Google Drive for saving checkpoints and the final adapter
drive.mount("/content/drive")

# ---------------------------------------------------------------------------
# Config
# ---------------------------------------------------------------------------

MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"

# LoRA hyperparameters
LORA_R = 16
LORA_ALPHA = 32

# Training hyperparameters
LR = 1e-4
EPOCHS = 3
BATCH_SIZE = 1
MAX_SEQ_LEN = 8192

# Generation hyperparameters — identical across all 4 models for fair comparison
GEN_KWARGS = dict(
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    max_new_tokens=1024,
    do_sample=True,
)

N_SAMPLES = 5

# Paths
DATA_DIR = "/content/data"
DRIVE_BASE = "/content/drive/MyDrive/voice-ft"
CHECKPOINT_DIR = f"{DRIVE_BASE}/checkpoints/llama"

print(f"Model: {MODEL_ID}")
print(f"LoRA rank={LORA_R}, alpha={LORA_ALPHA}")
print(f"LR={LR}, epochs={EPOCHS}, batch={BATCH_SIZE}, max_seq_len={MAX_SEQ_LEN}")
print(f"Data dir: {DATA_DIR}")
print(f"Checkpoints: {CHECKPOINT_DIR}")

In [None]:
# === Cell 4: Load base Llama (HF login + 4-bit quantization) ===

from huggingface_hub import login
from google.colab import userdata

login(token=userdata.get("HF_TOKEN"))

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("Loading model in 4-bit quantization...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
)

model = prepare_model_for_kbit_training(model)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters (before LoRA): {trainable_params:,}")
print(f"Model loaded on: {model.device}")


In [None]:
# === Cell 5: Generate BASELINE canary outputs (STOP HERE for baseline-only run) ===
# ---------------------------------------------------------------------------
# Source of truth: 4_experiments/canary-prompts.md
#
#   A: known topic, short → compare against actual Note
#   B: novel topic, short → tests voice generalization
#   C: known topic, long  → tests essay-level architecture (Llama only)
# ---------------------------------------------------------------------------

canary_prompts = {
    "A": "Write a personal Substack Note/Tweet about class in America, told from the perspective of a Chinese first generation immigrant whose family is lower-middle class.",
    "B": "Write a personal Substack Note/Tweet about Eileen Gu and Alyssa Liu, both winter Olympic gold medalists. Both grew up in the Bay Area, are half-asian and half-white, conceived via anonymous egg donor, and raised by a single parent. Eileen competed for China in skiing and is maximizing her influencer career while studying at Stanford. Meanwhile, Alyssa competed for the United States, took breaks from skating, and is inactive on social media.",
    "C": "Write an essay about Jacques Ellul as a forgotten prophet of propaganda and technological conformity.",
}


def generate(model, tokenizer, prompt, n=N_SAMPLES):
    """Generate n samples from the model for a given prompt."""
    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]

    outputs_list = []
    for i in range(n):
        with torch.no_grad():
            output = model.generate(**inputs, **GEN_KWARGS)
        generated_text = tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
        outputs_list.append(generated_text)

    return outputs_list


# ---------------------------------------------------------------------------
# Generate BASELINE outputs (before fine-tuning)
# ---------------------------------------------------------------------------

baseline_outputs = {}
for name, prompt in canary_prompts.items():
    print(f"\n{'='*60}")
    print(f"BASELINE — Canary {name}")
    print(f"Prompt: {prompt[:80]}...")
    print(f"{'='*60}")

    samples = generate(model, tokenizer, prompt)
    baseline_outputs[name] = samples

    print(f"\nSample 1 of {N_SAMPLES}:")
    print(samples[0][:500])
    print("..." if len(samples[0]) > 500 else "")

print(f"\nBaseline generation complete. {N_SAMPLES} samples per canary saved.")

In [None]:
# === Cell 5b: Save baselines to Drive (insurance against disconnection) ===
import json, os

baseline_path = f"{DRIVE_BASE}/baselines/llama_baselines.json"
os.makedirs(f"{DRIVE_BASE}/baselines", exist_ok=True)

with open(baseline_path, "w") as f:
    json.dump(baseline_outputs, f, indent=2)

print(f"Baselines saved to: {baseline_path}")

In [None]:
# === Cell 6: Apply LoRA adapter ===

lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Should print ~0.1% trainable. If 0%, something went wrong.
model.print_trainable_parameters()

In [None]:
# === Cell 7: Load training data ===

train_path = f"{DATA_DIR}/llama_train.jsonl"
val_path = f"{DATA_DIR}/llama_val.jsonl"

train_dataset = load_dataset("json", data_files=train_path, split="train")
val_dataset = load_dataset("json", data_files=val_path, split="train")

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"\nFirst training example messages[0] (user prompt):")
print(train_dataset[0]["messages"][0]["content"][:200])


def format_chat(example):
    """Convert a chat-style example into tokenized input for training.
    Only compute loss on the assistant response, not the prompt/template."""
    # Full conversation (user + assistant)
    full_text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )

    # Prompt only (user turn + generation prompt marker)
    prompt_text = tokenizer.apply_chat_template(
        example["messages"][:1],  # just the user message
        tokenize=False,
        add_generation_prompt=True,
    )

    tokenized = tokenizer(
        full_text,
        truncation=True,
        max_length=MAX_SEQ_LEN,
        padding=False,
    )

    # Mask loss on prompt tokens — only train on assistant response
    prompt_len = len(tokenizer(prompt_text, truncation=True, max_length=MAX_SEQ_LEN)["input_ids"])
    labels = tokenized["input_ids"].copy()
    labels[:prompt_len] = [-100] * prompt_len
    tokenized["labels"] = labels

    return tokenized


train_dataset = train_dataset.map(format_chat, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(format_chat, remove_columns=val_dataset.column_names)

# Verify masking worked — show total vs trained tokens
total_tokens = len(train_dataset[0]["input_ids"])
trained_tokens = sum(1 for l in train_dataset[0]["labels"] if l != -100)
print(f"\nTokenized. First example: {total_tokens} total tokens, {trained_tokens} trained (assistant only)")


In [None]:
# === Cell 8: Train ===

from transformers import DataCollatorForSeq2Seq

training_args = TrainingArguments(
    output_dir=CHECKPOINT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    warmup_steps=2,
    logging_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,
    fp16=True,
    report_to="none",
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("Starting training...")
print(f"  {len(train_dataset)} training examples x {EPOCHS} epochs = {len(train_dataset) * EPOCHS} steps")
print(f"  Warmup: 2 steps")
print(f"  Gradient checkpointing: ON (trades compute for VRAM)")
print(f"  Checkpoints saved to: {CHECKPOINT_DIR}")

trainer.train()

print("\nTraining complete!")

In [None]:
# === Cell 9: Generate from FINE-TUNED model on canary prompts ===

finetuned_outputs = {}
for name, prompt in canary_prompts.items():
    print(f"\n{'='*60}")
    print(f"FINE-TUNED — Canary {name}")
    print(f"Prompt: {prompt[:80]}...")
    print(f"{'='*60}")

    samples = generate(model, tokenizer, prompt)
    finetuned_outputs[name] = samples

    print(f"\nSample 1 of {N_SAMPLES}:")
    print(samples[0][:500])
    print("..." if len(samples[0]) > 500 else "")

print(f"\nFine-tuned generation complete. {N_SAMPLES} samples per canary saved.")

In [None]:
# === Cell 9b: Save fine-tuned outputs to Drive ===
import json

finetuned_path = f"{DRIVE_BASE}/baselines/llama_finetuned.json"

with open(finetuned_path, "w") as f:
    json.dump(finetuned_outputs, f, indent=2)

print(f"Fine-tuned outputs saved to: {finetuned_path}")

In [None]:
# === Cell 10: Side-by-side comparison ===

for name, prompt in canary_prompts.items():
    print(f"\n{'#'*70}")
    print(f"  CANARY {name}")
    print(f"  Prompt: {prompt}")
    print(f"{'#'*70}")

    print(f"\n{'─'*35} BASELINE {'─'*35}")
    print(baseline_outputs[name][0])

    print(f"\n{'─'*33} FINE-TUNED {'─'*33}")
    print(finetuned_outputs[name][0])

    print()

print("\n" + "="*70)
print("What to look for:")
print("  - Does the fine-tuned version sound more like you?")
print("  - Is Canary B (novel topic) closer to your voice, or only A (known topic)?")
print("  - If only A improved → the model memorized content, not voice.")
print("  - If A and B improved but C (long) didn't → learned sentence voice, not structure.")
print("="*70)

In [None]:
# === Cell 11: Save LoRA adapter to Google Drive ===

adapter_path = f"{DRIVE_BASE}/adapters/llama-voice-v1"
model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)

print(f"LoRA adapter saved to: {adapter_path}")
print(f"\nTo reload this adapter later:")
print(f"""\n\
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
    "{MODEL_ID}",
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, "{adapter_path}")
tokenizer = AutoTokenizer.from_pretrained("{adapter_path}")
""")