# Llama 3.1 8B LoRA Fine-Tuning — Voice

Fine-tune Llama 3.1 8B on essay pairs (prompt → finished piece) to capture writing voice.
Uses LoRA so we only train ~0.1-0.5% of parameters — the rest stay frozen.

**Four-way comparison:** base Llama vs fine-tuned Llama vs base GPT-2-XL vs fine-tuned GPT-2-XL.
This notebook handles the Llama half.

## Prerequisites

1. **Accept the Meta license** at [huggingface.co/meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). This usually takes a few hours to approve.
2. **Create a Hugging Face access token** at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). You'll paste it when prompted below.
3. **Set runtime to GPU** (Runtime → Change runtime type → T4 GPU).

Training data gets uploaded in the cells below — no need to manually put files on Drive first.

In [None]:
!pip install -q transformers datasets peft bitsandbytes accelerate

In [None]:
# ---------------------------------------------------------------------------
# Upload training data from your local machine
#
# This opens a file picker. Select your JSONL files:
#   - llama_train.jsonl
#   - llama_val.jsonl
#
# These were generated locally by running:
#   python3 2_scripts/format_jsonl.py
#
# The files live in Colab's temporary storage (/content/) for this session.
# They'll disappear when the runtime disconnects, but that's fine — the
# trained adapter gets saved to Google Drive, and you can re-upload data
# for the next run.
# ---------------------------------------------------------------------------

import os
from google.colab import files

print("Select your JSONL files (llama_train.jsonl and llama_val.jsonl):")
uploaded = files.upload()

# Move uploaded files to a consistent location
DATA_DIR_LOCAL = "/content/data"
os.makedirs(DATA_DIR_LOCAL, exist_ok=True)

for filename in uploaded:
    dest = os.path.join(DATA_DIR_LOCAL, filename)
    os.rename(filename, dest)
    print(f"  {filename} → {dest}")

print(f"\nUploaded {len(uploaded)} file(s) to {DATA_DIR_LOCAL}/")

In [None]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from google.colab import drive

# Mount Google Drive for saving checkpoints and the final adapter
drive.mount("/content/drive")

# ---------------------------------------------------------------------------
# Config — all hyperparameters in one place so you can tweak without hunting
# ---------------------------------------------------------------------------

# We use Instruct (not base) because our training data is chat-style JSONL:
# {"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}
# The Instruct model already knows the chat template, so apply_chat_template()
# produces the right special tokens. Base model would ignore them.
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"

# LoRA hyperparameters
LORA_R = 16       # Rank — how many dimensions the adapter adds. 16 is moderate.
LORA_ALPHA = 32   # Scaling factor. Rule of thumb: alpha = 2 * rank.

# Training hyperparameters
LR = 2e-4         # Learning rate for LoRA. Higher than full fine-tuning because
                   # we're only updating a tiny fraction of weights.
EPOCHS = 3
BATCH_SIZE = 1     # T4 has 16GB VRAM — batch of 1 is safest with 8B model.
MAX_SEQ_LEN = 2048 # Llama supports 128K, but 2048 covers most essays and saves memory.

# Generation hyperparameters — identical across all 4 models for fair comparison
GEN_KWARGS = dict(
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    max_new_tokens=1024,
    do_sample=True,
)

N_SAMPLES = 5  # Generate 5 samples per prompt, compare the best from each model.

# Paths
# Training data: uploaded to Colab local storage (cell above)
DATA_DIR = "/content/data"
# Checkpoints and adapters: saved to Google Drive (persists across sessions)
DRIVE_BASE = "/content/drive/MyDrive/voice-ft"
CHECKPOINT_DIR = f"{DRIVE_BASE}/checkpoints/llama"

print(f"Model: {MODEL_ID}")
print(f"LoRA rank={LORA_R}, alpha={LORA_ALPHA}")
print(f"LR={LR}, epochs={EPOCHS}, batch={BATCH_SIZE}, max_seq_len={MAX_SEQ_LEN}")
print(f"Data dir: {DATA_DIR}")
print(f"Checkpoints: {CHECKPOINT_DIR}")

In [None]:
from huggingface_hub import login

# This opens a prompt where you paste your HF access token.
# The token is NOT stored in the notebook — it lives in your Colab session only.
login()

# 4-bit quantization config: loads the 8B model into ~4-5GB VRAM instead of ~16GB.
# nf4 (normalized float 4) is the best quantization type for fine-tuning.
# double_quant quantizes the quantization constants too — extra memory savings.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Llama doesn't have a pad token by default. We set it to eos_token so padding
# doesn't introduce a new token the model hasn't seen.
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Right-padding for causal LM (left would mask real tokens)

print("Loading model in 4-bit quantization...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically places layers across available devices
)

# prepare_model_for_kbit_training does two things:
# 1. Freezes the quantized base weights so we don't try to backprop through 4-bit values
# 2. Casts certain layers to float32 for stable gradient computation
# Without this, training on a quantized model will silently produce garbage.
model = prepare_model_for_kbit_training(model)

# Count parameters to verify model loaded correctly
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters (before LoRA): {trainable_params:,}")
print(f"Model loaded on: {model.device}")

In [None]:
# ---------------------------------------------------------------------------
# Canary prompts — fixed prompts run after every training iteration.
# They diagnose what the model learned vs memorized.
#
#   A: known topic, short → compare against actual essay (did it learn your voice?)
#   B: novel topic, short → tests generalization (voice without memorized content?)
#   C: known topic, long  → tests essay-level architecture (Llama only)
#
# If A sounds like you but B doesn't → memorized content, not voice.
# If A and B sound right but C doesn't → learned sentence-level voice, not structure.
# ---------------------------------------------------------------------------

canary_prompts = {
    "A": "Write a short essay about why writing transforms how you think. Thesis: writing is not a record of thought — it is the act of thinking itself.",
    "B": "Write a short essay about what rock climbing teaches you about fear. Thesis: the hardest part is not the wall — it is trusting your hands after they've slipped.",
    "C": "Write an essay about the difference between choosing and deciding. Thesis: decisions are what you make when the options are legible. Choices are what you make when they aren't. Most of what matters in life is choosing, not deciding.",
}


def generate(model, tokenizer, prompt, n=N_SAMPLES):
    """Generate n samples from the model for a given prompt.

    Uses apply_chat_template() to format the prompt with the model's expected
    special tokens. This is better than manually wrapping with [INST] tags because:
    - The template is version-locked to the tokenizer (won't break across model versions)
    - Handles system prompts, BOS/EOS tokens, etc. automatically
    """
    # Format as a single-turn chat: user asks, assistant responds
    messages = [{"role": "user", "content": prompt}]
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,  # Adds the assistant turn header so model knows to generate
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[1]

    outputs_list = []
    for i in range(n):
        with torch.no_grad():
            output = model.generate(**inputs, **GEN_KWARGS)
        # Decode only the NEW tokens (skip the prompt)
        generated_text = tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
        outputs_list.append(generated_text)

    return outputs_list


# ---------------------------------------------------------------------------
# Generate BASELINE outputs (before fine-tuning)
# Save these so we can compare against fine-tuned outputs later.
# ---------------------------------------------------------------------------

baseline_outputs = {}
for name, prompt in canary_prompts.items():
    print(f"\n{'='*60}")
    print(f"BASELINE — Canary {name}")
    print(f"Prompt: {prompt[:80]}...")
    print(f"{'='*60}")

    samples = generate(model, tokenizer, prompt)
    baseline_outputs[name] = samples

    # Print just the first sample as a preview
    print(f"\nSample 1 of {N_SAMPLES}:")
    print(samples[0][:500])
    print("..." if len(samples[0]) > 500 else "")

print(f"\nBaseline generation complete. {N_SAMPLES} samples per canary saved.")

In [None]:
# ---------------------------------------------------------------------------
# Apply LoRA adapter
#
# LoRA injects small trainable matrices into specific layers of the frozen model.
# We target q_proj and v_proj (the query and value projections in attention).
# This is conservative — good for small datasets where you don't want to overfit.
# If voice capture is weak, you can expand to ["q_proj", "v_proj", "k_proj", "o_proj"]
# or even add "gate_proj", "up_proj", "down_proj" (the MLP layers).
# ---------------------------------------------------------------------------

lora_config = LoraConfig(
    r=LORA_R,                        # Rank 16: moderate. Lower = less capacity, higher = more overfitting risk.
    lora_alpha=LORA_ALPHA,            # Scaling factor. Effective LR scales with alpha/r.
    target_modules=["q_proj", "v_proj"],  # Conservative: just attention Q and V.
    lora_dropout=0.05,                # Light dropout to regularize on small dataset.
    bias="none",                      # Don't train bias terms — not enough data to justify it.
    task_type="CAUSAL_LM",            # Tells PEFT this is autoregressive generation.
)

model = get_peft_model(model, lora_config)

# This prints something like "trainable params: 6,815,744 || all params: 8,030,261,248 || trainable%: 0.0849"
# ~0.1% trainable is expected. If it says 0% something went wrong with prepare_model_for_kbit_training.
model.print_trainable_parameters()

In [None]:
# ---------------------------------------------------------------------------
# Load training data from Google Drive
#
# Expected format (chat-style JSONL):
# {"messages": [{"role": "user", "content": "Write about X..."}, {"role": "assistant", "content": "The essay..."}]}
# ---------------------------------------------------------------------------

train_path = f"{DATA_DIR}/llama_train.jsonl"
val_path = f"{DATA_DIR}/llama_val.jsonl"

train_dataset = load_dataset("json", data_files=train_path, split="train")
val_dataset = load_dataset("json", data_files=val_path, split="train")

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"\nFirst training example messages[0] (user prompt):")
print(train_dataset[0]["messages"][0]["content"][:200])


def format_chat(example):
    """Convert a chat-style example into tokenized input for training.

    apply_chat_template converts the messages list into the model's expected format
    with all the right special tokens (e.g., <|begin_of_text|>, <|start_header_id|>, etc.).
    """
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,  # False because the assistant response is already in the messages
    )

    tokenized = tokenizer(
        text,
        truncation=True,
        max_length=MAX_SEQ_LEN,
        padding=False,  # DataCollator handles padding per-batch
    )

    # labels = input_ids means the model trains on predicting EVERY token, including
    # the prompt tokens. This is simpler and fine for small datasets.
    # For advanced usage: mask prompt tokens by setting their labels to -100,
    # so the model only learns to predict the assistant's response.
    tokenized["labels"] = tokenized["input_ids"].copy()

    return tokenized


# Map the formatting function over both datasets
train_dataset = train_dataset.map(format_chat, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(format_chat, remove_columns=val_dataset.column_names)

print(f"\nTokenized. First example length: {len(train_dataset[0]['input_ids'])} tokens")

In [None]:
# ---------------------------------------------------------------------------
# Training
# ---------------------------------------------------------------------------

training_args = TrainingArguments(
    output_dir=CHECKPOINT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    warmup_steps=10,
    logging_steps=1,             # Log every single step. Small dataset = want full-resolution loss curve.
    eval_strategy="epoch",       # Evaluate once per epoch (not more — val set is tiny).
    save_strategy="epoch",       # Save checkpoint each epoch to Drive for recovery.
    save_total_limit=3,          # Keep only the 3 most recent checkpoints to save Drive space.
    fp16=True,                   # Mixed precision — faster training, less VRAM.
    report_to="none",            # No wandb/tensorboard — we're logging to notebook output.
    gradient_accumulation_steps=1,  # Effective batch size = BATCH_SIZE * this. Increase if OOM.
    load_best_model_at_end=True, # After training, load the checkpoint with lowest eval loss.
    metric_for_best_model="eval_loss",
)

# DataCollatorForLanguageModeling handles padding within each batch.
# mlm=False means causal language modeling (predict next token), not masked LM.
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("Starting training...")
print(f"  {len(train_dataset)} training examples x {EPOCHS} epochs = {len(train_dataset) * EPOCHS} steps")
print(f"  Checkpoints saved to: {CHECKPOINT_DIR}")

trainer.train()

print("\nTraining complete!")

In [None]:
# ---------------------------------------------------------------------------
# Generate from FINE-TUNED model
# Same canary prompts, same generate() function, same GEN_KWARGS.
# The only thing that changed is the model weights (LoRA adapter active).
# ---------------------------------------------------------------------------

finetuned_outputs = {}
for name, prompt in canary_prompts.items():
    print(f"\n{'='*60}")
    print(f"FINE-TUNED — Canary {name}")
    print(f"Prompt: {prompt[:80]}...")
    print(f"{'='*60}")

    samples = generate(model, tokenizer, prompt)
    finetuned_outputs[name] = samples

    # Print just the first sample as a preview
    print(f"\nSample 1 of {N_SAMPLES}:")
    print(samples[0][:500])
    print("..." if len(samples[0]) > 500 else "")

print(f"\nFine-tuned generation complete. {N_SAMPLES} samples per canary saved.")

In [None]:
# ---------------------------------------------------------------------------
# Side-by-side comparison: baseline vs fine-tuned
#
# For each canary, print the best sample from baseline and fine-tuned.
# "Best" here means sample[0] — in practice you'd read all 5 and pick.
# ---------------------------------------------------------------------------

for name, prompt in canary_prompts.items():
    print(f"\n{'#'*70}")
    print(f"  CANARY {name}")
    print(f"  Prompt: {prompt}")
    print(f"{'#'*70}")

    print(f"\n{'─'*35} BASELINE {'─'*35}")
    print(baseline_outputs[name][0])

    print(f"\n{'─'*33} FINE-TUNED {'─'*33}")
    print(finetuned_outputs[name][0])

    print()  # Blank line between canaries

print("\n" + "="*70)
print("What to look for:")
print("  - Does the fine-tuned version sound more like you?")
print("  - Is Canary B (novel topic) closer to your voice, or only A (known topic)?")
print("  - If only A improved → the model memorized content, not voice.")
print("  - If A and B improved but C (long) didn't → learned sentence voice, not structure.")
print("="*70)

In [None]:
# ---------------------------------------------------------------------------
# Save LoRA adapter to Google Drive
#
# This saves ONLY the LoRA weights (~25MB), not the full 8B model.
# To use later, you load the base model + this adapter.
# ---------------------------------------------------------------------------

adapter_path = f"{DRIVE_BASE}/adapters/llama-voice-v1"
model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)

print(f"LoRA adapter saved to: {adapter_path}")
print(f"\nTo reload this adapter later:")
print(f"""\n\
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Load base model (same quantization config)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
    "{MODEL_ID}",
    quantization_config=bnb_config,
    device_map="auto",
)

# Load LoRA adapter on top
model = PeftModel.from_pretrained(base_model, "{adapter_path}")
tokenizer = AutoTokenizer.from_pretrained("{adapter_path}")
""")