
# LLM Fine-Tuning: Step-by-Step Tutorial (HF Transformers + PEFT)

This notebook walks through **multiple fine-tuning approaches** using Hugging Face:
- Full fine-tuning (sequence classification)
- Instruction tuning (causal LM on (instruction, output) format)
- Parameter-Efficient Fine-Tuning (PEFT):
  - **LoRA**
  - **Prefix Tuning**
  - **BitFit**

> **Note:** Choose *small models* for CPU training (e.g., `distilbert-base-uncased`, `distilgpt2`). GPU strongly recommended.



## 0. Environment Setup

Uncomment if needed:
```bash
# pip install -U transformers datasets accelerate peft
# pip install -U bitsandbytes  # optional, for 4-bit/8-bit quant (GPU only)
```


In [1]:
%pip install -U transformers datasets accelerate peft


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## 1. Utilities

In [2]:

import os, math, torch, importlib
from dataclasses import dataclass
from typing import Dict, Any

print("PyTorch:", torch.__version__, "CUDA available:", torch.cuda.is_available())

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def is_available(pkg: str) -> bool:
    try:
        importlib.import_module(pkg)
        return True
    except Exception:
        return False


PyTorch: 2.8.0+cpu CUDA available: False



## 2. Datasets & Tokenizers
We'll demo with:
- **IMDB** (binary sentiment classification) for full fine-tuning (Encoder model).
- **WikiText-2** for LM pretraining shape (small demo).
- **Dolly-15k**-style instruction data for instruction tuning (if available) — otherwise we create a tiny toy dataset.

> Replace with your own dataset as needed.


In [3]:

from datasets import load_dataset, Dataset
from transformers import AutoTokenizer

# Load IMDB for classification
imdb = load_dataset("imdb")

# Tokenizers
tok_bert = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tok_gpt  = AutoTokenizer.from_pretrained("distilgpt2")
tok_gpt.pad_token = tok_gpt.eos_token  # GPT2 family has no PAD by default

print(imdb, tok_bert.__class__.__name__, tok_gpt.__class__.__name__)


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
}) DistilBertTokenizerFast GPT2TokenizerFast



### 2.1 Preprocess IMDB (Sequence Classification)


In [4]:

def tokenize_imdb(batch):
    return tok_bert(batch["text"], truncation=True, padding="max_length", max_length=256)

imdb_tok = imdb.map(tokenize_imdb, batched=True)
imdb_tok = imdb_tok.remove_columns(["text"])
imdb_tok = imdb_tok.rename_column("label", "labels")
imdb_tok.set_format(type="torch")

small_train = imdb_tok["train"].shuffle(seed=42).select(range(2000))   # small subset for demo
small_test  = imdb_tok["test"].shuffle(seed=42).select(range(1000))
len(small_train), len(small_test)


(2000, 1000)

In [5]:
%pip install transformers --upgrade
import transformers
print(transformers.__version__)

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
4.56.2



## 3. Full Fine-Tuning (Sequence Classification)

We will fine-tune **DistilBERT** on IMDB:
- Update **all** model weights.
- Use Hugging Face **`Trainer`** for simplicity.


In [6]:
# ===== DistilBERT Fine-Tuning on IMDB (version-agnostic) =====
# - Handles old/new transformers TrainingArguments automatically
# - Small subsets for quick CPU runs
# - Prints final accuracy & F1

import os, sys, inspect, math, random
import numpy as np
import torch

from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score

import transformers
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)

print("Transformers:", transformers.__version__)
print("Python:", sys.version)
print("PyTorch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())

# ---------- Reproducibility ----------
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", DEVICE)

# ---------- Data & Tokenizer ----------
imdb = load_dataset("imdb")
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_imdb(batch):
    return tok(batch["text"], truncation=True, padding="max_length", max_length=256)

imdb_tok = imdb.map(tokenize_imdb, batched=True)
imdb_tok = imdb_tok.remove_columns(["text"]).rename_column("label", "labels")
imdb_tok.set_format(type="torch")

# Small subsets for speed (adjust up if you have GPU)
small_train = imdb_tok["train"].shuffle(seed=SEED).select(range(2000))
small_test  = imdb_tok["test"].shuffle(seed=SEED).select(range(1000))

# ---------- Model ----------
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
).to(DEVICE)

# ---------- Metrics ----------
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds),
    }

# ---------- Version-agnostic TrainingArguments ----------
base_kwargs = dict(
    output_dir="./ft_cls",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,         # bump to 2–3 for better accuracy
    logging_steps=50,
    seed=SEED,
    report_to="none",           # avoid W&B or other reporters unless you use them
    save_total_limit=2,
)

# Prefer modern args; fall back gracefully if not supported in this environment
try:
    # Newer transformers (supports evaluation_strategy/save_strategy)
    training_args = TrainingArguments(
        **base_kwargs,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,  # valid only when eval/save match
        metric_for_best_model="f1",
        greater_is_better=True,
    )
except TypeError:
    # Older transformers: remove modern args & best-at-end
    training_args = TrainingArguments(
        **base_kwargs,
        # No eval/save strategies supported here; we'll just evaluate after training
        # and won't try to load "best" at end.
        # evaluate_during_training could exist in some very old builds, but we avoid it.
    )
    print(
        "[Info] Using legacy-compatible TrainingArguments (no evaluation during training). "
        "We'll run an explicit evaluation at the end."
    )

# ---------- Trainer ----------
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_test,      # used if evaluation is active
    tokenizer=tok,
    compute_metrics=compute_metrics,
)

# ---------- Train & Evaluate ----------
trainer.train()

# Always run a final evaluation; works for all versions
metrics = trainer.evaluate()
print("Final metrics:", metrics)

# Optional: save final model (and tokenizer)
trainer.save_model("./ft_cls/final")
tok.save_pretrained("./ft_cls/final")

# Quick test inference (sanity check)
from transformers import pipeline
clf = pipeline("text-classification", model="./ft_cls/final", tokenizer=tok, device=0 if DEVICE=="cuda" else -1)
print(clf("This movie was absolutely wonderful and inspiring!"))
print(clf("This was painfully boring. I would not recommend it."))


Transformers: 4.56.2
Python: 3.13.5 | packaged by Anaconda, Inc. | (main, Jun 12 2025, 16:37:03) [MSC v.1929 64 bit (AMD64)]
PyTorch: 2.8.0+cpu | CUDA available: False
Using device: cpu


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


[Info] Using legacy-compatible TrainingArguments (no evaluation during training). We'll run an explicit evaluation at the end.


Step,Training Loss
50,0.6681
100,0.4924
150,0.4225
200,0.4169
250,0.313
300,0.2501
350,0.3162
400,0.1852
450,0.2154
500,0.2697




Final metrics: {'eval_loss': 0.802811861038208, 'eval_accuracy': 0.87, 'eval_f1': 0.8715415019762845, 'eval_runtime': 36.9387, 'eval_samples_per_second': 27.072, 'eval_steps_per_second': 3.384, 'epoch': 10.0}


Device set to use cpu


[{'label': 'LABEL_1', 'score': 0.9997630715370178}]
[{'label': 'LABEL_0', 'score': 0.9995478987693787}]



## 4. Instruction Tuning (Causal LM)

We simulate instruction tuning with a tiny toy dataset of `(instruction, input, output)` triples.
You can replace `toy_instr_data` with a larger dataset (e.g., Dolly-15k or your own JSON).

We'll fine-tune **DistilGPT2** using a simple formatting function.


In [7]:
import transformers, inspect
print("Transformers version:", transformers.__version__)
from transformers import TrainingArguments
print("Signature:", inspect.signature(TrainingArguments.__init__))

Transformers version: 4.56.2


In [8]:
from datasets import Dataset
from transformers import AutoModelForCausalLM, DataCollatorForLanguageModeling, TrainingArguments, Trainer

# --- toy data & formatting ---
toy_instr_data = [
    {"instruction":"Translate to French", "input":"Hello world!", "output":"Bonjour le monde !"},
    {"instruction":"Summarize", "input":"Transformers are powerful sequence models for NLP.", "output":"Transformers are strong NLP sequence models."},
    {"instruction":"Give a title", "input":"A beginner guide to PEFT methods.", "output":"PEFT Methods: A Beginner's Guide"},
]

def format_example(ex):
    instruction = ex["instruction"].strip()
    inp = ex.get("input", "").strip()
    out = ex["output"].strip()
    if inp:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{inp}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    return prompt, out

def build_text(example):
    prompt, out = format_example(example)
    return {"text": prompt + out + tok_gpt.eos_token}   # assumes tok_gpt defined earlier

toy_ds = Dataset.from_list(toy_instr_data).map(build_text)
toy_ds = toy_ds.train_test_split(test_size=0.3, seed=42)

def tokenize_lm(batch):
    return tok_gpt(batch["text"], truncation=True, padding="max_length", max_length=256)

toy_tok = toy_ds.map(tokenize_lm, batched=True, remove_columns=["text"])
# For causal LM, labels = input_ids
toy_tok = toy_tok.map(lambda examples: {"labels": examples["input_ids"]})
toy_tok.set_format(type="torch")

# --- model & collator ---
model_lm = AutoModelForCausalLM.from_pretrained("distilgpt2").to(DEVICE)  # assumes DEVICE defined earlier
data_collator = DataCollatorForLanguageModeling(tokenizer=tok_gpt, mlm=False)

# --- IMPORTANT: use eval_strategy 
args_lm = TrainingArguments(
    output_dir="./ft_instr",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=10,
    learning_rate=5e-5,
    eval_strategy="epoch",     # <-- this matches your signature
    save_strategy="epoch",     # must match eval strategy when using load_best_model_at_end
    load_best_model_at_end=True,
    metric_for_best_model=None,  # optional; set to "loss" via Trainer defaults for LM
    logging_steps=10,
    report_to="none",
)

trainer_lm = Trainer(
    model=model_lm,
    args=args_lm,
    train_dataset=toy_tok["train"],
    eval_dataset=toy_tok["test"],
    data_collator=data_collator,
)

trainer_lm.train()
lm_metrics = trainer_lm.evaluate()
print(lm_metrics)


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,No log,4.596352
2,No log,4.44927
3,No log,4.324956
4,No log,4.244764
5,No log,4.184104
6,No log,4.134472
7,No log,4.091203
8,No log,4.06017
9,No log,4.041215
10,3.160600,4.031079


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


{'eval_loss': 4.031078815460205, 'eval_runtime': 0.1121, 'eval_samples_per_second': 8.921, 'eval_steps_per_second': 8.921, 'epoch': 10.0}



## 5. PEFT: LoRA (on Causal LM)

We'll adapt the instruction-tuned setup to use **LoRA**, training only a small number of parameters.

> If you want 4-bit QLoRA: requires `bitsandbytes` and a CUDA GPU. This notebook detects availability automatically.


In [22]:
import torch
from peft import LoraConfig, get_peft_model, TaskType

# Check if CUDA is available
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Optional: 4/8-bit quantization if bitsandbytes + CUDA are available
bnb_available = False
try:
    import bitsandbytes
    bnb_available = DEVICE == "cuda"
except ImportError:
    pass

quant_kwargs = {}
if bnb_available:
    from transformers import BitsAndBytesConfig
    quant_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
    quant_kwargs["device_map"] = {"": 0}  # specify device map

base_lm = AutoModelForCausalLM.from_pretrained("distilgpt2", **quant_kwargs)

lora_cfg = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,             # rank
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["c_attn","c_proj"],  # common GPT-2 modules
    fan_in_fan_out=True,
)

lora_model = get_peft_model(base_lm, lora_cfg)

args_lora = TrainingArguments(
    output_dir="./ft_lora",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=20,
    learning_rate=1e-4,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    optim="adamw_torch",
)

trainer_lora = Trainer(
    model=lora_model,
    args=args_lora,
    train_dataset=toy_tok["train"],
    eval_dataset=toy_tok["test"],
    data_collator=data_collator,
)

trainer_lora.train()
lora_metrics = trainer_lora.evaluate()
lora_metrics

# Save the adapter weights
lora_model.save_pretrained("./ft_lora_adapter")

Epoch,Training Loss,Validation Loss
1,No log,4.828366
2,No log,4.822859
3,No log,4.817143
4,No log,4.811232
5,No log,4.805022
6,No log,4.79882
7,No log,4.792397
8,No log,4.785789
9,No log,4.779376
10,4.542600,4.773182





## 6. PEFT: Prefix Tuning (on Causal LM)
Trains a small set of **learnable prefix vectors** to steer the model.


In [20]:
from peft import PrefixTuningConfig, get_peft_model, TaskType

base_lm_prefix = AutoModelForCausalLM.from_pretrained("distilgpt2").to(DEVICE)

prefix_cfg = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20
)
prefix_model = get_peft_model(base_lm_prefix, prefix_cfg)

args_prefix = TrainingArguments(
    output_dir="./ft_prefix",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=100,
    learning_rate=1e-4,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
)

trainer_prefix = Trainer(
    model=prefix_model,
    args=args_prefix,
    train_dataset=toy_tok["train"],
    eval_dataset=toy_tok["test"],
    data_collator=data_collator,
)

trainer_prefix.train()
prefix_metrics = trainer_prefix.evaluate()
prefix_metrics

# Save the prefix weights
prefix_model.save_pretrained("./ft_prefix_adapter")

Epoch,Training Loss,Validation Loss
1,No log,7.529821
2,No log,7.4995
3,No log,7.471971
4,No log,7.441835
5,No log,7.413912
6,No log,7.38704
7,No log,7.361664
8,No log,7.337343
9,No log,7.31496
10,8.578400,7.296068





## 7. PEFT: BitFit (on Causal LM)
Updates **only bias terms** throughout the model.


In [11]:
import torch
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling

# 1) Load model + align PAD/EOS
bitfit_model = AutoModelForCausalLM.from_pretrained("distilgpt2").to(DEVICE)
tok_gpt.pad_token = tok_gpt.eos_token
bitfit_model.config.pad_token_id = tok_gpt.pad_token_id
bitfit_model.config.eos_token_id = tok_gpt.eos_token_id

# 2) Freeze all params, then unfreeze only biases
for name, p in bitfit_model.named_parameters():
    p.requires_grad = False

trainable = []
for name, p in bitfit_model.named_parameters():
    if p.ndim > 0 and name.endswith(".bias"):
        p.requires_grad = True
        trainable.append(name)

print(f"Trainable (bias-only) params: {len(trainable)}")
# Optional: also unfreeze LayerNorm biases if you want the common BitFit+LN variant
# for name, p in bitfit_model.named_parameters():
#     if "ln_" in name and name.endswith(".bias"):
#         p.requires_grad = True

# 3) Collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tok_gpt, mlm=False)

# 4) Training args (use eval_strategy in your build)
args_bitfit = TrainingArguments(
    output_dir="./ft_bitfit",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=10,
    learning_rate=5e-4,              # a bit higher lr is typical for BitFit
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=10,
    report_to="none",
)

# 5) Trainer
trainer_bitfit = Trainer(
    model=bitfit_model,
    args=args_bitfit,
    train_dataset=toy_tok["train"],
    eval_dataset=toy_tok["test"],
    processing_class=tok_gpt,        # future-proof vs tokenizer=
    data_collator=data_collator,
)

trainer_bitfit.train()
bitfit_metrics = trainer_bitfit.evaluate()
print(bitfit_metrics)


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


Trainable (bias-only) params: 37


Epoch,Training Loss,Validation Loss
1,No log,4.767668
2,No log,4.720472
3,No log,4.677702
4,No log,4.645629
5,No log,4.61758
6,No log,4.596546
7,No log,4.580127
8,No log,4.56789
9,No log,4.559289
10,4.187900,4.555153


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


{'eval_loss': 4.555152893066406, 'eval_runtime': 0.1069, 'eval_samples_per_second': 9.354, 'eval_steps_per_second': 9.354, 'epoch': 10.0}



## 8. Inference Utility (for Causal LM)


In [25]:
from transformers import pipeline

# Create pipeline once
text_gen = pipeline(
    "text-generation",
    model=lora_model,
    tokenizer=tok_gpt,
    device=0 if DEVICE=="cuda" else -1,
)

def generate_response(prompt, max_new_tokens=60):
    out = text_gen(
        prompt,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.9,
        top_p=0.95,
    )
    return out[0]["generated_text"]

# Test
#test_prompt = "### Instruction:\nTranslate to French\n\n### Input:\nGood morning\n\n### Response:\n"
#print(generate_response(test_prompt))

# Test
test_prompt = "Once upon a time in a land far, far away"
print(generate_response(test_prompt))

Device set to use cpu


Once upon a time in a land far, far away in space, far away from any land-bound object, the universe in which we live could see it. In any case, the universe in which we lived could exist and, while it could exist in the distant future, in a universe far away from any ground, far away from any object,



## 9. Saving & Loading

Save:
```python
trainer.save_model("path")
```

Load:
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("path")
```
For PEFT models, save and load adapters with:
```python
peft_model.save_pretrained("path_to_adapter")
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("distilgpt2")
model = PeftModel.from_pretrained(base, "path_to_adapter")
```



## 10. What to Report

- **Setup:** model, dataset, tokenizer, hardware (CPU/GPU), hyperparameters.
- **Learning curves:** training/validation loss & (for classification) accuracy/F1.
- **Comparison:** Full FT vs LoRA vs Prefix vs BitFit (final metrics, parameters trained, wall-clock time).
- **Qualitative:** example generations before/after tuning.
- **Reproducibility:** seeds, versions, config.
