# Parameter-Efficient Fine-Tuning (PEFT) ‚öôÔ∏èüß©

In the previous notebook, you explored how **large language models (LLMs)** can be made smaller and faster through **model compression techniques** such as **Knowledge Distillation** and **Quantization**.  
You learned how to preserve most of a model‚Äôs capability while drastically reducing its size and computational footprint.

In this notebook, we‚Äôll take the next step toward **efficient model adaptation**: learning how to **fine-tune large models for new tasks without retraining all their parameters**.  
This is where **Parameter-Efficient Fine-Tuning (PEFT)** comes in.

### üß† What You‚Äôll Learn

You‚Äôll discover how PEFT allows us to adapt large pretrained models (like **DistilBERT**) to specific tasks while keeping **most of the model frozen** and only training a *small number* of additional parameters.  
This results in:
- **Smaller memory footprint** ü™∂  
- **Faster training** ‚ö°  
- **Comparable performance** üéØ to full fine-tuning

In this notebook we'll explore **LoRA (Low-Rank Adaptation)** which adds lightweight trainable matrices inside attention layers.  

### üìö Dataset

Just like before, we‚Äôll use the **Yelp Polarity** dataset, a real-world collection of positive and negative user reviews.  
This time, instead of masked word prediction, we‚Äôll fine-tune our model for **sentiment classification**.

### üöÄ By the End of This Notebook...

You‚Äôll have:
- Compared **zero-shot**, **full fine-tuning**, and **PEFT** approaches  
- Measured their **accuracy, training cost, and parameter efficiency**  
- Understood how modern LLM systems achieve rapid, low-cost adaptation at scale

Let‚Äôs dive in and see how **PEFT bridges the gap between power and efficiency** in modern NLP!


In [None]:
from tqdm import tqdm
import numpy as np

from typing import Dict
from transformers import EvalPrediction
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from peft import LoraConfig, get_peft_model, TaskType

from transformers import Trainer, TrainingArguments
import torch

from sklearn.metrics import accuracy_score, f1_score

In [None]:
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

**`TODO:`** Since we're using `distilbert-base-uncased` as our model we should only consider samples that can fit in its context size. Complete the implementation of the `filter_long_examples` function that is used to filter our dataset..

In [None]:
# Load Yelp dataset
dataset = load_dataset("yelp_polarity", split={"train": "train[:10000]", "test": "test[:2000]"})

def filter_long_examples(example:Dict) -> bool:
    """Filters out text samples that exceed the maximum token length supported by the model tokenizer.

    Args:
        example (dict): 
            A single dataset example containing at least a `"text"` field. 
            Example structure:
            {
                "text": "This is a review example.", 
                ...
            }
    """
    # TODO: Complete the function to filter out examples that exceed the model's max token length
    ...

filtered_dataset = dataset.filter(filter_long_examples)



train_ds = filtered_dataset["train"].select(range(5000))   # small subset for speed
test_ds  = filtered_dataset["test"].select(range(1000))


## üîç Evaluating the Base Model (Zero-Shot)

Before we start fine-tuning (either with standard training or using **LoRA**), it‚Äôs important to understand **how well our base model performs out-of-the-box** on the downstream task.

Our downstream task here is **sentiment classification** using the **Yelp Polarity** dataset.  
Each review is labeled as either:
- `0` ‚Üí **Negative**
- `1` ‚Üí **Positive**

We'll use the pretrained **DistilBERT** model directly ‚Äî without any additional training ‚Äî to see how accurately it can classify reviews into positive or negative sentiment.  
This kind of evaluation is often called a **zero-shot test**, since the model hasn‚Äôt been fine-tuned on this specific dataset yet.

**`TODO:`** Use the [Hugging Face pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines) for text classification to evaluate the original model on the test set. Print the final accuracy. For more information about HF pipelines, refer to the official documentation or revisit the examples from our previous lab sessions.

## üß† Fine-Tuning the Base Model (Full Fine-Tuning)

Now that we‚Äôve seen how the pretrained model performs in a **zero-shot** setting, let‚Äôs improve its accuracy by **fine-tuning** it directly on our downstream task ‚Äî **sentiment classification** using the Yelp Polarity dataset.

In full fine-tuning, **all model parameters are updated** during training.  
This means the model learns task-specific representations by adjusting every layer of the pretrained network.  
While this typically leads to excellent performance, it also comes with notable downsides:
- üê¢ **Slower training**
- üíæ **Higher memory consumption**
- ‚öôÔ∏è **More parameters to update and store**

We‚Äôll fine-tune our **DistilBERT** model on a subset of the Yelp dataset, using standard training parameters (learning rate, batch size, etc.), and then evaluate its accuracy on the test set.

This will serve as our **baseline for comparison** when we later explore **Parameter-Efficient Fine-Tuning (PEFT)** methods like **LoRA**, where only a small number of parameters are trained.


### üß© Preparing the Dataset for Fine-Tuning

Before we can fine-tune the model, we need to preprocess our text data so it‚Äôs ready for training.  
As you know, transformers models like **DistilBERT** require inputs in the form of **token IDs** rather than raw text.

**`TODO:`**
1. **Tokenize** each review using the pretrained tokenizer to convert text into token IDs.  
2. **Truncate and pad** each sequence to a fixed length of 128 tokens.
3. **Format for PyTorch** by setting the dataset columns to:
   - `input_ids` ‚Üí tokenized text  
   - `attention_mask` ‚Üí marks real tokens vs. padding  
   - `label` ‚Üí sentiment label (0 = negative, 1 = positive)

**`TODO:`** Complete the `compute_metrics` function. This function should calculate evaluation metrics (such as accuracy and F1 score) from the model‚Äôs predictions and true labels.

In [None]:
def compute_metrics(eval_pred: EvalPrediction) -> Dict[str, float]:
    """
    Compute evaluation metrics (accuracy and F1 score) for model predictions.

    Args:
        eval_pred (EvalPrediction): 
            An object containing model predictions (`eval_pred.predictions`) and true labels (`eval_pred.label_ids`).

    Returns:
        Dict[str, float]: 
            A dictionary with computed "accuracy" and "f1" scores.
    """
    # TODO: Complete the function to compute accuracy and F1 score
    ...

training_args = TrainingArguments(
    output_dir="distilbert-full-yelp",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    fp16=torch.cuda.is_available(),
    logging_steps=50,
    save_steps=500,
    eval_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_enc,
    eval_dataset=test_enc,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


**`TODO:`** Use the `trainer` to train train the model and once that's done save the trained model.

**`TODO:`** Use the trainer to evaluate the fine-tuned model on the test set.

## ‚öôÔ∏è Fine-Tuning with LoRA (Parameter-Efficient Fine-Tuning)

Now that we‚Äôve trained the full model and seen how resource-intensive it can be, let‚Äôs explore a more efficient approach: **LoRA (Low-Rank Adaptation)**.

LoRA is a **Parameter-Efficient Fine-Tuning (PEFT)** method that allows us to adapt large pretrained models to new tasks **without updating all their parameters**.  
Instead, LoRA introduces a few small trainable matrices into the model‚Äôs attention layers.  
During training:
- The original model weights remain **frozen** üßä  
- Only the LoRA parameters are **trained** üß†  
- The number of trainable parameters is reduced by **orders of magnitude**

Despite this massive reduction in trainable parameters, LoRA often achieves performance **comparable to full fine-tuning**, making it an ideal choice when working with limited compute or storage.

Let‚Äôs see how much performance we can retain while training only a tiny fraction of the model!


In [None]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

### üß© The PEFT Library and LoRA Configuration

To make LoRA integration simple and flexible, we use the **[PEFT (Parameter-Efficient Fine-Tuning)](https://huggingface.co/docs/peft/index)** library from Hugging Face.  
This library provides a unified interface for applying efficient fine-tuning techniques such as **LoRA**, **Prefix Tuning**, **Adapters**, and more, all built on top of the `transformers` ecosystem.

Instead of modifying the model architecture manually, PEFT lets us define a **configuration** that describes *where* and *how* LoRA layers should be inserted.  
Once configured, we simply wrap the base model using `get_peft_model()`, and PEFT handles all the internal modifications automatically.

**LoRA Configuration Parameters:**

- **`task_type`**: Specifies the task type. Here, `TaskType.SEQ_CLS` indicates a *sequence classification* task (e.g., sentiment analysis).

- **`r`**: The **rank** of the LoRA matrices. Controls how many additional parameters are introduced; lower values mean lighter adapters.

- **`lora_alpha`**: A **scaling factor** that amplifies the LoRA updates before they‚Äôre added to the base weights.

- **`lora_dropout`**: The **dropout rate** applied to LoRA layers during training to reduce overfitting.

- **`bias`**: Determines whether bias terms are trainable. Setting it to `"none"` means only LoRA parameters are updated.

- **`target_modules`**: Specifies which model submodules receive LoRA adapters. `"k_lin"` and `"v_lin"` refer to the **key** and **value** projection layers in the attention mechanism.


In [None]:
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # sequence classification
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    target_modules=["k_lin", "v_lin"],  # attention projections
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


**`TODO:`** In the same way as before define the `TrainingArguments`and the `Trainer`. Once that's done, train the `model`(notice how the model was wrapped with peft) and save it again.

**`TODO:`** Use the trainer to evaluate the fine-tuned model on the test set.

## Conclusion

**`Discussion:`** Now that you've finetuned your model once with and once without LoRA, what do you notice? Whate are your key takeaways?

\[Your Answer\]

## üß© Beyond LoRA: A Whole Family of Parameter-Efficient Fine-Tuning (PEFT) Methods

While **LoRA (Low-Rank Adaptation)** has become the most popular parameter-efficient fine-tuning method, it‚Äôs far from the only one.  
The PEFT ecosystem includes several complementary techniques that balance **compute efficiency**, **modularity**, and **adaptability**, each suited to different fine-tuning goals.

Here are some key approaches worth exploring:

| Method | Core Idea | When to Use |
|:--|:--|:--|
| **LoRA** | Injects low-rank trainable matrices into attention layers | General-purpose fine-tuning with small GPU footprint |
| **QLoRA** | Combines LoRA with 4-bit quantization for ultra-efficient fine-tuning | Fine-tuning very large models (33B‚Äì70B) on a single GPU |
| **Prefix Tuning** | Learns virtual ‚Äúprefix‚Äù tokens prepended to transformer layers | When you need adaptation without modifying base weights |
| **Prompt Tuning** | Learns continuous prompt embeddings instead of textual prompts | For lightweight control or domain-specific prompting |
| **Adapter Tuning** | Adds small trainable ‚Äúadapter‚Äù modules between frozen layers | When you want modular, composable task adapters |
| **BitFit** | Fine-tunes only bias terms across layers | For extremely fast, low-cost adaptation |
| **IA¬≥** | Learns multiplicative scaling vectors for hidden states | For memory-limited setups or stability-focused fine-tuning |

#### üß∞ Useful Libraries

| Library | Description |
|:--|:--|
| [ü§ó **PEFT**](https://github.com/huggingface/peft) | The official Hugging Face library implementing LoRA, QLoRA, Prefix, Prompt, and Adapter tuning ‚Äî integrates seamlessly with `transformers` and `accelerate`. |
| [**Adapter-Transformers**](https://github.com/adapter-hub/adapter-transformers) | AdapterHub‚Äôs modular framework for Adapter, Prefix, and BitFit tuning on top of Hugging Face models. |
| [**Colossal-AI**](https://github.com/hpcaitech/ColossalAI) | Large-scale training toolkit supporting LoRA and hybrid PEFT with distributed optimization. |
| [**DeepSpeed**](https://github.com/microsoft/DeepSpeed) | Microsoft‚Äôs library that includes LoRA/QLoRA support and efficient parameter partitioning. |
| [**OpenDelta**](https://github.com/thunlp/OpenDelta) | Research-oriented toolkit supporting multiple delta-tuning methods (LoRA, Adapter, BitFit, Prefix, IA¬≥). |

These frameworks make it straightforward to experiment with PEFT methods in just a few lines of code ‚Äî you can swap fine-tuning strategies without rewriting your training loop.

#### üìö Key References

- Hu et al., *LoRA: Low-Rank Adaptation of Large Language Models* (2021)  
- Dettmers et al., *QLoRA: Efficient Finetuning of Quantized LLMs* (2023)  
- Li & Liang, *Prefix-Tuning: Optimizing Continuous Prompts for Generation* (2021)  
- Lester et al., *The Power of Scale: Parameter-Efficient Prompt Tuning* (2021)  
- Houlsby et al., *Parameter-Efficient Transfer Learning for NLP* (2019)  
- Zaken et al., *BitFit: Simple Parameter-Efficient Fine-Tuning for Transformers* (2021)  
- Liu et al., *Few-Shot Parameter-Efficient Fine-Tuning via IA¬≥* (2022)


These methods form a toolkit of strategies for **scaling fine-tuning without retraining full models**, enabling rapid experimentation and deployment across diverse domains.
