# Week 3 Practical: LLM Efficiency Techniques

## Training and Inference Optimization

### Learning Objectives

By the end of this practical session, you will:

1. **Implement LoRA from scratch** (Exercise 1)
   - Understand low-rank adaptation mechanics (W' = W + BA)
   - Apply LoRA to DistilBERT attention layers
   - Understand how to patch a neural networks (torch)
   - Compare to full fine-tuning (~0.5% parameters, similar accuracy)

2. **Implement INT8 Quantization** (Exercise 2)
   - Quantize FFN layers manually using symmetric quantization
   - Analyze accuracy/memory trade-offs (4x compression)
   - Understand post-training quantization techniques

3. **Use HuggingFace PEFT Library** (Exercise 3)
   - Experiment with production PEFT methods (LoRA, Adapters, Prefix Tuning)
   - Compare different parameter-efficient fine-tuning approaches
   - Learn configuration patterns for real-world use

4. **Benchmark Performance** (Exercise 4)
   - Measure memory usage, training time, inference speed
   - Compare all efficiency techniques systematically
   - Understand real-world trade-offs for deployment decisions

## Setup

The next cell imports the necessary modules â€“ just run it, no need to look at
the details!

In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import math
import pandas as pd
import matplotlib.pyplot as plt
import evaluate
from typing import Tuple, List
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

2025-12-15 16:35:22.714651: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-15 16:35:22.832038: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-12-15 16:35:25.092966: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [4]:
def get_best_device():
    """Returns the best device on this computer"""

    if torch.cuda.is_available():
        device = torch.device("cuda")
        total_memory = torch.cuda.get_device_properties(device).total_memory
        print(f"GPU Memory: {total_memory / 1e9:.1f} GB")
        print(f"GPU Name: {torch.cuda.get_device_name(device)}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    print(f"Found device: {device}")
    return device


device = get_best_device()

GPU Memory: 6.4 GB
GPU Name: NVIDIA GeForce RTX 3050 6GB Laptop GPU
Found device: cuda


### Load IMDB Dataset

We'll use a subset of IMDB for faster experimentation with efficiency techniques.

In [5]:
print("Loading IMDb dataset...")
dataset = load_dataset("imdb", split="train")
dataset = dataset.map(lambda x: {"labels": x.pop("label")})

# Use a subset (for speed in efficiency experiments)
dataset = dataset.shuffle(seed=42)

# Default dataset configuration (can be modified for faster experimentation)
dataset_size = 10_000
eval_split = 0.05


dataset = dataset.select(range(dataset_size))

# Split into train/val
train_val = dataset.train_test_split(test_size=eval_split, seed=42)
train_dataset = train_val["train"]
eval_dataset = train_val["test"]

print(f"Train: {len(train_dataset)}, Eval: {len(eval_dataset)}")

Loading IMDb dataset...
Train: 9500, Eval: 500


### Load Pre-trained Sentiment Classifier

Instead of training from scratch, we'll use a pre-trained DistilBERT model
fine-tuned on SST-2 (Stanford Sentiment Treebank). This saves time and provides
a strong baseline for our efficiency experiments.

In [6]:
print("Loading pre-trained sentiment classifier...")
classifier_model_name = "distilbert-base-uncased-finetuned-sst-2-english"
classifier_tokenizer = AutoTokenizer.from_pretrained(classifier_model_name)
classifier_model = AutoModelForSequenceClassification.from_pretrained(
    classifier_model_name
).to(device)

print(f"Model: {classifier_model_name}")
print(f"Parameters: {sum(p.numel() for p in classifier_model.parameters()):,}")

Loading pre-trained sentiment classifier...
Model: distilbert-base-uncased-finetuned-sst-2-english
Parameters: 66,955,010


### Tokenize Datasets

In [7]:
print("Tokenizing datasets...")


def preprocess_function(examples):
    return classifier_tokenizer(
        examples["text"], truncation=True, max_length=256, padding="max_length"
    )


train_dataset = train_dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.map(preprocess_function, batched=True)

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
eval_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print(
    f"Datasets tokenized with {len(train_dataset[0]['input_ids'])} tokens per example"
)

Tokenizing datasets...


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Datasets tokenized with 256 tokens per example


In [8]:
# Evaluate baseline accuracy on IMDB dataset
print("Evaluating baseline accuracy...")
classifier_model.eval()

# Compute baseline accuracy
accuracy_metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments(
    output_dir="./trained-models/tmp",
    per_device_eval_batch_size=32,
    report_to=None,
)

trainer = Trainer(
    model=classifier_model,
    args=training_args,
    compute_metrics=compute_metrics,
)

prediction_output = trainer.predict(eval_dataset)
classifier_accuracy = compute_metrics(
    (prediction_output.predictions, prediction_output.label_ids)
)["accuracy"]

print(f"Baseline accuracy on IMDB: {classifier_accuracy:.4f}")

Evaluating baseline accuracy...


Baseline accuracy on IMDB: 0.8960


# IMDB statistics and baseline accuracy

In [9]:
print(f"Device: {device}")
print(f"Train samples: {len(train_dataset)}")
print(f"Test samples: {len(eval_dataset)}")
print(f"Model: {classifier_model_name}")
print(f"Baseline accuracy: {classifier_accuracy:.4f}")

Device: cuda
Train samples: 9500
Test samples: 500
Model: distilbert-base-uncased-finetuned-sst-2-english
Baseline accuracy: 0.8960


---
# Exercise 1: Manual LoRA Implementation

## What is LoRA?

**LoRA (Low-Rank Adaptation)** is a parameter-efficient fine-tuning method
that freezes the pre-trained model weights and injects trainable low-rank
matrices into each layer.

**Key Idea:** Instead of updating $W \in \mathbb{R}^{d \times d}$ directly, we
learn:

$$W' = W + \Delta W = W + BA$$

where:
- $A \in \mathbb{R}^{d \times r}$ and $B \in \mathbb{R}^{r \times d}$
- $r \ll d$ (rank, typically 4-16)
- $W$ is frozen, only $A$ and $B$ are trained

**Paper:** [LoRA: Low-Rank Adaptation of Large Language
Models](https://arxiv.org/abs/2106.09685) (Hu et al., 2021)

## Step 1.1: Implement LoRALayer

We'll implement a LoRA layer that wraps a standard PyTorch Linear layer.

In [99]:
class LoRALayer(nn.Module):
    """
    LoRA (Low-Rank Adaptation) layer.

    Implements: output = W @ x + (B @ A) @ x * (alpha / r)
    where W is frozen, A and B are trainable.

    Args:
        original_layer: The linear layer to adapt
        r: Rank of adaptation matrices (default: 8)
        alpha: Scaling factor (default: 16, effective scale is alpha/r)
    """

    def __init__(
        self,
        original_layer: nn.Linear,
        r: int = 8,
        alpha: float = 16,
    ):
        super(LoRALayer, self).__init__()

        self.original_layer = original_layer
        self.in_features = original_layer.in_features
        self.out_features = original_layer.out_features
        
        # Initialize LoRA matrices A and B
        
        # Initialize A with kaiming_uniform (standard init)
        self.A = torch.nn.Linear(original_layer.in_features, r)
        torch.nn.init.kaiming_uniform_(self.A.weight)
        
        # Initialize B with zeros (important: starts as identity)
        self.B = torch.nn.Linear(original_layer.in_features, r)
        torch.nn.init.zeros_(self.B.weight)
        
        # Freeze original layer
        for param in self.original_layer.parameters():
            param.requires_grad = False


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Implement LoRA forward pass

        # Compute original output
        W =  self.original_layer @ x
        
        # Compute LoRA adaptation: B @ A @ x, scaled by alpha/r
        LoRA_adapt = (self.B @ self.A) @ x * (alpha / r)
        
        return W + LoRA_adapt


    def merge_weights(self):
        """Merge LoRA weights into original layer for deployment."""
        # Optional: Implement weight merging

        assert False, 'Not implemented yet'

## Step 1.2: Helper Functions

These helper functions will apply LoRA to DistilBERT's attention layers.

In [100]:
from tqdm import tqdm
def apply_lora_to_distilbert(
    model: nn.Module,
    r: int = 8,
    alpha: float = 16,
    target_modules: List[str] = None,
) -> List[nn.Parameter]:
    """
    Apply LoRA to specific modules in DistilBERT.

    Args:
        model: DistilBERT model
        r: LoRA rank
        alpha: LoRA alpha
        target_modules: Which attention matrices to adapt (q_lin, k_lin, v_lin, out_lin)

    Returns:
        List of LoRA parameters to optimize
    """
    if target_modules is None:
        target_modules = ["q_lin", "v_lin"]

    lora_params = []

    # Iterate through transformer layers
    for i, layer in tqdm(enumerate(model.distilbert.transformer.layer)):
        attention = layer.attention

        for module_name in target_modules:
            if hasattr(attention, module_name):
                original_layer = getattr(attention, module_name)

                # Replace with LoRA layer
                lora_layer = LoRALayer(original_layer, r=r, alpha=alpha)
                setattr(attention, module_name, lora_layer)

                # Collect LoRA parameters
                lora_params.extend([lora_layer.A.weight, lora_layer.B.weight])

                print(f"Applied LoRA to layer {i} -> {module_name}")

    return lora_params


def count_trainable_parameters(model: nn.Module) -> Tuple[int, int]:
    """
    Count trainable vs total parameters.

    Returns:
        (trainable_params, total_params)
    """
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return trainable, total

In [101]:
# Load base model
lora_model = AutoModelForSequenceClassification.from_pretrained(
    classifier_model_name,
    num_labels=2,
).to(device)

# And look at its parameters
lora_model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


## Step 1.3: Apply LoRA and Train

Now let's apply LoRA to a DistilBERT model and fine-tune it.

In [102]:
# Setup LoRA training

# Freeze ALL model parameters first
for param in lora_model.parameters():
    param.requires_grad = False
    
# Apply LoRA to attention layers (q_lin and v_lin)
apply_lora_to_distilbert(lora_model)

# Count trainable parameters
trainable, total = count_trainable_parameters(lora_model)
print("Ratio of parameters trained: ", trainable, total)

6it [00:00, 339.14it/s]

Applied LoRA to layer 0 -> q_lin
Applied LoRA to layer 0 -> v_lin
Applied LoRA to layer 1 -> q_lin
Applied LoRA to layer 1 -> v_lin
Applied LoRA to layer 2 -> q_lin
Applied LoRA to layer 2 -> v_lin
Applied LoRA to layer 3 -> q_lin
Applied LoRA to layer 3 -> v_lin
Applied LoRA to layer 4 -> q_lin
Applied LoRA to layer 4 -> v_lin
Applied LoRA to layer 5 -> q_lin
Applied LoRA to layer 5 -> v_lin
Ratio of parameters trained:  147648 67102658





In [103]:
# Train LoRA model

# Default training configuration (can be reduced for faster experimentation)
num_epochs = 2
logging_steps = 100


print(f"\nTraining LoRA model ({num_epochs} epochs)...")

training_args = TrainingArguments(
    output_dir="./trained-models/lora-distillbert-imdb",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=50,
    weight_decay=0.01,
    logging_steps=logging_steps,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to=None,
)

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

# Evaluate
lora_results = trainer.evaluate()
print(f"\nLoRA model accuracy: {lora_results['eval_accuracy']:.4f}")


Training LoRA model (2 epochs)...


TypeError: unsupported operand type(s) for @: 'Linear' and 'Tensor'

## Step 1.4: Compare to Baseline

Let's compare LoRA to full fine-tuning.

In [None]:
# Compare results
print("\n" + "=" * 80)
print("COMPARISON: LoRA vs Full Fine-tuning")
print("=" * 80)

# Full fine-tuning baseline (from distillbert_imdb.py)
baseline_accuracy = classifier_accuracy
print(f"Full fine-tuning: {baseline_accuracy:.4f} accuracy, 66M trainable params")
print(
    f"LoRA (r=8):       {lora_results['eval_accuracy']:.4f} accuracy, "
    f"{trainable:,} trainable params"
)
print(f"\nParameter reduction: {100*(1-trainable/66e6):.2f}%")
print(f"Accuracy drop: {100*(baseline_accuracy - lora_results['eval_accuracy']):.2f}%")

In [None]:
# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Parameters comparison
methods = ["Full FT", "LoRA (r=8)"]
params = [66e6, trainable]
colors = ["steelblue", "coral"]

ax1.bar(methods, params, color=colors)
ax1.set_ylabel("Trainable Parameters")
ax1.set_title("Parameter Efficiency")
ax1.set_yscale("log")
for i, (method, param) in enumerate(zip(methods, params)):
    ax1.text(i, param, f"{param/1e6:.1f}M", ha="center", va="bottom")

# Accuracy comparison
accuracies = [baseline_accuracy, lora_results["eval_accuracy"]]
ax2.bar(methods, accuracies, color=colors)
ax2.set_ylabel("Accuracy")
ax2.set_title("Model Accuracy")
ax2.set_ylim([0.80, 0.95])
for i, (method, acc) in enumerate(zip(methods, accuracies)):
    ax2.text(i, acc, f"{acc:.3f}", ha="center", va="bottom")

plt.tight_layout()
plt.show()

## Exercise 1 Extension: Experiment with Rank

Try different LoRA ranks and see the trade-off between parameters and accuracy.

In [None]:
# Experiment with different ranks

    # Apply LoRA
    
    # Count params
    
    # Train (1 epoch for speed)
    
    # Evaluate
    
# Plot results

assert False, 'Not implemented yet'


---
# Exercise 2: Manual INT8 Quantization

## What is Quantization?

**Quantization** converts high-precision floating-point weights (FP32: 32
bits) to low-precision integers (INT8: 8 bits).

**Symmetric Quantization Formula:**

$$\text{quantized} =
\text{round}\left(\frac{\text{original}}{\text{scale}}\right)$$

where:

$$\text{scale} = \frac{\max(|\text{original}|)}{127}$$

**Dequantization:**

$$\text{dequantized} = \text{quantized} \times \text{scale}$$

## Step 2.1: Implement Quantization Functions

In [None]:
def quantize_tensor(
    tensor: torch.Tensor,
    bits: int = 8,
) -> Tuple[torch.Tensor, float]:
    """
    Quantize tensor to INT8 using symmetric quantization.

    Args:
        tensor: FP32 tensor to quantize
        bits: Number of bits (default: 8)

    Returns:
        quantized: INT8 tensor
        scale: Scale factor for dequantization
    """
    # Implement symmetric INT8 quantization

    # Compute scale factor
    
    # Quantize: divide by scale, round, clamp to INT8 range
    
    assert False, 'Not implemented yet'



def dequantize_tensor(
    quantized: torch.Tensor,
    scale: float,
) -> torch.Tensor:
    """
    Dequantize INT8 tensor back to FP32.

    Args:
        quantized: INT8 tensor
        scale: Scale factor from quantization

    Returns:
        dequantized: FP32 tensor
    """
    # Implement dequantization

    # Convert to float and multiply by scale
    
    assert False, 'Not implemented yet'


## Step 2.2: Test Quantization

Let's test our quantization on a sample tensor.

In [None]:
# Test quantization
test_tensor = torch.randn(1000, 768)
print(f"Original tensor: shape={test_tensor.shape}, dtype={test_tensor.dtype}")
print(f"Original range: [{test_tensor.min():.4f}, {test_tensor.max():.4f}]")
print(f"Original memory: {test_tensor.nbytes / 1e6:.2f} MB")

# Quantize
quantized, scale = quantize_tensor(test_tensor)
print(f"\nQuantized tensor: shape={quantized.shape}, dtype={quantized.dtype}")
print(f"Quantized range: [{quantized.min()}, {quantized.max()}]")
print(f"Quantized memory: {quantized.nbytes / 1e6:.2f} MB")
print(f"Scale factor: {scale:.6f}")
print(f"Memory reduction: {test_tensor.nbytes / quantized.nbytes:.1f}x")

# Dequantize
dequantized = dequantize_tensor(quantized, scale)
print(f"\nDequantized tensor: shape={dequantized.shape}, dtype={dequantized.dtype}")
print(f"Dequantized range: [{dequantized.min():.4f}, {dequantized.max():.4f}]")

# Measure error
mse = ((test_tensor - dequantized) ** 2).mean()
print(f"\nReconstruction MSE: {mse:.6f}")
print(f"Relative error: {(mse / test_tensor.var()).sqrt():.4f}")

## Step 2.3: Visualize Quantization Error

In [None]:
# Visualize weight distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Original
axes[0].hist(test_tensor.flatten().numpy(), bins=100, alpha=0.7, color="blue")
axes[0].set_title("Original (FP32)")
axes[0].set_xlabel("Value")
axes[0].set_ylabel("Frequency")

# Quantized
axes[1].hist(quantized.flatten().numpy(), bins=50, alpha=0.7, color="orange")
axes[1].set_title("Quantized (INT8)")
axes[1].set_xlabel("Value")

# Dequantized vs Original
axes[2].scatter(
    test_tensor.flatten().numpy()[::100],
    dequantized.flatten().numpy()[::100],
    alpha=0.3,
    s=1,
)
axes[2].plot([-4, 4], [-4, 4], "r--", label="Perfect reconstruction")
axes[2].set_title("Dequantized vs Original")
axes[2].set_xlabel("Original")
axes[2].set_ylabel("Dequantized")
axes[2].legend()

plt.tight_layout()
plt.show()

## Step 2.4: Implement Quantized Linear Layer

In [None]:
class QuantizedLinear(nn.Module):
    """
    Quantized linear layer for FFN.

    Stores weights in INT8, dequantizes during forward pass.
    This is post-training quantization (PTQ).
    """

    def __init__(self, linear: nn.Linear):
        # Initialize quantized linear layer

        # Quantize weight matrix
        
        # Store as buffer (not parameter - don't train)
        
        # Keep bias as float (small, not worth quantizing)
        
        assert False, 'Not implemented yet'


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Forward pass with dequantization

        # Dequantize weights on the fly
        
        # Apply linear transformation
        
        assert False, 'Not implemented yet'


    def get_memory_size(self) -> float:
        """Returns memory size in MB."""
        size = self.weight_quantized.nbytes  # INT8 weights
        if self.bias is not None:
            size += self.bias.nbytes
        return size / 1e6

## Step 2.5: Quantize DistilBERT FFN Layers

In [None]:
def quantize_distilbert_ffn(model: nn.Module) -> nn.Module:
    """
    Quantize all FFN layers (lin1, lin2) in DistilBERT.

    Args:
        model: DistilBERT model

    Returns:
        Model with quantized FFN layers
    """
    for i, layer in enumerate(model.distilbert.transformer.layer):
        ffn = layer.ffn

        # Quantize lin1 (768 -> 3072)
        ffn.lin1 = QuantizedLinear(ffn.lin1)

        # Quantize lin2 (3072 -> 768)
        ffn.lin2 = QuantizedLinear(ffn.lin2)

        print(f"Quantized layer {i} FFN (lin1, lin2)")

    return model


def measure_model_size(model: nn.Module) -> float:
    """
    Measure model size in MB.

    Returns:
        Size in MB
    """
    size = 0
    for param in model.parameters():
        size += param.numel() * param.element_size()
    for buffer in model.buffers():
        size += buffer.numel() * buffer.element_size()
    return size / 1e6

## Step 2.6: Quantize and Evaluate

In [None]:
# Quantize model and analyze results

# Measure original size
# Quantize FFN layers

# Measure quantized size
assert False, 'Not implemented yet'


# Evaluate accuracy
quantized_model.eval()
from transformers import Trainer

trainer_quant = Trainer(
    model=quantized_model,
    args=TrainingArguments(output_dir="./trained-models/tmp", report_to=None),
    compute_metrics=compute_metrics,
)
prediction_output = trainer_quant.predict(eval_dataset)
quantized_accuracy = compute_metrics(
    (prediction_output.predictions, prediction_output.label_ids)
)["accuracy"]

print(f"\nOriginal accuracy: {baseline_accuracy:.4f}")
print(f"Quantized accuracy: {quantized_accuracy:.4f}")
print(f"Accuracy drop: {100*(baseline_accuracy - quantized_accuracy):.2f}%")

In [None]:
# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Memory comparison
methods = ["FP32", "INT8"]
sizes = [size_before, size_after]
colors = ["steelblue", "coral"]

ax1.bar(methods, sizes, color=colors)
ax1.set_ylabel("Model Size (MB)")
ax1.set_title("Memory Efficiency")
for i, (method, size) in enumerate(zip(methods, sizes)):
    ax1.text(i, size, f"{size:.1f} MB", ha="center", va="bottom")

# Accuracy comparison
accuracies = [baseline_accuracy, quantized_accuracy]
ax2.bar(methods, accuracies, color=colors)
ax2.set_ylabel("Accuracy")
ax2.set_title("Model Accuracy")
ax2.set_ylim([0.80, 0.95])
for i, (method, acc) in enumerate(zip(methods, accuracies)):
    ax2.text(i, acc, f"{acc:.3f}", ha="center", va="bottom")

plt.tight_layout()
plt.show()

---
# Exercise 3: HuggingFace PEFT Library

## Introduction to PEFT

The [PEFT (Parameter-Efficient Fine-Tuning) library by
HuggingFace](https://huggingface.co/blog/peft) provides production-ready
implementations of various efficiency methods:

- **LoRA**: Low-rank adaptation (we implemented this in Exercise 1)
- **Adapters**: Bottleneck layers inserted after transformer blocks
- **Prefix Tuning**: Learnable "virtual tokens" prepended to input
- **IA3**: Learned element-wise rescaling of activations

**Benefits:**
- Battle-tested implementations
- Easy to switch between methods
- Modular: One base model, many task-specific adapters
- Active development and community support

In [None]:
from peft import (
    LoraConfig,
    IA3Config,
    get_peft_model,
    PeftModel,
)

## Step 3.2: LoRA with PEFT

In [None]:
# Configure LoRA using PEFT library

# Configure LoRA

# Apply LoRA

assert False, 'Not implemented yet'


In [None]:
# Train PEFT LoRA model
training_args_peft = TrainingArguments(
    output_dir="./trained-models/peft-lora-distillbert-imdb",
    num_train_epochs=num_epochs,  # Use same config as manual LoRA
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=50,
    weight_decay=0.01,
    logging_steps=logging_steps,  # Use same config as manual LoRA
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to=None,
)

trainer_peft = Trainer(
    model=peft_lora_model,
    args=training_args_peft,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer_peft.train()
peft_lora_results = trainer_peft.evaluate()
print(f"PEFT LoRA accuracy: {peft_lora_results['eval_accuracy']:.4f}")

## Step 3.3: Compare Other PEFT Methods

In [None]:
# Compare different PEFT methods

# Define configurations

    # Load fresh base model
    
    # Apply PEFT
    
    # Train
    
    # Evaluate
    
# Display results

assert False, 'Not implemented yet'


In [None]:
# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Trainable parameters
methods = df_peft["Method"].tolist()
params = df_peft["Trainable Params"].tolist()
colors = plt.cm.Set3(range(len(methods)))

ax1.barh(methods, params, color=colors)
ax1.set_xlabel("Trainable Parameters")
ax1.set_title("Parameter Efficiency")
ax1.set_xscale("log")
for i, (method, param) in enumerate(zip(methods, params)):
    ax1.text(param, i, f"  {param/1e6:.2f}M", va="center")

# Accuracy
accuracies = df_peft["Accuracy"].tolist()
ax2.barh(methods, accuracies, color=colors)
ax2.set_xlabel("Accuracy")
ax2.set_title("Model Accuracy")
ax2.set_xlim([0.80, 0.95])
for i, (method, acc) in enumerate(zip(methods, accuracies)):
    ax2.text(acc, i, f"  {acc:.3f}", va="center")

plt.tight_layout()
plt.show()

## Step 3.4: Save and Load PEFT Adapters

In [None]:
# Save and load PEFT adapter

# Save adapter (only the adapter weights, not the base model)

# Load adapter back

assert False, 'Not implemented yet'


---
# Project: Performance Benchmarking

## Introduction

Now that we've explored various efficiency techniques, you can
study their effect on:

- **Memory**: Model size (MB)
- **Training Time**: Time per epoch (seconds)
- **Inference Speed**: Samples per second
- **Accuracy**: Test set accuracy

This will help us understand real-world trade-offs and make informed
decisions.