# Fine-Tuning Language Models: From Theory to Practice

This notebook provides a hands-on exploration of fine-tuning pre-trained language models. We'll build on our knowledge of transformer architecture to adapt these powerful models to specific tasks through efficient fine-tuning techniques.

## Learning Objectives

By the end of this notebook, you'll understand:
- The concept of transfer learning and why fine-tuning works
- Different fine-tuning approaches and when to use each
- Parameter-efficient fine-tuning (PEFT) techniques like LoRA
- How to fine-tune a model for classification tasks
- How to evaluate fine-tuned models effectively
- How to create a simple interface for testing your models
- Common challenges and best practices in fine-tuning

Let's start by setting up our environment with the necessary libraries.

In [4]:
# Install required packages
%pip install transformers datasets peft evaluate accelerate gradio scikit-learn matplotlib pandas seaborn numpy torch bitsandbytes -q

Note: you may need to restart the kernel to use updated packages.


In [5]:
%pip install scikit-learn evaluate gradio transformers datasets accelerate huggingface_hub sentence-transformers langchain langchain-community matplotlib seaborn


^C
Note: you may need to restart the kernel to use updated packages.


In [None]:
%pip install scikit-learn==1.3.0 evaluate==0.4.0 gradio==3.50.2



Collecting scikit-learn==1.3.0
  Downloading scikit_learn-1.3.0-cp310-cp310-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting evaluate==0.4.0
  Downloading evaluate-0.4.0-py3-none-any.whl.metadata (9.4 kB)
Collecting gradio==3.50.2
  Downloading gradio-3.50.2-py3-none-any.whl.metadata (17 kB)
Collecting responses<0.19 (from evaluate==0.4.0)
  Downloading responses-0.18.0-py3-none-any.whl.metadata (29 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio==3.50.2)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting altair<6.0,>=4.2.0 (from gradio==3.50.2)
  Using cached altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting gradio-client==0.6.1 (from gradio==3.50.2)
  Downloading gradio_client-0.6.1-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio==3.50.2)
  Downloading MarkupSafe-2.1.5-cp310-cp310-macosx_10_9_universal2.whl.metadata (3.0 kB)
Collecting numpy>=1.17.3 (from scikit-learn==1.3.0)
  Downloading numpy-1.26.4-cp310-cp310-macosx_1

In [None]:
# Import necessary libraries
import os
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
from datasets import load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    pipeline
)
from peft import get_peft_model, LoraConfig, TaskType, PeftModel, PeftConfig
import evaluate
import gradio as gr

# Set a consistent random seed for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

# Enable plotting in the notebook
%matplotlib inline
plt.style.use('ggplot')
sns.set(style="whitegrid")

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

  Referenced from: <0B7EB158-53DC-3403-8A49-22178CAB4612> /opt/anaconda3/envs/my-llm/lib/python3.10/site-packages/torchvision/image.so
  warn(


Using device: cpu


## 1. Understanding Transfer Learning and Fine-Tuning

### What is Transfer Learning?

Transfer learning is a machine learning technique where a model developed for one task is reused as the starting point for a model on a second task. In the context of NLP, this means taking a pre-trained language model (trained on a massive corpus of text) and adapting it to a specific task or domain.

### Why Fine-Tuning Works

Pre-trained language models like BERT, RoBERTa, T5, or GPT have learned rich representations of language through self-supervised training on large text corpora. These models have internalized:

1. **Linguistic structure**: Grammar, syntax, and language patterns
2. **Semantic knowledge**: Word meanings, entity relationships, and conceptual understanding
3. **World knowledge**: Facts about the world embedded in the text they were trained on

Fine-tuning leverages this knowledge by making minimal adjustments to the model weights to adapt them to a specific task, rather than training from scratch. This requires significantly less data and computational resources.

### Different Fine-Tuning Approaches

1. **Full Fine-Tuning**: Update all parameters of the pre-trained model
   - Pros: Can achieve best performance
   - Cons: Resource-intensive, potential for catastrophic forgetting

2. **Parameter-Efficient Fine-Tuning (PEFT)**: Update only a small subset of parameters
   - Examples: LoRA (Low-Rank Adaptation), Adapters, Prompt Tuning
   - Pros: Memory efficient, faster training, better generalization with limited data
   - Cons: May not reach full fine-tuning performance in all cases

3. **Prompt-Based Fine-Tuning**: Add task-specific prompts to steer model behavior
   - Pros: Can be very effective with large models, less prone to overfitting
   - Cons: Requires careful prompt design, works best with very large models

Let's visualize these approaches to better understand them:

In [None]:
# Visualize different fine-tuning approaches
def visualize_fine_tuning_approaches():
    # Create a figure with multiple subplots
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    # Colors
    colors = {
        'frozen': 'lightgray',
        'trained': 'skyblue',
        'adapter': 'lightgreen',
        'prompt': 'lightsalmon'
    }
    
    # 1. Full Fine-Tuning
    ax = axes[0]
    layer_count = 12
    for i in range(layer_count):
        rect = plt.Rectangle((0.1, i*0.5), 0.8, 0.4, facecolor=colors['trained'], edgecolor='black')
        ax.add_patch(rect)
        ax.text(0.5, i*0.5 + 0.2, f"Layer {i+1}", ha='center', va='center')
    
    # Add embeddings layer
    rect = plt.Rectangle((0.1, layer_count*0.5), 0.8, 0.4, facecolor=colors['trained'], edgecolor='black')
    ax.add_patch(rect)
    ax.text(0.5, layer_count*0.5 + 0.2, "Embeddings", ha='center', va='center')
    
    # Add training indicator
    ax.text(0.5, -0.5, "All parameters updated", ha='center', va='center', fontweight='bold')
    
    ax.set_xlim(0, 1)
    ax.set_ylim(-1, layer_count*0.5 + 1)
    ax.axis('off')
    ax.set_title("Full Fine-Tuning", fontsize=14)
    
    # 2. Parameter-Efficient Fine-Tuning (LoRA)
    ax = axes[1]
    for i in range(layer_count):
        # Main layer (frozen)
        rect = plt.Rectangle((0.1, i*0.5), 0.6, 0.4, facecolor=colors['frozen'], edgecolor='black')
        ax.add_patch(rect)
        ax.text(0.4, i*0.5 + 0.2, f"Layer {i+1}", ha='center', va='center')
        
        # LoRA adapter (only in some layers)
        if i % 3 == 0:  # Add adapters to some layers
            rect = plt.Rectangle((0.7, i*0.5), 0.2, 0.4, facecolor=colors['adapter'], edgecolor='black')
            ax.add_patch(rect)
            ax.text(0.8, i*0.5 + 0.2, "LoRA", ha='center', va='center', fontsize=8)
            
            # Add connecting line
            ax.plot([0.7, 0.7], [i*0.5, i*0.5 + 0.4], 'k-')
            ax.plot([0.1, 0.7], [i*0.5 + 0.2, i*0.5 + 0.2], 'k-', alpha=0.3)
    
    # Add embeddings layer
    rect = plt.Rectangle((0.1, layer_count*0.5), 0.8, 0.4, facecolor=colors['frozen'], edgecolor='black')
    ax.add_patch(rect)
    ax.text(0.5, layer_count*0.5 + 0.2, "Embeddings", ha='center', va='center')
    
    # Add training indicator
    ax.text(0.5, -0.5, "Only adapters updated", ha='center', va='center', fontweight='bold')
    
    ax.set_xlim(0, 1)
    ax.set_ylim(-1, layer_count*0.5 + 1)
    ax.axis('off')
    ax.set_title("Parameter-Efficient Fine-Tuning (LoRA)", fontsize=14)
    
    # 3. Prompt-Based Fine-Tuning
    ax = axes[2]
    
    # All layers frozen
    for i in range(layer_count):
        rect = plt.Rectangle((0.1, i*0.5), 0.8, 0.4, facecolor=colors['frozen'], edgecolor='black')
        ax.add_patch(rect)
        ax.text(0.5, i*0.5 + 0.2, f"Layer {i+1}", ha='center', va='center')
    
    # Add embeddings layer
    rect = plt.Rectangle((0.1, layer_count*0.5), 0.8, 0.4, facecolor=colors['frozen'], edgecolor='black')
    ax.add_patch(rect)
    ax.text(0.5, layer_count*0.5 + 0.2, "Embeddings", ha='center', va='center')
    
    # Add prompt tokens
    for i in range(5):
        rect = plt.Rectangle((0.1 + i*0.15, layer_count*0.5 + 0.6), 0.1, 0.3, facecolor=colors['prompt'], edgecolor='black')
        ax.add_patch(rect)
        ax.text(0.15 + i*0.15, layer_count*0.5 + 0.75, f"P{i+1}", ha='center', va='center', fontsize=8)
    
    # Add training indicator
    ax.text(0.5, -0.5, "Only prompt tokens updated", ha='center', va='center', fontweight='bold')
    
    ax.set_xlim(0, 1)
    ax.set_ylim(-1, layer_count*0.5 + 2)
    ax.axis('off')
    ax.set_title("Prompt-Based Fine-Tuning", fontsize=14)
    
    plt.suptitle("Different Fine-Tuning Approaches", fontsize=16)
    plt.tight_layout()
    plt.subplots_adjust(top=0.85)
    plt.show()
    
    # Create a legend
    fig, ax = plt.subplots(figsize=(10, 2))
    legend_elements = [
        plt.Rectangle((0, 0), 1, 1, facecolor=colors['trained'], edgecolor='black', label='Trainable Parameters'),
        plt.Rectangle((0, 0), 1, 1, facecolor=colors['frozen'], edgecolor='black', label='Frozen Parameters'),
        plt.Rectangle((0, 0), 1, 1, facecolor=colors['adapter'], edgecolor='black', label='LoRA Adapters'),
        plt.Rectangle((0, 0), 1, 1, facecolor=colors['prompt'], edgecolor='black', label='Learnable Prompt Tokens')
    ]
    ax.legend(handles=legend_elements, loc='center', ncol=4)
    ax.axis('off')
    plt.tight_layout()
    plt.show()

# Call the visualization function
visualize_fine_tuning_approaches()

## 2. Understanding Parameter-Efficient Fine-Tuning (PEFT)

While full fine-tuning updates all parameters of a pre-trained model, PEFT techniques aim to achieve similar performance by updating only a small subset of parameters. This is especially important for large models, where full fine-tuning may be prohibitively expensive or lead to overfitting on small datasets.

### Low-Rank Adaptation (LoRA)

LoRA is one of the most popular PEFT techniques. It works by inserting trainable low-rank matrices into the pre-trained model to capture task-specific information. Let's understand how LoRA works:

1. **The Mathematical Idea**: 
   - In a neural network, weight matrices in each layer typically have full rank
   - LoRA makes the assumption that updates to these weights during fine-tuning can be captured by low-rank decompositions
   - Original weight update: ΔW (large matrix)
   - LoRA approximation: ΔW ≈ BA, where B has shape (d × r) and A has shape (r × k)
   - r is the "rank" of the adaptation and is typically much smaller than d and k

2. **Implementation**:
   - Freeze the original pre-trained weights W
   - Add a parallel path with low-rank matrices A and B
   - Forward pass becomes: h = Wx + BAx
   - Only train A and B, leaving W frozen

3. **Benefits**:
   - Dramatically reduces the number of trainable parameters
   - Enables fine-tuning of models that wouldn't fit in GPU memory otherwise
   - Easy to switch between tasks by swapping LoRA matrices
   - Can be applied selectively to specific layers or components

Let's visualize how LoRA works in a transformer model:

In [None]:
# Visualize how LoRA works
def visualize_lora():
    # Create a figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7))
    
    # First subplot: LoRA concept
    ax1.set_xlim(0, 10)
    ax1.set_ylim(0, 10)
    
    # Original weight matrix W
    rect_W = plt.Rectangle((1, 6), 3, 3, facecolor='lightgray', edgecolor='black')
    ax1.add_patch(rect_W)
    ax1.text(2.5, 7.5, "W\n(Frozen)", ha='center', va='center')
    
    # LoRA matrices A and B
    rect_A = plt.Rectangle((7, 8), 1, 3, facecolor='lightblue', edgecolor='black')
    rect_B = plt.Rectangle((6, 7), 3, 1, facecolor='lightblue', edgecolor='black')
    ax1.add_patch(rect_A)
    ax1.add_patch(rect_B)
    ax1.text(7.5, 9.5, "A", ha='center', va='center')
    ax1.text(7.5, 7.5, "B", ha='center', va='center')
    
    # Input and output vectors
    rect_x = plt.Rectangle((1, 3), 1, 3, facecolor='lightyellow', edgecolor='black')
    rect_h = plt.Rectangle((5, 3), 1, 3, facecolor='lightyellow', edgecolor='black')
    ax1.add_patch(rect_x)
    ax1.add_patch(rect_h)
    ax1.text(1.5, 4.5, "x", ha='center', va='center')
    ax1.text(5.5, 4.5, "h", ha='center', va='center')
    
    # Arrows for the original path
    ax1.arrow(2, 4.5, 2.5, 0, head_width=0.2, head_length=0.2, fc='black', ec='black')
    ax1.text(3.25, 4.8, "Wx", ha='center', va='center')
    
    # Arrows for the LoRA path
    ax1.arrow(2, 4, 5, 3.5, head_width=0.2, head_length=0.2, fc='blue', ec='blue', linestyle=':')
    ax1.text(5, 5.2, "BAx", ha='center', va='center', color='blue')
    
    # Final addition
    ax1.plot([4.5, 5], [4.5, 4.5], 'k-')
    ax1.plot([4.75, 4.75], [4.25, 4.75], 'k-')
    
    # Equation
    ax1.text(3, 2, "h = Wx + BAx\nOnly train A and B", ha='center', va='center', fontsize=12, bbox=dict(facecolor='white', alpha=0.7))
    
    ax1.set_title("Low-Rank Adaptation (LoRA) Concept", fontsize=14)
    ax1.axis('off')
    
    # Second subplot: LoRA in a transformer
    ax2.set_xlim(0, 10)
    ax2.set_ylim(0, 10)
    
    # Draw attention block
    rect_attn = plt.Rectangle((1, 6), 8, 3, facecolor='lightgray', edgecolor='black', alpha=0.3)
    ax2.add_patch(rect_attn)
    ax2.text(5, 9.5, "Multi-Head Attention", ha='center', va='center', fontsize=12)
    
    # Draw QKV projections
    q_rect = plt.Rectangle((2, 7), 1.5, 1, facecolor='lightgray', edgecolor='black')
    k_rect = plt.Rectangle((4, 7), 1.5, 1, facecolor='lightgray', edgecolor='black')
    v_rect = plt.Rectangle((6, 7), 1.5, 1, facecolor='lightgray', edgecolor='black')
    ax2.add_patch(q_rect)
    ax2.add_patch(k_rect)
    ax2.add_patch(v_rect)
    ax2.text(2.75, 7.5, "WQ", ha='center', va='center')
    ax2.text(4.75, 7.5, "WK", ha='center', va='center')
    ax2.text(6.75, 7.5, "WV", ha='center', va='center')
    
    # Add LoRA adapters
    q_lora = plt.Rectangle((2, 6.2), 1.5, 0.5, facecolor='lightblue', edgecolor='black')
    k_lora = plt.Rectangle((4, 6.2), 1.5, 0.5, facecolor='lightblue', edgecolor='black')
    v_lora = plt.Rectangle((6, 6.2), 1.5, 0.5, facecolor='lightblue', edgecolor='black')
    ax2.add_patch(q_lora)
    ax2.add_patch(k_lora)
    ax2.add_patch(v_lora)
    ax2.text(2.75, 6.45, "LoRA", ha='center', va='center', fontsize=8)
    ax2.text(4.75, 6.45, "LoRA", ha='center', va='center', fontsize=8)
    ax2.text(6.75, 6.45, "LoRA", ha='center', va='center', fontsize=8)
    
    # Draw FFN block
    rect_ffn = plt.Rectangle((1, 3), 8, 2, facecolor='lightgray', edgecolor='black', alpha=0.3)
    ax2.add_patch(rect_ffn)
    ax2.text(5, 5.2, "Feed-Forward Network", ha='center', va='center', fontsize=12)
    
    # FFN weights
    ffn1_rect = plt.Rectangle((2, 3.5), 2, 1, facecolor='lightgray', edgecolor='black')
    ffn2_rect = plt.Rectangle((6, 3.5), 2, 1, facecolor='lightgray', edgecolor='black')
    ax2.add_patch(ffn1_rect)
    ax2.add_patch(ffn2_rect)
    ax2.text(3, 4, "W1", ha='center', va='center')
    ax2.text(7, 4, "W2", ha='center', va='center')
    
    # FFN LoRA adapters
    ffn1_lora = plt.Rectangle((2, 3), 2, 0.3, facecolor='lightblue', edgecolor='black')
    ffn2_lora = plt.Rectangle((6, 3), 2, 0.3, facecolor='lightblue', edgecolor='black')
    ax2.add_patch(ffn1_lora)
    ax2.add_patch(ffn2_lora)
    ax2.text(3, 3.15, "LoRA", ha='center', va='center', fontsize=8)
    ax2.text(7, 3.15, "LoRA", ha='center', va='center', fontsize=8)
    
    # Arrows
    ax2.arrow(5, 8.5, 0, -0.5, head_width=0.2, head_length=0.2, fc='black', ec='black')
    ax2.arrow(5, 6, 0, -0.5, head_width=0.2, head_length=0.2, fc='black', ec='black')
    ax2.arrow(5, 3, 0, -0.5, head_width=0.2, head_length=0.2, fc='black', ec='black')
    
    # Parameter counts
    ax2.text(5, 1.5, "Trainable: <1% of parameters\nFrozen: >99% of parameters", 
             ha='center', va='center', fontsize=12, bbox=dict(facecolor='white', alpha=0.7))
    
    ax2.set_title("LoRA Applied to Transformer", fontsize=14)
    ax2.axis('off')
    
    plt.suptitle("Understanding Low-Rank Adaptation (LoRA)", fontsize=16)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

# Call the visualization function
visualize_lora()

## 3. Loading and Preparing a Dataset

For our fine-tuning example, we'll use a sentiment analysis task. The goal is to classify text as expressing positive or negative sentiment. We'll use the IMDB movie reviews dataset, which contains 50,000 movie reviews labeled as positive or negative.

Let's load and prepare the dataset:

In [None]:
# Load the IMDB dataset
def load_and_prepare_imdb_dataset(max_samples=None):
    """
    Load and prepare the IMDB dataset for sentiment analysis.
    
    Args:
        max_samples: Maximum number of samples to use (for faster experimentation)
    
    Returns:
        train_dataset, test_dataset: Prepared datasets for training and evaluation
    """
    print("Loading IMDB dataset...")
    
    # Load the dataset
    dataset = load_dataset("imdb")
    
    # If max_samples is provided, reduce the dataset size for faster experimentation
    if max_samples is not None:
        # Ensure we take a balanced sample
        train_pos = dataset["train"].filter(lambda example: example["label"] == 1).select(range(max_samples // 2))
        train_neg = dataset["train"].filter(lambda example: example["label"] == 0).select(range(max_samples // 2))
        train_dataset = concatenate_datasets([train_pos, train_neg])
        train_dataset = train_dataset.shuffle(seed=seed)
        
        test_pos = dataset["test"].filter(lambda example: example["label"] == 1).select(range(max_samples // 10))
        test_neg = dataset["test"].filter(lambda example: example["label"] == 0).select(range(max_samples // 10))
        test_dataset = concatenate_datasets([test_pos, test_neg])
        test_dataset = test_dataset.shuffle(seed=seed)
    else:
        train_dataset = dataset["train"]
        test_dataset = dataset["test"]
    
    # Print some statistics
    print(f"Train dataset size: {len(train_dataset)}")
    print(f"Test dataset size: {len(test_dataset)}")
    
    # Check class balance
    train_labels = [example["label"] for example in train_dataset]
    train_positive = sum(train_labels)
    train_negative = len(train_labels) - train_positive
    
    print(f"Training set class distribution:")
    print(f"  Positive: {train_positive} ({train_positive / len(train_labels) * 100:.1f}%)")
    print(f"  Negative: {train_negative} ({train_negative / len(train_labels) * 100:.1f}%)")
    
    # Sample a few examples
    print("\nSample reviews:")
    for i in range(2):
        sentiment = "Positive" if train_dataset[i]["label"] == 1 else "Negative"
        truncated_text = train_dataset[i]["text"][:200] + "..." if len(train_dataset[i]["text"]) > 200 else train_dataset[i]["text"]
        print(f"\n[{sentiment}] {truncated_text}")
    
    # Visualize the length distribution of reviews
    text_lengths = [len(example["text"].split()) for example in train_dataset.select(range(1000))]
    
    plt.figure(figsize=(10, 5))
    plt.hist(text_lengths, bins=50, alpha=0.7)
    plt.xlabel("Review Length (words)")
    plt.ylabel("Count")
    plt.title("Distribution of Review Lengths")
    plt.axvline(x=np.median(text_lengths), color='red', linestyle='--', label=f"Median: {np.median(text_lengths):.0f} words")
    plt.legend()
    plt.grid(alpha=0.3)
    plt.show()
    
    return train_dataset, test_dataset

# Import required additional library
from datasets import concatenate_datasets

# Load a small subset of the data for faster experimentation
# For a real project, you might want to use the full dataset or a larger subset
train_dataset, test_dataset = load_and_prepare_imdb_dataset(max_samples=1000)

## 4. Setting Up the Model and Tokenizer

Now, let's set up our pre-trained model. We'll use DistilBERT, a smaller and faster version of BERT that retains much of its performance. First, we need to prepare our data by tokenizing the text.

In [None]:
# Define model name
model_name = "distilbert-base-uncased"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_function(examples):
    """
    Tokenize the text examples using the loaded tokenizer.
    
    Args:
        examples: Batch of examples from the dataset
        
    Returns:
        Tokenized examples
    """
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# Tokenize the datasets
print("Tokenizing datasets...")
train_tokenized = train_dataset.map(tokenize_function, batched=True)
test_tokenized = test_dataset.map(tokenize_function, batched=True)

# Set the format for PyTorch
train_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

print("Dataset preparation complete!")

# Quick look at the tokenized input
print("\nExample of tokenized input:")
sample_ids = train_tokenized[0]["input_ids"][:20].tolist()  # First 20 tokens
print(f"Token IDs: {sample_ids}")
print(f"Decoded: {tokenizer.decode(sample_ids)}")

## 5. Full Fine-Tuning Approach

Let's start with the traditional full fine-tuning approach, where we update all parameters of the pre-trained model. This will serve as a baseline for comparison with PEFT methods.

First, let's define the evaluation metrics and training arguments:

In [None]:
# Define the metrics for evaluation
def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for the model.
    
    Args:
        eval_pred: Tuple of predictions and labels
        
    Returns:
        Dictionary of metrics
    """
    metric = evaluate.load("accuracy")
    f1_metric = evaluate.load("f1")
    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    accuracy = metric.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    
    return {"accuracy": accuracy, "f1": f1}

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results/full_fine_tuning",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="none",  # Disable wandb or other integrations
)

In [None]:
# Load the pre-trained model
def load_full_model():
    """
    Load the pre-trained model for full fine-tuning.
    
    Returns:
        Pre-trained model initialized for sequence classification
    """
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2,  # Binary classification: positive or negative
    )
    
    # Calculate the number of trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Model: {model_name}")
    print(f"Trainable parameters: {trainable_params:,} ({trainable_params / total_params:.2%} of total)")
    print(f"Total parameters: {total_params:,}")
    
    return model

# Load the model
full_model = load_full_model()

Now, let's set up the trainer and run the full fine-tuning:

In [None]:
# Initialize the Trainer
full_trainer = Trainer(
    model=full_model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

# Train the model
print("Starting full fine-tuning...")
full_trainer.train()

# Evaluate the model
print("\nEvaluating the fully fine-tuned model...")
full_eval_results = full_trainer.evaluate()
print(f"Evaluation results: {full_eval_results}")

# Save the model
full_trainer.save_model("./results/full_fine_tuning/final")

## 6. Parameter-Efficient Fine-Tuning with LoRA

Now, let's implement LoRA fine-tuning to see how it compares to full fine-tuning in terms of efficiency and performance.

In [None]:
# Set up training arguments for LoRA
lora_training_args = TrainingArguments(
    output_dir="./results/lora_fine_tuning",
    learning_rate=5e-4,  # Usually higher than full fine-tuning
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,  # May need more epochs
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="none",
)

# Load and configure the model with LoRA
def load_lora_model():
    """
    Load the pre-trained model and configure it for LoRA fine-tuning.
    
    Returns:
        Model configured with LoRA adapters
    """
    # Load the base model
    base_model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2,
    )
    
    # Configure LoRA
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,  # Sequence classification
        r=8,                         # Rank of the update matrices
        lora_alpha=32,               # Alpha parameter for scaling
        lora_dropout=0.1,            # Dropout probability for LoRA layers
        # Which layers to apply LoRA to
        target_modules=["q_lin", "v_lin"],  # For DistilBERT, target query and value projection
    )
    
    # Apply LoRA to the model
    lora_model = get_peft_model(base_model, peft_config)
    
    # Calculate parameter counts
    trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in lora_model.parameters())
    print(f"LoRA Configuration:")
    print(f"  Rank (r): {peft_config.r}")
    print(f"  Alpha: {peft_config.lora_alpha}")
    print(f"  Target modules: {peft_config.target_modules}")
    print(f"\nTrainable parameters: {trainable_params:,} ({trainable_params / total_params:.2%} of total)")
    print(f"Total parameters: {total_params:,}")
    
    # Print the LoRA model to see its structure
    lora_model.print_trainable_parameters()
    
    return lora_model

# Load the LoRA model
lora_model = load_lora_model()

In [None]:
# Initialize the Trainer for LoRA
lora_trainer = Trainer(
    model=lora_model,
    args=lora_training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

# Train the model
print("Starting LoRA fine-tuning...")
lora_trainer.train()

# Evaluate the model
print("\nEvaluating the LoRA fine-tuned model...")
lora_eval_results = lora_trainer.evaluate()
print(f"Evaluation results: {lora_eval_results}")

# Save the model
lora_trainer.save_model("./results/lora_fine_tuning/final")

## 7. Comparing Fine-Tuning Methods

Now, let's compare the results of full fine-tuning and LoRA fine-tuning:

In [None]:
# Compare the two approaches
def compare_fine_tuning_approaches():
    """
    Compare the results of full fine-tuning and LoRA fine-tuning.
    """
    # Create a comparison table
    comparison_data = {
        "Method": ["Full Fine-Tuning", "LoRA Fine-Tuning"],
        "Accuracy": [full_eval_results["eval_accuracy"], lora_eval_results["eval_accuracy"]],
        "F1 Score": [full_eval_results["eval_f1"], lora_eval_results["eval_f1"]],
        "Trainable Parameters": [
            sum(p.numel() for p in full_model.parameters() if p.requires_grad),
            sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
        ],
        "Training Time (per epoch)": [
            full_trainer.state.log_history[0]["train_runtime"] / full_trainer.state.epoch,
            lora_trainer.state.log_history[0]["train_runtime"] / lora_trainer.state.epoch
        ]
    }
    
    # Create a DataFrame
    comparison_df = pd.DataFrame(comparison_data)
    
    # Format the numeric columns
    comparison_df["Accuracy"] = comparison_df["Accuracy"].map("{:.4f}".format)
    comparison_df["F1 Score"] = comparison_df["F1 Score"].map("{:.4f}".format)
    comparison_df["Trainable Parameters"] = comparison_df["Trainable Parameters"].map("{:,}".format)
    comparison_df["Training Time (per epoch)"] = comparison_df["Training Time (per epoch)"].map("{:.2f} sec".format)
    
    # Set Method as index
    comparison_df.set_index("Method", inplace=True)
    
    # Calculate parameter reduction percentage
    full_params = sum(p.numel() for p in full_model.parameters() if p.requires_grad)
    lora_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
    reduction = (full_params - lora_params) / full_params * 100
    
    # Display the comparison
    print("Fine-Tuning Methods Comparison:")
    display(comparison_df)
    
    print(f"\nParameter Reduction with LoRA: {reduction:.2f}%")
    
    # Visualize the comparison
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Performance comparison
    metrics = ["Accuracy", "F1 Score"]
    full_vals = [float(full_eval_results["eval_accuracy"]), float(full_eval_results["eval_f1"])]
    lora_vals = [float(lora_eval_results["eval_accuracy"]), float(lora_eval_results["eval_f1"])]
    
    x = np.arange(len(metrics))
    width = 0.35
    
    ax1.bar(x - width/2, full_vals, width, label='Full Fine-Tuning')
    ax1.bar(x + width/2, lora_vals, width, label='LoRA Fine-Tuning')
    
    ax1.set_ylabel('Score')
    ax1.set_title('Performance Metrics')
    ax1.set_xticks(x)
    ax1.set_xticklabels(metrics)
    ax1.set_ylim(0, 1)
    ax1.legend()
    
    # Parameter and time comparison (log scale)
    metrics2 = ["Trainable Parameters", "Training Time (sec/epoch)"]
    full_vals2 = [full_params, full_trainer.state.log_history[0]["train_runtime"] / full_trainer.state.epoch]
    lora_vals2 = [lora_params, lora_trainer.state.log_history[0]["train_runtime"] / lora_trainer.state.epoch]
    
    ax2.set_yscale('log')
    ax2.bar(x - width/2, full_vals2, width, label='Full Fine-Tuning')
    ax2.bar(x + width/2, lora_vals2, width, label='LoRA Fine-Tuning')
    
    # Add value labels on bars
    for i, v in enumerate(full_vals2):
        if i == 0:  # Parameters
            ax2.text(i - width/2, v * 1.1, f"{v:,}", ha='center', va='bottom', rotation=90, fontsize=8)
        else:  # Time
            ax2.text(i - width/2, v * 1.1, f"{v:.1f}s", ha='center', va='bottom', fontsize=8)
    
    for i, v in enumerate(lora_vals2):
        if i == 0:  # Parameters
            ax2.text(i + width/2, v * 1.1, f"{v:,}", ha='center', va='bottom', rotation=90, fontsize=8)
        else:  # Time
            ax2.text(i + width/2, v * 1.1, f"{v:.1f}s", ha='center', va='bottom', fontsize=8)
    
    ax2.set_ylabel('Value (log scale)')
    ax2.set_title('Resource Usage (lower is better)')
    ax2.set_xticks(x)
    ax2.set_xticklabels(metrics2)
    ax2.legend()
    
    plt.tight_layout()
    plt.show()
    
    # Print conclusions
    print("\nKey Observations:")
    
    # Compare accuracy
    acc_diff = float(lora_eval_results["eval_accuracy"]) - float(full_eval_results["eval_accuracy"])
    if abs(acc_diff) < 0.01:
        print("1. LoRA achieved comparable accuracy to full fine-tuning")
    elif acc_diff > 0:
        print(f"1. LoRA surprisingly achieved higher accuracy (+{acc_diff:.4f}) than full fine-tuning")
    else:
        print(f"1. LoRA had slightly lower accuracy ({acc_diff:.4f}) compared to full fine-tuning")
    
    # Compare parameter count
    print(f"2. LoRA used only {lora_params:,} trainable parameters ({lora_params/full_params:.2%} of full fine-tuning)")
    
    # Compare training time
    time_diff = (full_trainer.state.log_history[0]["train_runtime"] - lora_trainer.state.log_history[0]["train_runtime"]) / full_trainer.state.log_history[0]["train_runtime"] * 100
    print(f"3. LoRA training was {abs(time_diff):.1f}% {'faster' if time_diff > 0 else 'slower'} than full fine-tuning")
    
    # Overall assessment
    if acc_diff > -0.03 and time_diff > 10:  # If accuracy is within 3% and training is >10% faster
        print("\nConclusion: LoRA offers an excellent trade-off, providing similar performance")
        print("with significantly fewer parameters and faster training time.")
    elif acc_diff > 0:  # If LoRA is actually better
        print("\nConclusion: In this case, LoRA is clearly superior, offering both better performance")
        print("and efficiency. This sometimes happens when full fine-tuning overfits.")
    else:  # If full fine-tuning is notably better
        print("\nConclusion: Full fine-tuning achieved better results, but at a much higher computational cost.")
        print("The trade-off depends on your specific requirements for accuracy vs. efficiency.")

try:
    # Only run if both training processes completed successfully
    from IPython.display import display
    compare_fine_tuning_approaches()
except Exception as e:
    print(f"Could not run comparison due to an error: {e}")
    print("Please ensure both training processes completed successfully.")

## 8. Building an Evaluation Dashboard

Let's create a simple interface to interact with our fine-tuned models and evaluate their predictions. We'll use Gradio to build this dashboard.

In [None]:
# Create a dashboard to compare models
def create_evaluation_dashboard():
    """
    Create an interactive dashboard to compare the fine-tuned models.
    """
    # Load models for inference
    try:
        full_model_path = "./results/full_fine_tuning/final"
        lora_model_path = "./results/lora_fine_tuning/final"
        
        # Create inference pipelines
        full_pipeline = pipeline(
            "text-classification", 
            model=full_model_path, 
            tokenizer=tokenizer,
            device=0 if torch.cuda.is_available() else -1
        )
        
        # For LoRA, we need to load the PEFT model
        peft_config = PeftConfig.from_pretrained(lora_model_path)
        base_model = AutoModelForSequenceClassification.from_pretrained(
            peft_config.base_model_name_or_path, 
            num_labels=2
        )
        lora_model = PeftModel.from_pretrained(base_model, lora_model_path)
        
        # Create a pipeline for LoRA model
        lora_pipeline = pipeline(
            "text-classification", 
            model=lora_model, 
            tokenizer=tokenizer,
            device=0 if torch.cuda.is_available() else -1
        )
        
        # Function to make predictions
        def predict(text, model_type):
            if model_type == "Full Fine-Tuned Model":
                result = full_pipeline(text)[0]
            else:  # LoRA Fine-Tuned Model
                result = lora_pipeline(text)[0]
                
            label = result["label"]
            score = result["score"]
            sentiment = "Positive" if label == "LABEL_1" else "Negative"
            
            # Create the output message
            message = f"Sentiment: {sentiment} (Confidence: {score:.2%})"
            
            return message
        
        # Create the interface
        iface = gr.Interface(
            fn=predict,
            inputs=[
                gr.Textbox(lines=5, placeholder="Enter text to analyze sentiment...", label="Text Input"),
                gr.Radio(["Full Fine-Tuned Model", "LoRA Fine-Tuned Model"], label="Model Type", value="Full Fine-Tuned Model")
            ],
            outputs="text",
            title="Sentiment Analysis Model Comparison",
            description="Compare the predictions of fully fine-tuned and LoRA fine-tuned models.",
            examples=[
                ["This movie was absolutely brilliant! The acting was superb and the plot kept me engaged throughout.", "Full Fine-Tuned Model"],
                ["This movie was absolutely brilliant! The acting was superb and the plot kept me engaged throughout.", "LoRA Fine-Tuned Model"],
                ["What a waste of time. The plot was confusing and the characters were poorly developed.", "Full Fine-Tuned Model"],
                ["What a waste of time. The plot was confusing and the characters were poorly developed.", "LoRA Fine-Tuned Model"],
                ["The movie had its moments, but overall it was just average. Nothing special but not terrible either.", "Full Fine-Tuned Model"],
                ["The movie had its moments, but overall it was just average. Nothing special but not terrible either.", "LoRA Fine-Tuned Model"]
            ]
        )
        
        # Launch the interface
        iface.launch(share=True)
        
    except Exception as e:
        print(f"Failed to create evaluation dashboard: {e}")
        print("Please ensure that both models were trained and saved properly.")
        
        # Create a simpler interface with sample data
        def demo_predict(text, model_type):
            import random
            sentiment = random.choice(["Positive", "Negative"])
            score = random.uniform(0.7, 0.99)
            return f"[DEMO MODE] Sentiment: {sentiment} (Confidence: {score:.2%})"
        
        demo_iface = gr.Interface(
            fn=demo_predict,
            inputs=[
                gr.Textbox(lines=5, placeholder="Enter text to analyze sentiment...", label="Text Input"),
                gr.Radio(["Full Fine-Tuned Model", "LoRA Fine-Tuned Model"], label="Model Type", value="Full Fine-Tuned Model")
            ],
            outputs="text",
            title="Sentiment Analysis Model Comparison (DEMO MODE)",
            description="This is a demonstration with random outputs since the actual models are not available.",
        )
        
        demo_iface.launch(share=True)

# Create the dashboard
create_evaluation_dashboard()

## 9. Exploring Advanced LoRA Configurations

Let's experiment with different LoRA configurations to understand how they affect performance. LoRA has several hyperparameters that can be tuned:

1. **Rank (r)**: Controls the complexity of the adaptations; higher rank means more capacity but more parameters
2. **Alpha (α)**: Scaling factor for the LoRA update; higher alpha gives more weight to adaptations
3. **Target Modules**: Which layers to apply LoRA to (e.g., attention only, FFN only, or both)
4. **Dropout**: Adding regularization to prevent overfitting

Let's create a function to analyze how these parameters affect model performance:

In [None]:
# Analyze the impact of different LoRA configurations
def analyze_lora_configs():
    """
    Experiment with different LoRA configurations and analyze their impact.
    """
    # Define configurations to test
    configs = [
        {"r": 4, "alpha": 16, "target": ["q_lin"], "description": "Low rank, query only"},
        {"r": 8, "alpha": 32, "target": ["q_lin", "v_lin"], "description": "Medium rank, query and value"},
        {"r": 16, "alpha": 64, "target": ["q_lin", "k_lin", "v_lin"], "description": "High rank, all attention"}
    ]
    
    results = []
    
    # This would take too long to run in this notebook
    # Instead, let's simulate the results based on common patterns
    
    # Simulate results
    results = [
        {
            "config": configs[0],
            "accuracy": 0.875,
            "f1": 0.872,
            "params": 12_000,
            "train_time": 82
        },
        {
            "config": configs[1],
            "accuracy": 0.888,
            "f1": 0.886,
            "params": 24_000,
            "train_time": 85
        },
        {
            "config": configs[2],
            "accuracy": 0.892,
            "f1": 0.890,
            "params": 72_000,
            "train_time": 91
        }
    ]
    
    # Create a DataFrame for display
    results_df = pd.DataFrame({
        "Configuration": [r["config"]["description"] for r in results],
        "Rank (r)": [r["config"]["r"] for r in results],
        "Alpha": [r["config"]["alpha"] for r in results],
        "Target Modules": [', '.join(r["config"]["target"]) for r in results],
        "Accuracy": [r["accuracy"] for r in results],
        "F1 Score": [r["f1"] for r in results],
        "Parameters": [r["params"] for r in results],
        "Training Time (s)": [r["train_time"] for r in results]
    })
    
    # Display the results
    print("LoRA Configuration Analysis:")
    display(results_df)
    
    # Visualize the results
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Accuracy vs Parameters
    sns.scatterplot(x="Parameters", y="Accuracy", size="Rank (r)", hue="Configuration", 
                    data=results_df, ax=ax1)
    ax1.set_title("Accuracy vs Parameters")
    
    # Training Time vs Parameters with Accuracy as color
    sns.scatterplot(x="Parameters", y="Training Time (s)", size="Rank (r)", hue="Accuracy", 
                   data=results_df, ax=ax2, palette="viridis")
    ax2.set_title("Training Time vs Parameters")
    
    plt.tight_layout()
    plt.show()
    
    # Bar chart comparing performance metrics
    fig, ax = plt.subplots(figsize=(10, 5))
    
    ind = np.arange(len(results_df))
    width = 0.35
    
    bar1 = ax.bar(ind - width/2, results_df["Accuracy"], width, label="Accuracy")
    bar2 = ax.bar(ind + width/2, results_df["F1 Score"], width, label="F1 Score")
    
    ax.set_title("Performance Metrics by Configuration")
    ax.set_xticks(ind)
    ax.set_xticklabels(results_df["Configuration"])
    ax.set_ylim(0.85, 0.90)  # Zoom in to see differences
    ax.legend()
    
    plt.tight_layout()
    plt.show()
    
    # Key observations
    print("\nKey Observations:")
    print("1. Higher rank (r) values generally lead to better performance but more parameters")
    print("2. Including more attention components (Q, K, V) improves results but increases complexity")
    print("3. Training time increases with model complexity, but not dramatically")
    print("4. The medium configuration (r=8, targeting Q and V) offers a good balance")
    print("   between performance and efficiency")
    
    # Recommendations
    print("\nRecommendations for LoRA Configuration:")
    print("1. Start with a moderate rank (r=8) and target query and value projections")
    print("2. If more performance is needed, increase rank before adding more target modules")
    print("3. Scale alpha proportionally with rank (typically alpha = 4*r or 2*r)")
    print("4. For very large models, start with lower ranks (r=4) and fewer target modules")
    print("   to keep memory requirements manageable")

# Run the analysis
from IPython.display import display
analyze_lora_configs()

## 10. Advanced Topic: Fine-Tuning for Text Generation

So far, we've focused on classification tasks. Let's briefly discuss how fine-tuning differs for text generation models (like GPT-2, GPT-J, or Llama).

For text generation, the key differences include:

1. **Model Architecture**: Decoder-only models instead of encoder models
2. **Training Objective**: Next-token prediction rather than classification
3. **Input Formatting**: Structured prompts and completions
4. **Evaluation Metrics**: BLEU, ROUGE, or perplexity instead of accuracy/F1
5. **Generation Parameters**: Need to tune temperature, top-k, top-p, etc.

Here's a conceptual example of how you would fine-tune a generative model with LoRA:

In [None]:
# This is conceptual code - not meant to be run
# We'll just print it to demonstrate the approach

generative_code = '''
# Load a base causal language model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType

# For text generation, use a causal language model like GPT-2
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default

model = AutoModelForCausalLM.from_pretrained(model_name)

# Configure LoRA for a text generation model
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # For causal language modeling
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    # Target the attention layers in GPT-2
    target_modules=["c_attn", "c_proj"],
)

# Apply LoRA to the model
lora_model = get_peft_model(model, peft_config)

# For text generation tasks, format your data differently
# Instead of labels, you typically have prompt-completion pairs
def tokenize_generation_data(examples):
    # Format: [prompt] [completion]
    texts = [f"{prompt} {completion}" for prompt, completion in zip(examples["prompt"], examples["completion"])]
    return tokenizer(texts, padding="max_length", truncation=True, max_length=512)

# Training arguments typically use a lower learning rate
training_args = TrainingArguments(
    output_dir="./results/lora_generation",
    learning_rate=2e-5,
    per_device_train_batch_size=4,  # Lower batch size due to memory constraints
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize the Trainer with a different data collator for generation
from transformers import DataCollatorForLanguageModeling

# Use a language modeling data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    data_collator=data_collator,
)

# After training, generate text with the fine-tuned model
prompt = "Write a short story about"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# Generation parameters matter a lot for quality
generation_output = lora_model.generate(
    input_ids,
    max_length=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    num_return_sequences=1,
)

generated_text = tokenizer.decode(generation_output[0], skip_special_tokens=True)
print(generated_text)
'''

print(generative_code)

## 11. Key Takeaways and Best Practices

Let's summarize what we've learned about fine-tuning language models:

### Key Takeaways

1. **Transfer Learning Fundamentals**:
   - Pre-trained language models contain rich linguistic and world knowledge
   - Fine-tuning adapts this knowledge to specific tasks with less data
   - Different fine-tuning approaches offer various trade-offs between performance and efficiency

2. **LoRA and PEFT Techniques**:
   - LoRA enables efficient fine-tuning by adding small trainable matrices
   - Reduces parameter count by 99%+ while maintaining comparable performance
   - Key hyperparameters include rank, alpha, and target modules

3. **Implementation Insights**:
   - Proper data preparation is crucial (tokenization, batching, etc.)
   - Training hyperparameters differ between full fine-tuning and PEFT
   - Evaluation should include multiple metrics and comparison to baselines

### Best Practices

1. **Choose the Right Approach**:
   - Use full fine-tuning when you have ample compute and need maximum performance
   - Use LoRA when resources are limited or for multiple specialized models
   - Start with established models and adapt incrementally

2. **Hyperparameter Selection**:
   - Start with standard values (r=8, alpha=32) and adjust based on results
   - Learning rates are typically higher for LoRA than full fine-tuning
   - Include early stopping to prevent overfitting

3. **Data Quality**:
   - Clean, balanced data is more important than large quantities
   - Consider augmentation for small datasets
   - Ensure evaluation data reflects real-world use cases

4. **Practical Workflow**:
   - Start with small experiments to validate approach
   - Build evaluation tools early to compare variations
   - Save models properly with metadata for reproducibility
   - Consider ensemble methods for critical applications

### Next Steps in Your Learning Journey

1. **More Advanced PEFT Techniques**:
   - Explore adapters, prompt tuning, and other PEFT methods
   - Combine multiple PEFT techniques for optimal results

2. **Scaling to Larger Models**:
   - Adapt these techniques to models like Llama, Falcon, etc.
   - Learn quantization methods (4-bit, 8-bit) for even larger models

3. **Real-World Applications**:
   - Move beyond classification to generation, summarization, etc.
   - Develop complete end-to-end systems with fine-tuned models
   - Integrate with Retrieval-Augmented Generation (RAG)

4. **Advanced Evaluation**:
   - Human evaluation and alignment techniques
   - Specialized metrics for different tasks
   - Robustness testing against adversarial inputs

Congratulations on completing this notebook on fine-tuning language models! You now have the tools and knowledge to apply these techniques to your own projects.

## 12. Exercises

To reinforce your learning, try these exercises on your own:

1. **Experiment with Different Hyperparameters**:
   - Change the learning rate, batch size, or number of epochs
   - Try different rank values for LoRA
   - Target different layers in the model

2. **Try Another Dataset**:
   - Fine-tune on a different classification dataset (e.g., AG News, SST-2)
   - Compare performance across different domains

3. **Implement Advanced Evaluation**:
   - Add confusion matrix visualization
   - Analyze performance on specific subsets of the data
   - Implement cross-validation

4. **Extend to Generation Tasks**:
   - Fine-tune a small generation model (e.g., GPT-2 small)
   - Create a prompt-completion dataset
   - Evaluate with BLEU or ROUGE metrics

5. **Experiment with Other PEFT Methods**:
   - Try adapters instead of LoRA
   - Implement prompt tuning
   - Compare multiple PEFT techniques

6. **Deploy Your Model**:
   - Export the model to ONNX format for optimized inference
   - Create a simple REST API to serve predictions
   - Measure inference latency and throughput

These exercises will help you gain practical experience and deepen your understanding of fine-tuning language models.