# Fine-tuning DistilBERT for Sentiment Analysis

This notebook walks through the process of fine-tuning a DistilBERT model for sentiment analysis using the IMDB movie reviews dataset.

## Why DistilBERT?

DistilBERT is a "distilled" version of BERT (Bidirectional Encoder Representations from Transformers) meaning that it retains 97% of its language understanding capabilities while being 40% smaller and 60% faster. This makes it ideal for:
- Production deployments where resource efficiency is crucial
- Quick experimentation and fine-tuning
- Applications requiring real-time inference

## What this Notebook Covers

1. Data preparation and preprocessing
2. Model configuration and optimization
3. Training with performance monitoring
4. Evaluation and metrics visualization
5. Model saving and deployment

## 0. Environment Setup

We'll start by setting up our environment:
- Importing the necessary libraries
- Setting random seeds for reproducibility
- Optimizing memory usage for GPU training (CUDA)


In [None]:
import torch
from transformers import (
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    get_linear_schedule_with_warmup
)
from torch.utils.data import DataLoader, Dataset
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from datasets import load_dataset
from collections import Counter
import gc
import time
from pathlib import Path

# Set random seeds for reproducibility
torch.manual_seed(2025)
np.random.seed(2025)

# Setup device and optimize memory usage
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.set_per_process_memory_fraction(0.8)
    torch.backends.cudnn.benchmark = True

print(f'Using device: {device}')

## 1. Data Preparation

We'll implement our dataset handling with the following features:
- Efficient data loading using Hugging Face `datasets`
- Balanced class distribution using stratified sampling (the IMDb dataset is already balanced but we'll use a subset for demonstration purposes)
- Memory-efficient PyTorch Dataset implementation
- Label distribution visualization

In [2]:
def print_label_distribution(data, split_name):
    """Print and visualize the distribution of labels in a dataset"""
    labels = data["label"]
    label_counts = Counter(labels)
    total = len(labels)
    
    print(f"\n{split_name} Label Distribution:")
    counts = []
    names = []
    for label, count in sorted(label_counts.items()):
        label_name = "Positive" if label == 1 else "Negative"
        percentage = (count / total) * 100
        print(f"{label_name}: {count} ({percentage:.2f}%)")
        counts.append(count)
        names.append(label_name)
    
    # Visualize distribution
    plt.figure(figsize=(8, 5))
    plt.bar(names, counts)
    plt.title(f"{split_name} Label Distribution")
    plt.ylabel("Count")
    plt.show()

In [3]:
class IMDBDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.encodings = tokenizer(
            texts,
            truncation=True,
            padding=True,
            max_length=max_length,
            return_tensors="pt"
        )
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

In [4]:
def prepare_data(num_samples=None, test_size=0.2, random_state=2025):
    """Load and prepare IMDB dataset with visualization"""
    print(f"Loading IMDB dataset{f' (using {num_samples} samples)' if num_samples else ''}")
    
    dataset = load_dataset("imdb", split="train")
    df = pd.DataFrame({"text": dataset["text"], "label": dataset["label"]})
    
    if num_samples:
        df = df.groupby("label", group_keys=False)\
               .apply(lambda x: x.sample(n=num_samples // 2, random_state=random_state))\
               .reset_index(drop=True)
    
    train_data, val_data = train_test_split(
        df, test_size=test_size, random_state=random_state, stratify=df["label"]
    )
    
    # Convert to dictionary format
    train_dict = {
        "text": train_data["text"].tolist(),
        "label": train_data["label"].tolist(),
    }
    val_dict = {
        "text": val_data["text"].tolist(),
        "label": val_data["label"].tolist(),
    }
    
    # Print and plot distributions
    print_label_distribution(train_dict, "Training")
    print_label_distribution(val_dict, "Validation")
    
    print(f"\nTrain size: {len(train_dict['text'])}, Validation size: {len(val_dict['text'])}")
    return train_dict, val_dict

Let's prepare our data with a smaller sample size for quick experimentation.

In [None]:
# Prepare data with 10000 samples for testing (use 25000 for full training)
train_data, val_data = prepare_data(num_samples=10000)

# Display a sample review
print("\nSample review:")
print(f"Text: {train_data['text'][0][:200]}...")
print(f"Label: {'Positive' if train_data['label'][0] == 1 else 'Negative'}")

## 2. Model Configuration

Now we'll set up our DistilBERT model with optimizations for fine-tuning:
- Frozen base layers to prevent catastrophic forgetting and reduce training time
- Memory optimizations for efficient training
- Gradient checkpointing for larger batch sizes

In [6]:
def initialize_model(model_name="distilbert-base-uncased", freeze_base_model=True):
    """Initialize and configure the model for fine-tuning"""
    # Load tokenizer and model
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
    
    if freeze_base_model:
        # Freeze base model layers
        for param in model.distilbert.parameters():
            param.requires_grad = False
            
        # Keep classification layers trainable
        for param in model.pre_classifier.parameters():
            param.requires_grad = True
        for param in model.classifier.parameters():
            param.requires_grad = True
    
    # Print parameter statistics
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable_params:,} / {total_params:,}")
    
    # Enable memory optimization
    model.config.use_cache = False
    model.gradient_checkpointing_enable()
    
    return model, tokenizer

## 3. Training and Evaluation Functions

We'll implement comprehensive evaluation metrics and a robust training loop. These functions are taken directly from our training scripts with minor adaptations for interactive notebook use.

In [7]:
def evaluate_model(model, dataloader, device):
    """
    Evaluate the model with multiple metrics
    
    Returns:
        dict: Dictionary containing various metrics
    """
    model.eval()
    all_predictions = []
    all_labels = []
    total_loss = 0

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)

            total_loss += outputs.loss.item()
            predictions = torch.argmax(outputs.logits, dim=-1)

            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(batch["labels"].cpu().numpy())

    # Calculate metrics
    avg_loss = total_loss / len(dataloader)
    report = classification_report(
        all_labels,
        all_predictions,
        target_names=["Negative", "Positive"],
        output_dict=True,
    )

    return {
        "loss": avg_loss,
        "accuracy": report["accuracy"],
        "precision": report["macro avg"]["precision"],
        "recall": report["macro avg"]["recall"],
        "f1": report["macro avg"]["f1-score"],
        "class_metrics": {
            "negative": {
                "precision": report["Negative"]["precision"],
                "recall": report["Negative"]["recall"],
                "f1": report["Negative"]["f1-score"],
            },
            "positive": {
                "precision": report["Positive"]["precision"],
                "recall": report["Positive"]["recall"],
                "f1": report["Positive"]["f1-score"],
            },
        },
    }

def print_metrics(metrics, split="Validation"):
    """Pretty print the metrics"""
    print(f"\n{split} Metrics:")
    print(f"Loss: {metrics['loss']:.4f}")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Precision (macro): {metrics['precision']:.4f}")
    print(f"Recall (macro): {metrics['recall']:.4f}")
    print(f"F1 Score (macro): {metrics['f1']:.4f}")

    print("\nPer-Class Metrics:")
    for class_name, class_metrics in metrics["class_metrics"].items():
        print(f"\n{class_name.capitalize()}:")
        print(f"  Precision: {class_metrics['precision']:.4f}")
        print(f"  Recall: {class_metrics['recall']:.4f}")
        print(f"  F1 Score: {class_metrics['f1']:.4f}")

## 4. Training Loop

Now we'll implement our training loop with hyperparameter optimizations for training speed and memory efficiency (my laptop GPU is not very powerful):
- Learning rate scheduling
- Gradient accumulation
- Memory optimization
- Progress tracking

Note: This implementation is adapted from the training script (train.py) with modifications for interactive use in the notebook. The core functionality remains the same.

In [23]:
def train_model(
    model,
    train_loader,
    val_loader,
    num_epochs=3,
    learning_rate=2e-5,
    output_dir="/root/freelance-labs/movie_review_service/app/ml/models/best_model"
):
    """Train the model with comprehensive logging"""
    optimizer = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad], lr=learning_rate)
    total_steps = (len(train_loader) // 4) * num_epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=0, num_training_steps=total_steps
    )
    
    # Create output directory
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Training loop
    best_accuracy = 0
    start_time = time.time()
    
    metrics_history = {
        'train_loss': [], 'val_loss': [],
        'train_accuracy': [], 'val_accuracy': [],
        'train_f1': [], 'val_f1': []
    }
    
    print(f"\nStarting training for {num_epochs} epochs...")
    
    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch + 1}/{num_epochs}")
        epoch_start = time.time()
        
        # Training
        model.train()
        total_loss = 0
        optimizer.zero_grad()
        
        for i, batch in enumerate(tqdm(train_loader, desc="Training")):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss / 4  # gradient accumulation
            loss.backward()
            
            if (i + 1) % 4 == 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
            
            total_loss += loss.item() * 4
            
            # Clear cache periodically
            if i % 100 == 0:
                torch.cuda.empty_cache()
        
        # Evaluate
        train_metrics = evaluate_model(model, train_loader, device)
        val_metrics = evaluate_model(model, val_loader, device)
        
        # Store metrics
        metrics_history['train_loss'].append(train_metrics['loss'])
        metrics_history['val_loss'].append(val_metrics['loss'])
        metrics_history['train_accuracy'].append(train_metrics['accuracy'])
        metrics_history['val_accuracy'].append(val_metrics['accuracy'])
        metrics_history['train_f1'].append(train_metrics['f1'])
        metrics_history['val_f1'].append(val_metrics['f1'])
        
        # Print metrics
        print_metrics(train_metrics, "Training")
        print_metrics(val_metrics, "Validation")
        
        # Save best model
        if val_metrics["accuracy"] > best_accuracy:
            best_accuracy = val_metrics["accuracy"]
            print(f"\nNew best accuracy: {best_accuracy:.4f}! Saving model...")
            model.save_pretrained(output_dir)
            tokenizer.save_pretrained(output_dir)
        
        epoch_time = time.time() - epoch_start
        print(f"\nEpoch time: {epoch_time:.2f} seconds")
        
        # Clear cache between epochs
        torch.cuda.empty_cache()
        gc.collect()
    
    total_time = time.time() - start_time
    print(f"\nTraining completed in {total_time:.2f} seconds")
    print(f"Best accuracy: {best_accuracy:.4f}")
    
    return metrics_history

## 5. Train the Model

Let's train our model and visualize the results. Note that we're using a smaller dataset for demonstration purposes. For production use, you should use the full dataset (25,000 samples).

In [None]:
# Initialize model and tokenizer
model, tokenizer = initialize_model()
model = model.to(device)

# Create datasets
train_dataset = IMDBDataset(
    texts=train_data["text"],
    labels=train_data["label"],
    tokenizer=tokenizer
)

val_dataset = IMDBDataset(
    texts=val_data["text"],
    labels=val_data["label"],
    tokenizer=tokenizer
)

# Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size=16,
    shuffle=True
)

val_loader = DataLoader(
    val_dataset,
    batch_size=16,
    shuffle=False
)

# Train the model
metrics_history = train_model(model, train_loader, val_loader, num_epochs=3)

# Plot training progress
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(metrics_history['train_loss'], label='Training')
plt.plot(metrics_history['val_loss'], label='Validation')
plt.title('Loss')
plt.legend()

plt.subplot(1, 3, 2)
plt.plot(metrics_history['train_accuracy'], label='Training')
plt.plot(metrics_history['val_accuracy'], label='Validation')
plt.title('Accuracy')
plt.legend()

plt.subplot(1, 3, 3)
plt.plot(metrics_history['train_f1'], label='Training')
plt.plot(metrics_history['val_f1'], label='Validation')
plt.title('F1 Score')
plt.legend()

plt.tight_layout()
plt.show()

## 6. Test the Model

So, we'd probably have better results if we used a larger dataset and more epochs, but this is just a demo. Let's test our model on some sample reviews. 

## Note on Three-Class Classification

While this model is trained on binary labels (positive/negative), our inference pipeline 
supports three-class classification (positive/negative/neutral) by using a `confidence threshold`:

- If prediction confidence < 0.55: Classify as "neutral"
- Otherwise: Use model's binary prediction (positive/negative)

This approach allows us to identify reviews with ambiguous sentiment without requiring three-class training data. It's set to 0.55 here since it seems that none of the reviews tested by the fine-tuned model ever had a confidence score below 0.50. Confidence scores would presumably be higher for a larger dataset and more epochs.

In [None]:
def predict_sentiment(text, model, tokenizer, confidence_threshold=0.55):
    """Make a prediction with confidence score"""
    model.eval()
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        probabilities = torch.softmax(outputs.logits, dim=1)
        confidence, prediction = torch.max(probabilities, dim=1)
        
    confidence = confidence.item()
    if confidence < confidence_threshold:
        sentiment = "Neutral"
    else:
        sentiment = "Positive" if prediction.item() == 1 else "Negative"
    
    return sentiment, confidence

# Test some sample reviews
sample_reviews = [
    "Best movie I've seen in a long time!",
    "I was really disappointed with this film. The story was confusing and the characters were poorly developed.",
    "While not perfect, the movie had its moments. Some scenes were great while others fell flat."
]

print("Sample Predictions:\n")
for review in sample_reviews:
    sentiment, confidence = predict_sentiment(review, model, tokenizer)
    print(f"Review: {review}")
    print(f"Prediction: {sentiment} (Confidence: {confidence:.2%})\n")

### Key Differences from Production Script

1. The notebook version omits Hugging Face Hub model pushing (which requires authentication)
2. We've added interactive visualizations not present in the production script
3. The training loop includes additional metrics tracking for visualization
4. Memory management is adapted for interactive use

The core model architecture, training process, and inference logic remain identical to the production implementation.