# Fine-tuning DistilBERT for Sentiment Analysis

This notebook walks through the process of fine-tuning a DistilBERT model for sentiment analysis using the IMDB movie reviews dataset.

## Why DistilBERT?

DistilBERT is a distilled version of BERT (Bidirectional Encoder Representations from Transformers) that retains 97% of its language understanding capabilities while being 40% smaller and 60% faster.

## What this Notebook Covers

1. Data preparation and preprocessing
2. Model configuration and training
3. Evaluation and testing


In [None]:
import torch
from transformers import (
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    get_linear_schedule_with_warmup
)
from torch.utils.data import DataLoader, Dataset
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from datasets import load_dataset

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## 1. Data Preparation

First, we'll create our dataset class and load the IMDB dataset.

In [None]:
class IMDBDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.encodings = tokenizer(
            texts,
            truncation=True,
            padding=True,
            max_length=max_length,
            return_tensors='pt'
        )
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

## 2. Model Configuration

Now we'll set up our DistilBERT model and configure it for fine-tuning.

In [None]:
def initialize_model(model_name='distilbert-base-uncased'):
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    model = DistilBertForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2  # Still using 2 labels since neutral is determined by confidence
    )
    
    # Add comment explaining our approach
    # Note: This model outputs binary predictions (positive/negative)
    # Neutral sentiment is determined at inference time using a confidence threshold
    # when confidence < CONFIDENCE_THRESHOLD (default 0.50)
    
    # Freeze base model layers
    for param in model.distilbert.parameters():
        param.requires_grad = False
    
    return model, tokenizer

## Note on Three-Class Classification

While this model is trained on binary labels (positive/negative), our inference pipeline 
supports three-class classification (positive/negative/neutral) by using a confidence threshold:

- If prediction confidence < 0.50: Classify as "neutral"
- Otherwise: Use model's binary prediction (positive/negative)

This approach allows us to identify reviews with ambiguous sentiment without requiring 
three-class training data.