# **Fine-Tuning DistilBERT for Sentiment Analysis**
In this notebook, we fine-tune the `DistilBERT` model on the IMDB movie reviews dataset for binary sentiment classification (positive/negative). This involves preprocessing the data, setting up a PyTorch dataset and dataloader, training the model, and evaluating its performance.

### Importing Libraries
1. **pandas**: For data manipulation.
2. **scikit-learn**: To split the dataset into training and validation sets.
3. **transformers**: For DistilBERT model, tokenizer, and optimizer setup.
4. **torch**: For PyTorch-based model training and evaluation.
5. **tqdm**: To display progress bars during training and evaluation.
6. **sklearn.metrics**: To calculate model accuracy.
7. **kagglehub**: Simplifies downloading and managing Kaggle datasets.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW, get_scheduler
import torch
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import kagglehub


### Dataset Loading and Preprocessing
- **kagglehub**: Downloads the IMDB dataset from Kaggle.
- Converts sentiment labels:
  - `positive` → `1`
  - `negative` → `0`
- The dataset contains 50,000 movie reviews with binary sentiment labels.



In [3]:
# Download dataset using kagglehub
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

# Define the path to the dataset
PATH = "/kaggle/input/imdb-dataset-of-50k-movie-reviews"
data = pd.read_csv(PATH + '/IMDB Dataset.csv')

# Convert sentiment labels to binary
data.loc[data['sentiment'] == 'positive', 'sentiment'] = 1
data.loc[data['sentiment'] == 'negative', 'sentiment'] = 0

### Splitting the Dataset
- **Training Set**: 99% of the data.
- **Validation Set**: 1% of the data.
- The split is reproducible using a fixed `random_state`.


In [4]:
# Split dataset into train and validation sets
x_train, x_valid, y_train, y_valid = train_test_split(
    data["review"], data["sentiment"], test_size=0.01, random_state=1
)


### Custom Dataset Class
- Converts raw text reviews into tokenized inputs for DistilBERT.
- Each sample includes:
  - `input_ids`: Tokenized review.
  - `attention_mask`: Attention mask for padding.
  - `label`: Binary sentiment label (0 or 1).


In [5]:
class IMDBDataset(Dataset):
    """
    Custom PyTorch Dataset class to preprocess and handle IMDB reviews.

    Args:
        reviews (pd.Series): A pandas Series containing the text of the reviews.
        labels (pd.Series): A pandas Series containing the sentiment labels (binary).
        tokenizer (transformers.PreTrainedTokenizer): Tokenizer instance to convert text to token IDs.
        max_length (int): Maximum length for padding/truncation of reviews. Default is 512.
    """

    def __init__(self, reviews, labels, tokenizer, max_length=512):
        # Convert reviews and labels to numpy arrays for indexing
        self.reviews = reviews.values
        self.labels = labels.values
        self.tokenizer = tokenizer  # Tokenizer for text processing
        self.max_length = max_length  # Maximum sequence length for input
        
    def __len__(self):
        """
        Returns the number of samples in the dataset.
        """
        return len(self.reviews)  # Total number of reviews

    def __getitem__(self, idx):
        """
        Retrieves a single sample (review and its corresponding label) at the specified index.

        Args:
            idx (int): Index of the sample to retrieve.

        Returns:
            dict: A dictionary containing the following keys:
                - "input_ids" (torch.Tensor): Token IDs of the review.
                - "attention_mask" (torch.Tensor): Mask indicating padding (0) vs actual tokens (1).
                - "label" (torch.Tensor): Sentiment label as a PyTorch tensor.
        """
        # Fetch the review text and its corresponding label
        review = self.reviews[idx]
        label = self.labels[idx]

        # Tokenize the review:
        # - Truncates text to the maximum length if it exceeds it.
        # - Pads text to the maximum length if it's shorter.
        # - Converts text into PyTorch-compatible tensors.
        encoding = self.tokenizer(
            review,                 # Input text to tokenize
            truncation=True,        # Enable truncation to max_length
            padding="max_length",   # Pad to max_length
            max_length=self.max_length,  # Maximum token length
            return_tensors="pt"     # Return PyTorch tensors
        )

        # Return a dictionary with tokenized inputs and the label
        return {
            "input_ids": encoding["input_ids"].squeeze(0),         # Token IDs of the review
            "attention_mask": encoding["attention_mask"].squeeze(0),  # Padding mask
            "label": torch.tensor(label, dtype=torch.long)         # Sentiment label as a tensor
        }


### Tokenizer and DataLoader Setup
- **Tokenizer**: Converts text into token IDs and attention masks.
- **DataLoader**:
  - `train_loader`: Shuffles and batches the training data.
  - `valid_loader`: Batches the validation data without shuffling.
- Batch size: 16 for efficient processing.


In [6]:
# Initialize tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Create dataset objects
train_dataset = IMDBDataset(x_train, y_train, tokenizer)
valid_dataset = IMDBDataset(x_valid, y_valid, tokenizer)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=16)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

### Model Fine-Tuning
- **Model**: `DistilBertForSequenceClassification` with 2 labels.
- **Device**: Runs on GPU (if available) or CPU.
- **Optimizer**: `AdamW` with learning rate `5e-5`.
- **Scheduler**: Linearly adjusts learning rate during training.
- **Training Loop**:
  - Trains the model for 3 epochs.
  - Computes loss, updates weights, and displays progress using `tqdm`.


In [7]:
# Load the pre-trained DistilBERT model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2  # We have 2 labels: positive (1) and negative (0)
)

# Set the device to CUDA (GPU) if available, otherwise default to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the selected device (GPU or CPU)
model.to(device)

# Set up the optimizer (AdamW) with the model's parameters and a learning rate of 5e-5
optimizer = AdamW(model.parameters(), lr=5e-5)

# Calculate the total number of training steps based on the length of the training data loader and the number of epochs
num_training_steps = len(train_loader) * 3  # 3 epochs, change if more epochs are used

# Set up the learning rate scheduler with linear warmup (zero warmup steps in this case)
lr_scheduler = get_scheduler(
    "linear",  # Linear learning rate decay
    optimizer=optimizer,  # The optimizer to be scheduled
    num_warmup_steps=0,  # No warmup steps
    num_training_steps=num_training_steps  # Total number of training steps (to schedule accordingly)
)

# Start training loop for 3 epochs
epochs = 3  # Number of times to iterate over the entire dataset

# Set the model to training mode
model.train()

# Training loop: Iterate over the number of epochs
for epoch in range(epochs):
    # Create a progress bar for tracking batch progress in the current epoch
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch + 1}")
    
    # Iterate over each batch of data in the training set
    for batch in progress_bar:
        # Move the batch data to the selected device (GPU or CPU)
        input_ids = batch["input_ids"].to(device)  # Tokenized inputs
        attention_mask = batch["attention_mask"].to(device)  # Attention mask for padding
        labels = batch["label"].to(device)  # Labels (sentiment: 0 or 1)
        
        # Perform a forward pass through the model
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss  # Extract the loss value from the model's output
        
        # Zero out the gradients from the previous step
        optimizer.zero_grad()

        # Perform backpropagation to calculate gradients for the current batch
        loss.backward()

        # Update the model parameters based on the gradients
        optimizer.step()

        # Update the progress bar with the current loss value
        progress_bar.set_postfix(loss=loss.item())  # Display loss during training

    # Update the learning rate scheduler at the end of each epoch
    lr_scheduler.step()



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1: 100%|██████████| 3094/3094 [26:24<00:00,  1.95it/s, loss=0.586]  
Epoch 2: 100%|██████████| 3094/3094 [26:26<00:00,  1.95it/s, loss=0.00869]
Epoch 3: 100%|██████████| 3094/3094 [26:28<00:00,  1.95it/s, loss=0.00213] 


### Model Evaluation
- **Evaluation**:
  - Uses the validation dataset.
  - Computes predictions without gradients (`torch.no_grad()`).
- **Metric**: Calculates accuracy using `accuracy_score` from sklearn.


In [8]:
# Set the model to evaluation mode
model.eval()

# Initialize empty lists to store predictions and true labels for evaluation
predictions, true_labels = [], []

# Disable gradient calculations as we are only performing inference (evaluation)
with torch.no_grad():
    # Iterate over the validation data loader, which provides batches of validation data
    for batch in tqdm(valid_loader, desc="Validating"):
        # Move input data (input_ids, attention_mask) and labels to the selected device (GPU/CPU)
        input_ids = batch["input_ids"].to(device)  # Tokenized input
        attention_mask = batch["attention_mask"].to(device)  # Attention mask to handle padding
        labels = batch["label"].to(device)  # True sentiment labels (0 or 1)

        # Perform a forward pass through the model with the validation inputs
        outputs = model(input_ids, attention_mask=attention_mask)
        
        # Get the predicted class for each input by selecting the class with the highest score
        preds = torch.argmax(outputs.logits, dim=1)

        # Append the predicted labels and true labels to their respective lists
        predictions.extend(preds.cpu().numpy())  # Move predictions to CPU and convert to numpy
        true_labels.extend(labels.cpu().numpy())  # Move true labels to CPU and convert to numpy

# Calculate the accuracy by comparing true labels and predictions
accuracy = accuracy_score(true_labels, predictions)

# Print the validation accuracy
print(f"Validation Accuracy: {accuracy:.4f}")



Validating: 100%|██████████| 32/32 [00:06<00:00,  5.15it/s]

Validation Accuracy: 0.9340





### Saving the Fine-Tuned Model
- The fine-tuned model and tokenizer are saved locally for reuse.
- Saved in the directory: `./fine_tuned_imdb_distilbert`.


In [9]:
# Save the model
model.save_pretrained("./fine_tuned_imdb_distilbert")
tokenizer.save_pretrained("./fine_tuned_imdb_distilbert")

('./fine_tuned_imdb_distilbert/tokenizer_config.json',
 './fine_tuned_imdb_distilbert/special_tokens_map.json',
 './fine_tuned_imdb_distilbert/vocab.txt',
 './fine_tuned_imdb_distilbert/added_tokens.json')