# **Fine-Tuning DistilBERT for Sentiment Analysis on IMDB Reviews**

In this code segment,we are going to demonstrate how to fine-tune a pre-trained DistilBERT model for sentiment analysis. We will work with the IMDB movie reviews dataset, process and tokenize the data, train the model on a balanced dataset, evaluate it, and finally make predictions on new reviews. We will also utilize GPU acceleration in Google Colab for faster training and inference.

**Setup & Install Dependencies**

The first step is to install all the necessary libraries. We need Hugging Face Transformers for the pre-trained DistilBERT model, Datasets to load and manage the IMDB data, and Evaluate to compute metrics like accuracy and F1 score. Installing these ensures that all the tools we need for training, evaluating, and making predictions are available in our Colab environment.

In [1]:
# Install Hugging Face Transformers, Datasets, and Evaluate
!pip install --quiet transformers datasets evaluate


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

**Import Libraries**

Next, we import the required Python libraries. Torch is used for tensor operations and GPU computations. The DistilBERT tokenizer converts text into numerical tokens, while the DistilBERT model itself is used for sequence classification. Trainer and TrainingArguments simplify the training process, and the datasets library allows us to load and manipulate our data easily. We also import metrics and the softmax function to evaluate and interpret the model’s predictions.

In [2]:
import torch
import numpy as np
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset, Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from torch.nn.functional import softmax


**Check GPU Availability**

Here, we check if a GPU is available in Colab. Using a GPU significantly speeds up both training and inference because it can handle large matrix operations much faster than a CPU. If a GPU is not available, the code will default to using the CPU, which will still work but will be slower.

In [3]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("Using device:", device)


Using device: cuda


**Load Dataset**

We then load the IMDB dataset, which contains 50,000 movie reviews labeled as positive or negative. To ensure the model does not become biased towards one class, we create a balanced training set by selecting an equal number of positive and negative reviews. Shuffling the dataset ensures that the model sees the data in a random order during training, which helps with generalization.

In [4]:
# Load IMDB dataset
dataset = load_dataset("imdb")

# Separate positive and negative reviews for balancing
train_full = dataset["train"]

positive_samples = [x for x in train_full if x["label"] == 1][:4000]
negative_samples = [x for x in train_full if x["label"] == 0][:4000]

# Combine and shuffle
balanced_train = Dataset.from_list(positive_samples + negative_samples).shuffle(seed=42)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [17]:
# Display 5 sample reviews from the training dataset
for i, sample in enumerate(dataset["train"].select(range(5))):
    review_text = sample["text"]
    label = "Positive 😀" if sample["label"] == 1 else "Negative 😠"
    print(f"Sample {i+1}:")
    print(f"Review: {review_text}")
    print(f"Label: {label}\n")


Sample 1:
Review: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few an

**Tokenize Dataset**

Tokenization converts text data into numerical representations that the model can understand. Here, we use the DistilBERT tokenizer to truncate sequences longer than 256 tokens and pad shorter sequences to a uniform length. We apply this to both the full dataset for evaluation and the balanced training set. Setting the format to 'torch' prepares the dataset to be compatible with PyTorch tensors, which the model requires.

In [5]:
# Load tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

# Tokenization function
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

# Tokenize full dataset for evaluation
encoded_dataset = dataset.map(preprocess_function, batched=True)

# Tokenize balanced training dataset
balanced_encoded = balanced_train.map(preprocess_function, batched=True)

# Set format for PyTorch
balanced_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])
encoded_dataset["test"].set_format("torch", columns=["input_ids", "attention_mask", "label"])


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

**Define Metrics Function**

Before training, we define a function to compute evaluation metrics. This function calculates accuracy, precision, recall, and F1 score by comparing the model’s predicted labels to the true labels. These metrics help us understand how well the model is performing on both positive and negative reviews.

In [6]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}


**Load Model**

Next, we load the pre-trained DistilBERT model for sequence classification. We specify two output labels because we are performing binary classification. Moving the model to the device ensures that all computations happen on the GPU if available. Some weights in the classifier head are randomly initialized because this layer is task-specific and needs to be trained on our sentiment dataset.

In [7]:
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
).to(device)  # move model to GPU if available


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Define Training Arguments**

We then define the training arguments, which control how the model is trained. This includes specifying the output directory for checkpoints, evaluation strategy, learning rate, batch sizes, number of epochs, and weight decay to prevent overfitting. We also enable logging and instruct the trainer to load the best model at the end of training. These settings are important for ensuring efficient and effective training.

In [8]:
training_args = TrainingArguments(
    output_dir="./distilbert-imdb",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    report_to="none"
)


**Initialize Trainer**

The Trainer class simplifies the training loop. Here, we provide the model, training arguments, the balanced training dataset, a subset of the test set for evaluation, the tokenizer, and our metrics function. The Trainer handles forward and backward passes, optimization, evaluation, and logging automatically, so we don’t have to write the training loop manually.

In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=balanced_encoded,
    eval_dataset=encoded_dataset["test"].select(range(2000)),  # subset for quick evaluation
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = Trainer(


**Train and Evaluate**

Now we start the training process. The trainer fine-tunes DistilBERT on our balanced dataset. After training, the model is evaluated on the test set, and metrics like accuracy, precision, recall, and F1 score are computed. This step allows us to see how well the model has learned to classify positive and negative reviews.

In [10]:
# Train the model
trainer.train()

# Evaluate
trainer.evaluate()


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3337,0.308887,0.874,0.0,0.0,0.0
2,0.1773,0.336152,0.89,0.0,0.0,0.0
3,0.1018,0.305132,0.9115,0.0,0.0,0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 0.30513155460357666,
 'eval_accuracy': 0.9115,
 'eval_f1': 0.0,
 'eval_precision': 0.0,
 'eval_recall': 0.0,
 'eval_runtime': 13.8246,
 'eval_samples_per_second': 144.669,
 'eval_steps_per_second': 9.042,
 'epoch': 3.0}

**Make Predictions on New Reviews**

Finally, we demonstrate how to use the trained model to predict the sentiment of new movie reviews. We tokenize the input text, feed it through the model, and apply a softmax function to convert the outputs into probabilities. By comparing the positive and negative probabilities, we determine the predicted sentiment and print it along with the review text. This shows how the model can be used in real-world applications.

In [11]:
# Example reviews
texts = [
    "I absolutely loved this movie! The story was heartwarming and the acting was fantastic.",
    "This was one of the worst films I've ever seen. Total waste of time.",
    "The cinematography was beautiful, but the plot was confusing and slow.",
    "An outstanding performance by the lead actor — I was hooked from start to finish!"
]

# Tokenize and move to GPU
inputs = tokenizer(texts, truncation=True, padding=True, return_tensors="pt").to(device)

# Get logits from model
outputs = model(**inputs)

# Convert logits to probabilities
probs = softmax(outputs.logits, dim=1)

# Print predictions
for t, p in zip(texts, probs):
    sentiment = "Positive 😀" if p[1] > p[0] else "Negative 😠"
    print(f"Review: {t}\nPositive: {p[1]:.4f}, Negative: {p[0]:.4f} → {sentiment}\n")


Review: I absolutely loved this movie! The story was heartwarming and the acting was fantastic.
Positive: 0.9944, Negative: 0.0056 → Positive 😀

Review: This was one of the worst films I've ever seen. Total waste of time.
Positive: 0.0038, Negative: 0.9962 → Negative 😠

Review: The cinematography was beautiful, but the plot was confusing and slow.
Positive: 0.0045, Negative: 0.9955 → Negative 😠

Review: An outstanding performance by the lead actor — I was hooked from start to finish!
Positive: 0.9914, Negative: 0.0086 → Positive 😀



In [22]:

# Get user input for a review
user_review = input("Enter a movie review: ")

# Tokenize the user input and move it to the same device as the model
user_inputs = tokenizer([user_review], padding=True, truncation=True, return_tensors="pt").to(device)

# Get the model's output (logits)
user_outputs = model(**user_inputs)

# Apply softmax to get probabilities
user_probs = softmax(user_outputs.logits, dim=1)

# Determine sentiment
user_sentiment = "Positive 😀" if user_probs[0][1] > user_probs[0][0] else "Negative 😠"

# Print result
print(f"\nReview: {user_review}")
print(f"Positive: {user_probs[0][1]:.4f}, Negative: {user_probs[0][0]:.4f} → {user_sentiment}\n")


Enter a movie review: I was bored from start to finish. Don't waste your time.

Review: I was bored from start to finish. Don't waste your time.
Positive: 0.0106, Negative: 0.9894 → Negative 😠



In [14]:
example_reviews = [
    "This movie was absolutely amazing! I loved every minute of it.",
    "The plot was confusing and the acting was terrible.",
    "A heartwarming story with great performances.",
    "I was bored from start to finish. Don't waste your time."
]

print("Here are some example reviews to test:")
for review in example_reviews:
    print(f"- {review}")

Here are some example reviews to test:
- This movie was absolutely amazing! I loved every minute of it.
- The plot was confusing and the acting was terrible.
- A heartwarming story with great performances.
- I was bored from start to finish. Don't waste your time.


We successfully demonstrated how to fine-tune a pre-trained DistilBERT model for sentiment analysis on the IMDB movie reviews dataset. We prepared and tokenized a balanced training dataset, defined evaluation metrics, trained the model using Hugging Face’s Trainer, and evaluated its performance. Finally, we used the trained model to predict sentiment for new reviews, showing how it can classify them as positive or negative with probabilities. This workflow highlights the power of transfer learning and pre-trained models, allowing us to quickly build accurate NLP models for real-world tasks without training from scratch.