# Introduction

Transformer-based models like DistilBERT have revolutionized natural language processing. In this project, we explore DistilBERT's effectiveness in sentiment analysis on IMDb movie reviews. Our aim is to train a model that accurately predicts whether a review is positive or negative.

# Libraries

In [1]:
try:
    import torch, transformers, datasets
    print("✅ All libraries imported!")
    print("torch", torch.__version__)
    print("transformers", transformers.__version__)
    print("datasets", datasets.__version__)
except ModuleNotFoundError as e:
    print("❌", e)


✅ All libraries imported!
torch 2.1.2
transformers 4.39.3
datasets 2.18.0


In [2]:
from datasets import load_dataset

# Point to the mounted CSV
data_path = "/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv"

imdb_ds = load_dataset("csv", data_files={"train": data_path}, split="train")
print(imdb_ds)
print("\nSample ➜", imdb_ds[0])


Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 50000
})

Sample ➜ {'review': "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agree

In [4]:
# 1. Split 90 % train / 10 % validation
split_ds = imdb_ds.train_test_split(test_size=0.1, seed=42)
train_ds = split_ds["train"]
val_ds   = split_ds["test"]

print("Train rows:", len(train_ds))
print("Val rows  :", len(val_ds))


Train rows: 45000
Val rows  : 5000


In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
label2id = {"negative": 0, "positive": 1}

def preprocess_batch(batch):
    batch["label"] = [label2id[s] for s in batch["sentiment"]]
    enc = tokenizer(batch["review"], padding="max_length",
                    truncation=True, max_length=256)
    batch.update(enc); return batch

tokenized_train = train_ds.map(preprocess_batch, batched=True,
                               remove_columns=["review", "sentiment"])
tokenized_val   =  val_ds.map(preprocess_batch, batched=True,
                               remove_columns=["review", "sentiment"])
tokenized_train.set_format("torch",
                           columns=["input_ids", "attention_mask", "label"])
tokenized_val.set_format("torch",
                         columns=["input_ids", "attention_mask", "label"])


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/45000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [8]:
from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
import numpy as np
import evaluate

accuracy_metric = evaluate.load("accuracy")


# 1️⃣  load pretrained model (num_labels=2 -> binary sentiment)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
)


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=preds, references=labels)

# 3️⃣  training hyper-parameters
args = TrainingArguments(
    output_dir="checkpoints",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    seed=42,
    report_to=["none"]
)

# 4️⃣  Trainer object
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
)

# 5️⃣  start fine-tuning (≈2 h on Kaggle P100)
trainer.train()


Downloading builder script: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2202,0.229818,0.9178
2,0.1451,0.302815,0.9124
3,0.0949,0.307398,0.9262


TrainOutput(global_step=8439, training_loss=0.1759010926145049, metrics={'train_runtime': 1720.0413, 'train_samples_per_second': 78.486, 'train_steps_per_second': 4.906, 'total_flos': 8941549409280000.0, 'train_loss': 0.1759010926145049, 'epoch': 3.0})

In [9]:
# 1️⃣  Evaluate on the validation split
eval_metrics = trainer.evaluate()
print("Validation metrics ➜", eval_metrics)

# 2️⃣  Confirm which checkpoint was judged 'best'
print("Best checkpoint path ➜", trainer.state.best_model_checkpoint)

# 3️⃣  Save that best model for inference
trainer.save_model("distilbert-imdb")      # writes folder in /kaggle/working
print("✅ Model saved to distilbert-imdb/")


Validation metrics ➜ {'eval_loss': 0.3073979616165161, 'eval_accuracy': 0.9262, 'eval_runtime': 19.2223, 'eval_samples_per_second': 260.115, 'eval_steps_per_second': 16.283, 'epoch': 3.0}
Best checkpoint path ➜ checkpoints/checkpoint-8439
✅ Model saved to distilbert-imdb/


In [10]:
!zip -r distilbert-imdb.zip distilbert-imdb


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  adding: distilbert-imdb/ (stored 0%)
  adding: distilbert-imdb/training_args.bin (deflated 51%)
  adding: distilbert-imdb/model.safetensors (deflated 8%)
  adding: distilbert-imdb/config.json (deflated 46%)


In [11]:
!ls -R /kaggle/working/distilbert-imdb


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/kaggle/working/distilbert-imdb:
config.json  model.safetensors	training_args.bin


In [12]:
# 1️⃣  create the project directory
!mkdir -p /kaggle/working/imdb-sentiment

# 2️⃣  copy the model folder into that project
!cp -r /kaggle/working/distilbert-imdb /kaggle/working/imdb-sentiment/

# 3️⃣  show the new layout
!ls -R /kaggle/working/imdb-sentiment


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/kaggle/working/imdb-sentiment:
distilbert-imdb

/kaggle/working/imdb-sentiment/distilbert-imdb:
config.json  model.safetensors	training_args.bin


In [13]:
from transformers import AutoTokenizer

# 1️⃣  load the same tokenizer you used for training
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# 2️⃣  save it next to the model files
tok.save_pretrained("/kaggle/working/imdb-sentiment/distilbert-imdb")

# 3️⃣  show the final contents
import os, glob, textwrap
files = glob.glob("/kaggle/working/imdb-sentiment/distilbert-imdb/*")
print(textwrap.fill('\n'.join(os.path.basename(f) for f in files), width=80))


training_args.bin model.safetensors tokenizer_config.json vocab.txt
special_tokens_map.json config.json tokenizer.json


In [15]:
%%bash
mkdir -p /kaggle/working/imdb-sentiment/api

cat > /kaggle/working/imdb-sentiment/api/main.py <<'PY'
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# --- load model & tokenizer -----------------------------------------------
MODEL_PATH = "imdb-sentiment/distilbert-imdb"   # relative to working dir
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model     = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
model.eval().to("cpu")                          # GPU not needed for demo

label_map = {0: "negative", 1: "positive"}

# --- FastAPI app ----------------------------------------------------------
app = FastAPI(title="IMDb Sentiment API")

class Item(BaseModel):
    text: str

@app.post("/predict")
def predict(item: Item):
    inputs = tokenizer(
        item.text,
        truncation=True,
        padding="max_length",
        max_length=256,
        return_tensors="pt",
    )
    with torch.no_grad():
        logits = model(**inputs).logits
        pred   = int(torch.argmax(logits, dim=-1))
        score  = float(torch.softmax(logits, dim=-1)[0, pred])
    return {"label": label_map[pred], "confidence": round(score, 4)}
PY


In [16]:
ls /kaggle/working/imdb-sentiment/api/main.py


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/kaggle/working/imdb-sentiment/api/main.py


In [17]:
!pip install -q fastapi uvicorn[standard] "requests>=2.31"


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [18]:
from fastapi.testclient import TestClient
import importlib.util, sys, pathlib

# Dynamically import the api module we just wrote
spec = importlib.util.spec_from_file_location(
    "imdb_api", pathlib.Path("/kaggle/working/imdb-sentiment/api/main.py")
)
api_module = importlib.util.module_from_spec(spec)
sys.modules["imdb_api"] = api_module
spec.loader.exec_module(api_module)

client = TestClient(api_module.app)

resp = client.post("/predict", json={"text": "A surprisingly fun movie!"})
print("Status code:", resp.status_code)
print("Response JSON:", resp.json())


Status code: 200
Response JSON: {'label': 'positive', 'confidence': 0.9908}


In [19]:
%%bash
cd /kaggle/working/imdb-sentiment

# ── requirements.txt ──
cat > requirements.txt <<'REQ'
fastapi
uvicorn[standard]
torch>=2.2
transformers>=4.41
REQ

# ── Dockerfile ──
cat > Dockerfile <<'DOCK'
# ---- base image ----
FROM python:3.11-slim

# ---- install deps ----
COPY requirements.txt /tmp/
RUN pip install --no-cache-dir -r /tmp/requirements.txt

# ---- copy app & model ----
COPY api/ /app/api/
COPY distilbert-imdb/ /app/distilbert-imdb/

# ---- expose + run ----
WORKDIR /app
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "80"]
DOCK

ls -1


Dockerfile
api
distilbert-imdb
requirements.txt


In [22]:
!zip -r imdb-sentiment.zip imdb-sentiment


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  adding: imdb-sentiment/ (stored 0%)
  adding: imdb-sentiment/requirements.txt (stored 0%)
  adding: imdb-sentiment/Dockerfile (deflated 36%)
  adding: imdb-sentiment/distilbert-imdb/ (stored 0%)
  adding: imdb-sentiment/distilbert-imdb/training_args.bin (deflated 51%)
  adding: imdb-sentiment/distilbert-imdb/model.safetensors (deflated 8%)
  adding: imdb-sentiment/distilbert-imdb/tokenizer_config.json (deflated 76%)
  adding: imdb-sentiment/distilbert-imdb/vocab.txt (deflated 53%)
  adding: imdb-sentiment/distilbert-imdb/special_tokens_map.json (deflated 42%)
  adding: imdb-sentiment/distilbert-imdb/config.json (deflated 46%)
  adding: imdb-sentiment/distilbert-imdb/tokenizer.json (deflated 71%)
  adding: imdb-sentiment/api/ (stored 0%)
  adding: imdb-sentiment/api/main.py (deflated 52%)
  adding: imdb-sentiment/api/__pycache__/ (stored 0%)
  adding: imdb-sentiment/api/__pycache__/main.cpython-310.pyc (deflated 31%)


In [23]:
from datasets import load_dataset
import numpy as np, torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_PATH = "/kaggle/working/imdb-sentiment/distilbert-imdb"

# 1. Load test split (25 000 reviews)
test_ds = load_dataset("imdb", split="test")

# 2. Load tokenizer & model
tok   = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH).to("cuda").eval()

# 3. Batched inference
batch_size = 64
correct = 0
for i in range(0, len(test_ds), batch_size):
    texts  = test_ds[i : i + batch_size]["text"]
    labels = test_ds[i : i + batch_size]["label"]
    enc    = tok(texts, padding=True, truncation=True, max_length=256, return_tensors="pt").to("cuda")
    with torch.no_grad():
        preds = torch.argmax(model(**enc).logits, dim=-1).cpu().numpy()
    correct += int(np.sum(preds == labels))

test_acc = correct / len(test_ds)
print(f"✅ Test accuracy: {test_acc:.4f}")


Downloading readme: 0.00B [00:00, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 39.6MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 42.3MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:00<00:00, 78.5MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

✅ Test accuracy: 0.9806


In [None]:
import sys, torch, transformers, datasets
print("python:", sys.version)
print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("datasets:", datasets.__version__)
print("transformers location:", transformers.__file__)


In [7]:
!pip install -U evaluate --quiet


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict, load_metric
from transformers import pipeline

In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Loading Dataset & EDA

In [None]:
df = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
df.sample(5)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isna().sum()

## Visualising Target Values

In [None]:
df['sentiment'].unique()

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(3, 6))
sns.countplot(x='sentiment', data=df, palette='Set2')
plt.title('Distribution of Sentiment Labels', fontsize=16)
plt.xlabel('Sentiment', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

In [None]:
label_counts = df['sentiment'].value_counts()
colors = ['#66c2a5', '#fc8d62']

plt.figure(figsize=(8, 6))
plt.pie(label_counts, labels=label_counts.index, autopct='%1.1f%%', startangle=140, colors=colors)
plt.title('Distribution of Sentiment Labels')
plt.axis('equal')
plt.show()

# Model (DistilBERT)

## Preprocessing

### Label Encoding

In [None]:
reviews = df['review'].tolist()
labels = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0).tolist()

### Splitting the Dataset

In [None]:
train_reviews, val_reviews, train_labels, val_labels = train_test_split(reviews, labels, test_size=0.2, random_state=42)

### Tokenization

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert/distilbert-base-uncased-finetuned-sst-2-english')

In [None]:
# Function for tokenizing the reviews

def tokenize_function(texts):
    return tokenizer(texts, padding="max_length", truncation=True)

In [None]:
train_encodings = tokenize_function(train_reviews)
val_encodings = tokenize_function(val_reviews)

In [None]:
# Convert to Hugging Face Dataset format

train_dataset = Dataset.from_dict({
                                    'input_ids': train_encodings['input_ids'],
                                    'attention_mask': train_encodings['attention_mask'],
                                    'labels': train_labels
                                    })

val_dataset = Dataset.from_dict({
                                    'input_ids': val_encodings['input_ids'],
                                    'attention_mask': val_encodings['attention_mask'],
                                    'labels': val_labels
                                    })

dataset = DatasetDict({
                        'train': train_dataset,
                        'validation': val_dataset
                        })

## Model

In [None]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert/distilbert-base-uncased-finetuned-sst-2-english', num_labels=2).to(device)

In [None]:
# Load F1 metric
f1_metric = load_metric("f1")

In [None]:
# Define the evaluation metric

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    
    # Calculate accuracy
    accuracy = (preds == labels).mean()
    
    # Calculate F1-score (weighted)
    f1 = f1_metric.compute(predictions=preds, references=labels, average="weighted")
    
    return {"accuracy": accuracy, "f1": f1["f1"]}

## Hyperparameters Settings

In [None]:
# Define training arguments

training_args = TrainingArguments(
    output_dir='./results',                     # Output directory for model checkpoints
    num_train_epochs=5,                         # Number of training epochs
    per_device_train_batch_size=16,             # Batch size per device (GPU/CPU)
    per_device_eval_batch_size=16,              # Evaluation batch size
    warmup_steps=100,                           # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,                          # Weight decay for regularization
    logging_dir='./logs',                       # Directory for logging
    logging_steps=10,                           # Interval for logging updates
    evaluation_strategy='epoch',                # Evaluate at each epoch
    save_strategy="epoch",
    load_best_model_at_end=True,                # Load the best model based on eval_loss
    metric_for_best_model="eval_loss",          # Use validation loss to determine the best model
    greater_is_better=False,                    # Lower validation loss is better
    report_to="none",                           # Disable reporting to Hugging Face Hub
    push_to_hub=False,                          # Do not push to Hugging Face Hub
    fp16=True,                                  # Enable mixed precision
)

# Fine-Tuning the Model

In [None]:
# Initialize Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    compute_metrics=compute_metrics
)

In [None]:
# Fine-tune the model
trainer.train()

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)
print()
print(f"Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"F1-Score: {eval_results['eval_f1']:.4f}")

# Conclusion

In our experiment with DistilBERT on IMDb reviews, we achieved a peak accuracy score of 0.9371 and an F1-score of 0.9371, demonstrating the effectiveness of transformer-based models for sentiment analysis. By fine-tuning DistilBERT with optimized hyperparameters, including validation loss as the metric for model selection, we balanced performance and generalization efficiently.

The model’s consistent improvements in accuracy and F1-score, despite slight increases in validation loss, highlight the importance of monitoring multiple metrics. With its efficiency and adaptability, DistilBERT proves to be a robust solution for text classification tasks, offering a strong foundation for further exploration and refinement.