This notebook explores sentiment analysis using a pre-trained DistilBERT model on the IMDB movie review dataset.

**Dataset:**

The IMDB dataset contains 50,000 movie reviews, labeled as positive (1) or negative (0). This project utilizes a subset of 1,000 reviews for training and 200 for evaluation to expedite the demonstration.

**Methodology:**

1. **Initialization:** A pre-trained DistilBERT model is loaded.
2. **Initial Evaluation:** The model's initial performance on sentiment classification is assessed without fine-tuning.
3. **Fine-tuning:** The model is fine-tuned on the training subset of IMDB reviews to enhance its sentiment classification capabilities.
4. **Evaluation:** The fine-tuned model is evaluated using metrics such as accuracy, precision, recall, and F1 score.

**Results:**

Initial evaluation revealed an accuracy of approximately 52%. Following fine-tuning, the model's accuracy significantly improved to 87%.  Precision, recall, and F1 score also exhibited notable improvements. These results highlight the efficacy of fine-tuning for sentiment analysis tasks.

In [1]:
# !pip install datasets
# !pip install transformers
# !pip install torch
# !pip install scikit-learn
# !pip install wandb

In [12]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Load the IMDB dataset for sentiment classification
dataset = load_dataset("imdb")

# Initialize tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Create smaller dataset for demonstration
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))

# Define metrics computation function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    accuracy = accuracy_score(labels, predictions)
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Training arguments for initial evaluation
initial_args = TrainingArguments(
    output_dir="./initial_eval",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    logging_dir='./logs_initial',
)

# Evaluate model before fine-tuning
initial_trainer = Trainer(
    model=model,
    args=initial_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

print("Metrics before fine-tuning:")
initial_metrics = initial_trainer.evaluate()
print(initial_metrics)

# Training arguments for fine-tuning
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

print("\nFine-tuning the model...")
train_result = trainer.train()

# Print training metrics
print("\nTraining metrics:")
print(train_result.metrics)

# Evaluate model after fine-tuning
print("\nMetrics after fine-tuning:")
final_metrics = trainer.evaluate()
print(final_metrics)

# Print improvement summary
print("\nImprovement Summary:")
for metric in ['accuracy', 'f1', 'precision', 'recall']:
    initial_value = initial_metrics[f'eval_{metric}']
    final_value = final_metrics[f'eval_{metric}']
    improvement = final_value - initial_value
    print(f"{metric}: {improvement:.4f} improvement (from {initial_value:.4f} to {final_value:.4f})")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]



Metrics before fine-tuning:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 0.694140613079071, 'eval_model_preparation_time': 0.0031, 'eval_accuracy': 0.52, 'eval_f1': 0.0, 'eval_precision': 0.0, 'eval_recall': 0.0, 'eval_runtime': 0.9198, 'eval_samples_per_second': 217.448, 'eval_steps_per_second': 27.181}

Fine-tuning the model...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6905,0.682695,0.55,0.415584,0.551724,0.333333
2,0.5931,0.432336,0.835,0.835821,0.8,0.875
3,0.2942,0.342153,0.86,0.849462,0.877778,0.822917



Training metrics:
{'train_runtime': 104.4183, 'train_samples_per_second': 28.731, 'train_steps_per_second': 1.81, 'total_flos': 397402195968000.0, 'train_loss': 0.5259117853073847, 'epoch': 3.0}

Metrics after fine-tuning:


{'eval_loss': 0.3421531617641449, 'eval_accuracy': 0.86, 'eval_f1': 0.8494623655913979, 'eval_precision': 0.8777777777777778, 'eval_recall': 0.8229166666666666, 'eval_runtime': 1.6737, 'eval_samples_per_second': 119.496, 'eval_steps_per_second': 7.767, 'epoch': 3.0}

Improvement Summary:
accuracy: 0.3400 improvement (from 0.5200 to 0.8600)
f1: 0.8495 improvement (from 0.0000 to 0.8495)
precision: 0.8778 improvement (from 0.0000 to 0.8778)
recall: 0.8229 improvement (from 0.0000 to 0.8229)
