# IMDB movie review sentiment classification using Hugging Face models

In this notebook, we'll test pre-trained sentiment analysis models and later finetune a DistilBERT model to perform IMDB movie review sentiment classification. This notebook is adapted from [Getting Started with Sentiment Analysis using Python](https://huggingface.co/blog/sentiment-analysis-python).

Import the libraries

In [1]:
from transformers import pipeline
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
import numpy as np
from datasets import load_metric
from huggingface_hub import notebook_login
from transformers import TrainingArguments, Trainer
from transformers import pipeline

Check if PyTorch is using the GPU

In [2]:
print('Using PyTorch version:', torch.__version__)
if torch.cuda.is_available():
    print('Using GPU, device name:', torch.cuda.get_device_name(0))
    device = torch.device('cuda')
else:
    print('No GPU found, using CPU instead.') 
    device = torch.device('cpu')

Using PyTorch version: 2.4.1+rocm6.1
Using GPU, device name: AMD Instinct MI250X


## Use Pre-trained Sentiment Analysis Models

In [3]:
sentiment_pipeline = pipeline("sentiment-analysis",device=device)
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

- This code snippet above utilizes the **[pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines)** class to generate predictions using models from the Hub. It applies the [default sentiment analysis model](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) to evaluate the provided list of text data.
- The analysis results are **POSITIVE** for first entry and **NEGATIVE** for the second entry.

One can also use a specific sentiment analysis model by providing the name of the model, e.g., if you want a sentiment analysis model for tweets, you can specify the model id.

In [5]:
specific_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis", device = device)
specific_model(data)

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0


[{'label': 'POS', 'score': 0.9916695356369019},
 {'label': 'NEG', 'score': 0.9806600213050842}]

## Fine-tuning DistilBERT model using IMDB dataset 

- The [IMDB](https://huggingface.co/datasets/stanfordnlp/imdb) dataset contains 50000 movies reviews from the Internet Movie Database, split into 25000 reviews for training and 25000 reviews for testing. Half of the reviews are positive and half are negative. 

- The IMDB dataset is relatively large, so let's use 5000 samples for training to speed up our process for this exercise.

In [6]:
imdb = load_dataset("imdb")
small_train_dataset = imdb["train"].shuffle(seed=0).select([i for i in list(range(5000))])
test_dataset = imdb["test"]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

To preprocess our data, we will use DistilBERT tokenizer:

In [7]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

- Next, we will prepare the text inputs for the model for both splits of our dataset (training and test) by using the map method:

In [8]:
def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)
 
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

- To speed up training, let's use a data_collator to convert your training samples to PyTorch tensors and concatenate them with the correct amount of padding:

In [9]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Training the model
- We will be throwing away the pretraining head of the DistilBERT model and replacing it with a classification head fine-tuned for sentiment analysis. This enables us to transfer the knowledge from DistilBERT to our custom model.

In [10]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


- Then, let's define the metrics you will be using to evaluate how good is your fine-tuned model (accuracy and f1 score)

In [11]:
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

- Define the training arguments

In [16]:
repo_name = "finetuning-sentiment-model-5000-samples"
 
training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=10,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=False,
)
 
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

- Start training

In [17]:
trainer.train()

Step,Training Loss
500,0.1289
1000,0.0806
1500,0.0234
2000,0.0168
2500,0.0036
3000,0.0021


TrainOutput(global_step=3130, training_loss=0.041074030658307545, metrics={'train_runtime': 447.4617, 'train_samples_per_second': 111.741, 'train_steps_per_second': 6.995, 'total_flos': 6541136655478080.0, 'train_loss': 0.041074030658307545, 'epoch': 10.0})

- Evaluate the model

In [18]:
trainer.evaluate()

{'eval_loss': 0.6160064339637756,
 'eval_accuracy': 0.90988,
 'eval_f1': 0.9106909263883934,
 'eval_runtime': 74.5624,
 'eval_samples_per_second': 335.29,
 'eval_steps_per_second': 20.962,
 'epoch': 10.0}

- Model inference

In [19]:
pipe = pipeline("text-classification", model=model,tokenizer=tokenizer, device = device)
pipe(["I love this move", "This movie sucks!"])

[{'label': 'LABEL_1', 'score': 0.999527096748352},
 {'label': 'LABEL_0', 'score': 0.9998001456260681}]

## Task Compare the test dataset accuracy achieved from finetuned DistilBERT model and the previous RNN model. What do you notice?

In my experiment with sentiment analysis on the IMDB dataset, I tested three different models: a Simple RNN, a GRU-based RNN, and a fine-tuned DistilBERT model. Here’s a breakdown of each model’s performance and my observations.

  **Model Performance Summary**

  

1. **Model 1: Simple RNN**
-  **Architecture**: This model uses a single-layer LSTM with 32 hidden units, a dropout layer for regularization, and a sigmoid activation for binary classification.

-  **Test Accuracy**: 77.4%

-  **Test Loss**: 0.5666

-  **Observation**: The Simple RNN had decent performance, but it showed signs of overfitting. Despite applying dropout, the model’s test accuracy evened out, and the gap between training and test accuracy suggested it wasn’t generalizing well to new data.

2. **Model 2: GRU-based RNN**

-  **Architecture**: This model is a more complex two-layer bidirectional GRU with additional dropout layers to help combat overfitting.

-  **Test Accuracy**: 78.0%

-  **Test Loss**: 0.9487

-  **Observation**: The GRU-based model provided a slight accuracy improvement over the Simple RNN, likely due to the bidirectional structure and additional layer, allowing it to capture both forward and backward context. However, like the Simple RNN, it also struggled with overfitting, and the improvements in accuracy and loss were minor. This indicated that adding complexity in RNN architecture did not significantly impact generalization.

3. **Model 3: Fine-tuned DistilBERT**

-  **Architecture**: The DistilBERT model, pretrained on a large corpus and fine-tuned on the IMDB dataset, utilizes transformers and attention mechanisms to capture complex relationships in text.

-  **Test Accuracy**: 90.98%

-  **Test Loss**: 0.6160

-  **Observation**: DistilBERT outperformed both RNN-based models by a large margin, achieving an accuracy of 90.98%. Unlike the RNNs, DistilBERT did not show signs of overfitting and was able to generalize well to the test data. The transformer-based architecture, with its attention mechanism, allowed it to capture nuanced patterns in the text that the RNNs struggled with, especially on longer sequences.
  

**Key Takeaways**
 

1. **Overfitting in RNNs**: Both the Simple RNN and GRU models showed signs of overfitting, as their test accuracy didn’t improve significantly despite adding layers and dropout. This suggests that RNNs may struggle to generalize on complex text data without extensive regularization or larger datasets.

2. **Effectiveness of Transformers**: DistilBERT, as a pretrained model, demonstrated a clear advantage. It benefited from large-scale pretraining, which allowed it to start with a better understanding of language structure, whereas the RNNs had to learn from scratch on the IMDB dataset alone. Fine-tuning this model allowed it to adapt to sentiment-specific nuances in the data, leading to superior performance.

3. **Concluding Observation**: From this experiment, I noticed that pretrained transformer models like DistilBERT are more effective for text classification tasks than RNNs, especially when overfitting is an issue. This result shows that leveraging pretrained language models can lead to substantial improvements in both accuracy and generalization, making transformers the preferred choice for handling nuanced NLP tasks like sentiment analysis.