# PyTorch for Text Classification: Classifying IMDB Movie Reviews with TinyBERT Transformers
This project explores the use of pytorch for text classification, specifically through classifying IMDB movie reviews with TinyBERT transformers. 

NOTE: This project is based on Codecademy's [text classification project](https://www.codecademy.com/content-items/29838c7636654e48ac72458af6373d5d).

## Dataset
The dataset is about movie reviews, and can be found at [Hugging Face](https://huggingface.co/datasets/Lowerated/lm6-movies-reviews-aspects); the given datasets in the `datasets` folder have been already cleaned and preprocessed. 

## Setup 

In [1]:
import numpy as np
import pandas as pd

In [89]:
import re 

from collections import Counter

import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
torch.manual_seed(42) # set random seed 

import torch.nn.functional as F

from sklearn.metrics import confusion_matrix, classification_report

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import logging
logging.set_verbosity_error() # remove warning

## Import and Inspect the Movie Reviews
The datasets are imported and analyzed preliminarily, to understand what datatypes are present, the meaning of each column, and the possible values. 

In [None]:
train_review_df = pd.read_csv("datasets/imdb_movie_reviews_train.csv")

train_review_df.head()

Unnamed: 0,review,aspect,aspect_encoded
0,Ibiza filming location looks very enchanting,Cinematography,0
1,RANDOLPH SCOTT always played men you could loo...,Characters,1
2,"interesting and promising basic idea', 'some p...",Story,2
3,"the film could explore very powerful politics,...",Story,2
4,"The animation is nice, and the use of color ma...",Cinematography,0


In [4]:
train_review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 369 entries, 0 to 368
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   review          369 non-null    object
 1   aspect          369 non-null    object
 2   aspect_encoded  369 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 8.8+ KB


The dataset has the following 3 columns:
* "review": text content of the review
* "aspect": themetic summary of what the review is about
* "aspect_encoded": quantitative equivalent of the above column 

In [5]:
# look at the possible values of aspect column
print(train_review_df["aspect"].value_counts())

n_aspects = train_review_df["aspect"].nunique() # save the number of unique aspects
print(f"\n There are {n_aspects} possible labels.\n")

# confirm that the counts are equal to the aspect_encoded column
print(train_review_df["aspect_encoded"].value_counts())

aspect
Cinematography    125
Characters        123
Story             121
Name: count, dtype: int64

 There are 3 possible labels.

aspect_encoded
0    125
1    123
2    121
Name: count, dtype: int64


Since there are 3 possible aspects, this is a multi-class classification task. 

In [7]:
# create training body
train_texts = train_review_df["review"].tolist()
train_labels = train_review_df["aspect_encoded"].tolist()

## Pre-processing the Data
The text data has to be pre-processed into a numerical representation to allow it to be fed to the model. This done through tokenization and truncation/padding. The preprocessed texts are then converted to tensors.

In [None]:
def tokenize_review(review):
    return re.findall(r'\b\w+\b', review.lower())

# tokenize all texts
tokenized_train_texts = [tokenize_review(text) for text in train_texts]

In [20]:
# count number of occurences of each token

# list of all tokens
combined_corpus = []
for text in tokenized_train_texts:
    for token in text:
        combined_corpus.append(token)

word_freqs = Counter(combined_corpus) # frequency of all tokens

Now, the vocabulary is created as the top 1000 most commonly occuring word tokens. 

In [None]:
# find the most common words
MAX_VOCAB_SIZE = 1000
most_common_words = word_freqs.most_common(MAX_VOCAB_SIZE)

print("The 10 most common words are: ")
print(most_common_words[:10])

The 10 most common words are: 
[('the', 732), ('a', 307), ('and', 306), ('of', 296), ('is', 218), ('to', 213), ('in', 177), ('it', 134), ('s', 109), ('that', 105)]


In [29]:
# vocab is created a dictionary with word: id of the most common words: 0 corresponds to unk (unknown word for tokens which are not common) and 1 corresponds to pad 
vocab = {word: id + 2 for id, (word, freq) in enumerate(most_common_words)}
vocab['<unk>'] = 0
vocab['<pad>'] = 1 

The texts can now be tokenized, encoded, and expressed as tensors. 

In [30]:
def encode_text(text, vocab):
    """
    Tokenizes and encodes each review text into a sequence of token IDs. 

    Args:
    text (str) - review text to be tokenized and encoded
    vocab (dict) - vocabulary as {token: code}

    Returns:
    encoded_text (list of ints)
    """

    tokenized_text = tokenize_review(text)

    encoded_text = []
    for token in tokenized_text:
        if token in vocab:
            encoded_text.append(vocab[token])
        else:
            encoded_text.append(vocab['<unk>'])
    
    return encoded_text

In [31]:
def pad_or_truncate(encoded_text, max_len):
    """
    Pre-processes the encoded text to have the same length specified by the maximum length specified. 

    Args:
    encoded_text (list of ints) - input encoded review text
    max_len (int) - maximum length specified

    Returns:
    list of ints (subset or superset of encoded_text)
    """

    if len(encoded_text) > max_len:
        return encoded_text[:max_len]
    else:
        return encoded_text + [1]*(max_len - len(encoded_text)) # 1 corresponds to value for '<pad>' token


In [33]:
MAX_SEQ_LENGTH = 128

# take all texts, encode them and pad/truncate based on MAX_SEQ_LENGTH using the helper functions defined above
padded_text_seqs = []
for text in train_texts:
    encoded_text = encode_text(text, vocab)
    padded_text = pad_or_truncate(encoded_text, MAX_SEQ_LENGTH)
    padded_text_seqs.append(padded_text)

In [35]:
# convert text sequences to tensors
X_tensor = torch.tensor(padded_text_seqs)
y_tensor = torch.tensor(train_labels, dtype=torch.long)

# organize the input and label tensors to be loaded in batches
train_dataset = TensorDataset(X_tensor, y_tensor)
train_dataloader = DataLoader(dataset=train_dataset, batch_size=16, shuffle=True)

## Training a Simple Neural Network

The first text classification model is a simple neural network with an embedding layer and a hidden layer. 

In [80]:
torch.manual_seed(42) # set random seed --do not change!

class SimpleNNWithEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size):
        # initialize model with an embedding layer, a hidden layer, and the output
        super(SimpleNNWithEmbedding, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embed_size)
        self.fc1 = nn.Linear(embed_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x is the input
        x = self.embedding(x)       # 1. input -> embedding layer
        x = torch.mean(x, dim=1)    # 2. average the embedding into 1 representation
        x = self.fc1(x)             # 3. averaged embedding output -> layer 1
        x = torch.relu(x)           # 4. apply activation function
        x = self.fc2(x)             # 5. relu output -> layer 2
        return x                    # output as class

In [81]:
vocab_size = len(vocab)
embed_size = 50 
hidden_size = 100 
output_size = n_aspects

text_classifier_nn = SimpleNNWithEmbedding(vocab_size, embed_size, hidden_size, output_size)
print(text_classifier_nn)

SimpleNNWithEmbedding(
  (embedding): Embedding(1002, 50)
  (fc1): Linear(in_features=50, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=3, bias=True)
)


In [82]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(text_classifier_nn.parameters(), lr=0.005)

In [83]:
def train_model(model, train_loader, criterion, optimizer, num_epochs):
    # train the model for the given number of epochs
    for i in range(num_epochs):
        model.train() # set model to training mode
        epoch_loss = 0.0

        # pass each batch of the loader through the model to train it 
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()               # reset gradients at every pass
            output = model(batch_X)             # forward pass
            loss = criterion(output, batch_y)   # calculate loss
            loss.backward()                     # backward pass
            optimizer.step()                    # change weights and biases
            epoch_loss += loss.item()

        avg_loss = epoch_loss / len(train_loader)
        if (i + 1) % 5 == 0:
            print(f"[Epoch {i + 1}/{num_epochs}], Average CE Loss: {avg_loss:.5f}")

In [84]:
train_model(text_classifier_nn, train_dataloader, criterion, optimizer, num_epochs=50) # train model over 50 epochs

[Epoch 5/50], Average CE Loss: 0.98320
[Epoch 10/50], Average CE Loss: 0.74018
[Epoch 15/50], Average CE Loss: 0.40746
[Epoch 20/50], Average CE Loss: 0.15354
[Epoch 25/50], Average CE Loss: 0.07105
[Epoch 30/50], Average CE Loss: 0.03629
[Epoch 35/50], Average CE Loss: 0.02796
[Epoch 40/50], Average CE Loss: 0.01198
[Epoch 45/50], Average CE Loss: 0.01633
[Epoch 50/50], Average CE Loss: 0.01190


### Evaluate Neural Network
The neural network will be evaluted based on a testing set (which is first imported and preprocessed as with the training set). 

In [85]:
# prepare test set
test_review_df = pd.read_csv("datasets/imdb_movie_reviews_test.csv")

test_texts = test_review_df["review"].tolist()
test_labels = test_review_df["aspect_encoded"].tolist()

padded_text_seqs_test = []
for text in test_texts:
    encoded_text = encode_text(text, vocab)
    padded_text = pad_or_truncate(encoded_text, MAX_SEQ_LENGTH)
    padded_text_seqs_test.append(padded_text)

X_test_tensor = torch.tensor(padded_text_seqs_test)
y_test_tensor = torch.tensor(test_labels, dtype=torch.long)

test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_dataloader = DataLoader(dataset=test_dataset, batch_size=16, shuffle=False)

In [86]:
def get_predictions_and_probabilities(model, test_dataloader):

    model.eval() # set model to eval 

    all_probs = [] # stores all of the predicted probabilities for the testing dataset
    all_labels = [] # stores all of the predicted labels for the testing dataset

    with torch.no_grad():
        for batch_X, batch_y in test_dataloader:
            outputs = model(batch_X)                        # get outputs from forward pass
            probs = F.softmax(outputs, dim=1)               # generate predicted probabilities
            all_probs.extend(probs.cpu().numpy())
            predicted_labels = torch.argmax(outputs, dim=1) # select class label w highest probability
            all_labels.extend(predicted_labels.cpu().numpy())

    return all_probs, all_labels 

In [None]:
pred_probs, pred_labels = get_predictions_and_probabilities(text_classifier_nn, test_dataloader) # get predictions

In [None]:
# evaluate predictions
conf_matrix = confusion_matrix(test_labels, pred_labels)
print("Confusion matrix:\n", conf_matrix)

report = classification_report(test_labels, pred_labels)
print("\nClassification Report\n", report)

Confusion matrix:
 [[27 18  4]
 [ 1 35  2]
 [ 2 15 28]]

Classification Report
               precision    recall  f1-score   support

           0       0.90      0.55      0.68        49
           1       0.51      0.92      0.66        38
           2       0.82      0.62      0.71        45

    accuracy                           0.68       132
   macro avg       0.75      0.70      0.68       132
weighted avg       0.76      0.68      0.69       132



Based on these values, the neural network had an overall accuracy of 68%, which is okay. There is room for improvement. 

## Fine-tuning a TinyBERT Transformer
A TinyBERT model will be fine-tuned to classify movie reviews, in the hopes that it performs better than a simple neural network. This is loaded from HuggingFace.

In [90]:
# load pre-trained tinyBERT model 
model_name = 'huawei-noah/TinyBERT_General_4L_312D'

tinybert_tokenizer = BertTokenizer.from_pretrained(model_name) # tokenizer
text_classifier_tinybert = BertForSequenceClassification.from_pretrained(model_name, num_labels=n_aspects) # model 

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [92]:
# freeze all parameters in pre-trained tinyBERT
for param in text_classifier_tinybert.bert.parameters():
    param.requires_grad = False

# unfreeze the classification layer added on top of the pre-trained model
for param in text_classifier_tinybert.classifier.parameters():
    param.requires_grad = True

# unfreeze the encoder layer specified at layer[3]
for param in text_classifier_tinybert.bert.encoder.layer[3].parameters():
    param.requires_grad = True

In [94]:
MAX_SEQ_LENGTH_TINYBERT = 124
# create tokenized training set
X_train = tinybert_tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt", max_length=MAX_SEQ_LENGTH_TINYBERT)
y_train = torch.tensor(train_labels, dtype=torch.long)

train_dataset = TensorDataset(X_train["input_ids"], X_train["attention_mask"], y_train)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

In [95]:
optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, text_classifier_tinybert.parameters()), lr=0.0025)
criterion = nn.CrossEntropyLoss()

In [97]:
num_epochs = 10
for i in range(num_epochs):
    text_classifier_tinybert.train()
    total_loss = 0.0

    for batch_X, batch_attention_mask, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = text_classifier_tinybert(input_ids=batch_X, attention_mask=batch_attention_mask)
        logits = outputs.logits
        loss = criterion(logits, batch_y)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(train_loader)
    print(f'Epoch {i+1}/{num_epochs}, Loss: {avg_loss}')

Epoch 1/10, Loss: 1.000056820611159
Epoch 2/10, Loss: 0.5841939138869444
Epoch 3/10, Loss: 0.39991521059225005
Epoch 4/10, Loss: 0.35998631330827874
Epoch 5/10, Loss: 0.33731860884775716
Epoch 6/10, Loss: 0.23576063445458809
Epoch 7/10, Loss: 0.3799276165664196
Epoch 8/10, Loss: 0.30424073167766136
Epoch 9/10, Loss: 0.27067273389548063
Epoch 10/10, Loss: 0.21834331766391793


### Evaluate TinyBERT

In [101]:
# prepare test set
X_test = tinybert_tokenizer(test_texts, padding=True, truncation=True, return_tensors="pt", max_length=MAX_SEQ_LENGTH_TINYBERT)
y_test = torch.tensor(test_labels, dtype=torch.long)

test_dataset = TensorDataset(X_test['input_ids'], X_test['attention_mask'], y_test)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

In [102]:
# generate predictions for test set
text_classifier_tinybert.eval()
pred_probs = []
pred_labels = []

with torch.no_grad():
    for batch_X, batch_attention_mask, batch_y in test_loader:
        outputs = text_classifier_tinybert(input_ids= batch_X, attention_mask= batch_attention_mask)
        logits = outputs.logits
        probs = F.softmax(logits, dim=1)
        pred_probs.extend(probs.cpu().numpy())
        predicted_labels = torch.argmax(logits, dim=1)
        pred_labels.extend(predicted_labels.cpu().numpy())

In [103]:
# evaluate predictions
conf_matrix = confusion_matrix(test_labels, pred_labels)
print("Confusion matrix:\n", conf_matrix)

report = classification_report(test_labels, pred_labels)
print("\nClassification Report\n", report)

Confusion matrix:
 [[45  0  4]
 [ 1 35  2]
 [ 0  2 43]]

Classification Report
               precision    recall  f1-score   support

           0       0.98      0.92      0.95        49
           1       0.95      0.92      0.93        38
           2       0.88      0.96      0.91        45

    accuracy                           0.93       132
   macro avg       0.93      0.93      0.93       132
weighted avg       0.93      0.93      0.93       132



The TinyBERT model has an accuracy of 93%. This is quite good, and almost 50% better than the simple neural network!