# IMDB reviews sentiment analysis

Data: https://www.kaggle.com/utathya/imdb-review-dataset

The steps here are taken and adapted from https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/05_DL_PR/deep-learning-practical-lesson.ipynb

In [231]:
import torch
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

In [232]:
reviews = pd.read_csv("data/imdb.csv", encoding="ISO-8859-2", index_col=0)
reviews = reviews.drop(columns=["type", "file"])

In [233]:
reviews.head()

Unnamed: 0,review,label
0,Once again Mr. Costner has dragged out a movie...,neg
1,This is an example of why the majority of acti...,neg
2,"First of all I hate those moronic rappers, who...",neg
3,Not even the Beatles could write songs everyon...,neg
4,Brass pictures (movies is not a fitting word f...,neg


In [234]:
reviews["label"] = preprocessing.LabelEncoder().fit_transform(reviews["label"])

In [235]:
# Watch out! We have 3 classes here: neg, pos, unsup
reviews["label"].unique()

array([0, 1, 2])

In [236]:
# for faster debugging i reduce the size

#reviews = reviews.sample(10000)

In [237]:
# set a fixed random seed number for reproducibility
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [238]:
train_df, val_df = train_test_split(reviews, test_size=0.3, random_state=SEED)
    

## Transform sentences into vectors with CountVectorizer

To transform sentences into count vectors, we first need a vocabulary which assigns each word an id. Then we can use the vocabulary to transform a sentence into a vector.

In [239]:
from sklearn.feature_extraction.text import CountVectorizer

In [240]:
# Here is a small example of what we want to achieve

def minimal_example():
    # We start with a corpus of sentences
    corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    ]
    # Then we fit our countvectorizer on the corpus
    vectorizer = CountVectorizer()
    vectorizer.fit(corpus)
    # it will take all the different words as its features of its vocabulary
    print(vectorizer.get_feature_names_out())
    sorted_vocabulary = dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1]))
    print(sorted_vocabulary)
    
    #Now we can use our mapping to transform sentences into vectors
    # Lets transform the first two sentence
    print(vectorizer.transform(corpus[:2]).toarray())
    

    
minimal_example()

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]]


e.g. first sentence contains 0 _and_, 1 _document_, 1 _first_, ....

In [241]:
vectorizer = CountVectorizer()
vectorizer.fit(train_df["review"])

CountVectorizer()

In [242]:
list(vectorizer.vocabulary_.keys())[:10]

['although',
 'the',
 'plot',
 'of',
 'cover',
 'girl',
 'is',
 'very',
 'flimsy',
 'and']

We could now simply use the built-in tokenizer of CountVectorizer to create the vocabulary. However this leads to many similar words e.g. car and cars (see below). So we lemmatize the words. The `CountVectorizer`takes a tokenizer function.

In [243]:
sorted(list(filter(lambda x: x == "car" or x == "cars",list(vectorizer.vocabulary_.keys()))))

['car', 'cars']

We will use nltk for lemmatizing.

In [244]:
import nltk
from nltk.corpus import stopwords

nltk.download("punkt")
nltk.download("wordnet")


from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize

stopwords = set(stopwords.words('english'))

def tokenize(review):
    # review = one single review
    wnl = WordNetLemmatizer()
    return [wnl.lemmatize(t) for t in word_tokenize(review) if t not in stopwords]
    


# I use stopword filtering by nltk since sklearn states that the built-in filtering of CountVectorizer has some issues
word2idx = CountVectorizer(max_features=3000, 
                             tokenizer=tokenize).fit(train_df["review"])

[nltk_data] Downloading package punkt to /Users/max/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/max/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [245]:
# The vocab size is the length of the vocabulary, or the length of the feature vectors
VOCAB_SIZE = len(word2idx.vocabulary_)
assert VOCAB_SIZE == 3000

With this vectorizer we can transform sentences to vectors

In [246]:
word2idx.transform(["Hello my darling"]).toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

What happens if a word is not in our dictionary?

In [247]:
missing_word = "Kawabunga"

assert missing_word not in word2idx.vocabulary_.keys()

vec = word2idx.transform([missing_word]).toarray()

# Since this array does not appear in our vocabulary, all entries inside the vector should be 0
assert np.any(vec) == False

## Batching

Now that we can represent the text as vectors we will start preparing [Pytorch dataloaders](https://pytorch.org/docs/stable/data.html), a helper which batches the data for us. It also shuffles the data and does more nice things.

In [248]:
# First we prepare our initial data for the model; since it is still raw text (remember, we only fitted our vectorizer 
# but did not yet transform the whole sentences) we need to transform it to vectors


train_data_vecs = word2idx.transform(train_df["review"]).toarray()
train_data_labels = train_df["label"].tolist()

val_data_vecs = word2idx.transform(val_df["review"]).toarray()
val_data_labels = val_df["label"].tolist()

In [249]:
# The dataloaders need the data to be a daset; this can either be an iterable or a map-like (check docs for more info)
# We also need to transform the data to pytorch tensors

# lets take a look at an example

example = train_data_vecs[0]
print(example)
torch.FloatTensor(example)

[0 0 0 ... 0 0 0]


tensor([0., 0., 0.,  ..., 0., 0., 0.])

In [250]:
#Pytorch needs input to be of type float and output to be out type long

train_data_vecs_tensor = torch.FloatTensor(train_data_vecs)
train_data_labels_tensor = torch.LongTensor(train_data_labels)

val_data_vecs_tensor = torch.FloatTensor(val_data_vecs)
val_data_labels_tensor = torch.LongTensor(val_data_labels)

In [251]:
# Lets combine them into tuple arrays, where a tuple is like (input, label)

train_dataset = list(zip(train_data_vecs_tensor, train_data_labels_tensor))

val_dataset = list(zip(val_data_vecs_tensor, val_data_labels_tensor))


In [252]:
from torch.utils.data import DataLoader

BATCH_SIZE = 32

train_dataloader = DataLoader(
        train_dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
    )

val_dataloader = DataLoader(
        val_dataset,
        batch_size=BATCH_SIZE,
        shuffle=False,
    )

## First simple PyTorch model

Check out this awesome tutorial with more information https://pytorch.org/tutorials/beginner/nn_tutorial.html

In [253]:
from torch import nn
import torch.nn.functional as F


#Bag of Words classifier
class BoWClassifier(nn.Module):  # inheriting from nn.Module!
    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super().__init__()

        # Create one linear layer, with the input size of our vocabulary
        # (since our vector encoding is a 3000 long count-encoding)
        # and as output the number of labels, which will be 3
        self.linear = nn.Linear(vocab_size, num_labels)

    def forward(self, input_vec):
        # input_vec is one sentence-vector
        # dimension needs to be set, otherwise we get warning
        return F.log_softmax(self.linear(input_vec), dim=1)

In [254]:
INPUT_DIM = VOCAB_SIZE
OUTPUT_DIM = 3

In [255]:
model = BoWClassifier(OUTPUT_DIM, INPUT_DIM)

Now we have the structure of our model: a shallow network which takes in vectors of length 3000 and outputs a softmax probability between 2 classes. The weights of the single linear layer still need to be updated and optimized somehow. For this we need two things: an **optimizer** and a **loss** function. The loss functions tells us how far we are off at each weight and the optimizer will use this information to update the weights.

In [256]:
import torch.optim as optim

# The optimizer will update the weights of our model based on the loss function
# This is essential for correct training
# The _lr_ parameter is the learning rate
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()

## Training

In [257]:
from sklearn.metrics import precision_recall_fscore_support


def calculate_performance(preds, y):
    """
    Returns precision, recall, fscore per batch
    """
    # Get the predicted label from the probabilities
    rounded_preds = preds.argmax(1)

    # Calculate the correct predictions batch-wise and calculate precision, recall, and fscore
    # WARNING: Tensors here could be on the GPU, so make sure to copy everything to CPU
    precision, recall, fscore, support = precision_recall_fscore_support(
        rounded_preds.cpu(), y.cpu()
    )

    return precision[1], recall[1], fscore[1]

In [None]:
N_EPOCHS = 15

for epoch in range(N_EPOCHS):
    epoch_loss = 0
    epoch_prec = 0
    epoch_recall = 0
    epoch_fscore = 0
    val_epoch_loss = 0
    val_epoch_prec = 0
    val_epoch_recall = 0
    val_epoch_fscore = 0

    
    
    train_step = 0
    # We calculate the error on batches so the iterator will return matrices with shape [BATCH_SIZE, VOCAB_SIZE]
    for batch in train_dataloader:
        train_step +=1
        # Set to training mode so weights are updated correctly
        model.train()
        # batch is array of tuple (text_vecs, labels)
        text_vecs = batch[0]
        labels = batch[1]

        # We reset the gradients from the last step, so the loss will be calculated correctly (and not added together)
        optimizer.zero_grad()

        # This runs the forward function on your model (you don't need to call it directly)
        predictions = model(text_vecs)
        
        # Calculate the loss and the accuracy on the predictions (the predictions are log probabilities, remember!)
        loss = criterion(predictions, labels)

        prec, recall, fscore = calculate_performance(predictions, labels)

        # Propagate the error back on the model (this means changing the initial weights in your model)
        # Calculate gradients on parameters that requries grad
        loss.backward()
        # Update the parameters
        optimizer.step()

        # We add batch-wise loss to the epoch-wise loss
        epoch_loss += loss.item()
        # We also do the same with the scores
        epoch_prec += prec.item()
        epoch_recall += recall.item()
        epoch_fscore += fscore.item()
        
        val_step = 0
        
        if train_step % 10 == 0:
            # Validate each 10th step
            val_step +=1
            # On the validation dataset we don't want training so we need to set the model on evaluation mode
            model.eval()

            # Also tell Pytorch to not propagate any error backwards in the model or calculate gradients
            # This is needed when you only want to make predictions and use your model in inference mode!
            with torch.no_grad():

            # The remaining part is the same with the difference of not using the optimizer to backpropagation
                for val_batch in val_dataloader:
                    val_text_vecs = val_batch[0]
                    val_labels = val_batch[1]


                    val_predictions = model(val_text_vecs)
                    val_loss = criterion(predictions, labels)

                    val_prec, val_recall, val_fscore = calculate_performance(val_predictions, val_labels)

                    val_epoch_loss += val_loss.item()
                    val_epoch_prec += val_prec.item()
                    val_epoch_recall += val_recall.item()
                    val_epoch_fscore += val_fscore.item()
    
    print(f"Epoch: {epoch+1:02}")
    print(f"\tTrain Loss: {epoch_loss:.3f}")
    print(f"\t Val. Loss: {val_epoch_loss:.3f}")


    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
