# Recurrent Neural Networks - Sentiment Analysis

In this notebook, my goal is to implement an RNN to analyze the expressed opinion of a sentence and classifies it as positive or negative using PyTorch.

I will implement a multilayer RNN for sentiment analysis using a many-to-one architecture.

Okay, let's jump in!

# Loading and Preprocessing the data

For this exercice, I will the IMDb movie reviews. The multilayer RNN to be implemented will analyze those reviews and classify them as a positive (1) or a negative (0) review.

Thankfully, the IMDb movie reviews dataset is available in PyTorch through the `torchtext.datasets` module.

In [1]:
#pip install torchtext
#pip install 'portalocker>=2.0.0'
from torchtext.datasets import IMDB

train_dataset = IMDB(split='train')
test_dataset = IMDB(split='test')

The previous two lines return iterators, specifically `ShardingFilterIterDataPipe` object. As I am writing this, the PyTorch documentation says:

> The datasets supported by torchtext are datapipes from the torchdata project, which is still in Beta status. This means that the API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of `DataLoaderV2` from `torchdata`.

In case something changes that causes the code in this notebook to break, please let me know and I will update it.

With that out of the way, I would like to print the first training example, and see what it looks like:

In [2]:
list(train_dataset)[0]

(1,
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwee

`train_dataset` is an iterator that lets us traverse a list of tuples. Where _each_ tuple is the sentiment, followed by the text.

Let's start our preprocessing with splitting the training partition of the dataset into a training and validation partition.

In [3]:
import torch
import torch.nn as nn
from torch.utils.data.dataset import random_split
torch.manual_seed(1)

train_dataset, valid_dataset = random_split(
    list(train_dataset), [20000, 5000]
)

My second step is to **find the number of unique tokens (words)** in the training dataset. It is also known as the **vocab size**.

To help us acheive this goal, I use the `tokenizer` helper that I will write in the following cells. Its job is to remove punctuation, HTML markups, and other non-letter characters from each reviews in the dataset. Let's do this:

In [4]:
import re #regular expression
from collections import Counter, OrderedDict

def tokenizer(text):
    #Remove HTML tags in "text".
    text = re.sub('<[^>]*>', '', text)
    
    #find all occurrences of emoticons in "text.lower", place them in a list
    emoticons = re.findall(
        '(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower()
    )
    
    #Substitute non-word characters (equivalent to [^a-zA-Z0-9_]) in "text.lower" with ' '
    #Join in ONE string elements in the "emoticons" list separated by a space, Then replace all '-' by empty string.
    #Concatenate.
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    
    tokens = text.split()
    return tokens

In case you still have a hard time understanding what the regular expressions do despite the comments, I suggest you go to the [Regex101](https://regex101.com/) website, paste a test string and see what each individual regular expressions in the `tokenizer` function matches. I used the training example of earlier as my test string for instance. I added some emoticons like ":-)" and ":-(" and played around.

With our `tokenizer` helper written, let's determine the vocab size of the training dataset.

In [5]:
token_counts = Counter()

for label, line in train_dataset:
    tokens = tokenizer(line)
    token_counts.update(tokens)

print('Training dataset vocab size:', len(token_counts))

Training dataset vocab size: 69023


Nice! Let's continue. Now, for my third step I want to **assign a unique integer to each individual token** we detected in the previous step. You may think of this unique integer as the "id" of the token.

In [6]:
from torchtext.vocab import vocab

sorted_by_freq_tuples = sorted(
    token_counts.items(), key=lambda x: x[1], reverse=True
)

ordered_dict = OrderedDict(sorted_by_freq_tuples)

#vocab object maps tokens to indices
vocab = vocab(ordered_dict)

vocab.insert_token("<pad>", 0) #padding token to adjust sequence length
vocab.insert_token("<unk>", 1) #unknown tokens are assigned integer 1
vocab.set_default_index(1)

`token_counts.items()` returns a list of tuples where each tuple is a (token, frequency) pair. The `key` parameter is a function to specify how two elements from the list should be compared. In this case, we want compare tuples by looking at the second item in them (i.e. the frequency). Lastly, the `reversed` parameter specified that we want the elements in the list to be sorted in descending order. Using the `sorted` function, we are sorting the tokens in the vocab from the most common, to the least common.

Then, we use the sorted list to create an `OrderedDict`. An "ordered" dictionary is a dictionary that preserves the order of key-value pairs. Updading the value of existing key, then the order remains unchanged. If you remove an item and reinsert it, then the item is added at the end of the dictionary.

And then the `vocab` object is created which assigns integer values to the tokens in the ordered dictionary. Let's print some integer values associated with some of words found in the vocab:

In [7]:
[vocab[token] for token in ['this', 'is', 'an', 'example'] ]

[11, 7, 35, 457]

Our tokens have been assigned unique integer values. Our next step is to generate batches of example using `DataLoader`.

I start with writing two helper functions: `text_pipeline` and `label_pipeline`. `text_pipeline` expects a string as parameter and return an array of integer values of EACH token in the parameter. `label_pipeline` on the other hand, accepts a number (1 or 2), where 1 means negative and 2 means positive. `label_pipeline` returns 1 if label is 2(positive) or 0 otherwise.

In [8]:
text_pipeline = lambda text: [vocab[token] for token in tokenizer(text)]
label_pipeline = lambda label: 1 if label == 2 else 0

I continue with the `collate_batch` function. This function, implement in the next Python cell, is meant to be passed as the `collate_fn` argument of the `DataLoader` constructor. The `collate_fn` argument of the `DataLoader` constructor is a function used to process the list of samples from a batch. As a result, the `collate_batch` function accepts a `batch` parameter which is a list with _all_ the samples in a batch.

In [9]:
def collate_batch(batch):
    label_list, text_list, lengths = [], [], []

    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        
        processed_text = torch.tensor(text_pipeline(_text),
                                      dtype=torch.int64)
        
        text_list.append(processed_text)
        lengths.append(processed_text.size(0)) #processed_text is a 1-D tensor
    
    label_list = torch.Tensor(label_list)
    lengths = torch.Tensor(lengths)
    padded_text_list = nn.utils.rnn.pad_sequence(
        text_list, batch_first=True
    )
    return padded_text_list, label_list, lengths

Let's to generate small batch of 4 training examples, and use our `collate_batch` function.

In [10]:
from torch.utils.data import DataLoader

dataloader = DataLoader(train_dataset, batch_size=4,
                        shuffle=False, collate_fn=collate_batch)

We created the dataloader, and if I go ahead and print out the shape of the tensors returned by the `collate_batch` function, we'll see that the function was indeed used to process the samples in the batch. Let's go ahead and do that:

In [11]:
text_batch, label_batch, length_batch = next(iter(dataloader))
print(text_batch)

tensor([[   35,  1739,     7,   449,   721,     6,   301,     4,   787,     9,
             4,    18,    44,     2,  1705,  2460,   186,    25,     7,    24,
           100,  1874,  1739,    25,     7, 34415,  3568,  1103,  7517,   787,
             5,     2,  4991, 12401,    36,     7,   148,   111,   939,     6,
         11598,     2,   172,   135,    62,    25,  3199,  1602,     3,   928,
          1500,     9,     6,  4601,     2,   155,    36,    14,   274,     4,
         42945,     9,  4991,     3,    14, 10296,    34,  3568,     8,    51,
           148,    30,     2,    58,    16,    11,  1893,   125,     6,   420,
          1214,    27, 14542,   940,    11,     7,    29,   951,    18,    17,
         15994,   459,    34,  2480, 15211,  3713,     2,   840,  3200,     9,
          3568,    13,   107,     9,   175,    94,    25,    51, 10297,  1796,
            27,   712,    16,     2,   220,    17,     4,    54,   722,   238,
           395,     2,   787,    32,    27,  5236,  

In [12]:
#label of each sequence in the batch
print(label_batch)

tensor([1., 1., 1., 0.])


In [13]:
#length of each sequence in the batch
print(length_batch)

tensor([165.,  86., 218., 145.])


From the previous output, I can see that the longest review has 218 words. Consequently, the other examples in the mini-batch are zero-padded so that they all have the same length,so they can be stored efficiently in a tensor.

In the following Python cell for instance, I print the shape of the mini-batch to illustrate this point.

In [14]:
print(text_batch.shape)

torch.Size([4, 218])


Many things have happened, at least in my opinion. We loaded the training and testing partitions of the IMDb dataset from the `torchtext.datasets` module. We then further splitted training partitions

I then with tokenized the training examples, and assigned a unique integer to each token discovered using Pytorch's `vocab` object.

I finish with creating a small batches of 4 examples to show that everything is working.

Finally, let's divide all three datasets: training, validation, and testing into proper data loader with batch size 32

In [15]:
batch_size = 32

train_dl = DataLoader(train_dataset, batch_size=batch_size,
                      shuffle=True, collate_fn=collate_batch)

valid_dl = DataLoader(valid_dataset, batch_size=batch_size,
                      shuffle=False, collate_fn=collate_batch)

test_dl = DataLoader(test_dataset, batch_size=batch_size,
                     shuffle=False, collate_fn=collate_batch)

I now, have proper dataloaders now! The data is in a suitable format for the RNN. I am ready to move to finally move on something else. The next thing I want to discuss is feature **embedding**, which is optional but a highly recommended preprocessing step that is used to reduce the dimensionality of the word vectors.

# Embedding layers for sentence encoding

The previous section was dedicated to loading the dataset, and preprocess it. At this point, each sequence of words (i.e. reviews) was turned into a sequence of integers values that corresponds to indices of unique words. Then the dataset has been put into `DataLoader`s that we can iterate on, and extract batches. But are we ready to feed those integer sequences into an RNN? 🤔

No. Really, No.