# Recurrent Neural Networks - Sentiment Analysis

In this notebook, my goal is to implement an RNN to analyze the expressed opinion of a sentence and classifies it as positive or negative using PyTorch.

I will implement a multilayer RNN for sentiment analysis using a many-to-one architecture.

Okay, let's jump in!

# Loading and Preprocessing the data

For this exercice, I will the IMDb movie reviews. The multilayer RNN to be implemented will analyze those reviews and classify them as a positive (1) or a negative (0) review.

Thankfully, the IMDb movie reviews dataset is available in PyTorch through the `torchtext.datasets` module.

In [1]:
#pip install torchtext
#pip install 'portalocker>=2.0.0'
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')
test_iter = IMDB(split='test')

The previous two lines return iterators, specifically `ShardingFilterIterDataPipe` object. As I am writing this, the PyTorch documentation says:

> The datasets supported by torchtext are datapipes from the torchdata project, which is still in Beta status. This means that the API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of `DataLoaderV2` from `torchdata`.

In case something changes that causes the code in this notebook to break, please let me know and I will update it.

With that out of the way, I would like to print the first training example, and see what it looks like:

In [2]:
list(train_iter)[0]

(1,
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwee

`train_iter` is an iterator that lets us traverse a list of tuples. Where _each_ tuple is the sentiment, followed by the text.
This is the very first training example. 

Like before, I should prepare the dataset before using it. 

I start with splitting the training partition of the dataset into a training and validation partition.

In [3]:
import torch
from torch.utils.data.dataset import random_split
torch.manual_seed(1)

train_dataset, valid_dataset = random_split(
    list(train_iter), [20000, 5000]
)

My second step is to **find the number of unique tokens (words)** in the training dataset. It is also known as the **vocab size**.

To help us acheive this goal, I use the `tokenizer` helper that I will write in the following cells. Its job is to remove punctuation, HTML markups, and other non-letter characters from each reviews in the dataset. Let's do this:

In [14]:
import re #regular expression
from collections import Counter, OrderedDict

def tokenizer(text):
    #Remove HTML tags in "text".
    text = re.sub('<[^>]*>', '', text)
    
    #find all occurrences of emoticons in "text.lower", place them in a list
    emoticons = re.findall(
        '(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower()
    )
    
    #Substitute non-word characters (equivalent to [^a-zA-Z0-9_]) in "text.lower" with ' '
    #Join in ONE string elements in the "emoticons" list separated by a space, Then replace all '-' by empty string.
    #Concatenate.
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    
    tokens = text.split()
    return tokens

In case you still have a hard time understanding what the regular expressions do despite the comments, I suggest you go to the [Regex101](https://regex101.com/) website, paste a test string and see what each individual regular expressions in the `tokenizer` function matches. I used the training example of earlier as my test string for instance. I added some emoticons like ":-)" and ":-(" and played around.

With our `tokenizer` helper written, let's determine the vocab size of the training dataset.

In [16]:
token_counts = Counter()

for label, line in train_dataset:
    tokens = tokenizer(line)
    token_counts.update(tokens)

print('Training dataset vocab size:', len(token_counts))

Training dataset vocab size: 69023


Nice! Let's continue. Now, for my third step I want to **assign a unique integer to each individual token** we detected in the previous step. You may think of this unique integer as the "id" of the token.

In [23]:
from torchtext.vocab import vocab

sorted_by_freq_tuples = sorted(
    token_counts.items(), key=lambda x: x[1], reverse=True
)

ordered_dict = OrderedDict(sorted_by_freq_tuples)

#vocab object maps tokens to indices
vocab = vocab(ordered_dict)

vocab.insert_token("<pad>", 0) #padding token to adjust sequence length
vocab.insert_token("<unk>", 1) #unknown tokens are assigned integer 1
vocab.set_default_index(1)

`token_counts.items()` returns a list of tuples where each tuple is a (token, frequency) pair. The `key` parameter is a function to specify how two elements from the list should be compared. In this case, we want compare tuples by looking at the second item in them (i.e. the frequency). Lastly, the `reversed` parameter specified that we want the elements in the list to be sorted in descending order. Using the `sorted` function, we are sorting the tokens in the vocab from the most common, to the least common.

Then, we use the sorted list to create an `OrderedDict`. An "ordered" dictionary is a dictionary that preserves the order of key-value pairs. Updading the value of existing key, then the order remains unchanged. If you remove an item and reinsert it, then the item is added at the end of the dictionary.

And then the `vocab` object is creates which assigns indices to the tokens in the ordered dictionary. Let's print some integer values associated with some english words:

In [25]:
[vocab[token] for token in ['this', 'is', 'an', 'example'] ]

[11, 7, 35, 457]