# Recurrent Neural Networks - Sentiment Analysis

In this notebook, my goal is to implement an RNN to analyze the expressed opinion of a sentence and classifies it as positive or negative using PyTorch.

I will implement a multilayer RNN for sentiment analysis using a many-to-one architecture.

Okay, let's jump in!

# Loading and Preprocessing the data

For this exercice, I will the IMDb movie reviews. The multilayer RNN to be implemented will analyze those reviews and classify them as a positive (1) or a negative (0) review.

Thankfully, the IMDb movie reviews dataset is available in PyTorch through the `torchtext.datasets` module.

In [16]:
#pip install torchtext
#pip install 'portalocker>=2.0.0'
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')
test_iter = IMDB(split='test')

The previous two lines return iterators, specifically `ShardingFilterIterDataPipe` object.As I am writing this, the pytorch documentation says:

> The datasets supported by torchtext are datapipes from the torchdata project, which is still in Beta status. This means that the API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of `DataLoaderV2` from `torchdata`.

In case something changes that causes the code in this notebook to break, please let me know and I will update it.

With that out of the way, I would like to print the first training example, and see what it looks like:

In [20]:
list(train_iter)[0]

(1,
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwee

`train_iter` is an iterator that lets us traverse a list of tuples. Where _each_ tuple is the sentiment, followed by the text.
This is the very first training example. 

Like before, I should prepare the dataset before using it. 

I start with splitting the training partition of the dataset into a training and validation partition.

In [12]:
import torch
from torch.utils.data.dataset import random_split
torch.manual_seed(1)

train_dataset, valid_dataset = random_split(
    list(train_iter), [20000, 5000]
)

My second step is to find the number of unique tokens (words) in the training dataset.

To help us acheive this goal, I use the `tokenizer` helper that I will write in the following cells. Its job is to remove punctuation, HTML markups, and other non-letter characters from each reviews in the dataset. Let's do this:

In [None]:
import re #regular expression
from collections import Counter, OrderedDict

def tokenizer(text):
    #subsitute characters '<[^>]*>' in "text" with empty string
    text = re.sub('<[^>]*>', '', text)
    
    #matches all occurrences of a pattern, in "text.lower" and return them in a list
    emoticons = re.findall(
        '(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower()
    )
    
    #substitute characters matches any non-word character (equivalent to [^a-zA-Z0-9_]) in "text.lower" with ' '
    #join in ONE string elements in the "emoticons" list separated by a space, Then replace all '-' by empty string.
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    
    tokenized = text.split()
    return tokenized

In [28]:
import re

text = ':-)I rented I AM :-() CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.'

text = re.sub('[\W]+', ' ', text.lower())
emoticons = re.findall(
        '(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower()
)
' '.join(emoticons).replace('-', '')

''