## Sentiment Analysis: Determining the polarity of a text (positive or negative).

<img src='imgs/QRNN.png' width='100%'/>

## Data

[IMDB](http://ai.stanford.edu/~amaas/data/sentiment/) Dataset
- A dataset for binary sentiment classification.
- It provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.


**Note**: to run the following codes, you need to dowloand the dataset from the provided link and change the `data_dir` in the following cell accordingly.

## Libraries

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import os
import re
import sys
import spacy  # just for NLP
import pickle
import numpy as np

from glob import glob
import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

from utils import *
from data_utils import Vocabulary, tokenizer
from train_utils import train


# setup
NLP = spacy.load('en_core_web_sm')  # NLP toolkit

## Tokenizing

In [2]:
text = """
Bromwell High is a cartoon comedy. 
It ran at the same time as some other programs about school life, such as 'Teachers'. 
My 35 years in the teaching profession lead me to believe that Bromwell High's 
satire is much closer to reality than is 'Teachers'. 
The scramble to survive financially, the insightful students who can see 
right through their pathetic teachers' pomp, the pettiness of the whole situation, 
all remind me of the schools I knew and their students. 
When I saw the episode in which a student repeatedly tried to burn down the school, 
I immediately recalled ......... at .......... High. 
A classic line: INSPECTOR: I'm here to sack one of your teachers. 
STUDENT: Welcome to Bromwell High. 
I expect that many adults of my age think that Bromwell High is far fetched. 
What a pity that it isn't!!!
"""

In [3]:
''' Remove the followimg characters and replace with space  '''
text = re.sub(r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’;]", " ", str(text)) 
print(text)

 Bromwell High is a cartoon comedy.  It ran at the same time as some other programs about school life, such as 'Teachers'.  My 35 years in the teaching profession lead me to believe that Bromwell High's  satire is much closer to reality than is 'Teachers'.  The scramble to survive financially, the insightful students who can see  right through their pathetic teachers' pomp, the pettiness of the whole situation,  all remind me of the schools I knew and their students.  When I saw the episode in which a student repeatedly tried to burn down the school,  I immediately recalled ......... at .......... High.  A classic line  INSPECTOR  I'm here to sack one of your teachers.  STUDENT  Welcome to Bromwell High.  I expect that many adults of my age think that Bromwell High is far fetched.  What a pity that it isn't!!! 


In [4]:
'''Replace some spaces with one space'''
text = re.sub(r"[ ]+", " ", text)
print(text)

 Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as 'Teachers'. My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is 'Teachers'. The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line INSPECTOR I'm here to sack one of your teachers. STUDENT Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!!! 


In [5]:
'''Replace some signs ! with one !'''

text = re.sub(r"\!+", "!", text)
print(text)

 Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as 'Teachers'. My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is 'Teachers'. The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line INSPECTOR I'm here to sack one of your teachers. STUDENT Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't! 


In [6]:
'''tonenize'''

tokens = [w.text for w in NLP.tokenizer(text)]
print(tokens)

[' ', 'Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy', '.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', "'", 'Teachers', "'", '.', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', 'High', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', "'", 'Teachers', "'", '.', 'The', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students', '.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'I', 'immediately', 'recalled', '.........', 'at', '..........', 'High', '.', 'A', 'classic', 'line'

### Tokenizer and Vocabulary

We have defined a function in `utils.py`, which gets the inputs text and splits it to a sequence of tokens. We have used **SpaCy** toolkit for tokeniztion and you need to install it to run the codes.

```python
def tokenizer(text):
    text = re.sub(r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’;]", " ", str(text))
    text = re.sub(r"[ ]+", " ", text)
    text = re.sub(r"\!+", "!", text)
    text = re.sub(r"\,+", ",", text)
    text = re.sub(r"\?+", "?", text)
    return [x.text for x in NLP.tokenizer(text) if x.text != " "]
```

In [2]:

data_dir = 'dataset/aclImdb/'

vocab_path = 'vocab.pkl'

# parameters
max_len = 200  # By this initilazatio we consider just 200 character of each text. we determine based on mean + 2 * sigma
min_count = 10     #we replace every token which repeat less than 10 times with the spetial token. This is UNK = '<unk>'. 
batch_size = 50

In [3]:
import splitfolders
input_folder='dataset/aclImdb/test/'
splitfolders.ratio(input_folder, output="dataset/aclImdb/valid", seed=1337, ratio=(0.88, 0.1, 0.02)) 

Copying files: 25000 files [04:16, 97.52 files/s] 


In [4]:
data_dir='dataset/aclImdb/dev/'
os.listdir(data_dir)

['test', 'train']

In [5]:
os.listdir(f'{data_dir}/train')

['neg', 'pos']

### Statistics

In [6]:
all_filenames = glob(f'{data_dir}/*/*/*.txt')
num_words = [len(open(f, encoding="utf-8").read().split(' ')) for f in tqdm.notebook.tqdm(all_filenames)]

# print statistics
print('Min length =', min(num_words))
print('Max length =', max(num_words))

print('Mean = {:.2f}'.format(np.mean(num_words)))
print('Std  = {:.2f}'.format(np.std(num_words)))

print('mean + 2 * sigma = {:.2f}'.format(np.mean(num_words) + 2.0 * np.std(num_words)))

  0%|          | 0/3000 [00:00<?, ?it/s]

Min length = 9
Max length = 1090
Mean = 228.50
Std  = 170.35
mean + 2 * sigma = 569.19


## Dataset

In [7]:
PAD = '<pad>'  # special symbol we use for padding text
UNK = '<unk>'  # special symbol we use for rare or unknown word

In [8]:
class TextClassificationDataset(Dataset):
    
    def __init__(self, path, tokenizer, 
                 split='train', 
                 vocab_path='vocab.pkl', 
                 max_len=100, min_count=10):
        
        self.path = path
        assert split in ['train', 'test']
        self.split = split
        self.vocab_path = vocab_path
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.min_count = min_count
        
        self.cache = {}
        self.vocab = None
        
        self.classes = []
        self.class_to_index = {}
        self.text_files = []
        
        split_path = f'{path}/{split}'
        
        for cls_idx, label in enumerate(os.listdir(split_path)):
            text_files = [(fname, cls_idx) for fname in glob(f'{split_path}/{label}/*.txt')]
            self.text_files += text_files
            self.classes += [label]
            self.class_to_index[label] = cls_idx
        
        self.num_classes = len(self.classes)
            
        # build vocabulary from training and validation texts
        self.build_vocab()
        
    def __getitem__(self, index):
        # read the tokenized text file and its label (neg=0, pos=1)
        fname, class_idx = self.text_files[index]
        
        if fname in self.cache:
            return self.cache[fname], class_idx
        
        # read text file 
        text = open(fname, encoding="utf-8").read()
        
        # tokenize the text file
        tokens = self.tokenizer(text.lower().strip())
        
        # padding and trimming
        if len(tokens) < self.max_len:
            num_pads = self.max_len - len(tokens)
            tokens = [PAD] * num_pads + tokens
        elif len(tokens) > self.max_len:
            tokens = tokens[:self.max_len]
            
        # numericalizing
        ids = torch.LongTensor(self.max_len)
        for i, word in enumerate(tokens):
            if word not in self.vocab.word2index:
                ids[i] = self.vocab.word2index[UNK]  # unknown words
            elif word != PAD and self.vocab.word2count[word] < self.min_count:
                ids[i] = self.vocab.word2index[UNK]  # rare words
            else:
                ids[i] = self.vocab.word2index[word]
                
        # save in cache for future use
        self.cache[fname] = ids
        
        return ids, class_idx
    
    def __len__(self):
        return len(self.text_files)
    
    def build_vocab(self):
        if not os.path.exists(self.vocab_path):
            vocab = Vocabulary(self.tokenizer)
            filenames = glob(f'{self.path}/*/*/*.txt')
            for filename in tqdm.notebook.tqdm(filenames, desc='Building Vocab'):
                with open(filename, encoding='utf8') as f:
                    for line in f:
                        vocab.add_sentence(line.lower())

            # sort words by their frequencies
            words = [(0, PAD), (0, UNK)]
            words += sorted([(c, w) for w, c in vocab.word2count.items()], reverse=True)

            self.vocab = Vocabulary(self.tokenizer)
            for i, (count, word) in enumerate(words):
                self.vocab.word2index[word] = i
                self.vocab.word2count[word] = count
                self.vocab.index2word[i] = word
                self.vocab.count += 1

            pickle.dump(self.vocab, open(self.vocab_path, 'wb'))
        else:
            self.vocab = pickle.load(open(self.vocab_path, 'rb'))

In [9]:
train_ds = TextClassificationDataset(data_dir, tokenizer, 'train', vocab_path, max_len, min_count)
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)

valid_ds = TextClassificationDataset(data_dir, tokenizer, 'test', vocab_path, max_len, min_count)
valid_dl = DataLoader(valid_ds, batch_size=batch_size, shuffle=False)

In [10]:
len(train_ds)

2500

In [234]:
len(valid_ds)

500

In [235]:
train_ds.classes

['neg', 'pos']

In [236]:
train_ds.class_to_index

{'neg': 0, 'pos': 1}

In [237]:
ids, label = train_ds[0]

print(train_ds.classes[label])
print(ids.numpy())

neg
[   14     9    40   476     7   144     2  2236     7   213   116    30
     2   176     4  3920     5   368     3    45    16    73   171   283
   157   140     4     6   594   455     7     2   104  1174 14516  2030
     7  1788  1640     5  1788  5044     3    41   147   259 13558   118
   224   131    15    38    30  2196     7   124     3     5   124    82
     4    49    28  1273    22    14    35     3   150    74   175   660
   529     3     1    46   115   175   849  6226    21  1788  1640     3
    46 21342  2813     2  2844     3  1999  2860    46  2399    21  1788
  5044     5    74     2   154   804     4  1788  1640    16  2261  7362
   421   633   170    14    25 18465    36     2  2702     3     5    13
   155   137  1531    56     2  2143   960  5892    19   413    12    14
    25    59     5   144     2  2143    83    31   219   295     2  2539
   176   114    59    43  2062 21040     3   178    25    13   147   119
    22   960  5892    55   101   400     2  253

In [238]:
# convert back the sequence of integers into original text
print(' '.join([train_ds.vocab.index2word[i.item()] for i in ids]))

this is an example of why the majority of action films are the same . generic and boring , there 's really nothing worth watching here . a complete waste of the then barely tapped talents of ice t and ice cube , who 've each proven many times over that they are capable of acting , and acting well . do n't bother with this one , go see new jack city , <unk> or watch new york undercover for ice t , or boyz n the hood , higher learning or friday for ice cube and see the real deal . ice t 's horribly cliched dialogue alone makes this film grate at the teeth , and i 'm still wondering what the heck bill paxton was doing in this film ? and why the heck does he always play the exact same character ? from aliens onward , every film i 've seen with bill paxton has him playing the exact same irritating character , and at least in aliens his character died , which made it somewhat gratifying ... <br > < br > overall , this is second rate action trash . there are countless


In [239]:
# print the original text
print(open(train_ds.text_files[0][0]).read())

This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to se

### Vovcabulary size

In [11]:
vocab = train_ds.vocab
freqs = [(count, word) for (word, count) in vocab.word2count.items() if count >= min_count]
vocab_size = len(freqs) + 2  # for PAD and UNK tokens
print(f'Vocab size = {vocab_size}')

print('\nMost common words:')
for c, w in sorted(freqs, reverse=True)[:10]:
    print(f'{w}: {c}')

Vocab size = 29506

Most common words:
the: 666713
,: 543467
.: 470130
and: 324156
a: 321800
of: 289313
to: 267961
is: 217022
>: 202243
it: 187974


## LSTM Classifier with Attention mechanism

In [15]:
# Attention computes a weighted average of the hidden states of the LSTM Model.
# In fact, it produce a weight for each hidden state at different time steps

class SelfAttention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.projection = nn.Sequential(
            nn.Linear(hidden_dim, 64),
            nn.ReLU(True),
            nn.Linear(64, 1)
        )
    
    def forward(self, encoder_outputs):
        # encoder_outputs = [batch size, sent len, hid dim]
        energy = self.projection(encoder_outputs)
        # energy = [batch size, sent len, 1]
        weights = F.softmax(energy.squeeze(-1), dim=1)
        # weights = [batch size, sent len]
        outputs = (encoder_outputs * weights.unsqueeze(-1)).sum(dim=1)
        # outputs = [batch size, hid dim]
        return outputs, weights

    
class AttentionLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding_dim = embed_size
        self.num_layers = n_layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers,
                            bidirectional=bidirectional, 
                            dropout= 0 if n_layers < 2 else dropout)
        self.attention = SelfAttention(hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        # x = [sent len, batch size]
        embedded = self.embedding(x)
        # embedded = [sent len, batch size, emb dim]
        output, (hidden, cell) = self.lstm(embedded)
        # use 'batch_first' if you want batch size to be the 1st para
        # output = [sent len, batch size, hid dim*num directions]
        output = output[:, :, :self.hidden_dim] + output[:, :, self.hidden_dim:]
        # output = [sent len, batch size, hid dim]
        ouput = output.permute(1, 0, 2)
        # ouput = [batch size, sent len, hid dim]
        new_embed, weights = self.attention(ouput)
        # new_embed = [batch size, hid dim]
        # weights = [batch size, sent len]
        new_embed = self.dropout(new_embed)
        return self.fc(new_embed)

In [13]:
vocab_size = 2 + len([w for (w, c) in train_ds.vocab.word2count.items() if c >= min_count])
print(vocab_size)

29506


## Model

In [29]:
# LSTM parameters
embed_size = 100
hidden_size = 256 
num_layers = 4

# training parameters
lr = 0.001
num_epochs =10

In [30]:
model = AttentionLSTM(vocab_size, embed_size, hidden_size, 
                      output_dim=train_ds.num_classes, 
                      n_layers=num_layers, bidirectional=True, dropout=0.3)


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

In [31]:
criterion = nn.CrossEntropyLoss().to(device)
criterion = criterion.to(device)
    
optimizer = optim.Adam(model.parameters(), lr=lr, betas=(0.7, 0.99))
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)

### Training

In [32]:
hist = train(model, train_dl, valid_dl, criterion, optimizer, device, scheduler, num_epochs)

[Epoch:  1/10] | Training Loss: 0.0139 | Testing Loss: 0.0138 | Training Acc:           50.16 | Testing Acc: 51.80
[Epoch:  2/10] | Training Loss: 0.0132 | Testing Loss: 0.0116 | Training Acc:           61.16 | Testing Acc: 71.80
[Epoch:  3/10] | Training Loss: 0.0104 | Testing Loss: 0.0113 | Training Acc:           75.12 | Testing Acc: 73.00
[Epoch:  4/10] | Training Loss: 0.0083 | Testing Loss: 0.0113 | Training Acc:           81.64 | Testing Acc: 72.20
[Epoch:  5/10] | Training Loss: 0.0064 | Testing Loss: 0.0111 | Training Acc:           86.84 | Testing Acc: 74.00
[Epoch:  6/10] | Training Loss: 0.0050 | Testing Loss: 0.0117 | Training Acc:           89.64 | Testing Acc: 75.60
[Epoch:  7/10] | Training Loss: 0.0037 | Testing Loss: 0.0166 | Training Acc:           93.08 | Testing Acc: 74.00
[Epoch:  8/10] | Training Loss: 0.0027 | Testing Loss: 0.0172 | Training Acc:           95.52 | Testing Acc: 75.00
[Epoch:  9/10] | Training Loss: 0.0022 | Testing Loss: 0.0183 | Training Acc:   

Training:   0%|          | 0/50 [00:00<?, ?it/s]

Validation:   0%|          | 0/10 [00:00<?, ?it/s]