<div class="alert alert-success">
    <h1 align="center">Lesson 5: Text Classification - Sentiment Analysis</h1>
    <h3 align="center">Javad Mohammadzadeh
    </h3>
</div>

## Introduction

<h6>Text Classification:</h6> Assigning a label (from a set of pre-defined labels) to a given text.

<h6>Some applications:</h6>
- **Sentiment Analysis:** Determining the polarity of a text (`positive` or `negative`).
- **Spam detection:** email, web pages, etc.
- **Toxic text detection:** insult, racism, hatred, etc.

## Libraries

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import os
import re
import sys
import spacy
import pickle
import numpy as np

from glob import glob
from tqdm import tqdm_notebook

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

from utils import *
from data_utils import Vocabulary, tokenizer
from train_utils import train

# setup
use_gpu = torch.cuda.is_available()
NLP = spacy.load('en')  # NLP toolkit

## Tokenizing

In [3]:
text = """
Bromwell High is a cartoon comedy. 
It ran at the same time as some other programs about school life, such as 'Teachers'. 
My 35 years in the teaching profession lead me to believe that Bromwell High's 
satire is much closer to reality than is 'Teachers'. 
The scramble to survive financially, the insightful students who can see 
right through their pathetic teachers' pomp, the pettiness of the whole situation, 
all remind me of the schools I knew and their students. 
When I saw the episode in which a student repeatedly tried to burn down the school, 
I immediately recalled ......... at .......... High. 
A classic line: INSPECTOR: I'm here to sack one of your teachers. 
STUDENT: Welcome to Bromwell High. 
I expect that many adults of my age think that Bromwell High is far fetched. 
What a pity that it isn't!!!
"""

In [4]:
text = re.sub(r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’;]", " ", str(text))
print(text)

 Bromwell High is a cartoon comedy.  It ran at the same time as some other programs about school life, such as 'Teachers'.  My 35 years in the teaching profession lead me to believe that Bromwell High's  satire is much closer to reality than is 'Teachers'.  The scramble to survive financially, the insightful students who can see  right through their pathetic teachers' pomp, the pettiness of the whole situation,  all remind me of the schools I knew and their students.  When I saw the episode in which a student repeatedly tried to burn down the school,  I immediately recalled ......... at .......... High.  A classic line  INSPECTOR  I'm here to sack one of your teachers.  STUDENT  Welcome to Bromwell High.  I expect that many adults of my age think that Bromwell High is far fetched.  What a pity that it isn't!!! 


In [5]:
text = re.sub(r"[ ]+", " ", text)
print(text)

 Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as 'Teachers'. My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is 'Teachers'. The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line INSPECTOR I'm here to sack one of your teachers. STUDENT Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!!! 


In [6]:
text = re.sub(r"\!+", "!", text)
print(text)

 Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as 'Teachers'. My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is 'Teachers'. The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line INSPECTOR I'm here to sack one of your teachers. STUDENT Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't! 


In [7]:
tokens = [w.text for w in NLP.tokenizer(text)]
print(tokens)

[' ', 'Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy', '.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', "'", 'Teachers', "'", '.', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', 'High', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', "'", 'Teachers', "'", '.', 'The', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students', '.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'I', 'immediately', 'recalled', '.........', 'at', '..........', 'High', '.', 'A', 'classic', 'line'

### Tokenizer and Vocabulary

We have defined a function in `utils.py`, which gets the inputs text and split it to a sequence of tokens. We have used **SpaCy** toolkit for tokeniztion and you need to install it to run the codes.

```python
def tokenizer(text):
    text = re.sub(r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’;]", " ", str(text))
    text = re.sub(r"[ ]+", " ", text)
    text = re.sub(r"\!+", "!", text)
    text = re.sub(r"\,+", ",", text)
    text = re.sub(r"\?+", "?", text)
    return [x.text for x in NLP.tokenizer(text) if x.text != " "]
```

### Install SpaCy 

Installation:
<pre>conda install -c conda-forge spacy</pre>

Download a language:
<pre>python -m spacy download en</pre>

Usage:
```python
import spacy
NLP = spacy.load('en')
```

## Data

[IMDB](http://ai.stanford.edu/~amaas/data/sentiment/) Dataset
- A dataset for binary sentiment classification.
- It provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing

In [8]:
data_dir = 'D:/datasets/aclImdb/dev'

vocab_path = 'vocab.pkl'

# parameters
max_len = 200
min_count = 10
batch_size = 50

In [9]:
os.listdir(data_dir)

['test', 'train']

In [10]:
os.listdir(f'{data_dir}/train')

['neg', 'pos']

### Statistics

In [11]:
all_filenames = glob(f'{data_dir}/*/*/*.txt')
num_words = [len(open(f).read().split(' ')) for f in tqdm_notebook(all_filenames)]

# print statistics
print('Min length =', min(num_words))
print('Max length =', max(num_words))

print('Mean = {:.2f}'.format(np.mean(num_words)))
print('Std  = {:.2f}'.format(np.std(num_words)))

print('mean + 2 * sigma = {:.2f}'.format(np.mean(num_words) + 2.0 * np.std(num_words)))


Min length = 10
Max length = 2470
Mean = 242.28
Std  = 189.11
mean + 2 * sigma = 620.49


## Dataset

In [19]:
PAD = '<pad>'  # special symbol we use for padding text
UNK = '<unk>'  # special symbol we use for rare or unknown word

In [12]:
class TextClassificationDataset(Dataset):
    
    def __init__(self, path, tokenizer, 
                 split='train', 
                 vocab_path='vocab.pkl', 
                 max_len=100, min_count=10):
        
        self.path = path
        assert split in ['train', 'test']
        self.split = split
        self.vocab_path = vocab_path
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.min_count = min_count
        
        self.cache = {}
        self.vocab = None
        
        self.classes = []
        self.class_to_index = {}
        self.text_files = []
        
        split_path = f'{path}/{split}'
        for cls_idx, label in enumerate(os.listdir(split_path)):
            text_files = [(fname, cls_idx) for fname in glob(f'{split_path}/{label}/*.txt')]
            self.text_files += text_files
            self.classes += [label]
            self.class_to_index[label] = cls_idx
        
        self.num_classes = len(self.classes)
            
        # build vocabulary from training and validation texts
        self.build_vocab()
        
    def __getitem__(self, index):
        # read the tokenized text file and its label (neg=0, pos=1)
        fname, class_idx = self.text_files[index]
        
        if fname in self.cache:
            return self.cache[fname], class_idx
        
        # read text file 
        text = open(fname).read()
        
        # tokenize the text file
        tokens = self.tokenizer(text.lower())
        
        # padding and trimming
        if len(tokens) < self.max_len:
            num_pads = self.max_len - len(tokens)
            tokens = [PAD] * num_pads + tokens
        elif len(tokens) > self.max_len:
            tokens = tokens[:self.max_len]
            
        # numericalizing
        ids = torch.LongTensor(self.max_len)
        for i, word in enumerate(tokens):
            if word not in self.vocab.word2index:
                ids[i] = self.vocab.word2index[UNK]  # unknown words
            elif word != PAD and self.vocab.word2count[word] < self.min_count:
                ids[i] = self.vocab.word2index[UNK]  # rare words
            else:
                ids[i] = self.vocab.word2index[word]
                
        # save in cache for future use
        self.cache[fname] = ids
        
        return ids, class_idx
    
    def __len__(self):
        return len(self.text_files)
    
    def build_vocab(self):
        if not os.path.exists(self.vocab_path):
            vocab = Vocabulary(self.tokenizer)
            filenames = glob(f'{path}/*/*/*.txt')
            for filename in tqdm(filenames, desc='Building Vocab'):
                with open(filename, encoding='utf8') as f:
                    for line in f:
                        vocab.add_sentence(line.lower())

            # sort words by their frequencies
            words = [(0, PAD), (0, UNK)]
            words += sorted([(c, w) for w, c in vocab.word2count.items()], reverse=True)

            self.vocab = Vocabulary(self.tokenizer)
            for i, (count, word) in enumerate(words):
                self.vocab.word2index[word] = i
                self.vocab.word2count[word] = count
                self.vocab.index2word[i] = word
                self.vocab.count += 1

            pickle.dump(self.vocab, open(self.vocab_path, 'wb'))
        else:
            self.vocab = pickle.load(open(self.vocab_path, 'rb'))

In [13]:
train_ds = TextClassificationDataset(data_dir, tokenizer, 'train', vocab_path, max_len, min_count)
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)

valid_ds = TextClassificationDataset(data_dir, tokenizer, 'test', vocab_path, max_len, min_count)
valid_dl = DataLoader(valid_ds, batch_size=batch_size, shuffle=False)

In [14]:
len(train_ds)

2000

In [15]:
len(valid_ds)

400

In [16]:
train_ds.classes

['neg', 'pos']

In [17]:
train_ds.class_to_index

{'neg': 0, 'pos': 1}

In [20]:
ids, label = train_ds[0]

print(train_ds.classes[label])
print(ids.numpy())

neg
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0    76     7     6   134    41    55  7586
  1409    21     6  5057     4   539    54    22     6   615   138    15
     9     6  1337   476     7  1910   211     4     6 10995  6497   316
     9   655    96    40  2019     3  1119  2731    39     2   940     1
     7    11    16  5607     4   483    11  2834  1910     2   226    69
    22    66   809  1355   854   239    11    47   108   133  1486     4
    68   153    43     2  1032   143    34   655   133     4     2 12830
   421    64   105  1771   313   762     8     

In [21]:
# convert back the sequence of integers into original text
print(' '.join([train_ds.vocab.index2word[i] for i in ids]))

<pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane , violent mob by the crazy <unk> of it 's singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it 's better than you might think with some 

In [22]:
# print the original text
print(open(train_ds.text_files[0][0]).read())

Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.


### Vovcabulary size

In [23]:
vocab = train_ds.vocab
freqs = [(count, word) for (word, count) in vocab.word2count.items() if count >= min_count]
vocab_size = len(freqs) + 2  # for PAD and UNK tokens
print(f'Vocab size = {vocab_size}')

print('\nMost common words:')
for c, w in sorted(freqs, reverse=True)[:10]:
    print(f'{w}: {c}')

Vocab size = 29512

Most common words:
the: 666690
,: 543308
.: 469782
and: 324154
a: 321797
of: 289312
to: 267959
is: 217022
>: 202239
it: 187957


## LSTM Classifier

In [24]:
class LSTMClassifier(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, num_classes, batch_size):
        super(LSTMClassifier, self).__init__()
        
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(vocab_size, embed_size) # a lookup table
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, dropout=0.3, bidirectional=True)
        self.fc = nn.Sequential(
            nn.Linear(2*hidden_size, 100),
            nn.ReLU(),
            nn.Dropout(p=0.2),
            nn.Linear(100, num_classes)
        )
        self.hidden = self.init_hidden()
    
    def init_hidden(self):
        h = to_var(torch.zeros((2*self.num_layers, self.batch_size, self.hidden_size)))
        c = to_var(torch.zeros((2*self.num_layers, self.batch_size, self.hidden_size)))
        return h, c
    
    def forward(self, x):
        x = self.embedding(x)
        x, self.hidden = self.lstm(x, self.hidden)
        x = self.fc(x[-1])  # select the last output
        return x

In [25]:
vocab_size = 2 + len([w for (w, c) in train_ds.vocab.word2count.items() if c >= min_count])
print(vocab_size)

29512


## Model

In [26]:
# LSTM parameters
embed_size = 100
hidden_size = 256
num_layers = 1

# training parameters
lr = 0.001
num_epochs = 10

In [27]:
model = LSTMClassifier(embed_size=embed_size, 
                       hidden_size=hidden_size, 
                       vocab_size=vocab_size,
                       num_layers=num_layers,
                       num_classes=train_ds.num_classes, 
                       batch_size=batch_size)

if use_gpu:
    model = model.cuda()

In [28]:
criterion = nn.CrossEntropyLoss()
if use_gpu:
    criterion = criterion.cuda()
    
optimizer = optim.Adam(model.parameters(), lr=lr, betas=(0.7, 0.99))
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.975)

In [None]:
# model.load_state_dict(torch.load('models/lstm-3-400-10-256-1024-1-0.85764.pth'))

### Training

In [29]:
hist = train(model, train_dl, valid_dl, criterion, optimizer, scheduler, num_epochs)





[Epoch:   1/ 10] Training Loss: 0.014, Testing Loss: 0.014, Training Acc: 0.507, Testing Acc: 0.497






[Epoch:   2/ 10] Training Loss: 0.013, Testing Loss: 0.014, Training Acc: 0.606, Testing Acc: 0.527






[Epoch:   3/ 10] Training Loss: 0.012, Testing Loss: 0.018, Training Acc: 0.700, Testing Acc: 0.585






[Epoch:   4/ 10] Training Loss: 0.009, Testing Loss: 0.015, Training Acc: 0.797, Testing Acc: 0.570






[Epoch:   5/ 10] Training Loss: 0.007, Testing Loss: 0.017, Training Acc: 0.851, Testing Acc: 0.570






[Epoch:   6/ 10] Training Loss: 0.005, Testing Loss: 0.019, Training Acc: 0.899, Testing Acc: 0.605






[Epoch:   7/ 10] Training Loss: 0.003, Testing Loss: 0.026, Training Acc: 0.941, Testing Acc: 0.562






[Epoch:   8/ 10] Training Loss: 0.001, Testing Loss: 0.041, Training Acc: 0.977, Testing Acc: 0.590






[Epoch:   9/ 10] Training Loss: 0.001, Testing Loss: 0.039, Training Acc: 0.982, Testing Acc: 0.590






[Epoch:  10/ 10] Training Loss: 0.000, Testing Loss: 0.052, Training Acc: 0.993, Testing Acc: 0.605


## Load a trained model

In [None]:
# LSTM parameters
max_len = 400
min_count = 10
embed_size = 256
hidden_size = 1024
num_layers = 1


model = LSTMClassifier(embed_size=embed_size, 
                       hidden_size=hidden_size, 
                       vocab_size=vocab_size,
                       num_layers=num_layers,
                       num_classes=train_ds.num_classes, 
                       batch_size=batch_size)

model.load_state_dict(torch.load('models/tmp/lstm-1-400-10-256-1024-1-0.87964.pth'))

if use_gpu:
    model = model.cuda()

## Visualizing word vectors

<img src='imgs/wordvecs_persian.png' width='80%'/>

In [30]:
word_vecs = model.embedding.weight.data =  

In [31]:
print(word_vecs.size())

torch.Size([29512, 100])


In [32]:
print(word_vecs[0])


-1.0678
-0.8920
-0.0058
-1.3950
 0.8038
 0.9048
 0.1682
-0.2589
 2.0809
-1.5028
-0.1373
 0.9797
-1.3226
 0.1690
 0.1304
 1.2229
-1.5732
-0.0680
 0.2411
-0.5359
-1.1798
 1.5737
-0.0653
-1.5709
-0.1531
-1.6529
 0.9688
 0.0960
-0.0773
-0.0464
-1.1947
-0.5481
 0.2004
-2.5060
 0.3788
-1.3304
-0.2268
 0.5360
-0.7743
-1.7638
-0.7493
-0.5174
-0.4552
 0.3821
-0.7965
 0.0837
-0.1656
-0.3691
 1.0137
-0.6089
 0.8563
-0.4448
-0.5062
 0.9916
 1.4466
-0.2972
 0.3063
 0.3336
 0.1705
-1.5537
-0.1382
-0.5471
-1.1656
 0.6987
 0.7385
 1.0847
-1.3165
 1.1401
 0.0106
 0.5664
-0.6338
 0.0254
-0.0188
-0.9588
-0.4423
 0.2074
 0.3086
-0.6075
-1.9901
 0.5488
-1.5887
-1.1958
 0.6825
-0.4171
 0.7192
-0.7814
-1.2733
 0.7871
-0.2044
 0.6582
 1.0289
 0.9307
-0.0660
 0.9300
 0.0040
 0.5446
 1.5127
 0.8484
-0.9488
 0.6794
[torch.cuda.FloatTensor of size 100 (GPU 0)]



## Improvements

- Fine-tuning hyper-parameters
- Bidirectional LSTM
- Pre-trained word vectors ([GloVe](https://nlp.stanford.edu/projects/glove/), [FastText](https://code.facebook.com/posts/1438652669495149/fair-open-sources-fasttext/), etc.)
- Dropouts and regularization (https://arxiv.org/pdf/1708.02182.pdf)
- Attention mechanism (https://distill.pub/2016/augmented-rnns/)
- Use GRU or QRNN (https://arxiv.org/pdf/1611.01576.pdf)
- Pre-training with a language model (https://arxiv.org/pdf/1605.07725.pdf)

### Attention mechanism

<img src='imgs/attention-example.png' width='90%'/>

### QRNN

<img src='imgs/QRNN.png' width='100%'/>

## Programming Excersize: Toxic Texts

Try to get an accuracy equal (or above 90) % for IMDB dataset.

## Further Study

<h6>LSTM</h6> 
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714

<h6>Word Vectors and NLP</h6>
- https://einstein.ai/research/learned-in-translation-contextualized-word-vectors