# Sentiment Analyzer

In this tutorial, we will build a LSTM Model that can classify a text sentence as expressing either positive sentiment or negative sentiment.

Note: This notebook has been borrowed from [GLUON-NLP Tutorials](https://gluon-nlp.mxnet.io/examples/sentiment_analysis/sentiment_analysis.html) and enhanced for this workshop

In [1]:
%%bash

# Install nltk
/home/ec2-user/anaconda3/envs/mxnet_p36/bin/pip install nltk==3.2.5 -U --quiet

# Download Spacy resources for English Language Model
sudo /home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m spacy download en

Collecting en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm: started
    Running setup.py install for en-core-web-sm: finished with status 'done'
Successfully installed en-core-web-sm-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/en_core_web_sm
-->
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


mxnet-cu90mkl 1.4.0.post0 has requirement numpy<1.15.0,>=1.8.2, but you'll have numpy 1.16.3 which is incompatible.
You are using pip version 10.0.1, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 10.0.1, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.


### the usual: Import libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')

import random
import time
import multiprocessing as mp
import numpy as np

import mxnet as mx
from mxnet import nd, gluon, autograd

import gluonnlp as nlp

from multiprocessing import cpu_count
CPU_COUNT = cpu_count()    

In [3]:
# Device Context in which to train the model. 
# Use mx.cpu() for CPU and mx.gpu(0), 0 is the index of GPU - we have 1 GPU on this instance, so we will set 
# context to mx.gpu(0)

context = mx.gpu(0)

### The Dataset: IMDb
To build the model We will use Stanford's Large Movie Review Dataset from IMDb as the data set for sentiment classification[1]. 
This data set is divided into two data sets for training and testing purposes, 
each containing 25,000 movie reviews downloaded from IMDb. 

In each data set, the number of comments labeled as "positive" and "negative" is equal.

**[Gluon-NLP](http://gluon-nlp.mxnet.io/)** provides many of the standard benchmark datasets used for research, we will download both the `train` and `test` IMDb datasets 

### Preprocessing Input

#### 1. Load the train and test IMDB movie review dataset from gluon-nlp

In [4]:
train_dataset, test_dataset = [nlp.data.IMDB(root='data/imdb', segment=segment)
                               for segment in ('train', 'test')]

Downloading data/imdb/train.json from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/imdb/train.json...
Downloading data/imdb/test.json from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/imdb/test.json...


In [5]:
!ls "data/imdb"

test.json  train.json


In [6]:
! head -c 1000 "data/imdb/train.json"

[["Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \"Teachers\". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is \"Teachers\". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!", 9], ["Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everythin

### Download Model, Embedding and Vocabulary from Gluon-NLP
Before we proceed further to process our dataset, lets firt lets download the Vocabulary, Embedding and the model to use in our pre-processing and training

In [7]:
model, vocab = nlp.model.get_model(name='standard_lstm_lm_200',
                                      dataset_name='wikitext-2',
                                      pretrained=True,
                                      ctx=context,
                                      dropout=0)

Vocab file is not found. Downloading.
Downloading /home/ec2-user/.mxnet/models/wikitext-2-be36dc52.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/wikitext-2-be36dc52.zip...
Downloading /home/ec2-user/.mxnet/models/standard_lstm_lm_200_wikitext-2-b233c700.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/standard_lstm_lm_200_wikitext-2-b233c700.zip...


In [8]:
model

StandardRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2)
  (decoder): HybridSequential(
    (0): Dense(200 -> 33278, linear)
  )
)

#### 2. Tokenize Sentences Into words

We use the Spacy tokenizer to split the sentences into words which are the tokens for

In [9]:
tokenizer = nlp.data.SpacyTokenizer('en')

#### 3. Clip the input sample(each review)

Since it needs a whole lot of memory to learn long sentences, we will clip the review to at most 500 words

In [10]:
seq_len = 500

In [11]:
length_clip = nlp.data.ClipSequence(seq_len)

#### 4. Preprocessing

We need to do some pre-processing of the input data, specifically we want to categorize our labels which have a review score of > 5 as positive and <=5 as negative ie., our `label=1` else `label=0` 

We will also convert our tokenized sequence using a vocabulary to indexes that can be fed to the network. The below routine will do the preprocessing job on each data sample

In [12]:
def preprocess(x):
    data, label = x
    label = int(label > 5)
    # Tokenize the data
    tokenized_data = tokenizer(data)
    # Clip the tokens
    tokenized_clipped_data = length_clip(tokenized_data)
    # Get vocabulary indexes for the tokens. Use pre-loaded 'vocab'.
    data = vocab[tokenized_clipped_data]

    return data, label

def get_length(x):
    return float(len(x[0]))

def preprocess_dataset(dataset, vocab):
    
    with mp.Pool(CPU_COUNT) as pool:
        # Each sample is processed in an asynchronous manner.
        dataset = gluon.data.SimpleDataset(pool.map(preprocess, dataset))
        lengths = gluon.data.SimpleDataset(pool.map(get_length, dataset))
    
    return dataset, lengths

5. #### Preprocess Datasets

In [13]:
# Preprocess the dataset
print("Preparing Train dataset. This will take few minutes...")
train_dataset, train_data_lengths = preprocess_dataset(train_dataset, vocab)

print("Preparing Test dataset. This will take few minutes...")
test_dataset, test_data_lengths = preprocess_dataset(test_dataset, vocab)

print("Data is ready!!!")

Preparing Train dataset. This will take few minutes...
Preparing Test dataset. This will take few minutes...
Data is ready!!!


6. #### Batchify, Samplers and DataLoaders

##### 6.1 Batchify
we will need to create mini-batches from the sequences of data we have. Gluon-NLP provides batchify function for a given sequence_length and batch_size.   

The batchify function creates batches so that the states of the previous batch connects to the hidden state of the current batch.

we will use the batchify function in the data loader that feeds the model Training process. 

In [14]:
batchify_fn = nlp.data.batchify.Tuple(
    nlp.data.batchify.Pad(axis=0, ret_length=True), 
    nlp.data.batchify.Stack(dtype='float32'))

##### 6.2 Sampler
Since the inputs are of different lengths(not necessarily all are > seq_len), 
its not ideal to pad empty on shorter sequences.
    
we will use FixedBucketSampler which creates multiple buckets(based on a num buckets and a ratio) of different lengths and 
assigns each data sample to a fixed bucket based on its length.

In [15]:
batch_size = 32
bucket_num, bucket_ratio = 10, 0.2

In [16]:
batch_sampler = nlp.data.sampler.FixedBucketSampler(
        train_data_lengths,
        batch_size=batch_size,
        num_buckets=bucket_num,
        ratio=bucket_ratio,
        shuffle=True)

In [17]:
print(batch_sampler.stats())

FixedBucketSampler:
  sample_num=25000, batch_num=779
  key=[59, 108, 157, 206, 255, 304, 353, 402, 451, 500]
  cnt=[591, 1999, 5092, 5108, 3035, 2084, 1476, 1164, 871, 3580]
  batch_size=[54, 32, 32, 32, 32, 32, 32, 32, 32, 32]


##### 6.3 Data Loaders:
* Finally we will pad sequences which are not of the bucket lenght 
* stack data, labels and length of data for each datasample

We feed all these methods to the DataLoader object which applies the above preprocess steps every iteration before feeding the input of the network. We will iterate over the data one batch at a time

In [18]:
train_dataloader = gluon.data.DataLoader(
    dataset=train_dataset,
    batch_sampler=batch_sampler,
    batchify_fn=batchify_fn, num_workers=CPU_COUNT)

test_dataloader = gluon.data.DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    shuffle=False,
    batchify_fn=batchify_fn, num_workers=CPU_COUNT)

### Building the Model with Pre-trained Model

We will using a techique called [Transfer Learning](http://cs231n.github.io/transfer-learning/) to build this model,
where we use a pre-trained model (in this case a Language Model), 
and a pre-trained embedding to represent our input data. 

The model we use here is a standard [standard_lstm_lm_200](https://gluon-nlp.mxnet.io/api/model.html#gluonnlp.model.standard_lstm_lm_200) 

This is a 2-layer standard LSTM with 200 hidden units and an embedding size of 200.

The model and the Embedding has been trained on the **WikiText-2** dataset which consists of around 2 million words extracted from Wikipedia articles.

### The Network

**Encoder:** We will use the pre-trained Embedding, the pre-trained Model(which is a language Model) as an encoder.

**MeanPool:** We will take the LSTM representation of the LSTM and average the predictions across timestamps (since we don't need output at every step), just at the end of the sequence to know the sentiment).

**Output:** The aggregated output is fed to a Dense Layer followed by Softmax to get an probabilty across the 2 classes (positive and negative)

#### MeanPoolLayer

MeanPoolLayer aggregates output across timesteps.

In [19]:
class MeanPoolingLayer(gluon.HybridBlock):
    """A block for mean pooling of encoder features"""
    def __init__(self, prefix=None, params=None):
        super(MeanPoolingLayer, self).__init__(prefix=prefix, params=params)

    def hybrid_forward(self, F, data, valid_length):
        masked_encoded = F.SequenceMask(data,
                                        sequence_length=valid_length,
                                        use_sequence_length=True)
        agg_state = F.broadcast_div(F.sum(masked_encoded, axis=0),
                                    F.expand_dims(valid_length, axis=1))
        return agg_state

#### Model
The Model consists of an Embedding, Encoder (LSTM), MeanPool and a Dense layer

![Sentiment Analyzer](network.png)

gluon.nn.Dense?

In [20]:
class SentimentNet(gluon.HybridBlock):
    """Network for sentiment analysis."""
    def __init__(self, prefix=None, params=None):
        super(SentimentNet, self).__init__(prefix=prefix, params=params)
        with self.name_scope():
            self.embedding = None # will set with lm embedding later
            self.encoder = None # will set with lm encoder later
            self.agg_layer = MeanPoolingLayer()
            self.output = gluon.nn.HybridSequential()
            with self.output.name_scope():
                self.output.add(gluon.nn.Dense(1, flatten=False))

    def hybrid_forward(self, F, data, valid_length): 
        embedded = self.embedding(data)
        encoded = self.encoder(embedded)
        agg_state = self.agg_layer(encoded, valid_length)
        out = self.output(agg_state)
        return out

#### Initialize Network with Pretrained Weights

In [21]:
net = SentimentNet()

# Use Pretrained Embeddings from wikitext-2
net.embedding = model.embedding

# Use Pretrained Encoder states (LSTM) from wikitext-2
net.encoder = model.encoder

#net.hybridize()

# Random initialize the last Dense Laywer
net.output.initialize(mx.init.Xavier(), ctx=context)
print(net)

SentimentNet(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2)
  (agg_layer): MeanPoolingLayer(
  
  )
  (output): HybridSequential(
    (0): Dense(None -> 1, linear)
  )
)


#### Hyperparameters

In [22]:
learning_rate = 0.005
epochs = 3

#### Evaluation Function

In [23]:
def evaluate(net, dataloader, context):
    loss = gluon.loss.SigmoidBCELoss()
    total_L = 0.0
    total_sample_num = 0
    total_correct_num = 0
    for i, ((data, valid_length), label) in enumerate(dataloader):

        data = mx.nd.transpose(data.as_in_context(context))
        valid_length = valid_length.as_in_context(context).astype(np.float32)
        label = label.as_in_context(context)
        
        output = net(data, valid_length)
        
        L = loss(output, label)
        
        pred = (output > 0.5).reshape(-1)
        total_L += L.sum().asscalar()
        total_sample_num += label.shape[0]
        total_correct_num += (pred == label).sum().asscalar()
    avg_L = total_L / float(total_sample_num)
    acc = total_correct_num / float(total_sample_num)
    return avg_L, acc

#### Training the Network

In [24]:
def train(net, context, epochs):
    # Use Follow the Moving Leader Optimizer - [7]
    trainer = gluon.Trainer(net.collect_params(), 'ftml',
                            {'learning_rate': learning_rate})
    loss = gluon.loss.SigmoidBCELoss()
    parameters = net.collect_params().values()
    start_train_time = time.time()
    print("Training the Sentiment Classification Model...")
    for epoch in range(epochs):
        epoch_L = 0.0
        start_epoch_time = time.time()        
        epoch_sent_num = 0
        for i, ((data, length), label) in enumerate(train_dataloader):
            L = 0
            with autograd.record():
                output = net(data.as_in_context(context).T,
                             length.as_in_context(context)
                                   .astype(np.float32))

                L = L + loss(output, label.as_in_context(context)).mean()
            
            L.backward()
            
            trainer.step(1)
            
            epoch_sent_num += data.shape[1]
            epoch_L += L.asscalar()
    
        print('Train Avg Loss: {:.6f}'.format(epoch_L / epoch_sent_num))
        
        test_avg_L, test_acc = evaluate(net, test_dataloader, context)
        print('Test Accuracy: {:.2f}, Test Avg Loss: {:.6f}'.format(test_acc, test_avg_L))

        print('[Epoch {}] time cost: {:.2f}'.format(epoch, time.time()-start_epoch_time))

In [25]:
# Train the model
train(net, context, epochs)

Training the Sentiment Classification Model...
Train Avg Loss: 0.001514
Test Accuracy: 0.87, Test Avg Loss: 0.319959
[Epoch 0] time cost: 48.52
Train Avg Loss: 0.000690
Test Accuracy: 0.86, Test Avg Loss: 0.318548
[Epoch 1] time cost: 41.65
Train Avg Loss: 0.000277
Test Accuracy: 0.85, Test Avg Loss: 0.404589
[Epoch 2] time cost: 41.73


##### Testing Time

**The Inventor: Out for Blood in Silicon Valley**

In [26]:
review1 = "I had read John Carreyrou's fine Wall Street Journal articles, \
as well as his thrilling book, Bad Blood, before seeing this documentary tonight. \
The first half of the documentary seems almost worshipful of Elizabeth Holmes, \
building up her mystique and putting her unique ability to attract doting followers \
to her message on display. Quite a lot of time is spent gazing into those big blue, \
unblinking eyes. By the time we get around to the cracks in the facade, \
we are more than an hour into the film. \
It is inevitable that a lot of important background was left out: \
the climate of constant firings that went on for years, the fact that \
Sunny and Elizabeth met when he was 38 (and married) and she was 19, \
that Elizabeth's dad had been a VP at Enron, etc. \
Mostly I would have appreciated a little more specific information on why the Edison machine failed. \
The examples given in the film don't seem that unsolvable, \
but I know from the book that there were some basic issues that simply \
couldn't be dreamed away owing to the tiny sample sizes from the finger pricks. \
Tyler Shultz comes off as a happy-go-lucky guy, but in fact he is one of the heroes of this story. \
It is not mentioned in this film, but not just his grandfather former Secretary of State and Theranos \
board member George Schultz, but also his parents flipped out when he told them he was quitting the company. \
His bravery in standing up for his values is truly admirable in one so young, \
especially considering the immense pressure he came under. To his parents' credit, \
they came around and ended up mortgaging their home to pay his legal bills. \
Ultimately, though, the story gets Elizabeth right: she is a zealot who is deaf to any naysayers, \
even to this day. The cautionary tale for the rest of us, is are we George Shultz or Tyler Shultz?\
Are we willing to see the truth and make a difficult decision, or are we too invested to be willing \
to give up on something we had believed in?"

[**review2**](https://www.imdb.com/review/rw4739739/?ref_=tt_urv)

In [27]:
review2 = "The beginning seems to draw a parallel between Holmes and Edison. \
His for lying for four years to investors about incandescent lightbulbs, hers \
for a one drop of blood test. Not accurate. She allowed people to make personal, \
medical decisions, but all he was doing was putting light in people's homes. \
They are completely different and not even in the same sphere.As for the rest, \
I got sick of an almost worship of this selfish woman. \
I agree with another about how major factors that went on at that company were left out of this documentary. \
The fact that her lawyers were threatening, stalking, and tapping her former employees, nearly crossed, \
if not crossed, the line in to harassment. If you want the real story, and not a lot of fluff, \
and an almost hero worship, read 'Bad Blood,' by John Carreyrou."

In [28]:
review1 = tokenizer(review1)
review2 = tokenizer(review2)
print('review1 len:{}'.format(len(review1)))
print('review2 len:{}'.format(len(review2)))

review1 len:400
review2 len:168


In [29]:
review1 = [vocab[word] for word in review1]
review2 = [vocab[word] for word in review2]

In [30]:
prob1 = net(
            mx.nd.reshape(
            mx.nd.array(review1, ctx=context),
            shape=(-1, 1)), mx.nd.array([4], ctx=context)).sigmoid()
print("Sentiment - ", 'positive' if prob1[0] > 0.5 else 'negative')

Sentiment -  positive


In [31]:
prob2 = net(
            mx.nd.reshape(
            mx.nd.array(review2, ctx=context),
            shape=(-1, 1)), mx.nd.array([5], ctx=context)).sigmoid()
print("Sentiment - ", 'positive' if prob2[0] > 0.5 else 'negative')

Sentiment -  negative


**Check it out yourselves :**    

[**review 1**](https://www.imdb.com/review/rw4732229/?ref_=tt_urv)

[**review2**](https://www.imdb.com/review/rw4739739/?ref_=tt_urv)