# Creating a Sentiment Analysis Web App
### Pytorch and AWS SageMaker
_SageMaker, Lambda, API, CloudWatch_

---
Put an overview of the notebook here

## Outline
1. [Download the data](#download)
2. [Process and prepare the data](#process)
3. [Upload data to S3](#upload)
4. [Build and train the Pytorch model](#train)
5. [Test the trained model](#test)
6. [Deploy the trained model](#deploy)
7. [Use the deployed model for inference](#use)


<a id='download'></a>
## Download the Data

The notebook and model use the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [3]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2020-08-03 17:35:31--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-08-03 17:35:38 (12.4 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



<a id='process'></a>
## Process and Prepare the Data

---
### Read in Data

In [1]:
# necessary imports 
import os
import glob

In [2]:
def read_imbd_data(data_dir='../data/aclImdb'):
    """ Read in IMDb data from aclImdb folder. Creates data and label dictionaries.
    
        Arguments:
        - data_dir: (str) Directory of the data
        
        Returns:
        - data: (dict) Movie review
        - labels: (dict) Movie review labels
    """
    data = {}
    labels = {}
    
    # create paths to read in review data
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            # join path names
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            # open each review and label. Append to dictionaries and label with binary vars
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                       "{}: data size does not equal {}: label size.".format(data_type, sentiment)
                    
    return data, labels

In [3]:
# read in data and display length of train and test data
data, labels = read_imbd_data()
print("IMDb Reviews: Train --> {} pos / {} neg ... Test --> {} pos / {} neg".format(len(data['train']['pos']),
                                                                                    len(data['train']['neg']),
                                                                                    len(labels['test']['pos']),
                                                                                    len(labels['test']['pos'])))

IMDb Reviews: Train --> 12500 pos / 12500 neg ... Test --> 12500 pos / 12500 neg


In [4]:
data['train']['pos'][0]

"I didn't know what to make of this film. I guess that is what it was all about really. I have never seen a film like it and I doubt that I really ever will again. Glover puts together something that is unique to him. I think to appreciate it you have to read some of his poetry, maybe see one of his slide shows. I really like this guy, he is just so bizarre I can't help it. Note: I saw this film before it was through its final editing, so maybe what I have seen and what others have seen are different. I will know, I guess, if I choose to view the film again. I think I will have to be properly drug influenced..."

In [5]:
labels['train']['pos'][0]

1

---
### Create Feature and Target Sets
Combine the training and test data/labels and shuffle to creat feature and target sets.

In [6]:
# necessary imports 
from sklearn.utils import shuffle

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import re
from bs4 import BeautifulSoup

import pickle

In [7]:
def combine_imdb_data(data, labels):
    """ Combine pos and neg reviews from the training and test data 
        dictionaries.
        
        Arguments:
        - data: (dict) Unprocessed reviews
        - labels: (dict) Sentiment label, 1 pos --> 0 neg
        
        Returns:
        - train_X, test_X: features
        - train_y, test_y: targets
    """
    # combine positive and negative reviews and labels
    train_data = data['train']['pos'] + data['train']['neg']
    test_data = data['test']['pos'] + data['test']['neg']
    train_labels = labels['train']['pos'] + labels['train']['neg']
    test_labels = labels['test']['pos'] + labels['test']['neg']
    
    # using sklearn shuffle data
    train_data, train_labels = shuffle(train_data, train_labels)
    test_data, test_labels = shuffle(test_data, test_labels)
    
    return train_data, test_data, train_labels, test_labels

In [8]:
train_X, test_X, train_y, test_y = combine_imdb_data(data, labels)
print("IMDb Data Length: Train data = {}, Test data = {}".format(len(train_X), len(test_X)))

IMDb Data Length: Train data = 25000, Test data = 25000


In [10]:
# take a look at a review and it's corresponding label
print(train_X[20], '\n')
print(train_y[20])

I am amazed at the amount of praise that is heaped on this movie by other commentators. To me it was rather a disappointment, especially the combination of historical facts, fantasy and the main character's internal turmoil does not work at all (in Vonnegut's book Slaughterhouse Five and even in George Roy Hill's adaptation for the screen it does). Credibility is often overstretched. Too many questions are left open. Did I miss some central points? Or did I fail to spot the lines that supposedly connect the dots? <br /><br />A boy called Campbell, Jr., grows up in upstate New York. At home his father has many technical trade papers and one book. It has photographs of heaps of dead bodies in it. The boy leafs through the book, his dad doesn't like his doing that. What should this tell me? The family moves away from upstate New York to Berlin. BANG. It is 1938, the boy is a married man in Berlin and a theater playwright. What kind of plays does he write? In what language? Is he successfu

---
### Process Review
Remove the html formatting and convert the review into a list of words.

In [11]:
def review_to_words(review):
    """ Converts a review string to a list of words. Removes html
        formatting, stopwords and morphological endings of common
        words.
        
        Arguments:
        - review: (str) String of words that make up review
        
        Returns:
        - words: (list) List of processed words in a review
    
    """ 
    nltk.download('stopwords', quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, 'html.parser').get_text() # remove html tags
    text = re.sub(r"[^a-zA-z0-9]", " ", text.lower()) 
    words = text.split() # split the string into a list of words
    words = [word for word in words if word not in stopwords.words('english')] # remove stopwords
    words = [stemmer.stem(word) for word in words] # stem words
    
    return words

In [12]:
words = review_to_words(train_X[20])
print(words)

['amaz', 'amount', 'prais', 'heap', 'movi', 'comment', 'rather', 'disappoint', 'especi', 'combin', 'histor', 'fact', 'fantasi', 'main', 'charact', 'intern', 'turmoil', 'work', 'vonnegut', 'book', 'slaughterhous', 'five', 'even', 'georg', 'roy', 'hill', 'adapt', 'screen', 'credibl', 'often', 'overstretch', 'mani', 'question', 'left', 'open', 'miss', 'central', 'point', 'fail', 'spot', 'line', 'supposedli', 'connect', 'dot', 'boy', 'call', 'campbel', 'jr', 'grow', 'upstat', 'new', 'york', 'home', 'father', 'mani', 'technic', 'trade', 'paper', 'one', 'book', 'photograph', 'heap', 'dead', 'bodi', 'boy', 'leaf', 'book', 'dad', 'like', 'tell', 'famili', 'move', 'away', 'upstat', 'new', 'york', 'berlin', 'bang', '1938', 'boy', 'marri', 'man', 'berlin', 'theater', 'playwright', 'kind', 'play', 'write', 'languag', 'success', 'wife', 'actress', 'look', 'glamor', 'parent', 'move', 'back', 'usa', 'invit', 'son', 'grown', 'germani', 'feel', 'german', 'american', 'success', 'wife', 'like', 'life', '

In [13]:
cache_dir = os.path.join("../cache", "sentiment_analysis")
os.makedirs(cache_dir, exist_ok=True) 

def preprocess_data(train_data, test_data, train_labels, test_labels,
                    cache_dir=cache_dir, cache_file='preprocesssed_data.pkl'):
    """ Convert each review to words and read from the cache file if 
        available. 
    
    """
    # if cache file exists try to read from it
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f: # open and read binary file
                cache_data = pickle.load(f)
                print("Reading preprocessed data from cache file: {}".format(cache_file))
        except:
            pass
        
    # if cache data does not exist create it
    if cache_data is None:
        # process data to create list of words for each review
        train_words = [review_to_words(review) for review in train_data]
        test_words = [review_to_words(review) for review in test_data]
        
        # write to cache file if it doesn't exist
        if cache_file is not None:
            cache_data = dict(train_words=train_words, test_words=test_words,
                              train_labels=train_labels, test_labels=test_labels)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file: {}".format(cache_file))
            
    else:
        # unpack data from cache file
        train_words = cache_data['train_words']
        test_words = cache_data['test_words']
        train_labels = cache_data['train_labels']
        test_labels = cache_data['test_labels']
        
    return train_words, test_words, train_labels, test_labels

In [14]:
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Reading preprocessed data from cache file: preprocesssed_data.pkl


In [15]:
len(train_X[20])
print(train_X[20])

['could', 'anyon', 'pleas', 'stop', 'john', 'carpent', 'continu', 'deliber', 'ruin', 'reput', 'low', 'go', 'seem', 'man', 'lost', 'self', 'respect', 'episod', 'look', 'like', 'done', 'film', 'student', 'even', 'worth', 'begin', 'talk', 'bad', 'borefest', 'direct', 'somebodi', 'talent', 'filmmak', 'without', 'motiv', 'come', 'mr', 'carpent', 'pleas', 'retir', 'immedi', 'rest', 'self', 'esteem', 'stop', 'spill', 'trash', 'like', 'bad', 'tradit', 'escap', 'l', 'ghost', 'mar', 'get', 'drunk', 'instead']


---
### Transform the Data
First we will create a working vocabulary of the most frequently occuring words in our dataset. We will remove the words that occur most infrequently. Each review will be fixed in size with shorter reviews padded with zeros. This will allow our RNN to train more efficiently.

In [16]:
# necessary imports
import numpy as np
from collections import Counter

In [17]:
def build_dict(data, vocab_size=5000):
    """ Construct and return a dictionary mapping each of the most frequently 
        appearing words to a unique integer.
        
        Arguments:
        - data: preprocessed reviews
        - vacab_size: (int) size of vacabulary
        
        Returns:
        - word_dict: dictionary of vocabulary mappings
    """
    # count and sort the words 
    word_count = Counter(np.concatenate(train_X, axis = 0))
    sorted_vocab = sorted(word_count, key=word_count.get, reverse=True)
    
    # create a word dictionary
    word_dict = {word: idx+2 for idx, word in enumerate(sorted_vocab[:vocab_size-2])}
    
    return word_dict

Take a look at the dictionary to make sure everything looks good. 

In [18]:
word_dict = build_dict(train_X)
# print(word_dict)

In [19]:
most_freq = [key for idx, (key, val) in enumerate(word_dict.items()) if idx < 5]
print(most_freq)

['movi', 'film', 'one', 'like', 'time']


### Save `word_dict`

Later on when we construct an endpoint which processes a submitted review we will need to make use of the `word_dict` which we have created. As such, we will save it to a file now for future use.

In [20]:
data_dir = '../data/pytorch' # folder that will store the data
if not os.path.exists(data_dir): # check if the folder exists
    os.makedirs(data_dir)

In [21]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the Reviews
Convert the reviews into their integer sequence representation. Shorter reviews will be padded with `0` or `1` for no word and infrequent word representation. Longer reviews will be truncated to 500 characters. 

In [22]:
review_lens = Counter([len(review) for review in train_X])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Max length review: {}".format(max(review_lens)))
print(len(train_X))

Zero-length reviews: 0
Max length review: 1429
25000


In [23]:
def convert_and_pad(word_dict, review, pad=500):
    NOWORD = 0
    INFREQ = 1
    current_review = [NOWORD] * pad
        
    for idx, word in enumerate(review[:pad]):
        if word in word_dict:
            current_review[idx] = word_dict[word]
        else:
            current_review[idx] = INFREQ
    
    return current_review, min(len(review), pad)

def convert_reviews(word_dict, data, pad=500):
    """ Convert each review to an integer sequence representation. Truncate
        to 500 chars and pad with 0s and 1s accordingly.
        
        Arguments:
        - word_dict: (dict) word mapping dictionary
        - data: reviews
        - pad: (int) length to truncate to
        
        Returns:
        - train_X: feature
        - train_X_len: length of feature set
    """    
    result = []
    lengths = []
    
    for review in data:
        current_review, leng = convert_and_pad(word_dict, review, pad)
        result.append(current_review)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [24]:
train_X, train_X_len = convert_reviews(word_dict, train_X)
test_X, test_X_len = convert_reviews(word_dict, train_X)

In [25]:
print(train_X[20], '\n')
print(train_X_len[20])

[  36  181  444  368  227 2516  471 2471  962 2055  295   25   39   55
  359  464  616  182   19    5  143    3  745   14  218  159  241   24
    1   97 1623  310  585  129 1128   45  326 2516  444 2579  997  302
  464    1  368 4963  976    5   24 1087  640 1224  837 2710   10 1562
  235    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

<a id='upload'></a>
## Upload the Data to S3
Save the data locally and upload to S3 later. Note that the format has to be in the form `label`, `length`, `review`.

### Save and process locally

In [26]:
# necessary imports
import pandas as pd
import sagemaker

In [27]:
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Upload the data
Create a SageMaker session, role, and bucket. Upload the data to the default S3 bucket.


In [30]:
# This is an object that represents the SageMaker session that we are currently operating in. This
# object contains some useful information that we will need to access later such as our region.
session = sagemaker.Session()

bucket = session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

# This is an object that represents the IAM role that we are currently assigned. When we construct
# and launch the training job later we will need to tell it what IAM role it should have. Since our
# use case is relatively simple we will simply assign the training job the role we currently have.
role = sagemaker.get_execution_role()

In [31]:
training_data = session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

<a id='train'></a>
## Build and Train the Pytorch Model

In [32]:
!pygmentize train/model.py

[37m# PROGRAMMER: Justin Bellucci [39;49;00m
[37m# DATE CREATED: 07_31_2020                                  [39;49;00m
[37m# REVISED DATE: [39;49;00m

[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m""" LSTM based RNN to perform sentiment analysis[39;49;00m
[33m    [39;49;00m
[33m    """[39;49;00m
    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size, n_layers=[34m2[39;49;00m):
        [33m""" Initialize the model by setting up the various [39;49;00m
[33m            layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.n_layers = n_layers
        [36mself[39;49;00m.hidden_dim = hidden_dim
        
        [36ms

Loading in a bit of data to test the model before we use GPU to train on Sagemaker. This is important to identify any mistakes.

In [33]:
# necessary imports
import torch
import torch.utils.data
import torch.nn as nn

import torch.optim as optim


In [71]:
# %load_ext autoreload
# %autoreload 2

from train.model import LSTMClassifier

In [72]:
batch_size = 50

# read in the fist 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# turn the Pandas DF into Tensors. Labels are first.
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# build dataset using TensorDataset() from Pytorch
train_sample_dataset = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)

# build the dataloader using DataLoader() from Pytorch
train_sample_loader = torch.utils.data.DataLoader(train_sample_dataset, batch_size=50)

### Training with the small sample dataset

In [73]:
def train_sample(model, train_loader, epochs, optimizer, criterion, device):
    """ Train a sample dataset in Jupyter notebook.
    """
    for e in range(epochs):
        model.train() # put model in training mode
        total_loss = 0
#         print(batch_size)
        h = model.init_hidden(batch_size, device)
        
        for batch in train_loader:
            batch_X, batch_y = batch
            
            # move to GPU if available
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            h = tuple([each.data for each in h])
            
            # train the model
            optimizer.zero_grad() # zero gradients
            out, h = model.forward(batch_X, h)
            loss = criterion(out, batch_y)
            loss.backward()
            
#             nn.utils.clip_grad_norm_(model.parameters(), 5)
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(e+1, total_loss/len(train_loader)))

In [74]:
embedding_dim = 42
hidden_dim = 100
vocab_size = 5000
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using {} to train...".format(device))

model = LSTMClassifier(embedding_dim, hidden_dim, vocab_size).to(device)
optimizer = optim.Adam(model.parameters())
criterion = torch.nn.BCELoss()

train_sample(model, train_sample_loader, epochs, optimizer, criterion, device)

Using cpu to train...
torch.Size([5000, 42])
torch.Size([2, 50, 100])
Epoch: 1, BCELoss: 0.6929988503456116


<a id='train'></a>
## Train the Model

In [38]:
# necessary imports
from sagemaker.pytorch import PyTorch

In [83]:
pytorch_estimator = PyTorch(entry_point="train.py",
                            source_dir="train",
                            role=role,
                            framework_version='1.0.0',
                            train_instance_count=1,
                            train_instance_type='ml.p2.xlarge',
                            hyperparameters={'epochs': 15,
                                             'hidden_dim': 300})

In [84]:
pytorch_estimator.fit({'training': training_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-08-09 20:46:20 Starting - Starting the training job...
2020-08-09 20:46:21 Starting - Launching requested ML instances......
2020-08-09 20:47:31 Starting - Preparing the instances for training......
2020-08-09 20:48:40 Downloading - Downloading input data...
2020-08-09 20:48:58 Training - Downloading the training image.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-08-09 20:49:59,531 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-08-09 20:49:59,559 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-08-09 20:49:59,561 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-08-09 20:49:59,792 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-08-09 20:49:59,792 sagemaker-containers INFO 

[34m2020-08-09 21:06:36,903 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-08-09 21:06:46 Uploading - Uploading generated training model
2020-08-09 21:06:46 Completed - Training job completed
Training seconds: 1086
Billable seconds: 1086


<a id='deploy'></a>
## Deploy the Trained Model

<a id='use'></a>
## Use the Deployed Model for Inference