# Creating a Sentiment Analysis Web App
### Pytorch and AWS SageMaker
_SageMaker, Lambda, API, CloudWatch_

---
Put an overview of the notebook here

## Outline
1. [Download the data](#download)
2. [Process and prepare the data](#process)
3. [Upload data to S3](#upload)
4. [Train a model](#train)
5. [Test the trained model](#test)
6. [Deploy the trained model](#deploy)
7. [Use the deployed model for inference](#use)


<a id='download'></a>
## Download the Data

The notebook and model use the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [3]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2020-08-03 17:35:31--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-08-03 17:35:38 (12.4 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



<a id='process'></a>
## Process and Prepare the Data

---
### Read in Data

In [4]:
# necessary imports 
import os
import glob

In [13]:
def read_imbd_data(data_dir='../data/aclImdb'):
    """ Read in IMDb data from aclImdb folder. Creates data and label dictionaries.
    
        Arguments:
        - data_dir: (str) Directory of the data
        
        Returns:
        - data: (dict) Movie review
        - labels: (dict) Movie review labels
    """
    data = {}
    labels = {}
    
    # create paths to read in review data
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            # join path names
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            # open each review and label. Append to dictionaries and label with binary vars
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                       "{}: data size does not equal {}: label size.".format(data_type, sentiment)
                    
    return data, labels

In [14]:
# read in data and display length of train and test data
data, labels = read_imbd_data()
print("IMDb Reviews: Train --> {} pos / {} neg ... Test --> {} pos / {} neg".format(len(data['train']['pos']),
                                                                                    len(data['train']['neg']),
                                                                                    len(labels['test']['pos']),
                                                                                    len(labels['test']['pos'])))

IMDb Reviews: Train --> 12500 pos / 12500 neg ... Test --> 12500 pos / 12500 neg


In [15]:
data['train']['pos'][0]

"I didn't know what to make of this film. I guess that is what it was all about really. I have never seen a film like it and I doubt that I really ever will again. Glover puts together something that is unique to him. I think to appreciate it you have to read some of his poetry, maybe see one of his slide shows. I really like this guy, he is just so bizarre I can't help it. Note: I saw this film before it was through its final editing, so maybe what I have seen and what others have seen are different. I will know, I guess, if I choose to view the film again. I think I will have to be properly drug influenced..."

In [20]:
labels['train']['pos'][0]

1

---
### Create Feature and Target Sets
Combine the training and test data/labels and shuffle to creat feature and target sets.

In [19]:
# necessary imports 
from sklearn.utils import shuffle

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import re
from bs4 import BeautifulSoup

import pickle

In [44]:
def combine_imdb_data(data, labels):
    """ Combine pos and neg reviews from the training and test data 
        dictionaries.
        
        Arguments:
        - data: (dict) Unprocessed reviews
        - labels: (dict) Sentiment label, 1 pos --> 0 neg
        
        Returns:
        - train_X, test_X: features
        - train_y, test_y: targets
    """
    # combine positive and negative reviews and labels
    train_data = data['train']['pos'] + data['train']['neg']
    test_data = data['test']['pos'] + data['test']['neg']
    train_labels = labels['train']['pos'] + labels['train']['neg']
    test_labels = labels['test']['pos'] + labels['test']['neg']
    
    # using sklearn shuffle data
    train_data, train_labels = shuffle(train_data, train_labels)
    test_data, test_labels = shuffle(test_data, test_labels)
    
    return train_data, test_data, train_labels, test_labels

In [45]:
train_X, test_X, train_y, test_y = combine_imdb_data(data, labels)
print("IMDb Data Length: Train data = {}, Test data = {}".format(len(train_X), len(test_X)))

IMDb Data Length: Train data = 25000, Test data = 25000


In [46]:
# take a look at a review and it's corresponding label
print(train_X[20], '\n')
print(train_y[20])

Could anyone please stop John Carpenter from continuously and deliberately ruining his reputation? How low can you go? It seems this man has lost any self respect.<br /><br />This episode looks like it has been done by a film student, it isn't even worth beginning to talk about WHAT was bad, because it was just a borefest, directed by somebody with no talent as a filmmaker or without any motivation...<br /><br />Come on, Mr. Carpenter, please retire immediately with a rest of self-esteem and stop spilling out trash like this in a bad tradition from Escape from L.A. to Ghosts of Mars.<br /><br />Get drunk instead. 

0


---
### Process Review
Remove the html formatting and convert the review into a list of words.

In [47]:
def review_to_words(review):
    """ Converts a review string to a list of words. Removes html
        formatting, stopwords and morphological endings of common
        words.
        
        Arguments:
        - review: (str) String of words that make up review
        
        Returns:
        - words: (list) List of processed words in a review
    
    """ 
    nltk.download('stopwords', quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, 'html.parser').get_text() # remove html tags
    text = re.sub(r"[^a-zA-z0-9]", " ", text.lower()) 
    words = text.split() # split the string into a list of words
    words = [word for word in words if word not in stopwords.words('english')] # remove stopwords
    words = [stemmer.stem(word) for word in words] # stem words
    
    return words

In [48]:
words = review_to_words(train_X[20])
print(words)

['could', 'anyon', 'pleas', 'stop', 'john', 'carpent', 'continu', 'deliber', 'ruin', 'reput', 'low', 'go', 'seem', 'man', 'lost', 'self', 'respect', 'episod', 'look', 'like', 'done', 'film', 'student', 'even', 'worth', 'begin', 'talk', 'bad', 'borefest', 'direct', 'somebodi', 'talent', 'filmmak', 'without', 'motiv', 'come', 'mr', 'carpent', 'pleas', 'retir', 'immedi', 'rest', 'self', 'esteem', 'stop', 'spill', 'trash', 'like', 'bad', 'tradit', 'escap', 'l', 'ghost', 'mar', 'get', 'drunk', 'instead']


In [60]:
cache_dir = os.path.join("../cache", "sentiment_analysis")
os.makedirs(cache_dir, exist_ok=True) 

def preprocess_data(train_data, test_data, train_labels, test_labels,
                    cache_dir=cache_dir, cache_file='preprocesssed_data.pkl'):
    """ Convert each review to words and read from the cache file if 
        available. 
    
    """
    # if cache file exists try to read from it
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f: # open and read binary file
                cache_data = pickle.load(f)
                print("Reading preprocessed data from cache file: {}".format(cache_file))
        except:
            pass
        
    # if cache data does not exist create it
    if cache_data is None:
        # process data to create list of words for each review
        train_words = [review_to_words(review) for review in train_data]
        test_words = [review_to_words(review) for review in test_data]
        
        # write to cache file if it doesn't exist
        if cache_file is not None:
            cache_data = dict(train_words=train_words, test_words=test_words,
                              train_labels=train_labels, test_labels=test_labels)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file: {}".format(cache_file))
            
    else:
        # unpack data from cache file
        train_words = cache_data['train_words']
        test_words = cache_data['test_words']
        train_labels = cache_data['train_labels']
        test_labels = cache_data['test_labels']
        
    return train_words, test_words, train_labels, test_labels

In [61]:
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Reading preprocessed data from cache file: preprocesssed_data.pkl


In [63]:
len(train_X[20])

57

---
### Transform the Data
First we will create a working vocabulary of the most frequently occuring words in our dataset. We will remove the words that occur most infrequently. Each review will be fixed in size with shorter reviews padded with zeros. This will allow our RNN to train more efficiently.

<a id='upload'></a>
## Upload the Data to S3

<a id='train'></a>
## Train a Model

<a id='test'></a>
## Test the Trained Model

<a id='deploy'></a>
## Deploy the Trained Model

<a id='use'></a>
## Use the Deployed Model for Inference