# Creating a Sentiment Analysis Web App
### Pytorch and AWS SageMaker
_SageMaker, Lambda, API, CloudWatch_

---
Put an overview of the notebook here

## Outline
1. [Download the data](#download)
2. [Process and prepare the data](#process)
3. [Upload data to S3](#upload)
4. [Train a model](#train)
5. [Test the trained model](#test)
6. [Deploy the trained model](#deploy)
7. [Use the deployed model for inference](#use)


<a id='download'></a>
## Download the Data

The notebook and model use the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [3]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2020-08-03 17:35:31--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-08-03 17:35:38 (12.4 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



<a id='process'></a>
## Process and Prepare the Data

---
### Read in Data

In [4]:
# necessary imports 
import os
import glob

In [13]:
def read_imbd_data(data_dir='../data/aclImdb'):
    """ Read in IMDb data from aclImdb folder. Creates data and label dictionaries.
    
        Arguments:
        - data_dir: (str) Directory of the data
        
        Returns:
        - data: (dict) Movie review
        - labels: (dict) Movie review labels
    """
    data = {}
    labels = {}
    
    # create paths to read in review data
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            # join path names
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            # open each review and label. Append to dictionaries and label with binary vars
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                       "{}: data size does not equal {}: label size.".format(data_type, sentiment)
                    
    return data, labels

In [14]:
# read in data and display length of train and test data
data, labels = read_imbd_data()
print("IMDb Reviews: Train --> {} pos / {} neg ... Test --> {} pos / {} neg".format(len(data['train']['pos']),
                                                                                    len(data['train']['neg']),
                                                                                    len(labels['test']['pos']),
                                                                                    len(labels['test']['pos'])))

IMDb Reviews: Train --> 12500 pos / 12500 neg ... Test --> 12500 pos / 12500 neg


In [15]:
data['train']['pos'][0]

"I didn't know what to make of this film. I guess that is what it was all about really. I have never seen a film like it and I doubt that I really ever will again. Glover puts together something that is unique to him. I think to appreciate it you have to read some of his poetry, maybe see one of his slide shows. I really like this guy, he is just so bizarre I can't help it. Note: I saw this film before it was through its final editing, so maybe what I have seen and what others have seen are different. I will know, I guess, if I choose to view the film again. I think I will have to be properly drug influenced..."

In [20]:
labels['train']['pos'][0]

1

---
### Create Feature and Target Sets

In [19]:
# necessary imports 
from sklearn.utils import shuffle

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import re
from bs4 import BeautifulSoup

import pickle

In [21]:
def combine_imdb_data(data, labels):
    """ Combine pos and neg reviews from the training and test data 
        dictionaries.
        
        Arguments:
        - data: (dict) Unprocessed reviews
        - labels: (dict) Sentiment label, 1 pos --> 0 neg
    """
    # combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    # using sklearn shuffle data
    data_train, data_test = shuffle(data_train, data_test)
    labels_train, labels_test = shuffle(labels_train, labels_test)
    
    return data_train, data_test, labels_train, labels_test

In [22]:
train_X, test_X, train_y, test_y = combine_imdb_data(data, labels)
print("IMDb Data Length: Train data = {}, Test data = {}".format(len(train_X), len(test_X)))

IMDb Data Length: Train data = 25000, Test data = 25000


<a id='upload'></a>
## Upload the Data to S3

<a id='train'></a>
## Train a Model

<a id='test'></a>
## Test the Trained Model

<a id='deploy'></a>
## Deploy the Trained Model

<a id='use'></a>
## Use the Deployed Model for Inference