# Creating a Sentiment Analysis Web App
## Using PyTorch and SageMaker

_Deep Learning Nanodegree Program | Deployment_

---

Now that we have a basic understanding of how SageMaker works we will try to use it to construct a complete project from end to end. Our goal will be to have a simple web page which a user can use to enter a movie review. The web page will then send the review off to our deployed model which will predict the sentiment of the entered review.

## Instructions

Some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this notebook. You will not need to modify the included code beyond what is requested. Sections that begin with '**TODO**' in the header indicate that you need to complete or implement some portion within them. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `# TODO: ...` comment. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions for you to answer which relate to the task and your implementation. Each section where you will answer a question is preceded by a '**Question:**' header. Carefully read each question and provide your answer below the '**Answer:**' header by editing the Markdown cell.

> **Note**: Code and Markdown cells can be executed using the **Shift+Enter** keyboard shortcut. In addition, a cell can be edited by typically clicking it (double-click for Markdown cells) or by pressing **Enter** while it is highlighted.

## General Outline

Recall the general outline for SageMaker projects using a notebook instance.

1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.

For this project, you will be following the steps in the general outline with some modifications. 

First, you will not be testing the model in its own step. You will still be testing the model, however, you will do it by deploying your model and then using the deployed model by sending the test data to it. One of the reasons for doing this is so that you can make sure that your deployed model is working correctly before moving forward.

In addition, you will deploy and use your trained model a second time. In the second iteration you will customize the way that your trained model is deployed by including some of your own code. In addition, your newly deployed model will be used in the sentiment analysis web app.

## Step 1: Downloading the data

As in the XGBoost in SageMaker notebook, we will be using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/)

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2020-04-08 22:15:50--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-04-08 22:15:55 (19.0 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Step 2: Preparing and Processing the data

Also, as in the XGBoost notebook, we will be doing some initial data processing. The first few steps are the same as in the XGBoost example. To begin with, we will read in each of the reviews and combine them into a single input structure. Then, we will split the dataset into a training set and a testing set.

In [1]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [2]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


Now that we've read the raw training and testing data from the downloaded dataset, we will combine the positive and negative reviews and shuffle the resulting records.

In [3]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [4]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


Now that we have our training and testing sets unified and prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows you to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [5]:
print(train_X[100])
print(train_y[100])

"Kaabee" depicts the hardship of a woman in pre and during WWII, raising her kids alone after her husband imprisoned for "thought crime". This movie was directed by Yamada Youji, and as expected the atmosphere of this movie is really wonderful. Although the historical correctness of some scenes, most notably the beach scene, is a suspect.<br /><br />The acting in this movie is absolutely incredible. I am baffled at how they managed to gather this all-star cast for a 2008 film. Yoshinaga Sayuri, possibly the most decorated still-active actress in Japan, will undoubtedly win more individual awards for her performance in this film. Shoufukutei Tsurube in a supporting role was really nice as well. It was Asano Tadanobu though, who delivered the most impressive performance, perfectly portraying the wittiness of his character and the difficult situation he was in.<br /><br />Films with pre-war setting is not my thing, but thanks to wonderful directing and acting, I was totally absorbed by th

The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews. As a check to ensure we know how everything is working, try applying `review_to_words` to one of the reviews in the training set.

In [7]:
# TODO: Apply review_to_words to a review (train_X[100] or any other review)
review_to_words(train_X[10])

['slither',
 'horror',
 'comedi',
 'realli',
 'enough',
 'horror',
 'comedi',
 'qualifi',
 'one',
 'one',
 'scene',
 'except',
 'good',
 'number',
 'zinger',
 'work',
 'real',
 'scare',
 'enough',
 'humor',
 'maintain',
 'movi',
 'addit',
 'script',
 'focu',
 'hero',
 'heroin',
 'goe',
 'kilter',
 'sever',
 'place',
 'major',
 'fail',
 'film',
 'introduc',
 'leav',
 'hero',
 'fillion',
 'follow',
 'grant',
 'grant',
 'michael',
 'rooker',
 'first',
 'introduc',
 'becom',
 'monster',
 'whole',
 'part',
 'film',
 'drag',
 'michael',
 'rooker',
 'charact',
 'interest',
 'us',
 'person',
 'watch',
 'goe',
 'seri',
 'motion',
 'act',
 'monster',
 'interest',
 'might',
 'interest',
 'grant',
 'portrait',
 'man',
 'turn',
 'monster',
 'rather',
 'horror',
 'comedi',
 'alien',
 'invas',
 'movi',
 'final',
 'analysi',
 'movi',
 'problem',
 'script',
 'import',
 'audienc',
 'monster',
 'act',
 'propag',
 'purpos',
 'horror',
 'comedi',
 'get',
 'hero',
 'back',
 'corner',
 'shotgun',
 'throw',
 

**Question:** Above we mentioned that `review_to_words` method removes html formatting and allows us to tokenize the words found in a review, for example, converting *entertained* and *entertaining* into *entertain* so that they are treated as though they are the same word. What else, if anything, does this method do to the input?

**Answer:**

The method below applies the `review_to_words` method to each of the reviews in the training and testing datasets. In addition it caches the results. This is because performing this processing step can take a long time. This way if you are unable to complete the notebook in the current session, you can come back without needing to process the data a second time.

In [8]:
import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data_full.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [9]:
# Preprocess data
#n=100
#train_X, test_X, train_y, test_y = preprocess_data(train_X[:n], test_X[:n], train_y[:n], test_y[:n])
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data_full.pkl


In [10]:
len(train_y)

25000

## Transform the data

In the XGBoost notebook we transformed the data from its word representation to a bag-of-words feature representation. For the model we are going to construct in this notebook we will construct a feature representation which is very similar. To start, we will represent each word as an integer. Of course, some of the words that appear in the reviews occur very infrequently and so likely don't contain much information for the purposes of sentiment analysis. The way we will deal with this problem is that we will fix the size of our working vocabulary and we will only include the words that appear most frequently. We will then combine all of the infrequent words into a single category and, in our case, we will label it as `1`.

Since we will be using a recurrent neural network, it will be convenient if the length of each review is the same. To do this, we will fix a size for our reviews and then pad short reviews with the category 'no word' (which we will label `0`) and truncate long reviews.

### (TODO) Create a word dictionary

To begin with, we need to construct a way to map words that appear in the reviews to integers. Here we fix the size of our vocabulary (including the 'no word' and 'infrequent' categories) to be `5000` but you may wish to change this to see how it affects the model.

> **TODO:** Complete the implementation for the `build_dict()` method below. Note that even though the vocab_size is set to `5000`, we only want to construct a mapping for the most frequently appearing `4998` words. This is because we want to reserve the special labels `0` for 'no word' and `1` for 'infrequent word'.

In [11]:
!pip install tqdm
from tqdm import tqdm
#Monitor the for loop like this [review_to_words(review) for review in tqdm(data_train)]

[31mfastai 1.0.60 requires nvidia-ml-py3, which is not installed.[0m
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [12]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
word_dict = None
def build_dict(data, vocab_size = 5000):
    
    """vectorizer = CountVectorizer(max_features=vocabulary_size)
    features_train = vectorizer.fit_transform(words_train).toarray()

        # Apply the same vectorizer to transform the test documents (ignore unknown words)
        
        
        # NOTE: Remember to convert the features using .toarray() for a compact representation
        
        # Write to cache file for future runs (store vocabulary as well)
    vocabulary = vectorizer.vocabulary_"""
    
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # TODO: Determine how often each word appears in `data`. Note that `data` is a list of sentences and that a
    #       sentence is a list of words.
    #print(set(tuple(x) for x in data))
    #flat_list = [item for sublist in data for item in sublist]
    #print(len(flat_list),len(set(flat_list)))
    print("flatten the list")
    flat_list = [item for sublist in tqdm(data) for item in sublist]
    """itr=0
    for sublist in data:
        for item in sublist:
            unique_data.append(item)
        itr+=1
        print("sentence # ",itr)"""
    print("len flat_list: ",len(flat_list))
    unique_data = list(set(flat_list))
    print("len set unique data: ",len(unique_data))#[x for x in set(data)] #tuple(x) for x in data)]
    #word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    print("word frequancies: ")
    wordfreq = [flat_list.count(p) for p in tqdm(unique_data)]
    print("max word freq:", max(wordfreq))
    word_count = {}
    #for p in tqdm(unique_data:
        #word_count.append()
    #print(unique_data,wordfreq)
    print("zip pairs of words with frequencies")
    word_count = dict(zip(unique_data,wordfreq))
    #unique_data,wordfreq = None
    # TODO: Sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and
    #       sorted_words[-1] is the least frequently appearing word.
    sorted_words = []
    listofTuples = sorted(word_count.items() , reverse=True, key=lambda x: x[1])
    # Iterate over the sorted sequence
    for elem in listofTuples :
        print(elem[0] , " ::" , elem[1] )
    #itr=0
    sorted_words = [elem[0] for elem in listofTuples]#[(word_count[key], key) for key in tqdm(word_count)]
    print("word counts with keys")
    #for key in tqdm(word_count):
        #sorted_words.append((word_count[key], key))
        #print(itr)
        #itr+=1
    #sorted_words = aux
    #sorted_words.sort()
    #sorted_words.reverse()
    print(sorted_words)
    #sorted_words = aux
    print("dict: ")
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in tqdm(enumerate(sorted_words[:vocab_size - 2])): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2       
        #print(idx)# 'infrequent' labels
        
    return word_dict

In [47]:
%%time
word_dict = build_dict(train_X)

 22%|██▏       | 5393/25000 [00:00<00:00, 26655.09it/s]

flatten the list


100%|██████████| 25000/25000 [00:00<00:00, 25987.54it/s]
  0%|          | 1/5000 [00:00<27:39,  3.01it/s]

len flat_list:  12500000
len set unique data:  5000
word frequancies: 


100%|██████████| 5000/5000 [27:56<00:00,  2.98it/s]


max word freq: 9484769
zip pairs of words with frequencies
0  :: 9484769
1  :: 320980
2  :: 51612
3  :: 48025
4  :: 27652
5  :: 22741
6  :: 16150
7  :: 15321
8  :: 15166
9  :: 14145
10  :: 14103
11  :: 14067
12  :: 13915
13  :: 13144
14  :: 12877
15  :: 12392
16  :: 11716
17  :: 11003
18  :: 10538
19  :: 10020
20  :: 9851
21  :: 9733
22  :: 9598
23  :: 9356
24  :: 9326
25  :: 9279
26  :: 9144
27  :: 9127
28  :: 9042
29  :: 9013
30  :: 8895
31  :: 8805
32  :: 8787
33  :: 8720
34  :: 8345
35  :: 8187
36  :: 7893
37  :: 7494
38  :: 7432
39  :: 7211
40  :: 7080
41  :: 6956
42  :: 6882
43  :: 6869
44  :: 6851
45  :: 6731
47  :: 6665
46  :: 6658
48  :: 6631
50  :: 6624
49  :: 6620
51  :: 6466
52  :: 6417
53  :: 6401
54  :: 6340
55  :: 6018
56  :: 5985
57  :: 5769
58  :: 5720
59  :: 5644
60  :: 5522
61  :: 5432
62  :: 5280
63  :: 5237
64  :: 5115
67  :: 5110
65  :: 5108
66  :: 5102
68  :: 5051
69  :: 4885
70  :: 4765
71  :: 4725
72  :: 4573
73  :: 4565
74  :: 4552
75  :: 4465
76  :: 4438
77  

772  :: 725
774  :: 724
776  :: 724
777  :: 722
778  :: 722
779  :: 722
780  :: 719
781  :: 718
782  :: 716
783  :: 714
784  :: 713
787  :: 712
786  :: 710
790  :: 710
785  :: 709
788  :: 709
792  :: 709
789  :: 708
791  :: 708
793  :: 708
794  :: 706
796  :: 706
795  :: 704
797  :: 702
799  :: 701
800  :: 701
804  :: 701
805  :: 701
806  :: 701
798  :: 700
801  :: 700
802  :: 700
807  :: 699
803  :: 698
809  :: 698
808  :: 697
812  :: 697
811  :: 696
814  :: 695
810  :: 694
813  :: 692
815  :: 691
816  :: 691
819  :: 688
817  :: 686
818  :: 686
820  :: 685
822  :: 685
823  :: 684
821  :: 683
824  :: 682
825  :: 680
827  :: 676
828  :: 676
829  :: 674
826  :: 673
831  :: 672
832  :: 672
830  :: 671
833  :: 670
834  :: 669
835  :: 665
836  :: 665
838  :: 664
839  :: 664
842  :: 662
845  :: 662
837  :: 660
847  :: 660
844  :: 659
841  :: 658
846  :: 658
843  :: 657
840  :: 656
850  :: 656
851  :: 656
852  :: 656
848  :: 655
853  :: 655
854  :: 654
857  :: 654
849  :: 653
858  :: 653
855 

1940  :: 239
1941  :: 239
1938  :: 238
1943  :: 238
1944  :: 238
1945  :: 238
1939  :: 237
1942  :: 237
1949  :: 237
1950  :: 237
1951  :: 237
1952  :: 237
1946  :: 236
1947  :: 236
1948  :: 236
1953  :: 236
1957  :: 236
1960  :: 236
1961  :: 236
1962  :: 236
1955  :: 235
1956  :: 235
1958  :: 235
1959  :: 235
1965  :: 235
1967  :: 235
1970  :: 235
1934  :: 234
1954  :: 234
1963  :: 234
1964  :: 234
1966  :: 234
1968  :: 234
1969  :: 234
1971  :: 234
1973  :: 234
1972  :: 233
1975  :: 233
1976  :: 233
1978  :: 233
1974  :: 232
1980  :: 232
1982  :: 232
1984  :: 232
1985  :: 232
1979  :: 231
1981  :: 231
1983  :: 231
1986  :: 231
1977  :: 230
1989  :: 230
1992  :: 230
1994  :: 230
1997  :: 230
1998  :: 230
1988  :: 229
1993  :: 229
1995  :: 229
1999  :: 229
2002  :: 229
1987  :: 228
1991  :: 228
2000  :: 228
2003  :: 228
2004  :: 228
1990  :: 227
1996  :: 227
2001  :: 227
2005  :: 227
2009  :: 227
2011  :: 227
2015  :: 227
2017  :: 227
2006  :: 226
2008  :: 226
2010  :: 226
2012  :: 226

3031  :: 125
3045  :: 125
3049  :: 125
3058  :: 125
3071  :: 125
3072  :: 125
3074  :: 125
3075  :: 125
3076  :: 125
3081  :: 125
3082  :: 125
3084  :: 125
3065  :: 124
3068  :: 124
3069  :: 124
3077  :: 124
3078  :: 124
3079  :: 124
3080  :: 124
3085  :: 124
3086  :: 124
3087  :: 124
3089  :: 124
3090  :: 124
3093  :: 124
3094  :: 124
3097  :: 124
3098  :: 124
3099  :: 124
3102  :: 124
3104  :: 124
3105  :: 124
3108  :: 124
3109  :: 124
3067  :: 123
3070  :: 123
3083  :: 123
3088  :: 123
3091  :: 123
3092  :: 123
3100  :: 123
3101  :: 123
3103  :: 123
3106  :: 123
3107  :: 123
3111  :: 123
3112  :: 123
3113  :: 123
3114  :: 123
3116  :: 123
3117  :: 123
3118  :: 123
3119  :: 123
3120  :: 123
3121  :: 123
3123  :: 123
3124  :: 123
3128  :: 123
3129  :: 123
3130  :: 123
3073  :: 122
3096  :: 122
3115  :: 122
3122  :: 122
3125  :: 122
3126  :: 122
3131  :: 122
3132  :: 122
3133  :: 122
3135  :: 122
3136  :: 122
3137  :: 122
3138  :: 122
3139  :: 122
3140  :: 122
3141  :: 122
3142  :: 122

4258  :: 79
4259  :: 79
4260  :: 79
4261  :: 79
4262  :: 79
4263  :: 79
4264  :: 79
4265  :: 79
4266  :: 79
4267  :: 79
4268  :: 79
4270  :: 79
4272  :: 79
4274  :: 79
4275  :: 79
4276  :: 79
4277  :: 79
4278  :: 79
4279  :: 79
4280  :: 79
4281  :: 79
4282  :: 79
4283  :: 79
4126  :: 78
4205  :: 78
4252  :: 78
4271  :: 78
4273  :: 78
4285  :: 78
4286  :: 78
4287  :: 78
4288  :: 78
4289  :: 78
4290  :: 78
4291  :: 78
4292  :: 78
4293  :: 78
4294  :: 78
4295  :: 78
4298  :: 78
4300  :: 78
4301  :: 78
4302  :: 78
4304  :: 78
4305  :: 78
4306  :: 78
4308  :: 78
4310  :: 78
4311  :: 78
4312  :: 78
4313  :: 78
4296  :: 77
4297  :: 77
4307  :: 77
4309  :: 77
4314  :: 77
4317  :: 77
4319  :: 77
4320  :: 77
4321  :: 77
4322  :: 77
4323  :: 77
4324  :: 77
4325  :: 77
4326  :: 77
4330  :: 77
4331  :: 77
4332  :: 77
4333  :: 77
4334  :: 77
4335  :: 77
4337  :: 77
4338  :: 77
4341  :: 77
4343  :: 77
4344  :: 77
4345  :: 77
4346  :: 77
4348  :: 77
4349  :: 77
4350  :: 77
4351  :: 77
4353  :: 77
4354

4998it [00:00, 806863.92it/s]

 :: 64
4862  :: 64
4864  :: 64
4866  :: 64
4870  :: 64
4871  :: 64
4874  :: 64
4875  :: 64
4876  :: 64
4877  :: 64
4878  :: 64
4879  :: 64
4880  :: 64
4883  :: 64
4884  :: 64
4885  :: 64
4886  :: 64
4887  :: 64
4888  :: 64
4889  :: 64
4890  :: 64
4891  :: 64
4892  :: 64
4893  :: 64
4894  :: 64
4895  :: 64
4898  :: 64
4899  :: 64
4900  :: 64
4901  :: 64
4902  :: 64
4903  :: 64
4904  :: 64
4905  :: 64
4907  :: 64
4908  :: 64
4821  :: 63
4867  :: 63
4868  :: 63
4869  :: 63
4872  :: 63
4873  :: 63
4881  :: 63
4882  :: 63
4896  :: 63
4897  :: 63
4909  :: 63
4910  :: 63
4911  :: 63
4912  :: 63
4913  :: 63
4914  :: 63
4915  :: 63
4916  :: 63
4918  :: 63
4920  :: 63
4921  :: 63
4923  :: 63
4925  :: 63
4926  :: 63
4927  :: 63
4928  :: 63
4929  :: 63
4931  :: 63
4932  :: 63
4933  :: 63
4934  :: 63
4935  :: 63
4936  :: 63
4937  :: 63
4938  :: 63
4939  :: 63
4940  :: 63
4941  :: 63
4942  :: 63
4943  :: 63
4944  :: 63
4945  :: 63
4946  :: 63
4947  :: 63
4948  :: 63
4949  :: 63
4951  :: 63
4953  :: 




Wall time: 28min


In [44]:
#word_dict
    
listofTuples = sorted(word_dict.items() , key=lambda x: x[1])
 
# Iterate over the sorted sequence
for elem in listofTuples :
    print(elem[0] , " ::" , elem[1] )

movi  :: 2
film  :: 3
one  :: 4
like  :: 5
time  :: 6
good  :: 7
make  :: 8
charact  :: 9
get  :: 10
see  :: 11
watch  :: 12
stori  :: 13
even  :: 14
would  :: 15
realli  :: 16
well  :: 17
scene  :: 18
look  :: 19
show  :: 20
much  :: 21
end  :: 22
peopl  :: 23
bad  :: 24
go  :: 25
great  :: 26
also  :: 27
first  :: 28
love  :: 29
think  :: 30
way  :: 31
act  :: 32
play  :: 33
made  :: 34
thing  :: 35
could  :: 36
know  :: 37
say  :: 38
seem  :: 39
work  :: 40
plot  :: 41
two  :: 42
actor  :: 43
year  :: 44
come  :: 45
mani  :: 46
seen  :: 47
take  :: 48
life  :: 49
want  :: 50
never  :: 51
littl  :: 52
best  :: 53
tri  :: 54
man  :: 55
ever  :: 56
give  :: 57
better  :: 58
still  :: 59
perform  :: 60
find  :: 61
feel  :: 62
part  :: 63
back  :: 64
use  :: 65
someth  :: 66
director  :: 67
actual  :: 68
interest  :: 69
lot  :: 70
real  :: 71
old  :: 72
cast  :: 73
though  :: 74
live  :: 75
star  :: 76
enjoy  :: 77
guy  :: 78
anoth  :: 79
new  :: 80
role  :: 81
noth  :: 82
10  :: 83
funn

larg  :: 713
among  :: 714
eventu  :: 715
accept  :: 716
train  :: 717
agre  :: 718
spirit  :: 719
soundtrack  :: 720
third  :: 721
teenag  :: 722
soldier  :: 723
adventur  :: 724
sorri  :: 725
famou  :: 726
suggest  :: 727
drug  :: 728
normal  :: 729
cri  :: 730
babi  :: 731
ultim  :: 732
troubl  :: 733
contain  :: 734
certain  :: 735
cultur  :: 736
romanc  :: 737
rare  :: 738
lame  :: 739
somehow  :: 740
mix  :: 741
disney  :: 742
gone  :: 743
cartoon  :: 744
student  :: 745
reveal  :: 746
fear  :: 747
suck  :: 748
kept  :: 749
attract  :: 750
appeal  :: 751
premis  :: 752
secret  :: 753
design  :: 754
greatest  :: 755
shame  :: 756
throw  :: 757
copi  :: 758
scare  :: 759
wit  :: 760
america  :: 761
admit  :: 762
relat  :: 763
particular  :: 764
brought  :: 765
screenplay  :: 766
whatev  :: 767
pure  :: 768
70  :: 769
harri  :: 770
averag  :: 771
master  :: 772
describ  :: 773
treat  :: 774
male  :: 775
20  :: 776
issu  :: 777
fantasi  :: 778
warn  :: 779
inde  :: 780
forward  :: 78

occur  :: 1304
blame  :: 1305
shine  :: 1306
logic  :: 1307
bruce  :: 1308
mainli  :: 1309
commerci  :: 1310
forev  :: 1311
skip  :: 1312
segment  :: 1313
surround  :: 1314
teacher  :: 1315
held  :: 1316
blond  :: 1317
zero  :: 1318
trap  :: 1319
summer  :: 1320
resembl  :: 1321
satir  :: 1322
six  :: 1323
ball  :: 1324
fool  :: 1325
queen  :: 1326
sub  :: 1327
twice  :: 1328
tragedi  :: 1329
reaction  :: 1330
pack  :: 1331
bomb  :: 1332
will  :: 1333
protagonist  :: 1334
hospit  :: 1335
mile  :: 1336
sport  :: 1337
vote  :: 1338
drink  :: 1339
jerri  :: 1340
trust  :: 1341
mom  :: 1342
encount  :: 1343
plane  :: 1344
program  :: 1345
current  :: 1346
station  :: 1347
al  :: 1348
martin  :: 1349
choos  :: 1350
celebr  :: 1351
join  :: 1352
favourit  :: 1353
tragic  :: 1354
lord  :: 1355
round  :: 1356
field  :: 1357
jean  :: 1358
vision  :: 1359
robot  :: 1360
arthur  :: 1361
tie  :: 1362
roger  :: 1363
random  :: 1364
fortun  :: 1365
intern  :: 1366
dread  :: 1367
psycholog  :: 1368
e

anna  :: 1863
condit  :: 1864
sudden  :: 1865
mirror  :: 1866
sole  :: 1867
veteran  :: 1868
spectacular  :: 1869
demonstr  :: 1870
meanwhil  :: 1871
overli  :: 1872
card  :: 1873
gift  :: 1874
freedom  :: 1875
liner  :: 1876
robin  :: 1877
experienc  :: 1878
grip  :: 1879
crappi  :: 1880
brilliantli  :: 1881
colour  :: 1882
subtitl  :: 1883
section  :: 1884
circumst  :: 1885
theori  :: 1886
drew  :: 1887
sheriff  :: 1888
unabl  :: 1889
oliv  :: 1890
pile  :: 1891
matt  :: 1892
laughter  :: 1893
altern  :: 1894
sheer  :: 1895
parker  :: 1896
path  :: 1897
cook  :: 1898
lawyer  :: 1899
accident  :: 1900
treatment  :: 1901
hall  :: 1902
defin  :: 1903
wander  :: 1904
sinatra  :: 1905
relief  :: 1906
captiv  :: 1907
dragon  :: 1908
hank  :: 1909
halloween  :: 1910
gratuit  :: 1911
moor  :: 1912
cowboy  :: 1913
k  :: 1914
jacki  :: 1915
wayn  :: 1916
barbara  :: 1917
broadway  :: 1918
unintent  :: 1919
kung  :: 1920
wound  :: 1921
surreal  :: 1922
canadian  :: 1923
statement  :: 1924
winte

heavili  :: 2402
cabin  :: 2403
holiday  :: 2404
gruesom  :: 2405
racist  :: 2406
india  :: 2407
understood  :: 2408
satan  :: 2409
philip  :: 2410
indulg  :: 2411
belov  :: 2412
stalk  :: 2413
forgot  :: 2414
midnight  :: 2415
outfit  :: 2416
pregnant  :: 2417
integr  :: 2418
tongu  :: 2419
fourth  :: 2420
lay  :: 2421
obnoxi  :: 2422
garden  :: 2423
deeper  :: 2424
ticket  :: 2425
carol  :: 2426
magazin  :: 2427
17  :: 2428
restor  :: 2429
inhabit  :: 2430
slapstick  :: 2431
incid  :: 2432
shoe  :: 2433
brad  :: 2434
devot  :: 2435
lincoln  :: 2436
underground  :: 2437
sandler  :: 2438
divorc  :: 2439
elizabeth  :: 2440
disbelief  :: 2441
anticip  :: 2442
maria  :: 2443
guarante  :: 2444
benefit  :: 2445
lili  :: 2446
amazingli  :: 2447
creation  :: 2448
explod  :: 2449
slave  :: 2450
capit  :: 2451
greater  :: 2452
mildli  :: 2453
bbc  :: 2454
cring  :: 2455
princip  :: 2456
lesli  :: 2457
funnier  :: 2458
introduct  :: 2459
halfway  :: 2460
extraordinari  :: 2461
punish  :: 2462
ov

reev  :: 2950
ian  :: 2951
easier  :: 2952
updat  :: 2953
burst  :: 2954
assault  :: 2955
smash  :: 2956
fond  :: 2957
smooth  :: 2958
useless  :: 2959
astair  :: 2960
bakshi  :: 2961
outcom  :: 2962
cardboard  :: 2963
tag  :: 2964
terri  :: 2965
cox  :: 2966
divers  :: 2967
exchang  :: 2968
sketch  :: 2969
vulner  :: 2970
melodramat  :: 2971
increasingli  :: 2972
vari  :: 2973
coincid  :: 2974
fist  :: 2975
rex  :: 2976
samurai  :: 2977
resolv  :: 2978
qualifi  :: 2979
2002  :: 2980
templ  :: 2981
scratch  :: 2982
suspend  :: 2983
blast  :: 2984
tame  :: 2985
insert  :: 2986
reynold  :: 2987
brillianc  :: 2988
luckili  :: 2989
be  :: 2990
conveni  :: 2991
farm  :: 2992
strictli  :: 2993
hamilton  :: 2994
coach  :: 2995
pin  :: 2996
nuclear  :: 2997
seventi  :: 2998
ambiti  :: 2999
walker  :: 3000
meat  :: 3001
jami  :: 3002
matthew  :: 3003
gotta  :: 3004
soprano  :: 3005
fisher  :: 3006
discoveri  :: 3007
revers  :: 3008
convolut  :: 3009
spooki  :: 3010
instantli  :: 3011
timeless  

minimum  :: 4045
angela  :: 4046
peck  :: 4047
israel  :: 4048
spain  :: 4049
peril  :: 4050
campaign  :: 4051
what  :: 4052
eastern  :: 4053
unleash  :: 4054
valentin  :: 4055
preserv  :: 4056
regist  :: 4057
mon  :: 4058
valley  :: 4059
represent  :: 4060
scotland  :: 4061
calib  :: 4062
stroke  :: 4063
din  :: 4064
perpetu  :: 4065
fido  :: 4066
crawl  :: 4067
wig  :: 4068
restrain  :: 4069
rout  :: 4070
gentleman  :: 4071
sabrina  :: 4072
contradict  :: 4073
han  :: 4074
shootout  :: 4075
bro  :: 4076
buster  :: 4077
quinn  :: 4078
unimagin  :: 4079
brenda  :: 4080
cooki  :: 4081
crow  :: 4082
realm  :: 4083
kurosawa  :: 4084
miyazaki  :: 4085
exposit  :: 4086
travesti  :: 4087
empathi  :: 4088
reson  :: 4089
stake  :: 4090
climat  :: 4091
1984  :: 4092
jan  :: 4093
cream  :: 4094
ross  :: 4095
fuller  :: 4096
femm  :: 4097
abomin  :: 4098
geek  :: 4099
josh  :: 4100
businessman  :: 4101
mclaglen  :: 4102
distress  :: 4103
cloud  :: 4104
passabl  :: 4105
soderbergh  :: 4106
delic  

info  :: 4650
danish  :: 4651
28  :: 4652
inflict  :: 4653
vaniti  :: 4654
off  :: 4655
divin  :: 4656
triangl  :: 4657
interestingli  :: 4658
carey  :: 4659
sensual  :: 4660
ritchi  :: 4661
departur  :: 4662
claud  :: 4663
repris  :: 4664
archiv  :: 4665
discern  :: 4666
fishburn  :: 4667
flock  :: 4668
1945  :: 4669
brush  :: 4670
someday  :: 4671
mol  :: 4672
deer  :: 4673
cb  :: 4674
biblic  :: 4675
heartwarm  :: 4676
cycl  :: 4677
mobster  :: 4678
undermin  :: 4679
europa  :: 4680
submit  :: 4681
proclaim  :: 4682
clad  :: 4683
loretta  :: 4684
carlo  :: 4685
pixar  :: 4686
frontier  :: 4687
banter  :: 4688
artwork  :: 4689
anton  :: 4690
wendigo  :: 4691
hug  :: 4692
cher  :: 4693
dim  :: 4694
jade  :: 4695
recit  :: 4696
miracul  :: 4697
harrison  :: 4698
parson  :: 4699
casino  :: 4700
neill  :: 4701
cliffhang  :: 4702
timberlak  :: 4703
pacif  :: 4704
rot  :: 4705
bate  :: 4706
colin  :: 4707
helm  :: 4708
traffic  :: 4709
kathryn  :: 4710
melissa  :: 4711
vibrant  :: 4712
dam

In [45]:
len(word_dict)

4998

**Question:** What are the five most frequently appearing (tokenized) words in the training set? Does it makes sense that these words appear frequently in the training set?

**Answer:** movi, film, one, like, time

##### TODO: Use this space to determine the five most frequently appearing words in the training set.

In [46]:
i=0
for wd in word_dict:
    print(wd)
    i+=1
    if i==5:
        break

movi
film
one
like
time


### Save `word_dict`

Later on when we construct an endpoint which processes a submitted review we will need to make use of the `word_dict` which we have created. As such, we will save it to a file now for future use.

In [13]:
data_dir = '../data/pytorch' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)

In [48]:
# save
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

In [14]:
# load
with open(os.path.join(data_dir, 'word_dict.pkl'), "rb" ) as f:
    word_dict = pickle.load( f)

### Transform the reviews

Now that we have our word dictionary which allows us to transform the words appearing in the reviews into integers, it is time to make use of it and convert our reviews to their integer sequence representation, making sure to pad or truncate to a fixed length, which in our case is `500`.

In [15]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [16]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

As a quick check to make sure that things are working as intended, check to see what one of the reviews in the training set looks like after having been processeed. Does this look reasonable? What is the length of a review in the training set?

In [19]:
# Use this cell to examine one of the processed reviews to make sure everything is working as intended.
print(len(train_X[1]),train_X_len)
train_X[1]

500 [ 68  75  84 ... 129  74  34]


array([ 135,    2,  996,  279, 1309,    1,   11, 3228, 3025,    1,   95,
          4, 1411,    2,   47,   44,   87,  139,  529,    1, 1394,  499,
        267,   40,  312,    4, 1876,   26, 1540, 1523, 3228, 3025, 4117,
          1,   27,   38,    2,  105,  683,  328,    6,  431,  122,  390,
        517,   37,  149,   21,    1,  111,    4,   53,    2,   47,  114,
          6,  219,  517,  372, 2600,    3,  664,   11, 1380,  279,  215,
        181,  660,    7,  105,   17,  343,  125,  116,    1,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

**Question:** In the cells above we use the `preprocess_data` and `convert_and_pad_data` methods to process both the training and testing set. Why or why not might this be a problem?

**Answer:**
the current implementation of `preprocess_data` method requires 2 labeled sets, it doesn't allow to preprocess and dump train, val, test sets at the same time. Also, preprocessing of a new unlabled test set would raise an error.
`convert_and_pad_data` doesn't cause any problems since we assume that both sets are in the same domain (therefore the words come from one dictionary), and if the test set has any words not present in train set, these words will get the key `1`.

## Step 3: Upload the data to S3

As in the XGBoost notebook, we will need to upload the training dataset to S3 in order for our training code to access it. For now we will save it locally and we will upload to S3 later on.

### Save the processed training dataset locally

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

In [17]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Uploading the training data


Next, we need to upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [18]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [19]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

**NOTE:** The cell above uploads the entire contents of our data directory. This includes the `word_dict.pkl` file. This is fortunate as we will need this later on when we create an endpoint that accepts an arbitrary review. For now, we will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that we will need to make sure it gets saved in the model directory.

## Step 4: Build and Train the PyTorch Model

In the XGBoost notebook we discussed what a model is in the SageMaker framework. In particular, a model comprises three objects

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
each of which interact with one another. In the XGBoost example we used training and inference code that was provided by Amazon. Here we will still be using containers provided by Amazon with the added benefit of being able to include our own custom code.

We will start by implementing our own neural network in PyTorch along with a training script. For the purposes of this project we have provided the necessary model object in the `model.py` file, inside of the `train` folder. You can see the provided implementation by running the cell below.

In [20]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)
        [36mself[39;49;00m.sig = nn.Sigm

The important takeaway from the implementation provided is that there are three parameters that we may wish to tweak to improve the performance of our model. These are the embedding dimension, the hidden dimension and the size of the vocabulary. We will likely want to make these parameters configurable in the training script so that if we wish to modify them we do not need to modify the script itself. We will see how to do this later on. To start we will write some of the training code in the notebook so that we can more easily diagnose any issues that arise.

First we will load a small portion of the training data set to use as a sample. It would be very time consuming to try and train the model completely in the notebook as we do not have access to a gpu and the compute instance that we are using is not particularly powerful. However, we can work on a small bit of the data to get a feel for how our training script is behaving.

In [21]:
import torch
import torch.utils.data

# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

### (TODO) Writing the training method

Next we need to write the training code itself. This should be very similar to training methods that you have written before to train PyTorch models. We will leave any difficult aspects such as model saving / loading and parameter loading until a little later.

In [50]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # TODO: Complete this train method to train the model provided.
            #x, y = place_and_unwrap(batch, device)
            with torch.set_grad_enabled(True):
             
                out = model(batch_X)
                    
                loss = loss_fn(out, batch_y)

            #if is_training:
            optimizer.zero_grad()
                  
            loss.backward()
               
            optimizer.step()

            #phase.batch_loss = loss.item()
            #
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

Supposing we have the training method above, we will test that it is working by writing a bit of code in the notebook that executes our training method on the small sample training set that we loaded earlier. The reason for doing this in the notebook is so that we have an opportunity to fix any errors that arise early when they are easier to diagnose.

In [51]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6940591692924499
Epoch: 2, BCELoss: 0.6854180932044983
Epoch: 3, BCELoss: 0.6777827024459839
Epoch: 4, BCELoss: 0.6689720630645752
Epoch: 5, BCELoss: 0.6574710845947266


In order to construct a PyTorch model using SageMaker we must provide SageMaker with a training script. We may optionally include a directory which will be copied to the container and from which our training code will be run. When the training container is executed it will check the uploaded directory (if there is one) for a `requirements.txt` file and install any required Python libraries, after which the training script will be run.

### (TODO) Training the model

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the `train` directory is a file called `train.py` which has been provided and which contains most of the necessary code to train our model. The only thing that is missing is the implementation of the `train()` method which you wrote earlier in this notebook.

**TODO**: Copy the `train()` method written above and paste it into the `train/train.py` file where required.

The way that SageMaker passes hyperparameters to the training script is by way of arguments. These arguments can then be parsed and used in the training script. To see how this is done take a look at the provided `train/train.py` file.

In [25]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type= 'ml.m4.xlarge',#'ml.p2.xlarge',#
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [26]:
estimator.fit({'training': input_data})

2020-04-09 09:01:53 Starting - Starting the training job...
2020-04-09 09:01:55 Starting - Launching requested ML instances......
2020-04-09 09:03:03 Starting - Preparing the instances for training......
2020-04-09 09:04:03 Downloading - Downloading input data...
2020-04-09 09:04:53 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-04-09 09:04:54,222 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-04-09 09:04:54,225 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-04-09 09:04:54,238 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-04-09 09:04:55,679 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-04-09 09:04:56,042 sagemaker-containers INFO    

[34mEpoch: 1, BCELoss: 0.6706423662146743[0m
[34mEpoch: 2, BCELoss: 0.6050693052155631[0m
[34mEpoch: 3, BCELoss: 0.5347774478853965[0m
[34mEpoch: 4, BCELoss: 0.4780822834190057[0m
[34mEpoch: 5, BCELoss: 0.4453286224482011[0m
[34mEpoch: 6, BCELoss: 0.3910784173984917[0m
[34mEpoch: 7, BCELoss: 0.3530127625076138[0m
[34mEpoch: 8, BCELoss: 0.3982901567099046[0m
[34mEpoch: 9, BCELoss: 0.32424689677296853[0m
[34mEpoch: 10, BCELoss: 0.2972212537210815[0m
[34m2020-04-09 10:46:45,241 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-04-09 10:46:54 Uploading - Uploading generated training model
2020-04-09 10:46:54 Completed - Training job completed
Training seconds: 6171
Billable seconds: 6171


## Step 5: Testing the model

As mentioned at the top of this notebook, we will be testing this model by first deploying it and then sending the testing data to the deployed endpoint. We will do this so that we can make sure that the deployed model is working correctly.

## Step 6: Deploy the model for testing

Now that we have trained our model, we would like to test it to see how it performs. Currently our model takes input of the form `review_length, review[500]` where `review[500]` is a sequence of `500` integers which describe the words present in the review, encoded using `word_dict`. Fortunately for us, SageMaker provides built-in inference code for models with simple inputs such as this.

There is one thing that we need to provide, however, and that is a function which loads the saved model. This function must be called `model_fn()` and takes as its only parameter a path to the directory where the model artifacts are stored. This function must also be present in the python file which we specified as the entry point. In our case the model loading function has been provided and so no changes need to be made.

**NOTE**: When the built-in inference code is run it must import the `model_fn()` method from the `train.py` file. This is why the training code is wrapped in a main guard ( ie, `if __name__ == '__main__':` )

Since we don't need to change anything in the code that was uploaded during training, we can simply deploy the current model as-is.

**NOTE:** When deploying a model you are asking SageMaker to launch an compute instance that will wait for data to be sent to it. As a result, this compute instance will continue to run until *you* shut it down. This is important to know since the cost of a deployed endpoint depends on how long it has been running for.

In other words **If you are no longer using a deployed endpoint, shut it down!**

**TODO:** Deploy the trained model.

In [27]:
# TODO: Deploy the trained model
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

-------------!

## Step 7 - Use the model for testing

Once deployed, we can read in the test data and send it off to our deployed model to get some results. Once we collect all of the results we can determine how accurate our model is.

In [28]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [29]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [30]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [31]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.85328

**Question:** How does this model compare to the XGBoost model you created earlier? Why might these two models perform differently on this dataset? Which do *you* think is better for sentiment analysis?

**Answer:** The result of XGBoost was almost the same: 0.8536

### (TODO) More testing

We now have a trained model which has been deployed and which we can send processed reviews to and which returns the predicted sentiment. However, ultimately we would like to be able to send our model an unprocessed review. That is, we would like to send the review itself as a string. For example, suppose we wish to send the following review to our model.

In [32]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

The question we now need to answer is, how do we send this review to our model?

Recall in the first section of this notebook we did a bunch of data processing to the IMDb dataset. In particular, we did two specific things to the provided reviews.
 - Removed any html tags and stemmed the input
 - Encoded the review as a sequence of integers using `word_dict`
 
In order process the review we will need to repeat these two steps.

**TODO**: Using the `review_to_words` and `convert_and_pad` methods from section one, convert `test_review` into a numpy array `test_data` suitable to send to our model. Remember that our model expects input of the form `review_length, review[500]`.

In [33]:
test_data_review_to_words = review_to_words(test_review)
test_data = [np.array(convert_and_pad(word_dict, test_data_review_to_words)[0])]

In [34]:
predictor.predict(test_data)

array(0.60721785, dtype=float32)

Now that we have processed the review, we can send the resulting array to our model to predict the sentiment of the review.

Since the return value of our model is close to `1`, we can be certain that the review we submitted is positive.

### Delete the endpoint

Of course, just like in the XGBoost notebook, once we've deployed an endpoint it continues to run until we tell it to shut down. Since we are done using our endpoint for now, we can delete it.

In [35]:
estimator.delete_endpoint()

## Step 6 (again) - Deploy the model for the web app

Now that we know that our model is working, it's time to create some custom inference code so that we can send the model a review which has not been processed and have it determine the sentiment of the review.

As we saw above, by default the estimator which we created, when deployed, will use the entry script and directory which we provided when creating the model. However, since we now wish to accept a string as input and our model expects a processed review, we need to write some custom inference code.

We will store the code that we write in the `serve` directory. Provided in this directory is the `model.py` file that we used to construct our model, a `utils.py` file which contains the `review_to_words` and `convert_and_pad` pre-processing functions which we used during the initial data processing, and `predict.py`, the file which will contain our custom inference code. Note also that `requirements.txt` is present which will tell SageMaker what Python libraries are required by our custom inference code.

When deploying a PyTorch model in SageMaker, you are expected to provide four functions which the SageMaker inference container will use.
 - `model_fn`: This function is the same function that we used in the training script and it tells SageMaker how to load our model.
 - `input_fn`: This function receives the raw serialized input that has been sent to the model's endpoint and its job is to de-serialize and make the input available for the inference code.
 - `output_fn`: This function takes the output of the inference code and its job is to serialize this output and return it to the caller of the model's endpoint.
 - `predict_fn`: The heart of the inference script, this is where the actual prediction is done and is the function which you will need to complete.

For the simple website that we are constructing during this project, the `input_fn` and `output_fn` methods are relatively straightforward. We only require being able to accept a string as input and we expect to return a single value as output. You might imagine though that in a more complex application the input or output may be image data or some other binary data which would require some effort to serialize.

### (TODO) Writing inference code

Before writing our custom inference code, we will begin by taking a look at the code which has been provided.

In [36]:
!pygmentize serve/predict.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m LSTMClassifier

[34mfrom[39;49;00m [04m[36mutils[39;49;00m [34mimport[39;49;00m review_to_words, 

As mentioned earlier, the `model_fn` method is the same as the one provided in the training code and the `input_fn` and `output_fn` methods are very simple and your task will be to complete the `predict_fn` method. Make sure that you save the completed file as `predict.py` in the `serve` directory.

**TODO**: Complete the `predict_fn()` method in the `serve/predict.py` file.

### Deploying the model

Now that the custom inference code has been written, we will create and deploy our model. To begin with, we need to construct a new PyTorchModel object which points to the model artifacts created during training and also points to the inference code that we wish to use. Then we can call the deploy method to launch the deployment container.

**NOTE**: The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a `numpy` array. In our case we want to send a string so we need to construct a simple wrapper around the `RealTimePredictor` class to accomodate simple strings. In a more complicated situation you may want to provide a serialization object, for example if you wanted to sent image data.

In [37]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------!

### Testing the model

Now that we have deployed our model with the custom inference code, we should test to see if everything is working. Here we test our model by loading the first `250` positive and negative reviews and send them to the endpoint, then collect the results. The reason for only sending some of the data is that the amount of time it takes for our model to process the input and then perform inference is quite long and so testing the entire data set would be prohibitive.

In [38]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # We make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, we store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(int(predictor.predict(review_input)))
                
            # Sending reviews to our endpoint one at a time takes a while so we
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [39]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [40]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.866

As an additional test, we can try sending the `test_review` that we looked at earlier.

In [41]:
predictor.predict(test_review)

b'1'

Now that we know our endpoint is working as expected, we can set up the web page that will interact with it. If you don't have time to finish the project now, make sure to skip down to the end of this notebook and shut down your endpoint. You can deploy it again when you come back.

## Step 7 (again): Use the model for the web app

> **TODO:** This entire section and the next contain tasks for you to complete, mostly using the AWS console.

So far we have been accessing our model endpoint by constructing a predictor object which uses the endpoint and then just using the predictor object to perform inference. What if we wanted to create a web app which accessed our model? The way things are set up currently makes that not possible since in order to access a SageMaker endpoint the app would first have to authenticate with AWS using an IAM role which included access to SageMaker endpoints. However, there is an easier way! We just need to use some additional AWS services.

<img src="Web App Diagram.svg">

The diagram above gives an overview of how the various services will work together. On the far right is the model which we trained above and which is deployed using SageMaker. On the far left is our web app that collects a user's movie review, sends it off and expects a positive or negative sentiment in return.

In the middle is where some of the magic happens. We will construct a Lambda function, which you can think of as a straightforward Python function that can be executed whenever a specified event occurs. We will give this function permission to send and recieve data from a SageMaker endpoint.

Lastly, the method we will use to execute the Lambda function is a new endpoint that we will create using API Gateway. This endpoint will be a url that listens for data to be sent to it. Once it gets some data it will pass that data on to the Lambda function and then return whatever the Lambda function returns. Essentially it will act as an interface that lets our web app communicate with the Lambda function.

### Setting up a Lambda function

The first thing we are going to do is set up a Lambda function. This Lambda function will be executed whenever our public API has data sent to it. When it is executed it will receive the data, perform any sort of processing that is required, send the data (the review) to the SageMaker endpoint we've created and then return the result.

#### Part A: Create an IAM Role for the Lambda function

Since we want the Lambda function to call a SageMaker endpoint, we need to make sure that it has permission to do so. To do this, we will construct a role that we can later give the Lambda function.

Using the AWS Console, navigate to the **IAM** page and click on **Roles**. Then, click on **Create role**. Make sure that the **AWS service** is the type of trusted entity selected and choose **Lambda** as the service that will use this role, then click **Next: Permissions**.

In the search box type `sagemaker` and select the check box next to the **AmazonSageMakerFullAccess** policy. Then, click on **Next: Review**.

Lastly, give this role a name. Make sure you use a name that you will remember later on, for example `LambdaSageMakerRole`. Then, click on **Create role**.

#### Part B: Create a Lambda function

Now it is time to actually create the Lambda function.

Using the AWS Console, navigate to the AWS Lambda page and click on **Create a function**. When you get to the next page, make sure that **Author from scratch** is selected. Now, name your Lambda function, using a name that you will remember later on, for example `sentiment_analysis_func`. Make sure that the **Python 3.6** runtime is selected and then choose the role that you created in the previous part. Then, click on **Create Function**.

On the next page you will see some information about the Lambda function you've just created. If you scroll down you should see an editor in which you can write the code that will be executed when your Lambda function is triggered. In our example, we will use the code below. 

```python
# We need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that we've created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given
    response = runtime.invoke_endpoint(EndpointName = '**ENDPOINT NAME HERE**',    # The name of the endpoint we created
                                       ContentType = 'text/plain',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of our inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }
```

Once you have copy and pasted the code above into the Lambda code editor, replace the `**ENDPOINT NAME HERE**` portion with the name of the endpoint that we deployed earlier. You can determine the name of the endpoint using the code cell below.

In [42]:
predictor.endpoint

'sagemaker-pytorch-2020-04-09-11-07-27-503'

Once you have added the endpoint name to the Lambda function, click on **Save**. Your Lambda function is now up and running. Next we need to create a way for our web app to execute the Lambda function.

### Setting up API Gateway

Now that our Lambda function is set up, it is time to create a new API using API Gateway that will trigger the Lambda function we have just created.

Using AWS Console, navigate to **Amazon API Gateway** and then click on **Get started**.

On the next page, make sure that **New API** is selected and give the new api a name, for example, `sentiment_analysis_api`. Then, click on **Create API**.

Now we have created an API, however it doesn't currently do anything. What we want it to do is to trigger the Lambda function that we created earlier.

Select the **Actions** dropdown menu and click **Create Method**. A new blank method will be created, select its dropdown menu and select **POST**, then click on the check mark beside it.

For the integration point, make sure that **Lambda Function** is selected and click on the **Use Lambda Proxy integration**. This option makes sure that the data that is sent to the API is then sent directly to the Lambda function with no processing. It also means that the return value must be a proper response object as it will also not be processed by API Gateway.

Type the name of the Lambda function you created earlier into the **Lambda Function** text entry box and then click on **Save**. Click on **OK** in the pop-up box that then appears, giving permission to API Gateway to invoke the Lambda function you created.

The last step in creating the API Gateway is to select the **Actions** dropdown and click on **Deploy API**. You will need to create a new Deployment stage and name it anything you like, for example `prod`.

You have now successfully set up a public API to access your SageMaker model. Make sure to copy or write down the URL provided to invoke your newly created public API as this will be needed in the next step. This URL can be found at the top of the page, highlighted in blue next to the text **Invoke URL**.

## Step 4: Deploying our web app

Now that we have a publicly available API, we can start using it in a web app. For our purposes, we have provided a simple static html file which can make use of the public api you created earlier.

In the `website` folder there should be a file called `index.html`. Download the file to your computer and open that file up in a text editor of your choice. There should be a line which contains **\*\*REPLACE WITH PUBLIC API URL\*\***. Replace this string with the url that you wrote down in the last step and then save the file.

Now, if you open `index.html` on your local computer, your browser will behave as a local web server and you can use the provided site to interact with your SageMaker model.

If you'd like to go further, you can host this html file anywhere you'd like, for example using github or hosting a static site on Amazon's S3. Once you have done this you can share the link with anyone you'd like and have them play with it too!

> **Important Note** In order for the web app to communicate with the SageMaker endpoint, the endpoint has to actually be deployed and running. This means that you are paying for it. Make sure that the endpoint is running when you want to use the web app but that you shut it down when you don't need it, otherwise you will end up with a surprisingly large AWS bill.

**TODO:** Make sure that you include the edited `index.html` file in your project submission.

Now that your web app is working, trying playing around with it and see how well it works.

**Question**: Give an example of a review that you entered into your web app. What was the predicted sentiment of your example review?

**Answer:** "It's a great action movie! Definitely recommended"
- "our review was POSITIVE!"

### Delete the endpoint

Remember to always shut down your endpoint if you are no longer using it. You are charged for the length of time that the endpoint is running so if you forget and leave it on you could end up with an unexpectedly large bill.

In [43]:
predictor.delete_endpoint()