# Sentiment Analysis 
At the end of this notebook we will have made a simple site which a user can use to enter a movie review. The site will then send the review off to our deployed model which will predict the sentiment of the submitted review.

## Imports

In [1]:
import os
import glob

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

import pickle

import numpy as np
import pandas as pd

from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

## Getting the data
We'll be using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/)

In [2]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2020-06-21 20:48:20--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-06-21 20:48:34 (5.70 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Preprocessing data

In [3]:
def read_imdb_data(data_dir='../data/aclImdb'):
    '''
    Need to add function information here later
    '''
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [4]:
data, labels = read_imdb_data()

In [5]:
print(f"IMDB reviews: train = {len(data['train']['pos'])} pos / {len(data['train']['neg'])} neg, test = {len(data['test']['pos'])} pos / {len(data['test']['neg'])} neg")

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


In [6]:
def prepare_imdb_data(data, labels):
    """
    Need to add function information here later
    """
    
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    return data_train, data_test, labels_train, labels_test

In [7]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)

In [9]:
print(f"IMDb reviews (total): train = {len(train_X)}, test = {len(test_X)}")

IMDb reviews (total): train = 25000, test = 25000


### Taking a peek at the training data

In [16]:
print(f"Review: {train_X[100]}")
print(f"Rating: {train_y[100]}")

Review: I like bad movies. I like to rent bad movies with my friends and rip on them for their duration. Then there are abhorrent movies like this. Redline is not just a bad movie, but a telling sign that maybe the American movie industry should please, for the sake of the viewer, at least proofread scripts before funding a movie.<br /><br />If a stereotype took a crap, this movie would spawn from that. The storyline is unbearable, and the acting all around is laughable. Nadia Bjorlin and Eddie Griffin have, perhaps, the worst screen chemistry I've seen in a good while, and even individually they should be isolated from humanity and beaten with a bag of oranges until they change their profession to street merchants (about the only thing they can legitimately qualify for). Furthermore, how Angus Macfadyen got convinced to do this movie is so far beyond me that I can't even think of an analogy. I am a loyal fan of his, but this has made me question him.<br /><br />To sum it up. Several p

### Removing html tags and tokenizing text

In [17]:
def review_to_words(review):
    """
    Need to add function information here later
    """
    
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() 
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) 
    words = text.split()
    words = [w for w in words if w not in stopwords.words("english")] 
    words = [PorterStemmer().stem(w) for w in words] 
    
    return words

In [19]:
# double checking our function works on an example review
sample_review_output = review_to_words(train_X[0])
sample_review_output[:10]

['stefan',
 'x',
 'con',
 'five',
 'year',
 'ago',
 'got',
 'marri',
 'mari',
 'marriag']

In [20]:
# setting cache directory incase running takes too long
cache_dir = os.path.join("../cache", "sentiment_analysis") 
os.makedirs(cache_dir, exist_ok=True) 

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """
    Convert each review to words; read from cache if available.
    """

    # if cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # if cache is missing, then do the heavy lifting
    if cache_data is None:
        # preprocess training and test data to obtain words for each review
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # writing to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [21]:
# reading in preprocessed data 
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Wrote preprocessed data to cache file: preprocessed_data.pkl


## Building word dictionary

In [24]:
def build_dict(data, vocab_size = 5000):
    """
    Need to add function information here later
    """
    
    word_count = {}
    
    for review in data:
        for word in review:
            if word in word_count:
                word_count[word] += 1
            else:
                word_count[word] = 1
    
    sorted_words = [item[0] for item in sorted(word_count.items(), key = lambda x:x[1], reverse=True)]
    
    word_dict = {} 
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              
        
    return word_dict

In [25]:
word_dict = build_dict(train_X)

In [26]:
# examining top words in dictionary
list(word_dict.keys())[:5]

['movi', 'film', 'one', 'like', 'time']

In [27]:
# saving word dict for later
data_dir = '../data/pytorch' 
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [28]:
# opening up saved dict in the event it has been changed
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

## Transforming reviews

In [30]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 
    INFREQ = 1
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [31]:
# pulling out converted training data
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

In [32]:
# making sure reviews have been processed correctly
train_X[100]

array([   5,   24,    2,    5,  443,   24,    2,  132,  943, 4826,    1,
          2,    5,    1,   24,    2,  133, 1061,  203,  183,    2, 1181,
        444, 1842,  256,  141,    1,  125, 3150,    2,  901,  493,  522,
          2,   15, 4827,  682, 2774,   32,  110, 1057,    1,    1, 1448,
          1,  332,  174,  166, 1084,   47,    7,   14, 1267, 2547,  220,
       3089, 2345, 3763,  255, 3816,  577,    1,   35, 4733, 2971, 3445,
          1,    1,  111,  548,    2,  151,  652,   14,   30,    1, 3541,
        123,   34,  416, 1634,  335,   23,   50,  981,  148,  134,  246,
        119,   37, 1773,  127,   21,   95,  498,  355,    2,    5,    1,
       2380, 3792, 3680, 1967, 4676,  585,  338,  651,  220, 2156,    2,
       1227,    4,  116,  378, 1311,   28,  252,    1,  104,    2,  302,
        564, 1157,  426,   36,   61,    4,  656,  660,    2,   30,  875,
        100,  715,  428,  837, 1440,    1,    1,  594,   42,  282,    2,
       2856,  189,   72,  321,    1, 1440,  443,   

## Sending data to s3

In [39]:
# first let's save locally to a csv
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [40]:
# now let's send to default sagemaker bucket in s3
sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [41]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

## Build and train Pytorch model

In [56]:
estimator = PyTorch(entry_point="train.py",
                    source_dir="../train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [57]:
estimator.fit({'training': input_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-06-21 22:40:06 Starting - Starting the training job...
2020-06-21 22:40:08 Starting - Launching requested ML instances.........
2020-06-21 22:41:39 Starting - Preparing the instances for training.........
2020-06-21 22:43:09 Downloading - Downloading input data...
2020-06-21 22:43:46 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-06-21 22:44:16,897 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-06-21 22:44:16,922 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-06-21 22:44:19,939 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-06-21 22:44:20,164 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-06-21 22:44:20,165 sagemaker-containers IN

[34mEpoch: 1, BCELoss: 0.6638028390553533[0m
[34mEpoch: 2, BCELoss: 0.5903471173072348[0m
[34mEpoch: 3, BCELoss: 0.5269349436370694[0m
[34mEpoch: 4, BCELoss: 0.442237063330047[0m
[34mEpoch: 5, BCELoss: 0.40509845833389124[0m
[34mEpoch: 6, BCELoss: 0.3793247591476051[0m
[34mEpoch: 7, BCELoss: 0.3388176852343034[0m
[34mEpoch: 8, BCELoss: 0.31961560766307673[0m
[34mEpoch: 9, BCELoss: 0.3105925583109564[0m
[34mEpoch: 10, BCELoss: 0.28699194959231783[0m
[34m2020-06-21 22:47:41,273 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-06-21 22:47:50 Uploading - Uploading generated training model
2020-06-21 22:47:50 Completed - Training job completed
Training seconds: 281
Billable seconds: 281


## Testing model
We'll deploy our model first to also test to make sure our deployment works as expected

In [58]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---------------!

In [59]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [60]:
def predict(data, rows=512):
    '''
    Need to add function information here later
    '''
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [61]:
# getting and rounding the prediction values 
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [62]:
# getting our accuracy score
accuracy_score(test_y, predictions)

0.85104

### Testing a review

In [63]:
test_review = 'Nothing was typical about this. Everything was beautifully done in this movie, the story, the flow, the scenario, everything. I highly recommend it for mystery lovers, for anyone who wants to watch a good movie!'

In [64]:
# running our test review through the preprocessing steps
test_data_review_to_words = review_to_words(test_review)
test_data = [np.array(convert_and_pad(word_dict, test_data_review_to_words)[0])]

In [65]:
# getting our prediction
test_prediction = predictor.predict(test_data)
test_prediction

array(0.9893678, dtype=float32)

Because the return value of our model is close to 1, we can be pretty sure the review was positive!

In [66]:
# optional delete endpoint as to save cost
estimator.delete_endpoint()

## Deploying endpoint again for web app
Because we're going to be passing the model a string, and Pytorch expects a numpy array, we have to build a wrapper to the stringpredictor class.

In [67]:
class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

In [70]:
model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='../serve',
                     predictor_cls=StringPredictor)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [71]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---------------!

### Again we should test the deployed model using a sample set from our testing data

In [73]:
def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    actual = []
     
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        for f in files:
            with open(f) as review:
                if sentiment == 'pos':
                    actual.append(1)
                else:
                    actual.append(0)
                # reading in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # sending the review to the predictor and store the results
                results.append(float(predictor.predict(review_input)))
                
                
            # we'll only send 250 reviews since this is just a test
            files_read += 1
            if files_read == stop:
                break
            
    return actual, results

In [74]:
# grabbing the actual sentiment values and the predictions from our test review function
actual, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [76]:
# getting our accuracy score. Isn't sklearn so handy?!
accuracy_score(actual, results)

0.85

In [77]:
# let's also send the test review from earlier since we already know the sentiment as a test
predictor.predict(test_review)

b'1.0'

In [78]:
predictor.endpoint

'sagemaker-pytorch-2020-06-22-00-52-31-245'