# 1. Downloading the data

Using [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/)

> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.

In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2020-09-26 03:30:12--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-09-26 03:30:18 (13.7 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



# 2. Preparing and Processing the data

Initial data processing: Read in each of the reviews and combine them into a single input structure. Next, splitting the dataset into a training set and a testing set.

In [9]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [10]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


Funtion below combines the positive and negative reviews and shuffles the resulting records.

In [11]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [12]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


Quick checking to see an example of the data which our model will be trained on, to ensure that the data has been loaded correctly.

In [13]:
print(train_X[100])
print(train_y[100])

The Japanese "Run Lola Run," his is one offbeat movie which will put a smile on just about anyone's face. Fans of Run Lola Run, Tampopo, Go!, and Slacker will probably like this one. It does tend to follow a formula that is increasingly popular these days of separate, seemingly unrelated vignettes, all contributing the the overall story in unexpected ways. catch it if you see it, otherwise wait for the rental.
1


As the first step in processing, we need to remove any html tags that appear in the raw input texts. In addition, we need to tokenize the words, such that *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

In [14]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

Check to ensure we know how everything is working.

In [15]:
review_to_words(train_X[100])

['japanes',
 'run',
 'lola',
 'run',
 'one',
 'offbeat',
 'movi',
 'put',
 'smile',
 'anyon',
 'face',
 'fan',
 'run',
 'lola',
 'run',
 'tampopo',
 'go',
 'slacker',
 'probabl',
 'like',
 'one',
 'tend',
 'follow',
 'formula',
 'increasingli',
 'popular',
 'day',
 'separ',
 'seemingli',
 'unrel',
 'vignett',
 'contribut',
 'overal',
 'stori',
 'unexpect',
 'way',
 'catch',
 'see',
 'otherwis',
 'wait',
 'rental']

The method below applies the `review_to_words` method to each of the reviews in the training and testing datasets, as well as caching the results so we can come back without needing to process the data again.

In [16]:
import pickle
import sklearn.utils._joblib as joblib # joblib seems to work faster than pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = joblib.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                joblib.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [17]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Read preprocessed data from cache file: preprocessed_data.pkl


# 3. Transform the data

Constructing feature representation:

To start, we will represent each word as an integer. Since some words appear very infrequently, we will fix the size of the working vocabulary so that we will deal with words that appear most frequently. We will then include the infrequently appearing words into a single category, with label `1` (`INFREQ`). 

Next, we will pad each input review so that all batch sequences will be feasible to work with in our recurrent neural network. All reviews longer than the pad length will be truncated to that size, while short reviews will be extended in lenth, and have labels, `0` (`NOWORD`), in its remainder spots.

### Word map dictionary
The `build_dict()` method below maps words in that appear in the reviews to integers and sets the size of our vocabulary to `5000`. The resulting word_dict will include total of `4998` vocabs, saving the first two indices for `0` and `1` for `no word` and `infrequent word` items.

In [19]:
import numpy as np
from collections import Counter

def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # TODO: Determine how often each word appears in `data`. Note that `data` is a list of sentences and that a
    #       sentence is a list of words.
    
    word_count = {}
    for sentence in data:
        counter = Counter(sentence)
        for key, val in counter.items():
            if key not in word_count.keys():
                word_count[key] = val
            else:
                word_count[key] += val
    
    
    # TODO: Sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and
    #       sorted_words[-1] is the least frequently appearing word.
    sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word[0]] = idx + 2                              # 'infrequent' labels
        
    return word_dict

In [20]:
word_dict = build_dict(train_X)

In [22]:
print(list(word_dict)[:5])
list(word_dict)[:10]

['movi', 'film', 'one', 'like', 'time']


['movi',
 'film',
 'one',
 'like',
 'time',
 'good',
 'make',
 'charact',
 'get',
 'see']

Top five words are `movi`, `film`, `one`, `like`, and `time`. This seems to make sense even further down top words, since the reviews are related to movies. 

### Save `word_dict`

We will save the 'word_dict' as a file so we can later access it as an endpoint.

In [23]:
data_dir = '../data/pytorch' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)

In [24]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the reviews

With our word dictionary, we can pad or truncate to a fixed length of 500, with each coverted into their integer sequence representation. `convert_and_pad` uses single review input, while below method, `convert_and_pad_data`, will be useful for a comprehensive dataset.

In [25]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and use 1 to represent infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [26]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

A quick look to make sure that things are working as intended:

In [27]:
print(len(train_X))
print(train_X[3])
print(train_X_len[3])

print(test_X[3])
print(test_X_len[3])

25000
[ 323    6  835  201  381   91  487  803 1076  181   37  152  133  438
  253  800   31    1   36  284  881    2  152 4356 1832    1   65   24
  308  448   18  155  115    3  389 1079 1304 3934    1 1361    1  228
  884  235  435 4284 1261  190   53   63    2  135 1692  990  235  676
  618 1586  297 1402    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0


# 3. Upload the data to S3

We will need to upload the training dataset to S3 in order for our training code to access it. For now we will save it locally and we will upload to S3 later on.

### Save the processed training dataset locally

As a reference, each row of the dataset has the format, `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review. We will need to know this to write our training code later on.

In [28]:
import pandas as pd

pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Uploading the training data


Next, we need to upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [29]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()


In [30]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

**NOTE:** The cell above uploads the entire contents of our data directory. This includes the `word_dict.pkl` file. We will need this later on when we create an endpoint that accepts an arbitrary review.

# 4. Build and Train the PyTorch Model

We will be using containers provided by Amazon with the added benefit of being able to include our own custom code.

We will start by implementing our own neural network in PyTorch along with a training script. Our necessary model object is in the `model.py` file, inside of the `train` folder. Below is our model architecture.

In [31]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)


We have three parameters, embedding dimension, hidden dimension and the size of vocabulary, which we can tweak to improve the performance of our model throughout our training.

Next, we will load a small portion of the training data set to use as a sample. Since we do not have access to a gpu and large enough compute instance to process the entire dataset, it is better to use the sample if we want to see how our training script is behaving.

In [32]:
import torch
import torch.utils.data


# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas df into tensors
train_samples_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_samples_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build dataset
train_sample_ds = torch.utils.data.TensorDataset(train_samples_X, train_samples_y)
# Build dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

In [33]:
example = iter(train_sample_dl)
ex_X, ex_y = example.next()
print(ex_X, ex_y)

tensor([[  56,  544,    1,  ...,    0,    0,    0],
        [ 199,    1, 1953,  ...,    0,    0,    0],
        [  67,   28,  135,  ...,    0,    0,    0],
        ...,
        [ 125,   77,    2,  ...,    0,    0,    0],
        [  99,  126,    1,  ...,    0,    0,    0],
        [ 240,    2, 3710,  ...,    0,    0,    0]]) tensor([1., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 1., 1.,
        1., 0., 0., 1., 0., 1., 1., 0., 0., 1., 0., 0., 1., 0.])


In [34]:
ex_X.size()

torch.Size([50, 501])

### Writing the training method

Next we need to write the training code itself. (This will be saved as a script inside `train/train.py` file for subsequent train runs.)

In [36]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):

    
    for epoch in range(1, epochs+1):
        model.train()
        total_loss = 0
        
        for batch in train_loader:
            
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            output = model(batch_X)
            
            loss = loss_fn(output, batch_y)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

Quick run of above method, so we can diagnose any  error if any.

In [37]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6963516235351562
Epoch: 2, BCELoss: 0.686224639415741
Epoch: 3, BCELoss: 0.6776556372642517
Epoch: 4, BCELoss: 0.6686712384223938
Epoch: 5, BCELoss: 0.6584097862243652


When the training container is executed, SageMaker will check the uploaded directory (if there is one) for a `requirements.txt` file and install any required Python libraries, after which the training script will be run.

### Training the model

First, we need to specify an entry point in which SageMaker can pass in the hyperparameters and training the model. This will be our `train.py` file in the `train` folder.

In [38]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [39]:
estimator.fit({'training': input_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-09-29 01:35:14 Starting - Starting the training job...
2020-09-29 01:35:16 Starting - Launching requested ML instances......
2020-09-29 01:36:19 Starting - Preparing the instances for training.........
2020-09-29 01:38:15 Downloading - Downloading input data...
2020-09-29 01:38:44 Training - Downloading the training image...
2020-09-29 01:39:04 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-09-29 01:39:05,861 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-09-29 01:39:05,886 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-09-29 01:39:08,936 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-09-29 01:39:10,234 sagemaker-containers INFO     Module train does not provide a setup.py. 

# 5. Testing the model

We will be testing this model by first deploying it and then sending the testing data to the deployed endpoint. We will do this so that we can make sure that the deployed model is working correctly.

# 6. Deploy the model for testing

Now that we have trained our model, we would like to test it to see how it performs. Currently our model takes input of the form `review_length, review[500]` where `review[500]` is a sequence of `500` integers which describe the words present in the review, encoded using `word_dict`. SageMaker provides built-in inference code for models with simple inputs such as this.

Inside our `train.py` file, we have `model_fn()` function that takes in a path to the directory where the model artifacts are stored. This will load the saved model.

Since we don't need to change anything in the code that was uploaded during training, we can simply deploy the current model as-is.

In [40]:
# TODO: Deploy the trained model
from sagemaker.predictor import csv_serializer


predictor = estimator.deploy(initial_instance_count = 1,
                         instance_type = 'ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---------------!

# 7. Use the model for testing

Once deployed, we can read in the test data and send it off to our deployed model to get some results. Once we collect all of the results we can determine how accurate our model is.

In [41]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [42]:
# Split data into chunks and send each, accumulating the results
def predict(data, rows=512):
    split_array = np.array_split(data, 
                                  int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
        
    return predictions

In [43]:
predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

In [44]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

0.8558

### More testing

We now have a trained model which has been deployed and which we can send processed reviews to and which returns the predicted sentiment. However, ultimately we would like to be able to send our model an unprocessed review, a string object.

In [45]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

In order to send this review to our model, we must repeat:
 - Removing any html tags and stemming the input using `review_to_words` method
 - Encoding the review as a sequence of integers with word dictionary using `convert_and_pad` method

Finally, convert `test_review` into a numpy array, `test_data`, that has the format `review_length, review[500]`


In [46]:
test_padded, test_len = convert_and_pad(word_dict,review_to_words(test_review))
test_pack = np.hstack((test_len, test_padded))
test_pack = test_pack.reshape(1, -1)
#test_df = pd.DataFrame(test_input_list)
test_data = torch.from_numpy(test_pack)
test_data


tensor([[  20,    1, 1376,   50,   53,    3,    4,  878,  173,  392,  682,   29,
          724,    2, 4428,  275, 2082, 1060,  760,    1,  582,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,  

Now that we have processed the review, we can send the resulting array to our model to predict the sentiment of the review.

In [47]:
predictor.predict(test_data)

array(0.8404054, dtype=float32)

Since the return value of our model is close to `1`, we can be certain that the review we submitted is positive.

### Delete the endpoint


In [48]:
estimator.delete_endpoint()

estimator.delete_endpoint() will be deprecated in SageMaker Python SDK v2. Please use the delete_endpoint() function on your predictor instead.


## 6.(again) Deploy the model for the web app

Now that we know that our model is working, we will create some custom inference code so that we can send the model a review which has not been processed and have it determine the sentiment of the review.

Since we now wish to accept a string as input and our model expects a processed review, we need to write some custom inference code.

We will store the code that we write in the `serve` directory. In includes:
- `model.py` file to construct our model
- `utils.py` file containing `review_to_words` and `convert_and_pad` for initial preprocessing functions
- `predict.py` file containing our custom inference code*
- `requirement.txt` file that tells SageMaker which Python libraries are required for the inference code


*Our custom inference code for deploying the model with SageMaker container:
> - `model_fn`: This function is the same function that we used in the training script and it tells SageMaker how to load our model.
- `input_fn`: This function receives the raw serialized input that has been sent to the model's endpoint and its job is to de-serialize and make the input available for the inference code.
- `output_fn`: This function takes the output of the inference code and its job is to serialize this output and return it to the caller of the model's endpoint.
- `predict_fn`: The heart of the inference script, this is where the actual prediction is done and is the function which you will need to complete.


### Deploying the model

Now that the custom inference code has been written, we will create and deploy our model. To begin with, we need to construct a new PyTorchModel object which points to the model artifacts created during training and also points to the inference code that we wish to use. Then we can call the deploy method to launch the deployment container.

**NOTE**: The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a `numpy` array. In our case we want to send a string so we need to construct a simple wrapper around the `RealTimePredictor` class to accomodate simple strings. 

In [50]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


------------------!

### Testing the model

Now that we have deployed our model with the custom inference code, we should test to see if everything is working. Here we test our model by loading the first `250` positive and negative reviews and send them to the endpoint, then collect the results.

In [51]:
import glob

def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # We make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, we store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(int(predictor.predict(review_input)))
                
            # Sending reviews to our endpoint one at a time takes a while so we
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
            
    return ground, results

In [52]:
ground, results = test_reviews()

Starting  pos  files
Starting  neg  files


In [53]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.866

As an additional test, we can try sending the `test_review` that we looked at earlier.

In [54]:
predictor.predict(test_review)

b'1'

Now that we know our endpoint is working as expected, we can set up the web page that will interact with it.

## Use the model for the web app

1. Create and set up a Lambda function (cell below is our edited `lambda_handler`)
2. Setting up API Gateway
3. Deploy our web app (add in invoked URL into our `index.html`)

In [None]:
# We need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that we've created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given
    response = runtime.invoke_endpoint(EndpointName = 'sagemaker-pytorch-2020-09-29-02-20-02-941',    # The name of the endpoint we created
                                       ContentType = 'text/plain',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of our inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }

In [55]:
predictor.endpoint

'sagemaker-pytorch-2020-09-29-02-20-02-941'

### Initial web app test example:

- My review: "I just tried this out. It was a bit confusing, but it was pretty cool."
- Result: "Your review was NEGATIVE!"

### Delete the endpoint
(last cell is an optional code to clean up our data memory)

In [59]:
predictor.delete_endpoint()

In [None]:
# First we will remove all of the files contained in the data_dir directory
!rm $data_dir/*

# And then we delete the directory itself
!rmdir $data_dir

# Similarly we remove the files in the cache_dir directory and the directory itself
!rm $cache_dir/*
!rmdir $cache_dir