# Creating a Sentiment Analysis Web App
## Using PyTorch and SageMaker


## General Outline

Recall the general outline for SageMaker projects using a notebook instance.

1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.



You will deploy and use your trained model a second time. In the second iteration you will customize the way that your trained model is deployed by including some of your own code. In addition, your newly deployed model will be used in the sentiment analysis web app.

In [1]:
#Imports
import os
import glob
from sklearn.utils import shuffle
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import re
import pandas as pd
import sagemaker
import torch
import torch.utils.data
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel
from bs4 import BeautifulSoup
import pickle
from sklearn.metrics import accuracy_score
import torch.optim as optim
from sagemaker.pytorch import PyTorch
from train.model import LSTMClassifier
cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists
import numpy as np

In [59]:
#######  defining all the functions: ##################

#To begin with, we will read in each of the reviews and combine them into a single input structure. 
#Then, we will split the dataset into a training set and a testing set
def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels
###################################################################################################################################

#Now that we've read the raw training and testing data from the downloaded dataset, 
#we will combine the positive and negative reviews and shuffle the resulting records.

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test
###################################################################################################################################

#The first step in processing the reviews is to make sure that any html tags that appear should be removed. 
#In addition we wish to tokenize our input, that way words such as entertained and entertaining are considered 
#the same with regard to sentiment analysis.

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words
###################################################################################################################################

#The method below applies the review_to_words method to each of the reviews in the training and testing datasets. 
#In addition it caches the results. This is because performing this processing step can take a long time. 
#This way if you are unable to complete the notebook in the current session, you can come back without needing 
#to process the data a second time.

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test
###################################################################################################################################

#transforming words in to numerical value, count the most 5000 frequent
def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    
    for sentence in data:             #`data` is a list of sentences
        for word in sentence:         #'sentence' is a list of words
            if word in word_count:    #if the word is already in the dict, add 1 to its count
                word_count[word]+=1
            else:
                word_count[word]=1    #if the word is not yet in the dict, add it with the count of 1
                
    sorted_words_list = sorted(word_count.items(),key=lambda x:x[1],reverse=True)   #ordering the dict  
    sorted_words = [k for k,_ in sorted_words_list]                                 #making a list from words in the dict
    sorted_words_list = None                                                        #freeing memory
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
        
    return word_dict
###################################################################################################################################

# it is time to make use of it and convert our reviews to their integer sequence representation,
# making sure to pad or truncate to a fixed length, which in our case is 500.

def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

###################################################################################################################################

# the train pytorch model
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            # TODO: Complete this train method to train the model provided.
            optimizer.zero_grad()
            out = model.forward(batch_X)
            loss = loss_fn(out, batch_y)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))
###################################################################################################################################

#estimator for pytorch
sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()
estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 20,                       # <--------------------------------------------------------------
                        'hidden_dim': 1,                  # <-------------------------------------------------------------
                    })
###################################################################################################################################

# We split the data into chunks and send each chunk seperately, accumulating the results.
def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions
###################################################################################################################################

#We need to construct a new PyTorchModel object which points to the model artifacts created during training,
#and also points to the inference code that we wish to use.
class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')
###################################################################################################################################

#we should test to see if everything is working. Here we test our model 
#by loading the first 250 positive and negative reviews and send them to the endpoint, then collect the results.
def test_reviews(data_dir='../data/aclImdb', stop=250):
    
    results = []
    ground = []
    
    # We make sure to test both positive and negative reviews    
    for sentiment in ['pos', 'neg']:
        
        path = os.path.join(data_dir, 'test', sentiment, '*.txt')
        files = glob.glob(path)
        
        files_read = 0
        
        print('Starting ', sentiment, ' files')
        # Iterate through the files and send them to the predictor
        for f in files:
            with open(f) as review:
                # First, we store the ground truth (was the review positive or negative)
                if sentiment == 'pos':
                    ground.append(1)
                else:
                    ground.append(0)
                # Read in the review and convert to 'utf-8' for transmission via HTTP
                review_input = review.read().encode('utf-8')
                # Send the review to the predictor and store the results
                results.append(float(predictor.predict(review_input)))
                
            # Sending reviews to our endpoint one at a time takes a while so we
            # only send a small number of reviews
            files_read += 1
            if files_read == stop:
                break
    return ground, results

## Step 1: Downloading the data

In [60]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2020-07-28 00:01:41--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-07-28 00:01:45 (20.0 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Step 2: Preparing and Processing the data

(1)The first few steps are the same as in the XGBoost example. To begin with, we will read in each of the reviews and combine them into a single input structure. Then, we will split the dataset into a training set and a testing set.

(2)Now that we've read the raw training and testing data from the downloaded dataset, we will combine the positive and negative reviews and shuffle the resulting records.

In [61]:
#  (1)
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))
#  (2)
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

#afeter:
# Preprocess all data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg
IMDb reviews (combined): train = 25000, test = 25000
Read preprocessed data from cache file: preprocessed_data.pkl


The first step in processing the reviews is to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.

The `review_to_words` method defined above uses `BeautifulSoup` to remove any html tags that appear and uses the `nltk` package to tokenize the reviews.

The method below applies the `review_to_words` method to each of the reviews in the training and testing datasets. In addition it caches the results. This is because performing this processing step can take a long time. This way if you are unable to complete the notebook in the current session, you can come back without needing to process the data a second time.

## Transform the data

In [62]:
word_dict = build_dict(train_X)

### Save `word_dict` and Transform the reviews

Later on when we construct an endpoint which processes a submitted review we will need to make use of the `word_dict` which we have created. As such, we will save it to a file now for future use.

In [63]:
data_dir = '../data/pytorch' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)
#    
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

#transform    
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

## Step 3: Upload the data to S3
### (1)Save the processed training dataset locally

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

### (2)Uploading the training data


Next, we need to upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [64]:
#       (1)
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)
####################################################################################################

#       (2)
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

**NOTE:** The cell above uploads the entire contents of our data directory. This includes the `word_dict.pkl` file. This is fortunate as we will need this later on when we create an endpoint that accepts an arbitrary review. For now, we will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that we will need to make sure it gets saved in the model directory.

## Step 4: Build and Train the PyTorch Model

We will start by implementing our own neural network in PyTorch along with a training script. For the purposes of this project we have provided the necessary model object in the `model.py` file, inside of the `train` folder.

The important takeaway from the implementation provided is that there are three parameters that we may wish to tweak to improve the performance of our model. These are the ****embedding dimension, the hidden dimension and the size of the vocabulary.**** We will likely want to make these parameters configurable in the training script so that if we wish to modify them we do not need to modify the script itself. 

We will load a small portion of the training data set to use as a sample. It would be very time consuming to try and train the model completely. We will work on a small bit of the data to get a feel for how our training script is behaving.

### (2)Training the model

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the `train` directory is a file called `train.py` which has been provided and which contains most of the necessary code to train our model.

The way that SageMaker passes hyperparameters to the training script is by way of arguments. These arguments can then be parsed and used in the training script. To see how this is done take a look at the provided `train/train.py` file.

In [65]:
data_dir = '../data/pytorch' # The folder we will use for storing data
# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

#(2) - training
estimator.fit({'training': input_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-07-28 00:02:01 Starting - Starting the training job...
2020-07-28 00:02:04 Starting - Launching requested ML instances......
2020-07-28 00:03:19 Starting - Preparing the instances for training......
2020-07-28 00:04:31 Downloading - Downloading input data...
2020-07-28 00:04:57 Training - Downloading the training image.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-07-28 00:05:38,947 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-07-28 00:05:38,972 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-07-28 00:05:41,990 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-07-28 00:05:42,272 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-07-28 00:05:42,272 sagemaker-containers INFO 

[34mModel loaded with embedding_dim 32, hidden_dim 1, vocab_size 5000.[0m
[34mEpoch: 1, BCELoss: 0.7925782921362896[0m
[34mEpoch: 2, BCELoss: 0.7732003866409769[0m
[34mEpoch: 3, BCELoss: 0.7566163393915916[0m
[34mEpoch: 4, BCELoss: 0.7384959921544912[0m
[34mEpoch: 5, BCELoss: 0.7184525314642458[0m
[34mEpoch: 6, BCELoss: 0.6988072942714302[0m
[34mEpoch: 7, BCELoss: 0.6811578808998575[0m
[34mEpoch: 8, BCELoss: 0.6639087686733324[0m
[34mEpoch: 9, BCELoss: 0.6504831800655443[0m
[34mEpoch: 10, BCELoss: 0.6380272240054851[0m
[34mEpoch: 11, BCELoss: 0.6257720054412375[0m
[34mEpoch: 12, BCELoss: 0.6141146762030465[0m
[34mEpoch: 13, BCELoss: 0.6017906872593627[0m
[34mEpoch: 14, BCELoss: 0.59041572103695[0m
[34mEpoch: 15, BCELoss: 0.5794022642836278[0m
[34mEpoch: 16, BCELoss: 0.5686059703632277[0m
[34mEpoch: 17, BCELoss: 0.5582511279047752[0m
[34mEpoch: 18, BCELoss: 0.548222757115656[0m
[34mEpoch: 19, BCELoss: 0.5380562860138562[0m
[34mEpoch: 20, BCELoss:

### Results:
* 'epochs': 12,'hidden_dim': 250
bceloss: 0.25462979656092977 ; 9 min (375 billable); 0.816

* 'epochs': 15,'hidden_dim': 250
bceloss: 0.2786020067881565 ;  10 min (412 billable); 0.814

* 'epochs': 20,'hidden_dim': 250
bceloss: 0.22177858012063162 ;  11 min (510 billable); 0.77

* 'epochs': 20,'hidden_dim': 350
bceloss: 0.1908689147355605  ;  15 min (758 billable); 0.818

* 'epochs': 10,'hidden_dim': 350
bceloss:  0.29813842597056406 ;  10 min (421 billable); 0.818

* 'epochs': 15,'hidden_dim': 350
bceloss:  0.2624467368028602 ;  12 min (609 billable); 0.814

* 'epochs': 20,'hidden_dim': 150
bceloss:  0.20069550525168983 ;  9 min (377 billable); 0.822

* 'epochs': 15,'hidden_dim': 150
bceloss: 0.2218188400171241   ;  8 min (304 billable); 0.828

* 'epochs': 10'hidden_dim': 150
bceloss: 0.30455971676476146   ; 7  min (239 billable); 0.822

* 'epochs': 20'hidden_dim': 100
bceloss: 0.1711716512028052   ; 8  min (286 billable); 0.83

* 'epochs': 50'hidden_dim': 100
bceloss:  0.07763163669376957  ; 12  min (537 billable); 0.824

* 'epochs': 20'hidden_dim': 50
bceloss:  0.1957816756805595  ; 7  min (223 billable); 0.842

* 'epochs': 100'hidden_dim': 50
bceloss: 0.024636344309440072   ; 13  min (580 billable); 0.794

* 'epochs': 20'hidden_dim': 1
bceloss: 0.5270818630043341   ; 5  min (175 billable); 0.656

Model loaded with embedding_dim 32, hidden_dim 150, vocab_size 5000.
Epoch: 1, BCELoss: 0.6784094061170306
Epoch: 2, BCELoss: 0.611366369286362
Epoch: 3, BCELoss: 0.48945855060402227
Epoch: 4, BCELoss: 0.43736129512592237
Epoch: 5, BCELoss: 0.3816710163135918
Epoch: 6, BCELoss: 0.34236725921533545
Epoch: 7, BCELoss: 0.31661797330087543
Epoch: 8, BCELoss: 0.298577758426569
Epoch: 9, BCELoss: 0.3138014509969828
Epoch: 10, BCELoss: 0.30455971676476146
Epoch: 11, BCELoss: 0.2762323821685752
Epoch: 12, BCELoss: 0.259689920410818
Epoch: 13, BCELoss: 0.23617319945169954
Epoch: 14, BCELoss: 0.2268513048789939
Epoch: 15, BCELoss: 0.2218188400171241
Epoch: 16, BCELoss: 0.20674322849633742
Epoch: 17, BCELoss: 0.21974128606368085
Epoch: 18, BCELoss: 0.21458637623154386
Epoch: 19, BCELoss: 0.22438540324872855
Epoch: 20, BCELoss: 0.20069550525168983

Model loaded with embedding_dim 32, hidden_dim 100, vocab_size 5000.
Epoch: 1, BCELoss: 0.6798512132800355
Epoch: 2, BCELoss: 0.6290153447462588
Epoch: 3, BCELoss: 0.5447559137733615
Epoch: 4, BCELoss: 0.45331359882743993
Epoch: 5, BCELoss: 0.3895978818134386
Epoch: 6, BCELoss: 0.34886146139125435
Epoch: 7, BCELoss: 0.3347627928062361
Epoch: 8, BCELoss: 0.3107057162085358
Epoch: 9, BCELoss: 0.27966263829445354
Epoch: 10, BCELoss: 0.26513474236945717
Epoch: 11, BCELoss: 0.26933389750062203
Epoch: 12, BCELoss: 0.2668301700329294
Epoch: 13, BCELoss: 0.24348120695474196
Epoch: 14, BCELoss: 0.23429017681248335
Epoch: 15, BCELoss: 0.219188604731949
Epoch: 16, BCELoss: 0.20371175481348622
Epoch: 17, BCELoss: 0.19406543672084808
Epoch: 18, BCELoss: 0.1925640057544319
Epoch: 19, BCELoss: 0.17953106091946971
Epoch: 20, BCELoss: 0.1711716512028052
Epoch: 21, BCELoss: 0.17182830781960973
Epoch: 22, BCELoss: 0.17596149459785346
Epoch: 23, BCELoss: 0.17902935281091806
Epoch: 24, BCELoss: 0.18579979256099585
Epoch: 25, BCELoss: 0.16099695450797372
Epoch: 26, BCELoss: 0.22459205665758677
Epoch: 27, BCELoss: 0.2002425674273043
Epoch: 28, BCELoss: 0.15815028061672132
Epoch: 29, BCELoss: 0.14172170189570407
Epoch: 30, BCELoss: 0.1298800323690687
Epoch: 31, BCELoss: 0.11694561249139357
Epoch: 32, BCELoss: 0.1122226594966285
Epoch: 33, BCELoss: 0.10961317993244346
Epoch: 34, BCELoss: 0.11156366210506886
Epoch: 35, BCELoss: 0.11508313726101603
Epoch: 36, BCELoss: 0.10670644812742058
Epoch: 37, BCELoss: 0.10266336068815114
Epoch: 38, BCELoss: 0.14788789400944904
Epoch: 39, BCELoss: 0.09688885684828369
Epoch: 40, BCELoss: 0.10119003118300925
Epoch: 41, BCELoss: 0.08249394876920448
Epoch: 42, BCELoss: 0.07415243139376446
Epoch: 43, BCELoss: 0.06686475492861806
Epoch: 44, BCELoss: 0.06825523352136417
Epoch: 45, BCELoss: 0.08029248717488074
Epoch: 46, BCELoss: 0.06760755926370621
Epoch: 47, BCELoss: 0.061323174978701434
Epoch: 48, BCELoss: 0.11737235840790126
Epoch: 49, BCELoss: 0.06857068897510062
Epoch: 50, BCELoss: 0.07763163669376957

Model loaded with embedding_dim 32, hidden_dim 50, vocab_size 5000.
Epoch: 1, BCELoss: 0.6850115328418965
Epoch: 2, BCELoss: 0.6306146273807604
Epoch: 3, BCELoss: 0.5421375650532392
Epoch: 4, BCELoss: 0.4553138455566095
Epoch: 5, BCELoss: 0.3968389976997765
Epoch: 6, BCELoss: 0.3702799270347673
Epoch: 7, BCELoss: 0.3315269174624462
Epoch: 8, BCELoss: 0.3071679564154878
Epoch: 9, BCELoss: 0.2951918721807246
Epoch: 10, BCELoss: 0.2832703988771049
Epoch: 11, BCELoss: 0.2584142791373389
Epoch: 12, BCELoss: 0.24631759889271795
Epoch: 13, BCELoss: 0.23265871405601501
Epoch: 14, BCELoss: 0.22190727536775628
Epoch: 15, BCELoss: 0.21546377272022013
Epoch: 16, BCELoss: 0.21191815514953768
Epoch: 17, BCELoss: 0.2142187031556149
Epoch: 18, BCELoss: 0.21939327765484246
Epoch: 19, BCELoss: 0.21449701001449506
Epoch: 20, BCELoss: 0.1957816756805595
Epoch: 21, BCELoss: 0.18053946263936102
Epoch: 22, BCELoss: 0.17261837286000348
Epoch: 23, BCELoss: 0.16618844243336697
Epoch: 24, BCELoss: 0.1546139659322038
Epoch: 25, BCELoss: 0.14969667199314857
Epoch: 26, BCELoss: 0.14619561011085705
Epoch: 27, BCELoss: 0.14425842737664982
Epoch: 28, BCELoss: 0.15357165129817263
Epoch: 29, BCELoss: 0.15664177661647602
Epoch: 30, BCELoss: 0.13171185020889556
Epoch: 31, BCELoss: 0.12470327637025289
Epoch: 32, BCELoss: 0.12145334056445531
Epoch: 33, BCELoss: 0.12245985531077093
Epoch: 34, BCELoss: 0.12619637485061372
Epoch: 35, BCELoss: 0.12095314325118552
Epoch: 36, BCELoss: 0.13082618296754603
Epoch: 37, BCELoss: 0.1263728899006941
Epoch: 38, BCELoss: 0.12489396941905119
Epoch: 39, BCELoss: 0.13398789143075748
Epoch: 40, BCELoss: 0.10819018502928773
Epoch: 41, BCELoss: 0.09017068869909461
Epoch: 42, BCELoss: 0.08388122019110894
Epoch: 43, BCELoss: 0.08159079144195634
Epoch: 44, BCELoss: 0.09056520842167796
Epoch: 45, BCELoss: 0.08047783260746878
Epoch: 46, BCELoss: 0.0726129487156868
Epoch: 47, BCELoss: 0.06892838519142598
Epoch: 48, BCELoss: 0.07058166385609277
Epoch: 49, BCELoss: 0.06437636284651804
Epoch: 50, BCELoss: 0.06315174691227017
Epoch: 51, BCELoss: 0.06272421165236405
Epoch: 52, BCELoss: 0.060773990212046373
Epoch: 53, BCELoss: 0.07006886274534829
Epoch: 54, BCELoss: 0.06104079025740526
Epoch: 55, BCELoss: 0.060380465864222875
Epoch: 56, BCELoss: 0.05693317729295516
Epoch: 57, BCELoss: 0.07303190565839106
Epoch: 58, BCELoss: 0.07425981284860446
Epoch: 59, BCELoss: 0.06147516534036519
Epoch: 60, BCELoss: 0.05536116112251671
Epoch: 61, BCELoss: 0.05737159572237609
Epoch: 62, BCELoss: 0.13633351875659155
Epoch: 63, BCELoss: 0.31577640498171045
Epoch: 64, BCELoss: 0.23022901236402746
Epoch: 65, BCELoss: 0.12245849884894429
Epoch: 66, BCELoss: 0.07548797541126913
Epoch: 67, BCELoss: 0.0594872849024072
Epoch: 68, BCELoss: 0.053032396665337135
Epoch: 69, BCELoss: 0.04777058141724187
Epoch: 70, BCELoss: 0.04333489825378875
Epoch: 71, BCELoss: 0.04051551477489423
Epoch: 72, BCELoss: 0.0374320314018702
Epoch: 73, BCELoss: 0.035625269902603965
Epoch: 74, BCELoss: 0.036811171422655484
Epoch: 75, BCELoss: 0.03277571601983236
Epoch: 76, BCELoss: 0.0310821784850286
Epoch: 77, BCELoss: 0.030101198512984782
Epoch: 78, BCELoss: 0.028505157983424713
Epoch: 79, BCELoss: 0.026686849642773063
Epoch: 80, BCELoss: 0.02529307426314573
Epoch: 81, BCELoss: 0.024149266698834847
Epoch: 82, BCELoss: 0.02315430033343787
Epoch: 83, BCELoss: 0.02222869993776691
Epoch: 84, BCELoss: 0.021444861938682745
Epoch: 85, BCELoss: 0.020644080169422895
Epoch: 86, BCELoss: 0.019857633687859894
Epoch: 87, BCELoss: 0.01909995502887332
Epoch: 88, BCELoss: 0.018358748064053302
Epoch: 89, BCELoss: 0.017625053983409793
Epoch: 90, BCELoss: 0.016913833970926245
Epoch: 91, BCELoss: 0.01622541602320817
Epoch: 92, BCELoss: 0.015555351725494375
Epoch: 93, BCELoss: 0.015426691992170349
Epoch: 94, BCELoss: 0.014779387255750445
Epoch: 95, BCELoss: 0.014147768452839583
Epoch: 96, BCELoss: 0.01352788613900086
Epoch: 97, BCELoss: 0.012495114601084165
Epoch: 98, BCELoss: 0.012381355240180785
Epoch: 99, BCELoss: 0.020358785226637005
Epoch: 100, BCELoss: 0.024636344309440072

#####      USED ONLY FOR DEBUGGING WHEN FIRST WRITING

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

## Step 5 and 6: Testing (by) Deploy the model for testing

As mentioned at the top of this notebook, we will be testing this model by first deploying it and then sending the testing data to the deployed endpoint. We will do this so that we can make sure that the deployed model is working correctly.

Now that we have trained our model, we would like to test it to see how it performs. Currently our model takes input of the form `review_length, review[500]` where `review[500]` is a sequence of `500` integers which describe the words present in the review, encoded using `word_dict`. Fortunately for us, SageMaker provides built-in inference code for models with simple inputs such as this.

There is one thing that we need to provide, however, and that is a function which loads the saved model. This function must be called `model_fn()` and takes as its only parameter a path to the directory where the model artifacts are stored. 

This function must also be present in the python file which we specified as the entry point. In our case the model loading function has been provided and so no changes need to be made.

In [None]:
predictor = estimator.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

## Step 7 - Use the model for testing

Once deployed, we can read in the test data and send it off to our deployed model to get some results. Once we collect all of the results we can determine how accurate our model is.

In [None]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

predictions = predict(test_X.values)
predictions = [round(num) for num in predictions]

accuracy_score(test_y, predictions)

In [None]:
estimator.delete_endpoint()

## Step 6 (AGAIN - after "validating") - Deploy the model for the web app

Now that we know that our model is working, it's time to create some custom inference code so that we can send the model a review which has not been processed and have it determine the sentiment of the review.

As we saw above, by default the estimator which we created, when deployed, will use the entry script and directory which we provided when creating the model. However, since we now wish to accept a string as input and our model expects a processed review, we need to write some custom inference code. (already done)
### (1)Deploying the model

Now that the custom inference code has been written, we will create and deploy our model.

We need to construct a new PyTorchModel object which points to the model artifacts created during training and also points to the inference code that we wish to use. *Then*, we can call the deploy method to launch the deployment container.

### (2)Testing the model

Now that we have deployed our model with the custom inference code, we should test to see if everything is working. Here we test our model by loading the first `250` positive and negative reviews and send them to the endpoint, then collect the results.

In [66]:
model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
#         (1)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
###########################################################################################
#       (2)
ground, results = test_reviews()
accuracy_score(ground, results)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


----------------!Starting  pos  files
Starting  neg  files


0.656

## Step 7 (AGAIN - after "validating"): Use the model for the web app

(already done)

In [None]:
predictor.endpoint  #to put on the lambda function

In [None]:
predictor.delete_endpoint()