#  Sentiment Analysis with TensorFlow

Sentiment analysis is a very common text analytics task that involves determining whether a text sample is positive or negative about its subject.  There are several different algorithms for performing this task, including statistical algorithms and deep learning algorithms.  With respect to deep learning, a Convolutional Neural Net (CNN) is sometimes used for this purpose.  In this notebook we'll use a CNN built with TensorFlow to perform sentiment analysis in Amazon SageMaker on the IMDB dataset, which consists of movie reviews labeled as having positive or negative sentiment. Three aspects of Amazon SageMaker will be demonstrated:

- How to use Script Mode with a prebuilt TensorFlow container, along with a training script similar to one you would use outside SageMaker. 
- Local Mode training, which allows you to test your code on your notebook instance before creating a full scale training job.
- Batch Transform for offline, asynchronous predictions on large batches of data. 

#  Prepare Dataset

We'll begin by loading the reviews dataset, and padding the reviews so all reviews have the same length.  Each review is represented as an array of numbers, where each number represents an indexed word.  Training data for both Local Mode and Hosted Training must be saved as files, so we'll also save the transformed data to files.

In [1]:
import os
import boto3
import sagemaker
import numpy as np
from tensorflow.keras.preprocessing import sequence
from tensorflow.python.keras.datasets import imdb

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role() # we are using the notebook instance role for training in this example




In [2]:
max_features = 20000
maxlen = 400

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

25000 train sequences
25000 test sequences
x_train shape: (25000, 400)
x_test shape: (25000, 400)


In [3]:
data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

test_dir = os.path.join(os.getcwd(), 'data/test')
os.makedirs(test_dir, exist_ok=True)

csv_test_dir = os.path.join(os.getcwd(), 'data/csv-test')
os.makedirs(csv_test_dir, exist_ok=True)

In [4]:
import pandas as pd 
pd.DataFrame(x_train).to_csv(os.path.join(train_dir, 'x_train.csv'), header=None, index=False)
pd.DataFrame(y_train).to_csv(os.path.join(train_dir, 'y_train.csv'), header=None, index=False)
pd.DataFrame(x_test).to_csv(os.path.join(test_dir, 'x_test.csv'), header=None, index=False)
pd.DataFrame(y_test).to_csv(os.path.join(test_dir, 'y_test.csv'), header=None, index=False)
np.savetxt(os.path.join(csv_test_dir, 'csv-test.csv'), np.array(x_test[:100], dtype=np.int32), fmt='%d', delimiter=",")

# Local Mode Training

Amazon SageMaker’s Local Mode training feature is a convenient way to make sure your code is working as expected before moving on to full scale, hosted training. With Local Mode, you can run quick tests with just a sample of training data, and/or a small number of epochs (passes over the full training set), while avoiding the time and expense of attempting full scale hosted training using possibly buggy code.  

### Setup for Local Mode

To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU) installed in the notebook instance. Running the following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.

In [5]:
!/bin/bash ./local_mode/local_mode_setup.sh

nvidia-docker2 already installed. We are good to go!
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


### Setup for Estimator

The next step is to set up a TensorFlow Estimator for Local Mode training. A key parameters for the Estimator is the `train_instance_type`, which is the kind of hardware on which training will run. In the case of Local Mode, we simply set this parameter to `local_gpu` to invoke Local Mode training on the GPU, or to `local` if the instance has a CPU. Other parameters of note are the algorithm’s hyperparameters, which are passed in as a dictionary, and a Boolean parameter indicating that we are using Script Mode.

In [6]:
import sagemaker
from sagemaker.tensorflow import TensorFlow


model_dir = '/opt/ml/model'
train_instance_type = 'local'
hyperparameters = {'epochs': 1, 'batch_size': 128}
local_estimator = TensorFlow(entry_point='train.py',
                       source_dir='./training_scripts/',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=role,
                       base_job_name='tf-keras-sentiment',
                       framework_version='1.13',
                       py_version='py3')

Now we'll briefly train the model in Local Mode.  Since this is just to make sure the code is working, we'll train for only one epoch.  (Note that on a CPU-based notebook instance, this one epoch will take at least 3 or 4 minutes.)  As you'll see from the logs below the cell when training is complete, even when trained for only one epoch, the accuracy of the model on training data is already at almost 80%.  

In [7]:
inputs = {'train': f'file://{train_dir}',
          'test': f'file://{test_dir}'}

local_estimator.fit(inputs)

Creating tmpiqrefcsd_algo-1-ri42s_1 ... 
[1BAttaching to tmpiqrefcsd_algo-1-ri42s_12mdone[0m
[36malgo-1-ri42s_1  |[0m 2020-04-17 11:37:13,458 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
[36malgo-1-ri42s_1  |[0m 2020-04-17 11:37:13,470 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-ri42s_1  |[0m 2020-04-17 11:37:13,646 sagemaker-containers INFO     Installing module with the following command:
[36malgo-1-ri42s_1  |[0m /usr/local/bin/python3.6 -m pip install -U . -r requirements.txt
[36malgo-1-ri42s_1  |[0m Processing /opt/ml/code
[36malgo-1-ri42s_1  |[0m Collecting scikit-learn (from -r requirements.txt (line 1))
[36malgo-1-ri42s_1  |[0m [?25l  Downloading https://files.pythonhosted.org/packages/5e/d8/312e03adf4c78663e17d802fe2440072376fee46cada1404f1727ed77a32/scikit_learn-0.22.2.post1-cp36-cp36m-manylinux1_x86_64.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 13.4MB/s e

[36malgo-1-ri42s_1  |[0m Using TensorFlow backend.
[36malgo-1-ri42s_1  |[0m x train (25000, 400) y train (25000, 1)
[36malgo-1-ri42s_1  |[0m x test (25000, 400) y test (25000, 1)
[36malgo-1-ri42s_1  |[0m Train on 25000 samples, validate on 25000 samples
[36malgo-1-ri42s_1  |[0m  - 186s - loss: 0.4303 - acc: 0.7871 - val_loss: 0.3001 - val_acc: 0.8696
[36malgo-1-ri42s_1  |[0m 2020-04-17 11:40:35,836 sagemaker-containers INFO     Reporting training SUCCESS
[36mtmpiqrefcsd_algo-1-ri42s_1 exited with code 0
[0mAborting on container exit...
===== Job Complete =====


#  Hosted Training

After we've confirmed our code seems to be working using Local Mode training, we can move on to use SageMaker's hosted training, which uses compute resources separate from your notebook instance.  Hosted training spins up one or more instances (cluster) for training, and then tears the cluster down when training is complete. In general, hosted training is preferred for doing actual training, especially for large-scale, distributed training. Before starting hosted training, the data must be present in storage that can be accessed by SageMaker. The storage options are:  Amazon S3 (object storage service), Amazon EFS (elastic NFS file system service), and Amazon FSx for Lustre (high-performance file system service). For this example, we'll upload the data to S3.  

In [9]:
bucket = 'mlops-bucket-366243680492'

traindata_s3_prefix = 'data/train'
testdata_s3_prefix = 'data/test'

train_s3 = sagemaker_session.upload_data(path='./data/train/', bucket=bucket, key_prefix=traindata_s3_prefix)
test_s3 = sagemaker_session.upload_data(path='./data/test/', bucket=bucket, key_prefix=testdata_s3_prefix)

inputs = {'train':train_s3, 'test': test_s3}
print(inputs)

{'train': 's3://mlops-bucket-366243680492/data/train', 'test': 's3://mlops-bucket-366243680492/data/test'}


With the training data now in S3, we're ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the `train_instance_type` has been set to a ML instance type instead of a local type for Local Mode. Additionally, we've set the number of epochs to a number greater than one for actual training, as opposed to just testing the code.

In [10]:
from sagemaker.tensorflow import TensorFlow

train_instance_type = 'ml.p3.2xlarge'
hyperparameters = {'epochs': 10, 'batch_size': 128}
model_dir = '/opt/ml/model'

estimator = TensorFlow(entry_point='train.py',
                       source_dir='./training_scripts/',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=role,
                       base_job_name='tf-keras-sentiment',
                       framework_version='1.13',
                       py_version='py3')

With the change in training instance type and increase in epochs, we simply call `fit` to start the actual hosted training.  At the end of hosted training, you'll see from the logs below the cell that accuracy on the training set has greatly increased, and accuracy on the validation set is around 90%.  The model may be overfitting now (less able to generalize to data it has not yet seen), even though we are employing dropout as a regularization technique.  In a production situation, further investigation would be necessary.

In [11]:
estimator.fit(inputs)

2020-04-17 11:44:53 Starting - Starting the training job...
2020-04-17 11:45:13 Starting - Launching requested ML instances......
2020-04-17 11:46:10 Starting - Preparing the instances for training......
2020-04-17 11:47:15 Downloading - Downloading input data...
2020-04-17 11:47:46 Training - Downloading the training image...
2020-04-17 11:48:06 Training - Training image download completed. Training in progress.[34m2020-04-17 11:48:10,133 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-04-17 11:48:10,547 sagemaker-containers INFO     Installing module with the following command:[0m
[34m/usr/local/bin/python3.6 -m pip install -U . -r requirements.txt[0m
[34mProcessing /opt/ml/code[0m
[34mCollecting scikit-learn (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/5e/d8/312e03adf4c78663e17d802fe2440072376fee46cada1404f1727ed77a32/scikit_learn-0.22.2.post1-cp36-cp36m-manylinux1_x86_64.whl (

[34m - 4s - loss: 2.0661e-04 - acc: 1.0000 - val_loss: 0.4049 - val_acc: 0.9029[0m
[34mEpoch 10/10[0m
[34m - 4s - loss: 1.7933e-04 - acc: 1.0000 - val_loss: 0.4148 - val_acc: 0.9029[0m
[34m2020-04-17 11:49:06,750 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-04-17 11:49:14 Uploading - Uploading generated training model
2020-04-17 11:49:14 Completed - Training job completed
Training seconds: 119
Billable seconds: 119


## Hosted Endpoint

In [12]:
sentiment_predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

-----------!

In [20]:
import re

regex = re.compile(r'^[\?\s]+')

word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [21]:
review_index = 10
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_test[review_index]])
print(regex.sub('', decoded_review))

inspired by hitchcock's strangers on a train concept of two men swapping murders in exchange for getting rid of the two people messing up their lives throw ? from the train is an original and very inventive comedy take on the idea it's a credit to danny devito that he both wrote and starred in this minor comedy gem br br anne ramsey is the mother who inspires the film's title and it's understandable why she gets under the skin of danny devito with her sharp tongue and relentlessly putting him down for any minor ? billy crystal is the writer who's wife has stolen his book idea and is now being ? as a great new author even appearing on the oprah show to in ? he should be enjoying thus devito gets the idea of swapping murders to rid themselves of these nuisance factors br br of course everything and anything can happen when writer carl reiner lets his imagination roam with ? ideas for how the plot develops and it's amusing all the way through providing plenty of laughs and chuckles along 

In [22]:
results = sentiment_predictor.predict(x_test[review_index])['predictions'][0][0]

def get_sentiment(score):
    return 'positive' if score > 0.5 else 'negative' 

print('Labeled sentiment for this review is {}, predicted sentiment is {}'.format(get_sentiment(y_test[review_index]), 
                                                                                  get_sentiment(results)))

Labeled sentiment for this review is positive, predicted sentiment is positive


In [None]:
sagemaker_session.delete_endpoint(sentiment_predictor.endpoint)

Training deep learning models is a stochastic process, so your results may vary -- there is no guarantee that the predicted result will match the actual label. However, it is likely that the sentiment prediction agrees with the label for this review.  Let's now examine another review:

Again, it is likely (but not guaranteed) that the prediction agreed with the label for the test data.  Note that there is no need to clean up any Batch Transform resources:  after the transform job is complete, the cluster used to make inferences is torn down.

Now that we've reviewed some sample predictions as a sanity check, we're finished.  Of course, in a typical production situation, the data science project lifecycle is iterative, with repeated cycles of refining the model using a tool such as Amazon SageMaker's Automatic Model Tuning feature, and gathering more data.  