#  Sentiment Analysis with TensorFlow

Sentiment analysis is a very common text analytics task that involves determining whether a text sample is positive or negative about its subject.  There are several different algorithms for performing this task, including statistical algorithms and deep learning algorithms.  With respect to deep learning, a Convolutional Neural Net (CNN) is sometimes used for this purpose.  In this notebook we'll use a CNN built with TensorFlow to perform sentiment analysis in Amazon SageMaker on the IMDB dataset, which consists of movie reviews labeled as having positive or negative sentiment.


#  Prepare Dataset

We'll begin by loading the reviews dataset, and padding the reviews so all reviews have the same length.  Each review is represented as an array of numbers, where each number represents an indexed word.  Training data for both Local Mode and Hosted Training must be saved as files, so we'll also save the transformed data to files.

In [None]:
import os
import boto3
import sagemaker
import numpy as np
from tensorflow.keras.preprocessing import sequence
from tensorflow.python.keras.datasets import imdb

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role() # we are using the notebook instance role for training in this example

account_id = !aws sts get-caller-identity --query Account --output text # get your account id number
account_id = account_id[0]

In [None]:
max_features = 20000
maxlen = 400

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

In [None]:
data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

test_dir = os.path.join(os.getcwd(), 'data/test')
os.makedirs(test_dir, exist_ok=True)

csv_test_dir = os.path.join(os.getcwd(), 'data/csv-test')
os.makedirs(csv_test_dir, exist_ok=True)

In [None]:
import pandas as pd 
pd.DataFrame(x_train).to_csv(os.path.join(train_dir, 'x_train.csv'), header=None, index=False)
pd.DataFrame(y_train).to_csv(os.path.join(train_dir, 'y_train.csv'), header=None, index=False)
pd.DataFrame(x_test).to_csv(os.path.join(test_dir, 'x_test.csv'), header=None, index=False)
pd.DataFrame(y_test).to_csv(os.path.join(test_dir, 'y_test.csv'), header=None, index=False)
np.savetxt(os.path.join(csv_test_dir, 'csv-test.csv'), np.array(x_test[:100], dtype=np.int32), fmt='%d', delimiter=",")

#  Hosted Training

After we've confirmed our code seems to be working using Local Mode training, we can move on to use SageMaker's hosted training, which uses compute resources separate from your notebook instance.  Hosted training spins up one or more instances (cluster) for training, and then tears the cluster down when training is complete. In general, hosted training is preferred for doing actual training, especially for large-scale, distributed training. Before starting hosted training, the data must be present in storage that can be accessed by SageMaker. The storage options are:  Amazon S3 (object storage service), Amazon EFS (elastic NFS file system service), and Amazon FSx for Lustre (high-performance file system service). For this example, we'll upload the data to S3.  

In [None]:
bucket = sagemaker_session.default_bucket()

traindata_s3_prefix = 'imdb/data/train'
testdata_s3_prefix = 'imdb/data/test'

train_s3 = sagemaker_session.upload_data(path='./data/train/', bucket=bucket, key_prefix=traindata_s3_prefix)
test_s3 = sagemaker_session.upload_data(path='./data/test/', bucket=bucket, key_prefix=testdata_s3_prefix)

inputs = {'train':train_s3, 'test': test_s3}
print(inputs)

With the training data now in S3, we're ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the `train_instance_type` has been set to a ML instance type instead of a local type for Local Mode. Additionally, we've set the number of epochs to a number greater than one for actual training, as opposed to just testing the code.

In [None]:
from sagemaker.tensorflow import TensorFlow

train_instance_type = 'ml.p3.2xlarge'
hyperparameters = {'epochs': 3, 'batch_size': 128}
model_dir = '/opt/ml/model'

estimator = TensorFlow(entry_point='train.py',
                       source_dir='./training_scripts/',
                       model_dir=model_dir,
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=role,
                       base_job_name='tf-keras-sentiment',
                       framework_version='2.1',
                       py_version='py3')

With the change in training instance type and increase in epochs, we simply call `fit` to start the actual hosted training.  At the end of hosted training, you'll see from the logs below the cell that accuracy on the training set has greatly increased, and accuracy on the validation set is around 90%.  The model may be overfitting now (less able to generalize to data it has not yet seen), even though we are employing dropout as a regularization technique.  In a production situation, further investigation would be necessary.

In [None]:
estimator.fit(inputs)

### Now let's automate the training and deployment of the model!

In [None]:
# Copy end-to-end project code into code repository folder
!cp -r ../mlops-code/* ../../mlops-code/

Once the project is copied to mlops-code, open the Terminal and push it to the CodeCommit repository
1. cd SageMaker/mlops-code/
2. git add .
3. git commit -m "initial commit"
4. git push

### Input and Output Processing in Step Functions

A Step Functions execution receives a JSON text as input and passes that input to the first state in the workflow. Individual states receive JSON as input and usually pass JSON as output to the next state. Understanding how this information flows from state to state, and learning how to filter and manipulate this data, is key to effectively designing and implementing workflows in AWS Step Functions.A Step Functions execution receives a JSON text as input and passes that input to the first state in the workflow. Individual states receive JSON as input and usually pass JSON as output to the next state. Understanding how this information flows from state to state, and learning how to filter and manipulate this data, is key to effectively designing and implementing workflows in AWS Step Functions.

Below is an example of parameters we will input in step function to trigger our model training and deployment workflow

In [None]:
import json
execution_params = {
  "data": {
    "bucket": bucket,
    "s3_train_data": train_s3,
    "s3_test_data": test_s3
  },
  "training": {
    "container": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.13-gpu-py3",
    "model_prefix": "sentiment-analysis",
    "training_instance_type": "ml.p3.2xlarge",
    "training_instance_count": 1,
    "hyperparameters": {
      "epochs": "10",
      "batch_size": "128",
      "sagemaker_container_log_level": "20",
      "sagemaker_enable_cloudwatch_metrics": "true",
      "sagemaker_program": "\"train.py\"",
      "sagemaker_region": "\"us-east-1\"",
      "sagemaker_submit_directory": f"\"s3://{bucket}/training/source/sourcedir.tar.gz\""
    },
    "s3_output_path": f"s3://{bucket}/training/"
  },
  "validation": {
    "log_group_name": "/aws/sagemaker/TrainingJobs",
    "validation_metric": "final validation accuracy",
    "validation_minimum_value": "85"
  },
  "deployment": {
    "container": "763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.13-cpu",
    "instance_type": "ml.t2.medium",
    "instance_count": "1"
  }
}
print(json.dumps(execution_params))