# UDACITY SageMaker Essentials: Endpoint Exercise

In the last exercise, you trained a BlazingText supervised sentiment analysis model. (Let's call this model HelloBlaze.) You've recently learned about how we can take a model we've previously trained and generate an endpoint that we can call to efficently evaluate new data. Here, we'll put what we've learned into practice. You will take HelloBlaze and use it to create an endpoint. Then, you'll evaluate some sample data on that model to see how well the model we've trained generalizes. (Sentiment analysis is a notoriously difficult problem, so we'll keep our expectations modest.)

In [2]:
import os
import boto3
import json
import sagemaker
import zipfile

import pandas as pd
import numpy as np

## Understanding Exercise: Preprocessing Data (again)

Before we start, we're going to do preprocessing on a new set of data that we'll be evaluating on HelloBlaze. We won't keep track of the labels here, we're just seeing how we could potentially evaluate new data using an existing model. This code should be very familiar, and requires no modification. Something to note: it is getting tedious to have to manually process the data ourselves whenever we want to do something with our model. We are also doing this on our local machine. Can you think of potential limitations and dangers to the preprocessing setup we currently have? Keep this in mind when we move on to our lesson about batch-transform jobs.  

In [3]:
# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')

# Input data is a file with a single JSON object per line with the following format: 
# {
#  "reviewerID": <string>,
#  "asin": <string>,
#  "reviewerName" <string>,
#  "helpful": [
#    <int>, (indicating number of "helpful votes")
#    <int>  (indicating total number of votes)
#  ],
#  "reviewText": "<string>",
#  "overall": <int>,
#  "summary": "<string>",
#  "unixReviewTime": <int>,
#  "reviewTime": "<string>"
# }
# 
# We are specifically interested in the fields "helpful" and "reviewText"
#

def label_data(input_data):
    labeled_data = []
    HELPFUL_LABEL = "__label__1"
    UNHELPFUL_LABEL = "__label__2"
     
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        reviewText = l_object['reviewText']
        if total_votes != 0:
            if helpful_votes / total_votes > .5:
                labeled_data.append(" ".join([HELPFUL_LABEL, reviewText]))
            elif helpful_votes / total_votes < .5:
                labeled_data.append(" ".join([UNHELPFUL_LABEL, reviewText]))
          
    return labeled_data


# Labeled data is a list of sentences, starting with the label defined in label_data. 

def split_sentences(labeled_data):
    new_split_sentences = []
    for d in labeled_data:
        label = d.split()[0]
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                new_split_sentences.append(" ".join([label, s]))
    return new_split_sentences


unzip_data('reviews_Musical_Instruments_5.json.zip')
labeled_data = label_data('reviews_Musical_Instruments_5.json')
new_split_sentence_data = split_sentences(labeled_data)

print(new_split_sentence_data[0:9])

['__label__1 The product does exactly as it should and is quite affordable', '__label__1 I did not realized it was double screened until it arrived, so it was even better than I had expected', "__label__1 As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording", '__label__1  :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]', '__label__1 The primary job of this device is to block the breath that would otherwise produce a popping sound, while allowing your voice to pass through with no noticeable reduction of volume or high frequencies', '__label__1  The double cloth filter blocks the pops and lets the voice through with no coloration', '__label__1  The metal clamp mount attaches to the mike stand secure enough to kee

In [4]:
import boto3
from botocore.exceptions import ClientError
# Note: This section implies that the bucket below has already been made and that you have access
# to that bucket. You would need to change the bucket below to a bucket that you have write
# premissions to. This will take time depending on your internet connection, the training file is ~ 40 mb

BUCKET = "aws-ml-nanodegree-bucket"
s3_prefix = "lesson2/ex2"


def cycle_data(fp, data):
    for d in data:
        fp.write(d + "\n")

def write_trainfile(split_sentence_data):
    train_path = "hello_blaze_train"
    with open(train_path, 'w') as f:
        cycle_data(f, split_sentence_data)
    return train_path

def write_validationfile(split_sentence_data):
    validation_path = "hello_blaze_validation"
    with open(validation_path, 'w') as f:
        cycle_data(f, split_sentence_data)
    return validation_path 

def upload_file_to_s3(file_name, s3_prefix):
    object_name = os.path.join(s3_prefix, file_name)
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, BUCKET, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    
# Split the data
split_data_trainlen = int(len(new_split_sentence_data) * .9)
split_data_validationlen = int(len(new_split_sentence_data) * .1)

# Todo: write the training file
train_path = write_trainfile(new_split_sentence_data[:split_data_trainlen])
print("Training file written!")

# Todo: write the validation file
validation_path = write_validationfile(new_split_sentence_data[split_data_trainlen:])
print("Validation file written!")

upload_file_to_s3(train_path, s3_prefix)
training_s3_uri = "s3://{}/{}/{}".format(BUCKET, s3_prefix, train_path)
print("Train file uploaded!")
upload_file_to_s3(validation_path, s3_prefix)
validation_s3_uri = "s3://{}/{}/{}".format(BUCKET, s3_prefix, validation_path)
print("Validation file uploaded!")

print(" ".join([train_path, validation_path]))
print(training_s3_uri)
print(validation_s3_uri)

Training file written!
Validation file written!
Train file uploaded!
Validation file uploaded!
hello_blaze_train hello_blaze_validation
s3://aws-ml-nanodegree-bucket/lesson2/ex2/hello_blaze_train
s3://aws-ml-nanodegree-bucket/lesson2/ex2/hello_blaze_validation


## Exercise: Train Model

In [5]:
from sagemaker import get_execution_role
from sagemaker import image_uris

session = sagemaker.Session()

role = get_execution_role()

# We need to get the location of the container. 
container = image_uris.retrieve('blazingtext', session.boto_region_name, version='latest')

# Bucket name for model output
bucket = BUCKET # default session bucket can be used session.default_bucket()
# We use this prefix to help us determine where the output will go. 
prefix = s3_prefix

# Now that we know which container to use, we can construct the estimator object.
bt = sagemaker.estimator.Estimator(container, # The image name of the training container
                                    role,      # The IAM role to use (our current role in this case)
                                    instance_count=1, # The number of instances to use for training
                                    instance_type='ml.m5.large', # The type of instance to use for training
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),# Where to save the output (the model artifacts)
                                    sagemaker_session=session, # The current SageMaker session
                                    max_run=900) # Timeout in seconds for training

# Set algoirthm hyperparameters, more information about hyperparameter algoirthm is here: 
# https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html    
    
bt.set_hyperparameters(mode="supervised")
                        
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=training_s3_uri)
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=validation_s3_uri)

# The fit method launches the training job. 

bt.fit({'train': s3_input_train, 'validation': s3_input_validation})
bt.model_data # The model location in S3. Only set if Estimator has been fit()

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


2022-01-16 15:42:14 Starting - Starting the training job...
2022-01-16 15:42:37 Starting - Launching requested ML instancesProfilerReport-1642347734: InProgress
...
2022-01-16 15:43:07 Starting - Preparing the instances for training.........
2022-01-16 15:44:38 Downloading - Downloading input data...
2022-01-16 15:45:06 Training - Training image download completed. Training in progress.[34mArguments: train[0m
[34m[01/16/2022 15:45:08 INFO 140157245322880] nvidia-smi took: 0.025205373764038086 secs to identify 0 gpus[0m
[34m[01/16/2022 15:45:08 INFO 140157245322880] Running single machine CPU BlazingText training using supervised mode.[0m
[34mNumber of CPU sockets found in instance is  1[0m
[34m[01/16/2022 15:45:08 INFO 140157245322880] Processing /opt/ml/input/data/train/hello_blaze_train . File size: 2.321709632873535 MB[0m
[34m[01/16/2022 15:45:08 INFO 140157245322880] Processing /opt/ml/input/data/validation/hello_blaze_validation . File size: 0.26017189025878906 MB[0m


's3://aws-ml-nanodegree-bucket/lesson2/ex2/output/blazingtext-2022-01-16-15-42-14-294/output/model.tar.gz'

## Exercise: Deploy Model

Once you have your model, it's trivially easy to create an endpoint. All you need to do is initialize a "model" object, and call the deploy method. Fill in the method below with the proper addresses and an endpoint will be created, serving your model. Once this is done, confirm that the endpoint is live by consulting the SageMaker Console. You'll see this under "Endpoints" in the "Inference" menu on the left-hand side. If done correctly, this will take a while to get instantiated. 

You will need the following methods: 

* You'll need `image_uris.retrieve` method to determine the image uri to get a BlazingText docker image uri https://sagemaker.readthedocs.io/en/stable/api/utility/image_uris.html
* You'll need a `model_data` to pass the S3 location of a SageMaker model data
* You'll need to use the `Model` object https://sagemaker.readthedocs.io/en/stable/api/inference/model.html
* You'll need to the get execution role. 
* You'll need to use the `deploy` method of the model object, using a single instance of "ml.m5.large"

In [14]:
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker import image_uris

# get the execution role
role = get_execution_role()
# get the image using the "blazingtext" framework and your region
image_uri = image_uris.retrieve(framework='blazingtext',region='eu-central-1', version='latest')
# get the S3 location of a SageMaker model data
model_data = bt.model_data
# define a model object
model = Model(image_uri=image_uri, model_data=model_data, role=role)
# deploy the model using a single instance of "ml.m5.large" and optionaly return predictor
predictor = model.deploy(initial_instance_count=1, instance_type="ml.m5.large")

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


-----!None


In [26]:
# name of deployed model
print(model.name)
endpoint_name = model.endpoint_name
print(endpoint_name)

blazingtext-2022-01-16-17-02-08-376
blazingtext-2022-01-16-17-02-08-813


## Exercise: Evaluate Data

Alright, we now have an easy way to evaluate our data! You will want to interact with the endpoint using the predictor interface: https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html

Predictor is not the endpoint itself, but instead is an interface that we can use to easily interact with our deployed model. Your task is to take `new_split_sentence_data` and evaluate it using the predictor.  

Note that the BlazingText supports "application/json" as the content-type for inference and the model expects a payload that contains a list of sentences with the key as “instances”.

The method you'll need to call is highlighted below.

Another recommendation: try evaluating a subset of the data before evaluating all of the data. This will make debugging significantly faster.

In [28]:
from sagemaker.predictor import Predictor
import json

# For manual predictor creation
if(predictor is None):
    predictor = sagemaker.predictor.Predictor(
        endpoint_name,
        sagemaker_session=sagemaker.Session()
    )

print(predictor)

# load the first five reviews from new_split_sentence_data
example_sentences = new_split_sentence_data[0:5]

payload = {"instances": example_sentences}

print(json.dumps(payload))

# make predictions using the "predict" method. Set initial_args to {'ContentType': 'application/json'}
predictions = predictor.predict(json.dumps(payload), initial_args={'ContentType': 'application/json'})

print(predictions)

<sagemaker.predictor.Predictor object at 0x7f7344bfe890>
{"instances": ["__label__1 The product does exactly as it should and is quite affordable", "__label__1 I did not realized it was double screened until it arrived, so it was even better than I had expected", "__label__1 As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording", "__label__1  :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]", "__label__1 The primary job of this device is to block the breath that would otherwise produce a popping sound, while allowing your voice to pass through with no noticeable reduction of volume or high frequencies"]}
b'[{"label": ["__label__1"], "prob": [0.936009407043457]}, {"label": ["__label__1"], "prob": [0.828795492649078

## Make sure you stop/delete the endpoint after completing the exercise to avoid cost.

In [9]:
predictor.delete_endpoint()