# UDACITY SageMaker Essentials: Endpoint Exercise

In the last exercise, you trained a BlazingText supervised sentiment analysis model. (Let's call this model HelloBlaze.) You've recently learned about how we can take a model we've previously trained and generate an endpoint that we can call to efficently evaluate new data. Here, we'll put what we've learned into practice. You will take HelloBlaze and use it to create an endpoint. Then, you'll evaluate some sample data on that model to see how well the model we've trained generalizes. (Sentiment analysis is a notoriously difficult problem, so we'll keep our expectations modest.)

In [2]:
import boto3
import json
import sagemaker
import zipfile

## Understanding Exercise: Preprocessing Data (again)

Before we start, we're going to do preprocessing on a new set of data that we'll be evaluating on HelloBlaze. We won't keep track of the labels here, we're just seeing how we could potentially evaluate new data using an existing model. This code should be very familiar, and requires no modification. Something to note: it is getting tedious to have to manually process the data ourselves whenever we want to do something with our model. We are also doing this on our local machine. Can you think of potential limitations and dangers to the preprocessing setup we currently have? Keep this in mind when we move on to our lesson about batch-transform jobs.  

In [3]:
# Function below unzips the archive to the local directory. 

def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')

# Input data is a file with a single JSON object per line with the following format: 
# {
#  "reviewerID": <string>,
#  "asin": <string>,
#  "reviewerName" <string>,
#  "helpful": [
#    <int>, (indicating number of "helpful votes")
#    <int>  (indicating total number of votes)
#  ],
#  "reviewText": "<string>",
#  "overall": <int>,
#  "summary": "<string>",
#  "unixReviewTime": <int>,
#  "reviewTime": "<string>"
# }
# 
# We are specifically interested in the fields "helpful" and "reviewText"
#

def label_data(input_data):
    labeled_data = []
    HELPFUL_LABEL = "__label__1"
    UNHELPFUL_LABEL = "__label__2"
     
    for l in open(input_data, 'r'):
        l_object = json.loads(l)
        helpful_votes = float(l_object['helpful'][0])
        total_votes = l_object['helpful'][1]
        reviewText = l_object['reviewText']
        if total_votes != 0:
            if helpful_votes / total_votes > .5:
                labeled_data.append(" ".join([HELPFUL_LABEL, reviewText]))
            elif helpful_votes / total_votes < .5:
                labeled_data.append(" ".join([UNHELPFUL_LABEL, reviewText]))
          
    return labeled_data


# Labeled data is a list of sentences, starting with the label defined in label_data. 

def split_sentences(labeled_data):
    new_split_sentences = []
    for d in labeled_data:       
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                new_split_sentences.append(s)
    return new_split_sentences

def split_sentences_labels(labeled_data):
    new_split_sentences = []
    for d in labeled_data:       
        sentences = " ".join(d.split()[1:]).split(".") # Initially split to separate label, then separate sentences
        label = d.split()[0]
        for s in sentences:
            if s: # Make sure sentences isn't empty. Common w/ "..."
                new_split_sentences.append(label)
    return new_split_sentences


unzip_data('reviews_Musical_Instruments_5.json.zip')
labeled_data = label_data('reviews_Musical_Instruments_5.json')
new_split_sentence_data = split_sentences(labeled_data)
new_split_labels_data = split_sentences_labels(labeled_data)

print(new_split_sentence_data[39:49])
print(new_split_labels_data[39:49])
print(labeled_data[6])

['This Hosa Cable is very well made, with good quality connectors and a nice long length', ' My son is expanding his collection of amps and effects pedals so needed additional cables to get everything connected', " This 25' cable gives him flexibility to move around and it is sturdy enough that it can take being stepped on and pulled around", 'The cable works well and is a good value for a decent quality cable', 'Highly Recommended!CFH', "I didn't expect this cable to be so thin", " It's easily 1/2 the thickness of any guitar cable I've used", ' Not sure about long-term durability or signal loss/interference', " If I had the foresight I'd spend a couple extra bucks on a thicker cable", ' Still, it works and was inexpensive']
['__label__2', '__label__2', '__label__2', '__label__2', '__label__2', '__label__1', '__label__1', '__label__1', '__label__1', '__label__1']
__label__2 This Hosa Cable is very well made, with good quality connectors and a nice long length. My son is expanding his c

## Exercise: Deploy Model

Once you have your model, it's trivially easy to create an endpoint. All you need to do is initialize a "model" object, and call the deploy method. Fill in the method below with the proper addresses and an endpoint will be created, serving your model. Once this is done, confirm that the endpoint is live by consulting the SageMaker Console. You'll see this under "Endpoints" in the "Inference" menu on the left-hand side. If done correctly, this will take a while to get instantiated. 

You will need the following methods: 

* You'll need `image_uris.retrieve` method to determine the image uri to get a BlazingText docker image uri https://sagemaker.readthedocs.io/en/stable/api/utility/image_uris.html
* You'll need a `model_data` to pass the S3 location of a SageMaker model data
* You'll need to use the `Model` object https://sagemaker.readthedocs.io/en/stable/api/inference/model.html
* You'll need to the get execution role. 
* You'll need to use the `deploy` method of the model object, using a single instance of "ml.m5.large"

In [4]:
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker import image_uris

# get the execution role
role = get_execution_role()
# get the image using the "blazingtext" framework and your region
image_uri = image_uris.retrieve(framework='blazingtext',region='us-east-1')
# get the S3 location of a SageMaker model data
model_data = 's3://sagemaker-essentials-bucket/lesson2_e1_training_jobs/output_attempt1/lesson2-exercise1-blazingtext-attempt1/output/model.tar.gz'
# define a model object
model = Model(image_uri, model_data, role)
# deploy the model using a single instance of "ml.m5.large"
predictor = model.deploy(initial_instance_count=1, instance_type="ml.m5.large", endpoint_name="l2e2-blazingtext-endpoint")

-----!

## Exercise: Evaluate Data

Alright, we now have an easy way to evaluate our data! You will want to interact with the endpoint using the predictor interface: https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html

Predictor is not the endpoint itself, but instead is an interface that we can use to easily interact with our deployed model. Your task is to take `new_split_sentence_data` and evaluate it using the predictor.  

Note that the BlazingText supports "application/json" as the content-type for inference and the model expects a payload that contains a list of sentences with the key as “instances”.

The method you'll need to call is highlighted below.

Another recommendation: try evaluating a subset of the data before evaluating all of the data. This will make debugging significantly faster.

In [10]:
from sagemaker.predictor import Predictor
import json

predictor = Predictor(endpoint_name='l2e2-blazingtext-endpoint')

# load the first five reviews from new_split_sentence_data
example_sentences = new_split_sentence_data

payload = {"instances": example_sentences}

# print(json.dumps(payload))

# make predictions using the "predict" method. Set initial_args to {'ContentType': 'application/json'}
predictions = json.loads(predictor.predict(json.dumps(payload), initial_args={'ContentType': 'application/json'}))

# print(predictions)

In [11]:
predictions_labels = [predicted_labels['label'][0] for predicted_labels in predictions]

In [12]:
from sklearn.metrics import accuracy_score
accuracy_score(new_split_labels_data, predictions_labels)

0.8821700836355563

In [25]:
import pandas as pd
df_predicted = pd.DataFrame(predictions_labels, columns=['label'])
df_predicted.to_csv('l2e2_output_labels.csv', index=False)
df_real_labels = pd.DataFrame(new_split_labels_data, columns=['label'])
df_real_labels.to_csv('l2e2_real_labels.csv', index=False)

In [24]:
new_split_labels_data

['__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__2',
 '__label__2',
 '__label__2',
 '__label__2',
 '__label__2',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__2',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label__1',
 '__label_

In [23]:
!head l2e2_output_labels.csv

label
__label__1
__label__1
__label__1
__label__1
__label__1
__label__1
__label__1
__label__1
__label__1


In [17]:
!head reviews_Musical_Instruments_5.json

{"reviewerID": "A2IBPI20UZIR0U", "asin": "1384719342", "reviewerName": "cassandra tu \"Yeah, well, that's just like, u...", "helpful": [0, 0], "reviewText": "Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,", "overall": 5.0, "summary": "good", "unixReviewTime": 1393545600, "reviewTime": "02 28, 2014"}
{"reviewerID": "A14VAT5EAX3D9S", "asin": "1384719342", "reviewerName": "Jake", "helpful": [13, 14], "reviewText": "The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smell

## Make sure you stop/delete the endpoint after completing the exercise to avoid cost.

In [13]:
predictor.delete_endpoint()