d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Lab 8: Querying from SageMaker vs from Parquet

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you:<br>
 - Deploy a model to SageMaker
 - Set up querying through a parquet file
 - Conduct time comparisons between the 2 methods

In [3]:
%run "./../Includes/Classroom-Setup"

## Set up SageMaker

Run the following cell to deploy our Airbnb model to SageMaker.

In [5]:
import json
import os
import boto3

# Set AWS credentials as environment variables
os.environ["AWS_ACCESS_KEY_ID"] = 'AKIAI4T2MLVBUB372FAA'
os.environ["AWS_SECRET_ACCESS_KEY"] = 'g1lSUmTtP2Y5TM4G3nryqg4TysUeKuJLKG0EYAZE' # READ ONLY ACCESS KEYS
os.environ["AWS_DEFAULT_REGION"] = 'us-west-2'


def query_endpoint_example(inputs, appName="airbnb-latest-0001", verbose=True):
  if verbose:
    print("Sending batch prediction request with inputs: {}".format(inputs))
  client = boto3.session.Session().client("sagemaker-runtime", "us-west-2")
  
  response = client.invoke_endpoint(
      EndpointName=appName,
      Body=json.dumps(inputs),
      ContentType='application/json',
  )
  preds = response['Body'].read().decode("ascii")
  preds = json.loads(preds)
  
  if verbose:
    print("Received response: {}".format(preds))
  return preds

def check_status(appName):
  sage_client = boto3.client('sagemaker', region_name="us-west-2")
  endpoint_description = sage_client.describe_endpoint(EndpointName=appName)
  endpoint_status = endpoint_description["EndpointStatus"]
  return endpoint_status

print("Application status is: {}".format(check_status(appName="airbnb-latest-0001")))

Load in airbnb data to use as inputs for our queries.

Reference the helper function in the 08-Real-Time-Deployment notebook to complete the `random_n_samples_sagemaker` function so that it connects to the `sagemaker-runtime` client and sends the record in the appropriate JSON format.

In [7]:
# ANSWER

import pandas as pd
import random
from sklearn.model_selection import train_test_split

df = pd.read_csv("/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv")
X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)

def random_n_samples_sagemaker(n, df=X_train, verbose=False):
  dfShape = X_train.shape[0]
  samples = []
  
  for i in range(n):
    sample = X_train.iloc[[random.randint(0, dfShape-1)]].values
    samples.append(sample.flatten().tolist())
  
  return query_endpoint_example(samples, appName="airbnb-latest-0001", verbose=verbose)

Check that the output of `random_n_samples_sagemaker` is what you expect.

In [9]:
random_n_samples_sagemaker(5, verbose=False)

## Using Parquet File

Read in the Airbnb parquet file as a Spark DataFrame.

In [11]:
prediction_df = spark.read.parquet("/mnt/training/airbnb/sf-listings/prediction.parquet") 
display(prediction_df)

Complete the following helper function `random_n_samples_parquet` to return `n` random queries from `predictions_df`.

In [13]:
# ANSWER
from pyspark.sql.functions import col

def random_n_samples_parquet(n, df=prediction_df):

  # get n ids to query using
  id_list = df.limit(n).select("id").collect()
  query_ids = [i[0] for i in id_list]
  
  # get prediction of each row of 'query_ids'
  preds = []
  for i in query_ids:
    pred = (df
      .filter(col("id") == i)
      .select("prediction")
      .first()
    )[0]
    preds.append(pred)
    
  return preds

Check that the output of `random_n_samples_parquet` is what you expect.

In [15]:
random_n_samples_parquet(5)

## Time Comparisons

Let's compare the time it takes to get the result of a single query from SageMaker and from a parquet file.

In [17]:
batch_size = 1

In [18]:
%timeit -n5 random_n_samples_sagemaker(batch_size)

In [19]:
%timeit -n5 random_n_samples_parquet(batch_size)

Using the read in parquet file is faster!

But what if we want to ask multiple queries? Increase `batch_size` up to 30 and see which method is faster with a larger number of queries.

In [21]:
batch_size = 30

In [22]:
%timeit -n5 random_n_samples_sagemaker(batch_size)

In [23]:
%timeit -n5 random_n_samples_parquet(batch_size)

Which performed better?  What are the trade-offs between using the two in production?

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>