## Pre-Processing the Data
Now that we have the raw data, let's process it. 
We'll first load the data into numpy arrays, and randomly split it into train and test with a 75/25 split.

In [None]:
import boto3
import os
import sagemaker
import numpy as np
import urllib
import cv2
import csv
import ssl
from sklearn.model_selection import train_test_split

s3 = boto3.resource('s3')
# get a handle on the bucket that holds your file
bucket = s3.Bucket('sagemaker-hotels50k-train') # example: energy_market_procesing
# get a handle on the object you want (i.e. your file)
obj = bucket.Object(key='train/dataset.csv') # example: market/zone1/data.csv
# get the object
response = obj.get()
# read the contents of the file
lines = response['Body'].read()
lines = lines.decode('utf-8')
lines = lines.split()
# now iterate over those lines
iterator = 0
labels = []
pixelArray = []

#To ignore SSL
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

print("Starting image download...")
for row in csv.reader(lines):
    if iterator != 0:
        resp = urllib.request.urlopen(row[2], context=ctx)
        image = np.asarray(bytearray(resp.read()), dtype="uint8")
        #print(image)
        image = cv2.imdecode(image, cv2.IMREAD_COLOR)
        #print(iterator)
        pixels = cv2.resize(image, (32,32)).flatten()
        labels.append(int(row[1]))
        pixelArray.append(pixels)
    iterator += 1
print(labels)
    
(train_features, test_features, train_labels, test_labels) = train_test_split(pixelArray, labels, test_size=0.25, random_state=42)
print("Finished setting train/test features/labels")


## Upload to Amazon S3
Now, since typically the dataset will be large and located in Amazon S3, let's write the data to Amazon S3 in recordio-protobuf format. We first create an io buffer wrapping the data, next we upload it to Amazon S3. Notice that the choice of bucket and prefix should change for different users and different datasets

In [2]:
import io
import sagemaker.amazon.common as smac

train_features = np.array(train_features)
train_labels = np.array(train_labels)
train_labels = train_labels.astype('int')
train_features = train_features.astype('int')
print('train_features shape = ', train_features.shape)
print('train_labels shape = ', train_labels.shape)
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, train_features, train_labels)
buf.seek(0)

train_features shape =  (375, 3072)
train_labels shape =  (375,)


0

In [3]:
import boto3
import os
import sagemaker

bucket = 'sagemaker-hotels50k'
prefix = 'testing'
key = 'feature-vectors'

boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

uploaded training data location: s3://sagemaker-hotels50k/testing/train/feature-vectors


It is also possible to provide test data. This way we can get an evaluation of the performance of the model from the training logs. In order to use this capability let's upload the test data to Amazon S3 as well

In [4]:
test_features = np.array(test_features)
test_labels = np.array(test_labels)
test_labels = test_labels.astype('int')
test_features = test_features.astype('int')

print('test_features shape = ', test_features.shape)
print('test_labels shape = ', test_labels.shape)

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, test_features, test_labels)
buf.seek(0)

boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test', key)).upload_fileobj(buf)
s3_test_data = 's3://{}/{}/test/{}'.format(bucket, prefix, key)
print('uploaded test data location: {}'.format(s3_test_data))

test_features shape =  (125, 3072)
test_labels shape =  (125,)
uploaded test data location: s3://sagemaker-hotels50k/testing/test/feature-vectors


## Training

We take a moment to explain at a high level, how Machine Learning training and prediction works in Amazon SageMaker. First, we need to train a model. This is a process that given a labeled dataset and hyper-parameters guiding the training process,  outputs a model. Once the training is done, we set up what is called an **endpoint**. An endpoint is a web service that given a request containing an unlabeled data point, or mini-batch of data points, returns a prediction(s).

In Amazon SageMaker the training is done via an object called an **estimator**. When setting up the estimator we specify the location (in Amazon S3) of the training data, the path (again in Amazon S3) to the output directory where the model will be serialized, generic hyper-parameters such as the machine type to use during the training process, and kNN-specific hyper-parameters such as the index type, etc. Once the estimator is initialized, we can call its **fit** method in order to do the actual training.

Now that we are ready for training, we start with a convenience function that starts a training job.

In [5]:
import matplotlib.pyplot as plt

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri


def trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path, s3_test_data=None):
    """
    Create an Estimator from the given hyperparams, fit to training data, 
    and return a deployed predictor
    
    """
    # set up the estimator
    knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
        get_execution_role(),
        train_instance_count=1,
        train_instance_type='ml.m5.2xlarge',
        output_path=output_path,
        sagemaker_session=sagemaker.Session())
    knn.set_hyperparameters(**hyperparams)
    
    # train a model. fit_input contains the locations of the train and test data
    fit_input = {'train': s3_train_data}
    if s3_test_data is not None:
        fit_input['test'] = s3_test_data
    knn.fit(fit_input)
    return knn

Now, we run the actual training job. For now, we stick to default parameters.

In [6]:
hyperparams = {
    'feature_dim': 3072,
    'k': 10,
    'sample_size': 200,
    'predictor_type': 'classifier' 
}
output_path = 's3://' + bucket + '/' + prefix + '/default_example/output'
knn_estimator = trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path, 
                                                   s3_test_data=s3_test_data)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-12-02 22:37:09 Starting - Starting the training job...
2020-12-02 22:37:11 Starting - Launching requested ML instances......
2020-12-02 22:38:33 Starting - Preparing the instances for training...
2020-12-02 22:39:04 Downloading - Downloading input data...
2020-12-02 22:39:10 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[12/02/2020 22:39:54 INFO 140716057478976] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'index_metric': u'L2', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'_log_level': u'info', u'feature_dim': u'auto', u'faiss_index_ivf_nlists': u'auto', u'epochs': u'1', u'index_type': u'faiss.Flat', u'_faiss_index_nprobe': u'5', u'_kvstore': u'dist_async', u'_num_kv_servers': u'1', u'mini_batch_size': u'5000'}[0m
[34m[12/02/2020 22:39:54 INFO 140716057478976] Merging with provid


2020-12-02 22:40:05 Uploading - Uploading generated training model
2020-12-02 22:40:05 Completed - Training job completed
Training seconds: 61
Billable seconds: 61


## Setting up the endpoint

Now that we have a trained model, we are ready to run inference. The **knn_estimator** object above contains all the information we need for hosting the model. Below we provide a convenience function that given an estimator, sets up and endpoint that hosts the model. Other than the estimator object, we provide it with a name (string) for the estimator, and an **instance_type**. The **instance_type** is the machine type that will host the model. It is not restricted in any way by the parameter settings of the training job.

In [7]:
def predictor_from_estimator(knn_estimator, estimator_name, instance_type, endpoint_name=None): 
    knn_predictor = knn_estimator.deploy(initial_instance_count=1, instance_type=instance_type,
                                        endpoint_name=endpoint_name)
    
    knn_predictor.content_type = 'text/csv'
    knn_predictor.serializer = csv_serializer
    knn_predictor.deserializer = json_deserializer
    return knn_predictor

In [8]:
import time

instance_type = 'ml.t2.medium'
model_name = 'knn_%s'% instance_type
endpoint_name = 'knn-ml-t2-medium-%s'% (str(time.time()).replace('.','-'))
print('setting up the endpoint..')
predictor = predictor_from_estimator(knn_estimator, model_name, instance_type, endpoint_name=endpoint_name)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


setting up the endpoint..
-------------------------!

## Inference

Now that we have our predictor, let's use it on our test dataset. The following code runs on the test dataset, computes the accuracy and the average latency. It splits up the data into 100 batches, each of size roughly 500. Then, each batch is given to the inference service to obtain predictions. Once we have all predictions, we compute their accuracy given the true labels of the test set.

In [10]:
# Pass in one image?
#client = boto3.client('sagemaker-runtime')
                                     # Your endpoint name.
#content_type = "base64"                                        # The MIME type of the input data in the request body.
# accept = "..."                                              # The desired MIME type of the inference in the response.
#payload = "..."   # Payload for inference.

#response = client.invoke_endpoint(
   # EndpointName=endpoint_name,
    #ContentType=content_type,
    #Body=b'bytes'|file

#)

#print(response);

batches = np.array_split(test_features, 50)
print('data split into 50 batches, of size %d.' % batches[0].shape[0])

# obtain an np array with the predictions for the entire test set
start_time = time.time()
predictions = []
for batch in batches:
    #this is where we will pass in a single image from Kevin
    result = predictor.predict(batch)
    cur_predictions = np.array([result['predictions'][i]['predicted_label'] for i in range(len(result['predictions']))])
    predictions.append(cur_predictions)
predictions = np.concatenate(predictions)
run_time = time.time() - start_time

test_size = test_labels.shape[0]
num_correct = sum(predictions == test_labels)
accuracy = num_correct / float(test_size)
print('time required for predicting %d data point: %.2f seconds' % (test_size, run_time))
print('accuracy of model: %.1f%%' % (accuracy * 100) )

data split into 1 batches, of size 125.


AttributeError: 'list' object has no attribute 'shape'

In [None]:
print(test_features[0])

In [None]:
import numpy as np
np.set_printoptions(threshold=np.inf)
print(test_features[0])