# [CPSC 310](https://github.com/GonzagaCPSC310) Data Mining
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Amazon SageMaker
What are our learning objectives for this lesson?
* Make an Amazon Web Services (AWS) account
* Set up a Jupyter Notebook instance on Amazon SageMaker
* Perform simple k-means clustering using SageMaker

Content used in this lesson is based upon information in the following sources:
* [Amazon SageMaker developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)

## Warm-up Task(s)
1. Make a [free AWS account](https://aws.amazon.com/free/?awsf.Free%20Tier%20Types=categories%23featured) if you don't already have one
    * Later, you can get free AWS credits as a student through AWS Educate: https://aws.amazon.com/education/awseducate/ 
1. Go to console.aws.amazon.com and sign in to your account
1. Make sure you have ClusteringFun/shirt_sizes.csv downloaded on your machine somewhere

## Helpful Links
* [Amazon SageMaker developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)
* Instance types:
    * [Amazon E2 instance types](https://aws.amazon.com/ec2/instance-types/)
    * [Amazon SageMaker ML instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/)
* [SageMaker pricing](https://aws.amazon.com/sagemaker/pricing/)
* [S3 pricing](https://aws.amazon.com/s3/pricing/)
* [SageMaker KMeans documentation](https://sagemaker.readthedocs.io/en/stable/kmeans.html)
* [JupyterLab vs Jupyter Notebook](https://towardsdatascience.com/jupyter-lab-evolution-of-the-jupyter-notebook-5297cacde6b)

## Setup
1. S3 Console:
    1. Create a S3 bucket
    1. Upload your dataset (in this example, shirt_sizes.csv) to the bucket
1. SageMaker Console:
    1. Create a new notebook instance
        1. Create a new IAM role
    1. Make sure your instance region is the same as your S3 bucket region
    1. Wait for it to initialize, then open it in Jupyter
1. Jupyter Dashboard:
    1. Create a new Jupyter notebook with conda-python3 kernel
    1. Rename the notebook

## Code
### Using SageMaker from Python
Sagemaker APIs can be accessed one of two ways from Python:
1. The low-level Python SDK... this is the `boto3` related code you might see in the tutorials
1. The high level Python API... this is the `sagemaker` related code you migth see in the tutorials
    1. Alot of things are customizable with option 1 that are abstracted with option 2
    1. Probably start with option 2 until you need option 1 to keep things simple

In [None]:
from sagemaker import get_execution_role

role = get_execution_role() # get the IAM role you associated with this notebook when you created it
# essentially authentication... for use when accessing S3 buckets, endpoints, etc.
bucket = "sagemaker-temp2"
original_data_key = "shirt_sizes.csv"
original_data_location = "s3://" + bucket + "/" + original_data_key
print("original data path:", original_data_location)

### Preparing Data for Amazon's K-Means Algorithm
Next we need to read the data from S3, clean it, and convert it to a numpy ndarray which is one type the ML algorithms can work with.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv(original_data_location)
df.columns = df.columns.str.strip()
df = df.drop("size(t-shirt)", axis=1)
# normalize the data
df = (df - df.min()) / (df.max() - df.min())
print(df)
# convert the data values to be numpy float32s
df = df.astype(np.float32)
train_set = df.values
print(train_set)
print(type(train_set))

### Setting up K-Means
We can use any of the following algorithms:
1. Ones Amazon provides via `sagemaker`
1. Ones in the Amazon Marketplace
1. Ones we write ourselves
    1. Write in Notebook
    1. Write offline and upload/convert

We will use `KMeans` from `sagemaker`

In [None]:
from sagemaker import KMeans

# sagemaker will convert our numpy array to a byte format called recordIO that it is more optimal to work with
# we need to say where to store that data
data_key = original_data_key.split(".")[0] + "_recordIO_data"
data_location = "s3://" + bucket + "/" + data_key
output_key = original_data_key.split(".")[0] + "_output"
output_location = "s3://" + bucket + "/" + output_key

print("training data will be uploaded to", data_location)
print("training artifacts will be uploaded to", output_location)

kmeans = KMeans(role=role,
               train_instance_count=1,
               train_instance_type="ml.m4.xlarge",
               output_path=output_location,
               k=2,
               data_location=data_location)



In [None]:
%%time
# magic command, times the duration to execute the cell

kmeans.fit(kmeans.record_set(train_set)) # runs asynchronously

### Using the Model
So the model has been written out to a model.tar.gz file in S3... kind of a pain to inspect if you ask me!

Anways, we need to "deploy" the model to an endpoint in order to start running instances through it for prediction/clustering/etc. Predictions are returned via JSON object (note: you can also make an HTTP request for a prediction of your endpoint and get a JSON response back as well... this is neat because you can use your model in your apps!).

In [None]:
%%time

kmeans_predictor = kmeans.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

In [None]:
unseen = np.array([[0.5, 0.5]], dtype=np.float32)
result = kmeans_predictor.predict(unseen)
print(result)

## Tips on Viewing the Algorithm Parameters
This would be if you wanted to "view your model." For k-means, this would be seeing the cluster centroids. This takes more work than I think it should because the `sagemaker.KMeans` class does not expose the centroids, it instead writes them to the model file, model.tar.gz

* You'll need to create a training job using the low-level SDK for Python so you have the job_name in your code. See this part of the developer guide for how to do this: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html
    * I'll note that you could manually go to your S3 bucket and find the job name that was automatically used via `kmeans.fit(kmeans.record_set(train_set[0]))` of the high-level Python library. It will be the name of the folder storing a model.tar.gz file, something like `kmeans-2019-04-29-02-11-06-721`
* You'll need to load the trained model via https://aws.amazon.com/blogs/machine-learning/analyze-us-census-data-for-population-segmentation-using-amazon-sagemaker/