# Learning
This notebook described how to use Neo4j and SageMaker together.  It takes Neo4j graph embedding data from S3 and trains a supervised learning model using SageMaker.

Click this button to open the notebook in SageMaker Studio Lab. [![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/neo4j-partners/neo4j-sagemaker/blob/main/form13/learning.ipynb)

## Credential and Prerequisites
First, let's make sure we have the packages we need.  We're also going to setup our AWS credentials.

In [5]:
%pip install boto3
%pip install sagemaker

Note: you may need to restart the kernel to use updated packages.
Collecting sagemaker
  Downloading sagemaker-2.94.0.tar.gz (527 kB)
[K     |████████████████████████████████| 527 kB 3.8 MB/s eta 0:00:01
[?25hCollecting attrs==20.3.0
  Downloading attrs-20.3.0-py2.py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 5.9 MB/s eta 0:00:01
Collecting google-pasta
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 7.3 MB/s  eta 0:00:01
Collecting protobuf<4.0,>=3.1
  Downloading protobuf-3.20.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 68.9 MB/s eta 0:00:01
[?25hCollecting protobuf3-to-dict<1.0,>=0.1.5
  Downloading protobuf3-to-dict-0.1.5.tar.gz (3.5 kB)
Collecting smdebug_rulesconfig==1.0.1
  Downloading smdebug_rulesconfig-1.0.1-py2.py3-none-any.whl (20 kB)
Collecting pathos
  Downloading pathos-0.2.9-py3-none-any.whl (76 kB)
[K     |████

In [2]:
# Edit these variables!
ACCESS_KEY = 'your-access-key'
SECRET_KEY = 'your-secret-key'

## Download the Dataset
Now we're going to grab the dataset from the bucket we put it in earlier.

In [3]:
import boto3

file_name='embedding.csv'
object_name=file_name
bucket_name = ACCESS_KEY.lower() + '-form13'

s3 = boto3.client('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
s3.download_file(bucket_name, object_name, file_name)

Alternatively, if you don't want to create the embedding, you can just download the dataset using this command:

In [8]:
!wget https://neo4j-dataset.s3.amazonaws.com/form13/embedding.csv

--2022-06-13 21:42:37--  https://neo4j-dataset.s3.amazonaws.com/form13/embedding.csv
Resolving neo4j-dataset.s3.amazonaws.com (neo4j-dataset.s3.amazonaws.com)... 52.216.184.147
Connecting to neo4j-dataset.s3.amazonaws.com (neo4j-dataset.s3.amazonaws.com)|52.216.184.147|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 181841328 (173M) [text/csv]
Saving to: 'embedding.csv'


2022-06-13 21:42:39 (97.5 MB/s) - 'embedding.csv' saved [181841328/181841328]



## SageMaker Connection
Let's setup our SageMaker connection.  That includes:
* The S3 buckets and prefixes that you want to use for training and model data. These should be within the same region as the notebook instance, training, and hosting.
* The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [7]:
import boto3
import sagemaker

session = sagemaker.Session()
region = boto3.Session().region_name

bucket = session.default_bucket()
prefix = 'sagemaker/form13'

role = sagemaker.get_execution_role()

ValueError: Must setup local AWS configuration with a region supported by SageMaker.

In [None]:
import re
import boto3
import sagemaker
from sagemaker import get_execution_role

sess = sagemaker.Session()

region = boto3.Session().region_name

# S3 bucket where the original mnist data is downloaded and stored.
downloaded_data_bucket = f"sagemaker-sample-files"
downloaded_data_prefix = "datasets/image/MNIST"

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket and prefix
bucket = sess.default_bucket()
prefix = "sagemaker/DEMO-linear-mnist"

# Define IAM role
role = get_execution_role()

## Data Ingestion
Next, we read the dataset into memory for preprocessing prior to training.

In [None]:
%%time
import pickle, gzip, numpy, json

# Load the dataset
s3 = boto3.client("s3")
s3.download_file(downloaded_data_bucket, f"{downloaded_data_prefix}/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open("mnist.pkl.gz", "rb") as f:
    train_set, valid_set, test_set = pickle.load(f, encoding="latin1")

## Data Conversion
The Amazon SageMaker implementation of Linear Learner takes recordIO-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk.

Most of the conversion effort is handled by the Amazon SageMaker Python SDK, imported as sagemaker below.

In [None]:
import io
import numpy as np
import sagemaker.amazon.common as smac

vectors = np.array([t.tolist() for t in train_set[0]]).astype("float32")
labels = np.where(np.array([t.tolist() for t in train_set[1]]) == 0, 1, 0).astype("float32")

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)

# Upload the Training Data
Now that we've created our recordIO-wrapped protobuf, we'll need to upload it to S3, so that Amazon SageMaker training can use it.

In [None]:
import boto3
import os

key = "recordio-pb-data"
boto3.resource("s3").Bucket(bucket).Object(os.path.join(prefix, "train", key)).upload_fileobj(buf)
s3_train_data = f"s3://{bucket}/{prefix}/train/{key}"
print(f"uploaded training data location: {s3_train_data}")

Let's also setup an output S3 location for the model artifact that will be output as the result of training with the algorithm.

In [None]:
output_location = f"s3://{bucket}/{prefix}/output"
print(f"training artifacts will be uploaded to: {output_location}")

# Training the Linear Model
Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the Linear Learner training algorithm, although we have tested it on multi-terabyte datasets.

Again, we'll use the Amazon SageMaker Python SDK to kick off training, and monitor status until it is completed. In this example that takes between 7 and 11 minutes. Despite the dataset being small, provisioning hardware and loading the algorithm container take time upfront.

First, let's specify our containers. Since we want this notebook to run in all 4 of Amazon SageMaker's regions, we'll create a small lookup. More details on algorithm containers can be found in AWS documentation.

In [None]:
from sagemaker import image_uris

container = image_uris.retrieve(region=boto3.Session().region_name, framework="linear-learner")

Next we'll kick off the base estimator, making sure to pass in the necessary hyperparameters. Notice:

* feature_dim is set to 784, which is the number of pixels in each 28 x 28 image.
* predictor_type is set to 'binary_classifier' since we are trying to predict whether the image is or is not a 0.
* mini_batch_size is set to 200. This value can be tuned for relatively minor improvements in fit and speed, but selecting a reasonable value relative to the dataset is appropriate in most cases.

In [None]:
import boto3
import sagemaker

sess = sagemaker.Session()

linear = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.c4.xlarge",
    output_path=output_location,
    sagemaker_session=sess,
)
linear.set_hyperparameters(feature_dim=784, predictor_type="binary_classifier", mini_batch_size=200)

linear.fit({"train": s3_train_data})

## Set up Hosting for the Model
Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint. This will allow out to make predictions (or inference) from the model dyanamically.

Note, Amazon SageMaker allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target.

In [None]:
linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

## Validate the Model for Use
Finally, we can now validate the model for use. We can pass HTTP POST requests to the endpoint to get back predictions. To make this easier, we'll again use the Amazon SageMaker Python SDK and specify how to serialize requests and deserialize responses that are specific to the algorithm.

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

linear_predictor.serializer = CSVSerializer()
linear_predictor.deserializer = JSONDeserializer()

Now let's try getting a prediction for a single record.

In [None]:
result = linear_predictor.predict(train_set[0][30:31], initial_args={"ContentType": "text/csv"})
print(result)

OK, a single prediction works. We see that for one record our endpoint returned some JSON which contains predictions, including the score and predicted_label. In this case, score will be a continuous value between [0, 1] representing the probability we think the digit is a 0 or not. predicted_label will take a value of either 0 or 1 where (somewhat counterintuitively) 1 denotes that we predict the image is a 0, while 0 denotes that we are predicting the image is not of a 0.

Let's do a whole batch of images and evaluate our predictive accuracy.

In [None]:
import numpy as np

predictions = []
for array in np.array_split(test_set[0], 100):
    result = linear_predictor.predict(array)
    predictions += [r["predicted_label"] for r in result["predictions"]]

predictions = np.array(predictions)

In [None]:
import pandas as pd

pd.crosstab(
    np.where(test_set[1] == 0, 1, 0), predictions, rownames=["actuals"], colnames=["predictions"]
)

As we can see from the confusion matrix above, we predict 931 images of 0 correctly, while we predict 44 images as 0s that aren't, and miss predicting 49 images of 0.

## Delete the Endpoint
If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
linear_predictor.delete_model()
linear_predictor.delete_endpoint()