# Introduction

This notebook outlines how to build a recommendation system using SageMaker's Factorization Machines (FM). The main goal is to showcase how to extend FM model to predict top "X" recommendations using SageMaker's KNN and Batch Transform.

There are four parts to this notebook:

1. Building a FM Model
2. Repackaging FM Model to fit a KNN Model
3. Building a KNN model
4. Running Batch Transform for predicting top "X" items


## Part 1 - Building a FM Model using movie lens dataset

Julien Simon has written a fantastic blog about how to build a FM model using SageMaker with detailed explanation. Please see the links below for more information. In this part, I utilized his code for the most part to have continutity for performing additional steps.

Source - https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/

In [1]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri
import numpy as np
from scipy.sparse import lil_matrix
import pandas as pd
import boto3, io, os

### Download movie rating data from movie lens

In [2]:
#download data
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2019-10-29 19:07:42--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.3’


2019-10-29 19:07:45 (2.49 MB/s) - ‘ml-100k.zip.3’ saved [4924029/4924029]

Archive:  ml-100k.zip
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating:

### Shuffle the data

In [3]:
!shuf ml-100k/ua.base -o ml-100k/ua.base.shuffled

### Load Training Data

In [4]:
user_movie_ratings_train = pd.read_csv('ml-100k/ua.base.shuffled', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_train.head(5)

Unnamed: 0,user_id,movie_id,rating
0,68,926,1
1,887,548,1
2,857,988,2
3,77,134,4
4,894,877,3


### Load Test Data

In [5]:
user_movie_ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_test.head(5)

Unnamed: 0,user_id,movie_id,rating
0,1,20,4
1,1,33,4
2,1,61,4
3,1,117,3
4,1,155,2


In [6]:
nb_users= user_movie_ratings_train['user_id'].max()
nb_movies=user_movie_ratings_train['movie_id'].max()
nb_features=nb_users+nb_movies
nb_ratings_test=len(user_movie_ratings_test.index)
nb_ratings_train=len(user_movie_ratings_train.index)
print " # of users: ", nb_users
print " # of movies: ", nb_movies
print " Training Count: ", nb_ratings_train
print " Test Count: ", nb_ratings_test
print " Features (# of users + # of movies): ", nb_features

 # of users:  943
 # of movies:  1682
 Training Count:  90570
 Test Count:  9430
 Features (# of users + # of movies):  2625


### FM Input

Input to FM is a one-hot encoded sparse matrix. Only ratings 4 and above are considered for the model. We will be ignoring ratings 3 and below.

In [7]:
def loadDataset(df, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    for index, row in df.iterrows():
            X[line,row['user_id']-1] = 1
            X[line, nb_users+(row['movie_id']-1)] = 1
            if int(row['rating']) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1

    Y=np.array(Y).astype('float32')            
    return X,Y


X_train, Y_train = loadDataset(user_movie_ratings_train, nb_ratings_train, nb_features)
X_test, Y_test = loadDataset(user_movie_ratings_test, nb_ratings_test, nb_features)

In [8]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nb_ratings_train, nb_features)
assert Y_train.shape == (nb_ratings_train, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nb_ratings_train-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nb_ratings_test, nb_features)
assert Y_test.shape  == (nb_ratings_test, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nb_ratings_test-zero_labels))

(90570, 2625)
(90570,)
Training labels: 49906 zeros, 40664 ones
(9430, 2625)
(9430,)
Test labels: 5469 zeros, 3961 ones


### Convert to Protobuf format for saving to S3

In [9]:
%%time

role = get_execution_role()
print(role)
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'factorization-machine-sagemaker'

arn:aws:iam::349934754982:role/service-role/AmazonSageMaker-ExecutionRole-20190918T150782
CPU times: user 380 ms, sys: 20.3 ms, total: 401 ms
Wall time: 1.52 s


In [10]:
train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train3')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test3')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [11]:
def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)    
test_data  = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)    
  
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))

s3://sagemaker-ap-southeast-1-349934754982/factorization-machine-sagemaker/train3/train.protobuf
s3://sagemaker-ap-southeast-1-349934754982/factorization-machine-sagemaker/test3/test.protobuf
Output: s3://sagemaker-ap-southeast-1-349934754982/factorization-machine-sagemaker/output


### Run training job

You can play around with the hyper parameters until you are happy with the prediction. For this dataset and hyper parameters configuration, after 100 epochs, test accuracy was around 70% on average and the F1 score (a typical metric for a binary classifier) was around 0.74 (1 indicates a perfect classifier). Not great, but you can fine tune the model further.

In [12]:
instance_type='ml.m5.large'
fm = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "factorization-machines"),
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type=instance_type,
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nb_features,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=100)

fm.fit({'train': train_data, 'test': test_data})

2019-10-29 19:08:46 Starting - Starting the training job...
2019-10-29 19:08:51 Starting - Launching requested ML instances......
2019-10-29 19:09:50 Starting - Preparing the instances for training...
2019-10-29 19:10:23 Downloading - Downloading input data...
2019-10-29 19:11:05 Training - Training image download completed. Training in progress.[31mDocker entrypoint called with argument(s): train[0m
  from numpy.testing import nosetester[0m
[31m[10/29/2019 19:11:07 INFO 140085722548032] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', u'bias_wd': u'0.01', u'use_linear'

[31m[2019-10-29 19:11:13.704] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 12, "duration": 897, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:11:13 INFO 140085722548032] #quality_metric: host=algo-1, epoch=5, train binary_classification_accuracy <score>=0.675417582418[0m
[31m[10/29/2019 19:11:13 INFO 140085722548032] #quality_metric: host=algo-1, epoch=5, train binary_classification_cross_entropy <loss>=0.640007660247[0m
[31m[10/29/2019 19:11:13 INFO 140085722548032] #quality_metric: host=algo-1, epoch=5, train binary_f_1.000 <score>=0.754392529581[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 899.4781970977783, "sum": 899.4781970977783, "min": 899.4781970977783}}, "EndTime": 1572376273.705473, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1572376272.805074}
[0m
[31m[10/29/2019 19:11:13 INFO 140085722548032] #progress_metric: host=algo-

[31m[2019-10-29 19:11:23.680] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 34, "duration": 854, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:11:23 INFO 140085722548032] #quality_metric: host=algo-1, epoch=16, train binary_classification_accuracy <score>=0.724186813187[0m
[31m[10/29/2019 19:11:23 INFO 140085722548032] #quality_metric: host=algo-1, epoch=16, train binary_classification_cross_entropy <loss>=0.586217581613[0m
[31m[10/29/2019 19:11:23 INFO 140085722548032] #quality_metric: host=algo-1, epoch=16, train binary_f_1.000 <score>=0.767651333512[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 856.6899299621582, "sum": 856.6899299621582, "min": 856.6899299621582}}, "EndTime": 1572376283.681138, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1572376282.823645}
[0m
[31m[10/29/2019 19:11:23 INFO 140085722548032] #progress_metric: host=al

[31m[2019-10-29 19:11:34.276] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 58, "duration": 866, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:11:34 INFO 140085722548032] #quality_metric: host=algo-1, epoch=28, train binary_classification_accuracy <score>=0.732912087912[0m
[31m[10/29/2019 19:11:34 INFO 140085722548032] #quality_metric: host=algo-1, epoch=28, train binary_classification_cross_entropy <loss>=0.560178659963[0m
[31m[10/29/2019 19:11:34 INFO 140085722548032] #quality_metric: host=algo-1, epoch=28, train binary_f_1.000 <score>=0.769478825817[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 868.3650493621826, "sum": 868.3650493621826, "min": 868.3650493621826}}, "EndTime": 1572376294.276935, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1572376293.407735}
[0m
[31m[10/29/2019 19:11:34 INFO 140085722548032] #progress_metric: host=al

[31m[2019-10-29 19:11:44.127] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 80, "duration": 869, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:11:44 INFO 140085722548032] #quality_metric: host=algo-1, epoch=39, train binary_classification_accuracy <score>=0.735956043956[0m
[31m[10/29/2019 19:11:44 INFO 140085722548032] #quality_metric: host=algo-1, epoch=39, train binary_classification_cross_entropy <loss>=0.546404515528[0m
[31m[10/29/2019 19:11:44 INFO 140085722548032] #quality_metric: host=algo-1, epoch=39, train binary_f_1.000 <score>=0.770168155644[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 872.089147567749, "sum": 872.089147567749, "min": 872.089147567749}}, "EndTime": 1572376304.127951, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1572376303.254999}
[0m
[31m[10/29/2019 19:11:44 INFO 140085722548032] #progress_metric: host=algo-

[31m[2019-10-29 19:11:54.710] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 104, "duration": 867, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:11:54 INFO 140085722548032] #quality_metric: host=algo-1, epoch=51, train binary_classification_accuracy <score>=0.738626373626[0m
[31m[10/29/2019 19:11:54 INFO 140085722548032] #quality_metric: host=algo-1, epoch=51, train binary_classification_cross_entropy <loss>=0.536034229991[0m
[31m[10/29/2019 19:11:54 INFO 140085722548032] #quality_metric: host=algo-1, epoch=51, train binary_f_1.000 <score>=0.771603337847[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 869.1310882568359, "sum": 869.1310882568359, "min": 869.1310882568359}}, "EndTime": 1572376314.711417, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1572376313.841333}
[0m
[31m[10/29/2019 19:11:54 INFO 140085722548032] #progress_metric: host=a

[31m[2019-10-29 19:12:04.466] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 126, "duration": 883, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:12:04 INFO 140085722548032] #quality_metric: host=algo-1, epoch=62, train binary_classification_accuracy <score>=0.745340659341[0m
[31m[10/29/2019 19:12:04 INFO 140085722548032] #quality_metric: host=algo-1, epoch=62, train binary_classification_cross_entropy <loss>=0.528975876693[0m
[31m[10/29/2019 19:12:04 INFO 140085722548032] #quality_metric: host=algo-1, epoch=62, train binary_f_1.000 <score>=0.776061999923[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 885.6830596923828, "sum": 885.6830596923828, "min": 885.6830596923828}}, "EndTime": 1572376324.466945, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1572376323.580427}
[0m
[31m[10/29/2019 19:12:04 INFO 140085722548032] #progress_metric: host=a

[31m[2019-10-29 19:12:15.088] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 150, "duration": 904, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:12:15 INFO 140085722548032] #quality_metric: host=algo-1, epoch=74, train binary_classification_accuracy <score>=0.747208791209[0m
[31m[10/29/2019 19:12:15 INFO 140085722548032] #quality_metric: host=algo-1, epoch=74, train binary_classification_cross_entropy <loss>=0.522961755774[0m
[31m[10/29/2019 19:12:15 INFO 140085722548032] #quality_metric: host=algo-1, epoch=74, train binary_f_1.000 <score>=0.777377772616[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 906.0549736022949, "sum": 906.0549736022949, "min": 906.0549736022949}}, "EndTime": 1572376335.089169, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1572376334.182304}
[0m
[31m[10/29/2019 19:12:15 INFO 140085722548032] #progress_metric: host=a

[31m[2019-10-29 19:12:24.846] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 172, "duration": 890, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:12:24 INFO 140085722548032] #quality_metric: host=algo-1, epoch=85, train binary_classification_accuracy <score>=0.748417582418[0m
[31m[10/29/2019 19:12:24 INFO 140085722548032] #quality_metric: host=algo-1, epoch=85, train binary_classification_cross_entropy <loss>=0.518437527499[0m
[31m[10/29/2019 19:12:24 INFO 140085722548032] #quality_metric: host=algo-1, epoch=85, train binary_f_1.000 <score>=0.778352212218[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 892.7009105682373, "sum": 892.7009105682373, "min": 892.7009105682373}}, "EndTime": 1572376344.847297, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1572376343.953654}
[0m
[31m[10/29/2019 19:12:24 INFO 140085722548032] #progress_metric: host=a


2019-10-29 19:12:45 Uploading - Uploading generated training model
2019-10-29 19:12:45 Completed - Training job completed
[31m[2019-10-29 19:12:35.457] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 196, "duration": 867, "num_examples": 91, "num_bytes": 5796480}[0m
[31m[10/29/2019 19:12:35 INFO 140085722548032] #quality_metric: host=algo-1, epoch=97, train binary_classification_accuracy <score>=0.750098901099[0m
[31m[10/29/2019 19:12:35 INFO 140085722548032] #quality_metric: host=algo-1, epoch=97, train binary_classification_cross_entropy <loss>=0.514150963081[0m
[31m[10/29/2019 19:12:35 INFO 140085722548032] #quality_metric: host=algo-1, epoch=97, train binary_f_1.000 <score>=0.779814293045[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 869.5211410522461, "sum": 869.5211410522461, "min": 869.5211410522461}}, "EndTime": 1572376355.458579, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorizatio

Training seconds: 142
Billable seconds: 142


## Part 2 - Repackaging Model data to fit a KNN Model

Now that we have the model created and stored in SageMaker, we can download the same and repackage it to fit a KNN model. Note - install mxnet by uncommenting the first line below, if need be.

### Download model data

In [13]:
#!pip install mxnet
import mxnet as mx
model_file_name = "model.tar.gz"
model_full_path = fm.output_path +"/"+ fm.latest_training_job.job_name +"/output/"+model_file_name
print "Model Path: ", model_full_path

#Download FM model 
os.system("aws s3 cp "+model_full_path+ " .")

#Extract model file for loading to MXNet
os.system("tar xzvf "+model_file_name)
os.system("unzip -o model_algo-1")
os.system("mv symbol.json model-symbol.json")
os.system("mv params model-0000.params")

Model Path:  s3://sagemaker-ap-southeast-1-349934754982/factorization-machine-sagemaker/output/factorization-machines-2019-10-29-19-08-46-681/output/model.tar.gz


0

### Extract model data to create item and user latent matrixes

In [14]:
#Extract model data
m = mx.module.Module.load('./model', 0, False, label_names=['out_label'])
V = m._arg_params['v'].asnumpy()
w = m._arg_params['w1_weight'].asnumpy()
b = m._arg_params['w0_weight'].asnumpy()

# item latent matrix - concat(V[i], w[i]).  
knn_item_matrix = np.concatenate((V[nb_users:], w[nb_users:]), axis=1)
knn_train_label = np.arange(1,nb_movies+1)

#user latent matrix - concat (V[u], 1) 
ones = np.ones(nb_users).reshape((nb_users, 1))
knn_user_matrix = np.concatenate((V[:nb_users], ones), axis=1)

## Part 3 - Building KNN Model

In this section, we upload the model input data to S3, create a KNN model and save the same. Saving the model, will display the model in the model section of SageMaker. Also, it will aid in calling batch transform down the line or even deploying it as an end point for real-time inference.

This approach uses the default 'index_type' parameter for knn. It is precise but can be slow for large datasets. In such cases, you may want to use a different 'index_type' parameter leading to an approximate, yet fast answer.

In [29]:
import numpy as np
from scipy import sparse

print('KNN train features shape = ', knn_item_matrix.shape)
knn_prefix = 'knn'
knn_output_prefix  = 's3://{}/{}/output'.format(bucket, knn_prefix)
knn_train_data_path = writeDatasetToProtobuf(sparse.csr_matrix(knn_item_matrix),knn_train_label, bucket, knn_prefix, train_key)
print('uploaded KNN train data: {}'.format(knn_train_data_path))

nb_recommendations = 100

# set up the estimator
knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
    get_execution_role(),
    train_instance_count=1,
    train_instance_type=instance_type,
    output_path=knn_output_prefix,
    sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(feature_dim=knn_item_matrix.shape[1], k=nb_recommendations, index_metric="INNER_PRODUCT", predictor_type='classifier', sample_size=200000)
fit_input = {'train': knn_train_data_path}
knn.fit(fit_input)
knn_model_name =  knn.latest_training_job.job_name
print "created model: ", knn_model_name

# save the model so that we can reference it in the next step during batch inference
sm = boto3.client(service_name='sagemaker')
primary_container = {
    'Image': knn.image_name,
    'ModelDataUrl': knn.model_data,
}

knn_model = sm.create_model(
        ModelName = knn.latest_training_job.job_name,
        ExecutionRoleArn = knn.role,
        PrimaryContainer = primary_container)
print "saved the model"

('KNN train features shape = ', (1682, 65))
uploaded KNN train data: s3://sagemaker-ap-southeast-1-349934754982/knn/train.protobuf
2019-10-29 19:20:56 Starting - Starting the training job...
2019-10-29 19:20:58 Starting - Launching requested ML instances......
2019-10-29 19:21:59 Starting - Preparing the instances for training...
2019-10-29 19:22:56 Downloading - Downloading input data
2019-10-29 19:22:56 Training - Downloading the training image...
2019-10-29 19:23:23 Uploading - Uploading generated training model[31mDocker entrypoint called with argument(s): train[0m
[31m[10/29/2019 19:23:20 INFO 140230963902272] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'index_metric': u'L2', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'_log_level': u'info', u'faiss_index_ivf_nlists': u'auto', u'epochs': u'1', u'index_type': u'faiss.Flat', u'_faiss_index_nprobe': u'5', u'_kvstore': u'dist_async', u'_num_kv_ser


2019-10-29 19:23:31 Completed - Training job completed
Training seconds: 62
Billable seconds: 62
created model:  knn-2019-10-29-19-20-55-926
saved the model


## Part 4 - Batch Transform

In this section, we will use SageMaker's batch transform option to batch predict top X for all the users.

In [35]:
#upload inference data to S3
knn_batch_data_path = writeDatasetToProtobuf(sparse.csr_matrix(knn_user_matrix),np.array(range(len(knn_user_matrix))), bucket, knn_prefix, train_key)
print "Batch inference data path: ",knn_batch_data_path

# Initialize the transformer object
transformer =sagemaker.transformer.Transformer(
    base_transform_job_name="knn",
    model_name=knn_model_name,
    instance_count=1,
    instance_type=instance_type,
    output_path=knn_output_prefix,
    accept="application/jsonlines; verbose=true"
)

# Start a transform job:
transformer.transform(knn_batch_data_path, content_type='application/x-recordio-protobuf')
transformer.wait()


#Download predictions 
results_file_name = "inference_output"
inference_output_file = "knn/output/train.protobuf.out"
s3_client = boto3.client('s3')
s3_client.download_file(bucket, inference_output_file, results_file_name)
with open(results_file_name) as f:
    results = f.readlines()  

Batch inference data path:  s3://sagemaker-ap-southeast-1-349934754982/knn/train.protobuf
...................[31mDocker entrypoint called with argument(s): serve[0m
[32mDocker entrypoint called with argument(s): serve[0m
[31m[10/29/2019 19:34:09 INFO 139698930906944] loaded entry point class algorithm.serve.server_config:config_api[0m
[32m[10/29/2019 19:34:09 INFO 139698930906944] loaded entry point class algorithm.serve.server_config:config_api[0m
[31m[10/29/2019 19:34:09 INFO 139698930906944] loading entry points[0m
[31m[10/29/2019 19:34:09 INFO 139698930906944] loaded request iterator text/csv[0m
[31m[10/29/2019 19:34:09 INFO 139698930906944] loaded request iterator application/x-recordio-protobuf[0m
[31m[10/29/2019 19:34:09 INFO 139698930906944] loaded request iterator application/json[0m
[31m[10/29/2019 19:34:09 INFO 139698930906944] loaded request iterator application/jsonlines[0m
[31m[10/29/2019 19:34:09 INFO 139698930906944] loaded response encoder applicatio

In [36]:
import json
test_user_idx = 89
u_one_json = json.loads(results[test_user_idx])

print "Recommended movie Ids for user #{} : {}".format(test_user_idx+1, [int(movie_id) for movie_id in u_one_json['labels']])
print
print "Movie distances for user #{} : {}".format(test_user_idx+1,  [round(distance, 4) for distance in u_one_json['distances']])

Recommended movie Ids for user #90 : [656, 48, 268, 923, 87, 193, 69, 208, 192, 165, 966, 505, 23, 482, 509, 527, 705, 166, 89, 269, 173, 176, 246, 216, 180, 83, 96, 251, 124, 659, 1, 183, 493, 204, 211, 520, 168, 196, 9, 194, 132, 190, 489, 210, 56, 185, 519, 197, 435, 316, 496, 510, 484, 134, 170, 641, 302, 181, 478, 963, 136, 654, 223, 275, 285, 187, 523, 100, 1039, 313, 1142, 22, 651, 408, 498, 272, 515, 480, 79, 178, 114, 191, 603, 199, 169, 172, 513, 511, 427, 357, 127, 174, 12, 657, 318, 98, 50, 479, 483, 64]

Movie distances for user #90 : [2.511, 2.5112, 2.5234, 2.526, 2.5325, 2.5399, 2.542, 2.5593, 2.5602, 2.565, 2.6086, 2.6127, 2.617, 2.6279, 2.6401, 2.6407, 2.6483, 2.6679, 2.6946, 2.695, 2.7188, 2.7288, 2.7357, 2.7464, 2.7562, 2.7576, 2.7618, 2.7695, 2.7756, 2.7828, 2.7906, 2.7964, 2.8044, 2.8585, 2.8638, 2.8698, 2.8756, 2.8763, 2.8771, 2.9154, 2.9277, 2.9556, 2.9609, 2.9666, 2.9795, 2.9872, 2.9979, 3.0255, 3.0293, 3.0353, 3.0464, 3.0506, 3.071, 3.1396, 3.1702, 3.1895, 3.23