# Introduction

This notebook outlines how to build a recommendation system using SageMaker's Factorization Machines (FM). After building a FM model, we will extend that model to predict top "X" recommendations using SageMaker's KNN and Batch Transform.



1. Building a FM Model (Build, Train and Deploy) and Realtime Inference
2. Repackaging FM Model to fit a KNN Model
3. Building a KNN model
4. Running Batch Transform for predicting top "X" items

Source: https://aws.amazon.com/blogs/machine-learning/extending-amazon-sagemaker-factorization-machines-algorithm-to-predict-top-x-recommendations/


## Part 1 - Building a FM Model using movie lens dataset

Ref to blog about how to build a FM model using SageMaker with detailed explanation. Please see the links below for more information. 

Source - https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/

In [1]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri
import numpy as np
from scipy.sparse import lil_matrix
import pandas as pd
import boto3, io, os

### Download movie rating data from movie lens

In [2]:
#download data
!wget -O ml-100k.zip http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2020-04-23 20:58:34--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2020-04-23 20:58:35 (12.6 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-

### Shuffle the data

In [3]:
!shuf ml-100k/ua.base -o ml-100k/ua.base.shuffled

### Load Training Data

In [4]:
user_movie_ratings_train = pd.read_csv('ml-100k/ua.base.shuffled', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_train.head(5)

Unnamed: 0,user_id,movie_id,rating
0,59,606,4
1,749,85,4
2,393,67,3
3,303,201,5
4,625,23,4


### Load Test Data

In [5]:
user_movie_ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_test.head(5)

Unnamed: 0,user_id,movie_id,rating
0,1,20,4
1,1,33,4
2,1,61,4
3,1,117,3
4,1,155,2


In [6]:
nb_users= user_movie_ratings_train['user_id'].max()
nb_movies=user_movie_ratings_train['movie_id'].max()
nb_features=nb_users+nb_movies
nb_ratings_test=len(user_movie_ratings_test.index)
nb_ratings_train=len(user_movie_ratings_train.index)
print (" # of users: ", nb_users)
print (" # of movies: ", nb_movies)
print (" Training Count: ", nb_ratings_train)
print (" Test Count: ", nb_ratings_test)
print (" Features (# of users + # of movies): ", nb_features)

 # of users:  943
 # of movies:  1682
 Training Count:  90570
 Test Count:  9430
 Features (# of users + # of movies):  2625


### FM Input

Input to FM is a one-hot encoded sparse matrix. We are going to build a binary recommender  (like/don't like a movie). 4-star and 5-star ratings are set to 1. Lower ratings are set to 0

In [7]:
def loadDataset(df, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    for index, row in df.iterrows():
            X[line,row['user_id']-1] = 1
            X[line, nb_users+(row['movie_id']-1)] = 1
            if int(row['rating']) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1

    Y=np.array(Y).astype('float32')            
    return X,Y


X_train, Y_train = loadDataset(user_movie_ratings_train, nb_ratings_train, nb_features)
X_test, Y_test = loadDataset(user_movie_ratings_test, nb_ratings_test, nb_features)

In [8]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nb_ratings_train, nb_features)
assert Y_train.shape == (nb_ratings_train, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nb_ratings_train-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nb_ratings_test, nb_features)
assert Y_test.shape  == (nb_ratings_test, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nb_ratings_test-zero_labels))

(90570, 2625)
(90570,)
Training labels: 49906 zeros, 40664 ones
(9430, 2625)
(9430,)
Test labels: 5469 zeros, 3961 ones


### Convert to Protobuf format for saving to S3
The FM implementation in Amazon SageMaker requires training and test data to be stored in float32 tensors in protobuf format.

In [9]:
#Change this value to your own bucket name
bucket = 'pv-sfdc-kf-sagemaker-workshop'
prefix = 'fm'

if bucket.strip() == '':
    raise RuntimeError("bucket name is empty.")

train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [10]:
def writeDatasetToProtobuf(X, bucket, prefix, key, d_type, Y=None):
    buf = io.BytesIO()
    if d_type == "sparse":
        smac.write_spmatrix_to_sparse_tensor(buf, X, labels=Y)
    else:
        smac.write_numpy_to_dense_tensor(buf, X, labels=Y)
        
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
fm_train_data_path = writeDatasetToProtobuf(X_train, bucket, train_prefix, train_key, "sparse", Y_train)    
fm_test_data_path  = writeDatasetToProtobuf(X_test, bucket, test_prefix, test_key, "sparse", Y_test)    
  
print ("Training data S3 path: ",fm_train_data_path)
print ("Test data S3 path: ",fm_test_data_path)
print ("FM model output S3 path: {}".format(output_prefix))

Training data S3 path:  s3://pv-sfdc-kf-sagemaker-workshop/fm/train/train.protobuf
Test data S3 path:  s3://pv-sfdc-kf-sagemaker-workshop/fm/test/test.protobuf
FM model output S3 path: s3://pv-sfdc-kf-sagemaker-workshop/fm/output


### Run training job

You can play around with the hyper parameters until you are happy with the prediction. For this dataset and hyper parameters configuration, after 100 epochs, test accuracy was around 70% on average and the F1 score (a typical metric for a binary classifier) was around 0.74 (1 indicates a perfect classifier). Not great, but you can fine tune the model further.

In [11]:
instance_type='ml.m5.2xlarge'
fm = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "factorization-machines"),
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type=instance_type,
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nb_features,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=100)

fm.fit({'train': fm_train_data_path, 'test': fm_test_data_path})

2020-04-23 21:02:33 Starting - Starting the training job...
2020-04-23 21:02:36 Starting - Launching requested ML instances......
2020-04-23 21:03:37 Starting - Preparing the instances for training...
2020-04-23 21:04:24 Downloading - Downloading input data...
2020-04-23 21:04:35 Training - Downloading the training image...
2020-04-23 21:05:17 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
  from numpy.testing import nosetester[0m
[34m[04/23/2020 21:05:18 INFO 139963863643968] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'li

[34m[2020-04-23 21:05:28.639] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 46, "duration": 437, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[04/23/2020 21:05:28 INFO 139963863643968] #quality_metric: host=algo-1, epoch=22, train binary_classification_accuracy <score>=0.73032967033[0m
[34m[04/23/2020 21:05:28 INFO 139963863643968] #quality_metric: host=algo-1, epoch=22, train binary_classification_cross_entropy <loss>=0.570853889214[0m
[34m[04/23/2020 21:05:28 INFO 139963863643968] #quality_metric: host=algo-1, epoch=22, train binary_f_1.000 <score>=0.769347895558[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 438.94314765930176, "sum": 438.94314765930176, "min": 438.94314765930176}}, "EndTime": 1587675928.639937, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1587675928.200399}
[0m
[34m[04/23/2020 21:05:28 INFO 139963863643968] #progress_metric: host=

[34m[2020-04-23 21:05:38.968] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 92, "duration": 441, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[04/23/2020 21:05:38 INFO 139963863643968] #quality_metric: host=algo-1, epoch=45, train binary_classification_accuracy <score>=0.736725274725[0m
[34m[04/23/2020 21:05:38 INFO 139963863643968] #quality_metric: host=algo-1, epoch=45, train binary_classification_cross_entropy <loss>=0.541353313069[0m
[34m[04/23/2020 21:05:38 INFO 139963863643968] #quality_metric: host=algo-1, epoch=45, train binary_f_1.000 <score>=0.770138542426[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 443.41588020324707, "sum": 443.41588020324707, "min": 443.41588020324707}}, "EndTime": 1587675938.968807, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1587675938.524725}
[0m
[34m[04/23/2020 21:05:38 INFO 139963863643968] #progress_metric: host

[34m[2020-04-23 21:05:48.820] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 136, "duration": 432, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[04/23/2020 21:05:48 INFO 139963863643968] #quality_metric: host=algo-1, epoch=67, train binary_classification_accuracy <score>=0.745549450549[0m
[34m[04/23/2020 21:05:48 INFO 139963863643968] #quality_metric: host=algo-1, epoch=67, train binary_classification_cross_entropy <loss>=0.527212917789[0m
[34m[04/23/2020 21:05:48 INFO 139963863643968] #quality_metric: host=algo-1, epoch=67, train binary_f_1.000 <score>=0.775927305805[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 434.15284156799316, "sum": 434.15284156799316, "min": 434.15284156799316}}, "EndTime": 1587675948.820518, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1587675948.385762}
[0m
[34m[04/23/2020 21:05:48 INFO 139963863643968] #progress_metric: hos

[34m[2020-04-23 21:05:58.665] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 180, "duration": 443, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[04/23/2020 21:05:58 INFO 139963863643968] #quality_metric: host=algo-1, epoch=89, train binary_classification_accuracy <score>=0.74832967033[0m
[34m[04/23/2020 21:05:58 INFO 139963863643968] #quality_metric: host=algo-1, epoch=89, train binary_classification_cross_entropy <loss>=0.518223897745[0m
[34m[04/23/2020 21:05:58 INFO 139963863643968] #quality_metric: host=algo-1, epoch=89, train binary_f_1.000 <score>=0.778175971485[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 444.78797912597656, "sum": 444.78797912597656, "min": 444.78797912597656}}, "EndTime": 1587675958.665844, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1587675958.220416}
[0m
[34m[04/23/2020 21:05:58 INFO 139963863643968] #progress_metric: host


2020-04-23 21:06:14 Uploading - Uploading generated training model
2020-04-23 21:06:14 Completed - Training job completed
Training seconds: 110
Billable seconds: 110


## Deploy model


In [12]:
fm_predictor = fm.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1, wait=False)

In [None]:
import json
def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    #print js
    return json.dumps(js)

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer


In [None]:
result = fm_predictor.predict(X_test[1000:1010].toarray())
print(result)
print (Y_test[1000:1010])

### Alternative: invoke with boto3

In [13]:
runtime = boto3.client('sagemaker-runtime')

In [14]:
import json
def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    #print js
    return json.dumps(js)

predictor_endpoint = "fm-demo-endpoint"
response = runtime.invoke_endpoint(
    EndpointName=predictor_endpoint,
    ContentType='application/json',
    Body=fm_serializer(X_test[1000:1010].toarray())
    )

print(response['Body'].read())

{"predictions": [{"score": 0.6875320076942444, "predicted_label": 1.0}, {"score": 0.20183657109737396, "predicted_label": 0.0}, {"score": 0.23944973945617676, "predicted_label": 0.0}, {"score": 0.6187267303466797, "predicted_label": 1.0}, {"score": 0.5537686944007874, "predicted_label": 1.0}, {"score": 0.15983177721500397, "predicted_label": 0.0}, {"score": 0.4090516269207001, "predicted_label": 0.0}, {"score": 0.5249448418617249, "predicted_label": 1.0}, {"score": 0.3604322671890259, "predicted_label": 0.0}, {"score": 0.125717431306839, "predicted_label": 0.0}]}


## Part 2 - Repackaging Model data to fit a KNN Model

Now that we have the model created and stored in SageMaker, we can download the same and repackage it to fit a KNN model. 

### Download model data

In [None]:
#!pip install mxnet
import mxnet as mx
model_file_name = "model.tar.gz"
model_full_path = fm.output_path +"/"+ fm.latest_training_job.job_name +"/output/"+model_file_name
print "Model Path: ", model_full_path

#Download FM model 
os.system("aws s3 cp "+model_full_path+ " .")

#Extract model file for loading to MXNet
os.system("tar xzvf "+model_file_name)
os.system("unzip -o model_algo-1")
os.system("mv symbol.json model-symbol.json")
os.system("mv params model-0000.params")

### Extract model data to create item and user latent matrixes

In [None]:
#Extract model data
m = mx.module.Module.load('./model', 0, False, label_names=['out_label'])
V = m._arg_params['v'].asnumpy()
w = m._arg_params['w1_weight'].asnumpy()
b = m._arg_params['w0_weight'].asnumpy()

# item latent matrix - concat(V[i], w[i]).  
knn_item_matrix = np.concatenate((V[nb_users:], w[nb_users:]), axis=1)
knn_train_label = np.arange(1,nb_movies+1)

#user latent matrix - concat (V[u], 1) 
ones = np.ones(nb_users).reshape((nb_users, 1))
knn_user_matrix = np.concatenate((V[:nb_users], ones), axis=1)

## Part 3 - Building KNN Model

In this section, we upload the model input data to S3, create a KNN model and save the same. Saving the model, will display the model in the model section of SageMaker. Also, it will aid in calling batch transform down the line or even deploying it as an end point for real-time inference.

This approach uses the default 'index_type' parameter for knn. It is precise but can be slow for large datasets. In such cases, you may want to use a different 'index_type' parameter leading to an approximate, yet fast answer.

In [None]:
print('KNN train features shape = ', knn_item_matrix.shape)
knn_prefix = 'knn'
knn_output_prefix  = 's3://{}/{}/output'.format(bucket, knn_prefix)
knn_train_data_path = writeDatasetToProtobuf(knn_item_matrix, bucket, knn_prefix, train_key, "dense", knn_train_label)
print('uploaded KNN train data: {}'.format(knn_train_data_path))

nb_recommendations = 100

# set up the estimator
knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
    get_execution_role(),
    train_instance_count=1,
    train_instance_type=instance_type,
    output_path=knn_output_prefix,
    sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(feature_dim=knn_item_matrix.shape[1], k=nb_recommendations, index_metric="INNER_PRODUCT", predictor_type='classifier', sample_size=200000)
fit_input = {'train': knn_train_data_path}
knn.fit(fit_input)
knn_model_name =  knn.latest_training_job.job_name
print "created model: ", knn_model_name

# save the model so that we can reference it in the next step during batch inference
sm = boto3.client(service_name='sagemaker')
primary_container = {
    'Image': knn.image_name,
    'ModelDataUrl': knn.model_data,
}

knn_model = sm.create_model(
        ModelName = knn.latest_training_job.job_name,
        ExecutionRoleArn = knn.role,
        PrimaryContainer = primary_container)
print "saved the model"

## Part 4 - Batch Transform

In this section, we will use SageMaker's batch transform option to batch predict top X for all the users.

In [None]:
#upload inference data to S3
knn_batch_data_path = writeDatasetToProtobuf(knn_user_matrix, bucket, knn_prefix, train_key, "dense")
print "Batch inference data path: ",knn_batch_data_path

# Initialize the transformer object
transformer =sagemaker.transformer.Transformer(
    base_transform_job_name="knn",
    model_name=knn_model_name,
    instance_count=1,
    instance_type=instance_type,
    output_path=knn_output_prefix,
    accept="application/jsonlines; verbose=true"
)

# Start a transform job:
transformer.transform(knn_batch_data_path, content_type='application/x-recordio-protobuf')
transformer.wait()


#Download predictions 
results_file_name = "inference_output"
inference_output_file = "knn/output/train.protobuf.out"
s3_client = boto3.client('s3')
s3_client.download_file(bucket, inference_output_file, results_file_name)
with open(results_file_name) as f:
    results = f.readlines()  

In [None]:
import json
test_user_idx = 9
u_one_json = json.loads(results[test_user_idx])

print "Recommended movie Ids for user #{} : {}".format(test_user_idx+1, [int(movie_id) for movie_id in u_one_json['labels']])
print
print "Movie distances for user #{} : {}".format(test_user_idx+1,  [round(distance, 4) for distance in u_one_json['distances']])