# Introduction

This notebook outlines how to build a recommendation system using a combination of [SageMaker's Factorization Machines (FM)](https://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines.html) and [k-Nearest Neigbor](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html) built-in algorithms. The main goal is to showcase how to extend FM model to predict top "X" recommendations using SageMaker's KNN" It is based on this blog post:

https://aws.amazon.com/blogs/machine-learning/extending-amazon-sagemaker-factorization-machines-algorithm-to-predict-top-x-recommendations/

There are four parts to this notebook:

1. Training a Factorization Model using the movie lens dataset
2. Repackaging FM Model to fit a [k-nearest-neighbors](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html) Model (KNN)
3. Fitting the KNN model
4. Deploy a realtime inference endpoint
5. Optional -  Batch Transform for predicting top "X" items


## Part 1 - Training a Factorization Model using the movie lens dataset

Julien Simon has written a fantastic blog about how to build a FM model using SageMaker with detailed explanation. Please see the links below for more information. In this part, I utilized his code for the most part to have continutity for performing additional steps.

Source - https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/

In [None]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri
import numpy as np
from scipy.sparse import lil_matrix
import pandas as pd
import boto3, io, os
import json

In [None]:
#Change this value to your own bucket name if you want to
sess = sagemaker.Session()
bucket = sess.default_bucket()
print("Using following s3 bucket: {}".format(bucket))

### Download movie rating data from movie lens

In [None]:
#download data
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

### Shuffle the data

In [None]:
!shuf ml-100k/ua.base -o ml-100k/ua.base.shuffled

### Load Training Data

First we will load the data into a Pandas dataframe and have a look at the first rows. 

In [None]:
user_movie_ratings_train = pd.read_csv('ml-100k/ua.base.shuffled', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_train.head(5)

Then load the test data.

In [None]:
user_movie_ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_test.head(5)

Let's have a look at the data first. 

In [None]:
nb_users= user_movie_ratings_train['user_id'].max()
nb_movies=user_movie_ratings_train['movie_id'].max()
nb_features=nb_users+nb_movies
nb_ratings_test=len(user_movie_ratings_test.index)
nb_ratings_train=len(user_movie_ratings_train.index)
print( " # of users: {}".format( nb_users))
print (" # of movies: {}".format(nb_movies))
print( " Training Count: {}".format(nb_ratings_train))
print (" Test Count:{} ".format(nb_ratings_test))
print (" Features (# of users + # of movies): {}".format(nb_features))


### Prepare the data.

We will convert the data into a one-hot encoded sparse matrix. Only ratings 4 and above are considered for the model. We will be ignoring ratings 3 and below.

In [None]:
def loadDataset(df, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    for index, row in df.iterrows():
            X[line,row['user_id']-1] = 1
            X[line, nb_users+(row['movie_id']-1)] = 1
            if int(row['rating']) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1

    Y=np.array(Y).astype('float32')            
    return X,Y


X_train, Y_train = loadDataset(user_movie_ratings_train, nb_ratings_train, nb_features)
X_test, Y_test = loadDataset(user_movie_ratings_test, nb_ratings_test, nb_features)

In [None]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nb_ratings_train, nb_features)
assert Y_train.shape == (nb_ratings_train, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nb_ratings_train-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nb_ratings_test, nb_features)
assert Y_test.shape  == (nb_ratings_test, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nb_ratings_test-zero_labels))

### Convert to Protobuf format for saving to S3

Next, we’re going to write the training set and the test set to two protobuf files stored in Amazon S3. Fortunately, we can rely on the write_spmatrix_to_sparse_tensor() utility function. It writes our samples and labels into an in-memory protobuf-encoded sparse multi-dimensional array (AKA tensor).

Then we commit the buffer to Amazon S3. After this step is complete, we’re done with data preparation, and we can now focus on our training job.

In [None]:
prefix = 'fm'

if bucket.strip() == '':
    raise RuntimeError("bucket name is empty.")

train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [None]:
def writeDatasetToProtobuf(X, bucket, prefix, key, d_type, Y=None):
    buf = io.BytesIO()
    if d_type == "sparse":
        smac.write_spmatrix_to_sparse_tensor(buf, X, labels=Y)
    else:
        smac.write_numpy_to_dense_tensor(buf, X, labels=Y)
        
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
fm_train_data_path = writeDatasetToProtobuf(X_train, bucket, train_prefix, train_key, "sparse", Y_train)    
fm_test_data_path  = writeDatasetToProtobuf(X_test, bucket, test_prefix, test_key, "sparse", Y_test)    
  
print ("Training data S3 path: ".format(fm_train_data_path))
print ("Test data S3 path: ".format(fm_test_data_path))
print ("FM model output S3 path: {}".format(output_prefix))

### Run training job

Let’s start by creating an Estimator based on the Factorization machines container available in our AWS Region. Then, we have to set some FM-specific hyperparameters. hyper parameters are documented in the documentation for the specific built-in algorithm, see [FM hyper parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html)

You can play around with the hyper parameters until you are happy with the prediction. For this dataset and hyper parameters configuration, after 100 epochs, test accuracy was around 70% on average and the F1 score (a typical metric for a binary classifier) was around 0.74 (1 indicates a perfect classifier). Not great, but you can fine tune the model further.

In [None]:
instance_type='ml.m5.large'
fm = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "factorization-machines"),
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type=instance_type,
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nb_features,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=100)

fm.fit({'train': fm_train_data_path, 'test': fm_test_data_path})

## Part 2 - Repackaging Model data to fit a K-Nearest-Neighbor Model

The model which we have just trained based on the factorization machines algorithm allows you to predict a score for a pair, such as user, item. The score represents how well the pair matches. However for our use case we want to provide a user as input and receive a list of the top x items that best match the user’s preferences. When the number of items is moderate, you can do this by querying the model for user, item for all possible items. However, this approach doesn’t scale well when the number of items is large. In this scenario, you can use the Amazon SageMaker k-nearest neighbors (k-NN) algorithm to speed up top x prediction tasks.

What we will do in this step is extract the latent item and user representations (embeddings) from the trained Factorization model. The assumption is that movies which are close to a user in this latent space are more favourable to the user. We will use this to fit a k-NN model which will allow us to query for the movies with the "closest distance".

To do this we will first download the trained model, extract the user and item representations and use these to fit a KNN model. 

### Download model data

In [None]:
import mxnet as mx
import pickle

model_file_name = "model.tar.gz"
model_full_path = fm.output_path +"/"+ fm.latest_training_job.job_name +"/output/"+model_file_name
print ("Model Path: {}".format( model_full_path))

#Download FM model 
os.system("aws s3 cp "+model_full_path+ " .")

#Extract model file for loading to MXNet
os.system("tar xzvf "+model_file_name)
os.system("unzip -o model_algo-1")
os.system("mv symbol.json model-symbol.json")
os.system("mv params model-0000.params")

### Extract model data to create item and user latent matrixes

In [None]:
#Extract model data
m = mx.module.Module.load('./model', 0, False, label_names=['out_label'])
V = m._arg_params['v'].asnumpy()
w = m._arg_params['w1_weight'].asnumpy()
b = m._arg_params['w0_weight'].asnumpy()

# item latent matrix - concat(V[i], w[i]).  
knn_item_matrix = np.concatenate((V[nb_users:], w[nb_users:]), axis=1)
knn_train_label = np.arange(1,nb_movies+1)

#user latent matrix - concat (V[u], 1) 
ones = np.ones(nb_users).reshape((nb_users, 1))
knn_user_matrix = np.concatenate((V[:nb_users], ones), axis=1)

Save user matrix for later as it will be needed for inference

In [None]:
with open('./user_embeddings.pickle', 'wb') as handle:
    pickle.dump(knn_user_matrix, handle)

user_matrix_upload_path = fm.output_path +"/"+ fm.latest_training_job.job_name + "/output/user_embeddings.pickle"
os.system("aws s3 cp user_embeddings.pickle "+user_matrix_upload_path)
user_matrix_upload_path

## Part 3 - Building KNN Model

In this section, we upload the model input data to S3, create a KNN model and save the same. Saving the model, will display the model in the model section of SageMaker. Also, it will aid in calling batch transform down the line or even deploying it as an end point for real-time inference.

This approach uses the default 'index_type' parameter for knn. It is precise but can be slow for large datasets. In such cases, you may want to use a different 'index_type' parameter leading to an approximate, yet fast answer.

In [None]:
print('KNN train features shape = ', knn_item_matrix.shape)
knn_prefix = 'knn'
knn_output_prefix  = 's3://{}/{}/output'.format(bucket, knn_prefix)
knn_train_data_path = writeDatasetToProtobuf(knn_item_matrix, bucket, knn_prefix, train_key, "dense", knn_train_label)
print('uploaded KNN train data: {}'.format(knn_train_data_path))

nb_recommendations = 100

# set up the estimator
knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
    get_execution_role(),
    train_instance_count=1,
    train_instance_type=instance_type,
    output_path=knn_output_prefix,
    sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(feature_dim=knn_item_matrix.shape[1], k=nb_recommendations, index_metric="INNER_PRODUCT", predictor_type='classifier', sample_size=200000)
fit_input = {'train': knn_train_data_path}
knn.fit(fit_input)
knn_model_name =  knn.latest_training_job.job_name
print ("created model: {}".format(knn_model_name))

# save the model so that we can reference it in the next step during batch inference
sm = boto3.client(service_name='sagemaker')
primary_container = {
    'Image': knn.image_name,
    'ModelDataUrl': knn.model_data,
}

knn_model = sm.create_model(
        ModelName = knn.latest_training_job.job_name,
        ExecutionRoleArn = knn.role,
        PrimaryContainer = primary_container)
print ("saved the model")

## Part 4 - Deploy a realtime inference endpoint

In [None]:
knnPredictor = knn.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1)

In [None]:
user_id = 10

def getEmbeddingsForUserAsCSV(user_id):
    user= knn_user_matrix[user_id-1][:]
    user.shape=(1,user.shape[0])
    buf = io.StringIO()
    np.savetxt(buf, user, delimiter=',')
    return buf.getvalue()

user_input = getEmbeddingsForUserAsCSV(user_id)

In [None]:
#from sagemaker import RealTimePredictor 
#knnPredictor = RealTimePredictor('knn-2019-10-24-09-09-02-023')
knnPredictor.content_type = 'text/csv'
knnPredictor.accept = 'application/jsonlines; verbose=true'
result_json = json.loads(knnPredictor.predict(data=user_input))
result_json

In [None]:
print ("Recommended movie Ids for user #{} : {}".format(user_id, [int(movie_id) for movie_id in result_json['labels']]))
print ("Movie distances for user #{} : {}".format(user_id,  [round(distance, 4) for distance in result_json['distances']]))

# Summary

We have now successfully trained and deployed a model which allows us to query a user for a list of movie recommendations. Please execute the following cell and note down below properties as these will be required in the next step.

In [None]:
print("EMBEDDINGS_S3_PATH: {}".format(user_matrix_upload_path))
print("SAGEMAKER_ENDPOINT_NAME: {}".format(knnPredictor.endpoint))


__[now jump back into the original Lab Guidebook - Deploying the integration lambda](https://github.com/johanneslanger/recommendations-on-aws-workshop/tree/master/lab-2-recommendations-with-sagemaker#deploying-the-integration-lambda-function)__