# Introduction

This notebook outlines how to build a recommendation system using SageMaker's Factorization Machines (FM). The main goal is to showcase how to extend FM model to predict top "X" recommendations using SageMaker's KNN and Batch Transform.

There are four parts to this notebook:

1. Building a FM Model
2. Repackaging FM Model to fit a KNN Model
3. Building a KNN model
4. Running Batch Transform for predicting top "X" items


## Part 1 - Building a FM Model using movie lens dataset

Julien Simon has written a fantastic blog about how to build a FM model using SageMaker with detailed explanation. Please see the links below for more information. In this part, I utilized his code for the most part to have continutity for performing additional steps.

Source - https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/

In [1]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri
import numpy as np
from scipy.sparse import lil_matrix
import pandas as pd
import boto3, io, os

### Download movie rating data from movie lens

In [2]:
#download data
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2019-04-25 15:17:59--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.4’


2019-04-25 15:18:00 (20.7 MB/s) - ‘ml-100k.zip.4’ saved [4924029/4924029]

Archive:  ml-100k.zip
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating:

### Shuffle the data

In [3]:
%cd ml-100k
!shuf ua.base -o ua.base.shuffled

/home/ec2-user/SageMaker/ml-100k


### Load Training Data

In [4]:
user_movie_ratings_train = pd.read_csv('ua.base.shuffled', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_train.head(5)

item_df = pd.read_csv("u.item", sep='|', lineterminator='\n', encoding='iso-8859-1', header=None, names=["movie_id", "movie_title", "release_date", "video_release_date", "IMDB_url", "unknown", "action", "adventure", "animation", "childrens","comedy","crime","documentary","drama","fantasy","film_noir","horror","musical","mystery", "romance", "sci_fi","thriller","war","western"])
filtered_item_df = item_df.drop(['movie_title', 'release_date', 'video_release_date', 'IMDB_url'], axis=1)
filtered_item_df.rename(columns={"unknown": "0",
                                 "action":"1",
                                 "adventure": "2",
                                 "animation": "3",
                                 "childrens": "4",
                                 "comedy": "5",
                                 "crime": "6",
                                 "documentary": "7",
                                 "drama": "8",
                                 "fantasy": "9",
                                 "film_noir": "10",
                                 "horror":"11",
                                 "musical":"12",
                                 "mystery": "13",
                                 "romance": "14",
                                 "sci_fi": "15",
                                 "thriller": "16",
                                 "war": "17",
                                 "western": "18"},
                       inplace=True)
user_movie_ratings_train = pd.merge(user_movie_ratings_train, filtered_item_df, left_on='movie_id', right_on='movie_id')
user_movie_ratings_train[(user_movie_ratings_train['user_id']==1) & (user_movie_ratings_train['movie_id']==606)]

Unnamed: 0,user_id,movie_id,rating,0,1,2,3,4,5,6,...,9,10,11,12,13,14,15,16,17,18


### Load Test Data

In [5]:
user_movie_ratings_test = pd.read_csv('ua.test', sep='\t', index_col=False, 
                 names=['user_id' , 'movie_id' , 'rating'])
user_movie_ratings_test = pd.merge(user_movie_ratings_test, filtered_item_df, left_on='movie_id', right_on='movie_id')

user_movie_ratings_test[user_movie_ratings_test['user_id']==1]

Unnamed: 0,user_id,movie_id,rating,0,1,2,3,4,5,6,...,9,10,11,12,13,14,15,16,17,18
0,1,20,4,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
11,1,33,4,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
19,1,61,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
22,1,117,3,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
80,1,155,2,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
82,1,160,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
90,1,171,5,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
98,1,189,3,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
105,1,202,5,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
127,1,265,4,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [6]:
nb_users= user_movie_ratings_train['user_id'].max()
nb_movies=user_movie_ratings_train['movie_id'].max()
nb_genres = 19
nb_features=nb_users+nb_movies+nb_genres
nb_ratings_test=len(user_movie_ratings_test.index)
nb_ratings_train=len(user_movie_ratings_train.index)
print " # of users: ", nb_users
print " # of movies: ", nb_movies
print " # of genres: ", nb_genres
print " Training Count: ", nb_ratings_train
print " Test Count: ", nb_ratings_test
print " Features (# of users + # of movies + # of genres): ", nb_features

 # of users:  943
 # of movies:  1682
 # of genres:  19
 Training Count:  90570
 Test Count:  9430
 Features (# of users + # of movies + # of genres):  2644


### FM Input

Input to FM is a one-hot encoded sparse matrix. Only ratings 4 and above are considered for the model. We will be ignoring ratings 3 and below.

In [7]:
def loadDataset(df, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    for index, row in df.iterrows():
            if row["0"] == 1:
                X[line, 0] = 1
            if row['1'] == 1:
                X[line, 1] = 1
            if row['2'] == 1:
                X[line, 2] = 1
            if row['3'] == 1:
                X[line, 3] = 1
            if row['4'] == 1:
                X[line, 4] = 1
            if row['5'] == 1:
                X[line, 5] = 1
            if row['6'] == 1:
                X[line, 6] = 1
            if row['7'] == 1:
                X[line, 7] = 1
            if row['8'] == 1:
                X[line, 8] = 1
            if row['9'] == 1:
                X[line, 9] = 1
            if row['10'] == 1:
                X[line, 10] = 1
            if row['11'] == 1:
                X[line, 11] = 1
            if row['12'] == 1:
                X[line, 12] = 1
            if row['13'] == 1:
                X[line, 13] = 1
            if row['14'] == 1:
                X[line, 14] = 1
            if row['15'] == 1:
                X[line, 15] = 1
            if row['16'] == 1:
                X[line, 16] = 1
            if row['17'] == 1:
                X[line, 17] = 1
            if row['18'] == 1:
                X[line, 18] = 1
            X[line,19+row['user_id']-1] = 1
            X[line, 19+nb_users+(row['movie_id']-1)] = 1
            if int(row['rating']) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1

    Y=np.array(Y).astype('float32')            
    return X,Y


X_train, Y_train = loadDataset(user_movie_ratings_train, nb_ratings_train, nb_features)
X_test, Y_test = loadDataset(user_movie_ratings_test, nb_ratings_test, nb_features)

In [8]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nb_ratings_train, nb_features)
assert Y_train.shape == (nb_ratings_train, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nb_ratings_train-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nb_ratings_test, nb_features)
assert Y_test.shape  == (nb_ratings_test, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nb_ratings_test-zero_labels))

(90570, 2644)
(90570,)
Training labels: 49906 zeros, 40664 ones
(9430, 2644)
(9430,)
Test labels: 5469 zeros, 3961 ones


### Convert to Protobuf format for saving to S3

In [9]:
#Change this value to your own bucket name
bucket = 'movie-lens-dataset'
prefix = 'fm'

train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [10]:
def writeDatasetToProtobuf(X, bucket, prefix, key, d_type, Y=None):
    buf = io.BytesIO()
    if d_type == "sparse":
        smac.write_spmatrix_to_sparse_tensor(buf, X, labels=Y)
    else:
        smac.write_numpy_to_dense_tensor(buf, X, labels=Y)
        
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
fm_train_data_path = writeDatasetToProtobuf(X_train, bucket, train_prefix, train_key, "sparse", Y_train)    
fm_test_data_path  = writeDatasetToProtobuf(X_test, bucket, test_prefix, test_key, "sparse", Y_test)    
  
print "Training data S3 path: ",fm_train_data_path
print "Test data S3 path: ",fm_test_data_path
print "FM model output S3 path: {}".format(output_prefix)

Training data S3 path:  s3://movie-lens-dataset/fm/train/train.protobuf
Test data S3 path:  s3://movie-lens-dataset/fm/test/test.protobuf
FM model output S3 path: s3://movie-lens-dataset/fm/output


### Run training job

You can play around with the hyper parameters until you are happy with the prediction. For this dataset and hyper parameters configuration, after 100 epochs, test accuracy was around 70% on average and the F1 score (a typical metric for a binary classifier) was around 0.74 (1 indicates a perfect classifier). Not great, but you can fine tune the model further.

In [11]:
instance_type='ml.m5.large'
fm = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "factorization-machines"),
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type=instance_type,
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nb_features,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=100)

fm.fit({'train': fm_train_data_path, 'test': fm_test_data_path})

INFO:sagemaker:Creating training-job with name: factorization-machines-2019-04-25-15-18-42-274


2019-04-25 15:18:42 Starting - Starting the training job...
2019-04-25 15:18:58 Starting - Launching requested ML instances......
2019-04-25 15:20:07 Starting - Preparing the instances for training......
2019-04-25 15:20:53 Downloading - Downloading input data...
2019-04-25 15:21:31 Training - Training image download completed. Training in progress.
[31mDocker entrypoint called with argument(s): train[0m
[31m[04/25/2019 15:21:32 INFO 139992915281728] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', u'bias_wd': u'0.01', u'use_linear': u'true', u'bias_lr': u'0.1', u'mini_b

## Part 2 - Repackaging Model data to fit a KNN Model

Now that we have the model created and stored in SageMaker, we can download the same and repackage it to fit a KNN model.

### Download model data

In [12]:
import mxnet as mx

model_file_name = "model.tar.gz"
model_full_path = fm.output_path +"/"+ fm.latest_training_job.job_name +"/output/"+model_file_name
print "Model Path: ", model_full_path

#Download FM model 
%cd /home/ec2-user/SageMaker
!sudo aws s3 cp $model_full_path .

#Extract model file for loading to MXNet
os.system('tar xzvf '+model_file_name)
os.system("unzip -o model_algo-1")
os.system("mv symbol.json model-symbol.json")
os.system("mv params model-0000.params")

Model Path:  s3://movie-lens-dataset/fm/output/factorization-machines-2019-04-25-15-18-42-274/output/model.tar.gz
/home/ec2-user/SageMaker
download: s3://movie-lens-dataset/fm/output/factorization-machines-2019-04-25-15-18-42-274/output/model.tar.gz to ./model.tar.gz


0

### Extract model data to create item and user latent matrixes

In [13]:
#Extract model data
m = mx.module.Module.load('./model', 0, False, label_names=['out_label'])
V = m._arg_params['v'].asnumpy()
w = m._arg_params['w1_weight'].asnumpy()
b = m._arg_params['w0_weight'].asnumpy()

# item latent matrix - concat(V[i], w[i]).  
knn_item_matrix = np.concatenate((V[nb_users:], w[nb_users:]), axis=1)
knn_train_label = np.arange(1,nb_movies+1+19)

#user latent matrix - concat (V[u], 1) 
ones = np.ones(nb_users).reshape((nb_users, 1))
knn_user_matrix = np.concatenate((V[:nb_users], ones), axis=1)

In [14]:
print(knn_item_matrix.shape)
print(knn_train_label.shape)

(1701, 65)
(1701,)


## Part 3 - Building KNN Model

In this section, we upload the model input data to S3, create a KNN model and save the same. Saving the model, will display the model in the model section of SageMaker. Also, it will aid in calling batch transform down the line or even deploying it as an end point for real-time inference.

This approach uses the default 'index_type' parameter for knn. It is precise but can be slow for large datasets. In such cases, you may want to use a different 'index_type' parameter leading to an approximate, yet fast answer.

In [15]:
print('KNN train features shape = ', knn_item_matrix.shape)
knn_prefix = 'matt_knn'
knn_output_prefix  = 's3://{}/{}/output'.format(bucket, knn_prefix)
knn_train_data_path = writeDatasetToProtobuf(knn_item_matrix, bucket, knn_prefix, train_key, "dense", knn_train_label)
print('uploaded KNN train data: {}'.format(knn_train_data_path))

nb_recommendations = 100

# set up the estimator
knn = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "knn"),
    get_execution_role(),
    train_instance_count=1,
    train_instance_type=instance_type,
    output_path=knn_output_prefix,
    sagemaker_session=sagemaker.Session())

knn.set_hyperparameters(feature_dim=knn_item_matrix.shape[1], k=nb_recommendations, index_metric="INNER_PRODUCT", predictor_type='classifier', sample_size=200000)
fit_input = {'train': knn_train_data_path}
knn.fit(fit_input)
knn_model_name =  knn.latest_training_job.job_name
print "created model: ", knn_model_name

# save the model so that we can reference it in the next step during batch inference
sm = boto3.client(service_name='sagemaker')
primary_container = {
    'Image': knn.image_name,
    'ModelDataUrl': knn.model_data,
}

knn_model = sm.create_model(
        ModelName = knn.latest_training_job.job_name,
        ExecutionRoleArn = knn.role,
        PrimaryContainer = primary_container)
print "saved the model"

('KNN train features shape = ', (1701, 65))
uploaded KNN train data: s3://movie-lens-dataset/matt_knn/train.protobuf


INFO:sagemaker:Creating training-job with name: knn-2019-04-25-15-23-27-982


2019-04-25 15:23:28 Starting - Starting the training job...
2019-04-25 15:23:30 Starting - Launching requested ML instances......
2019-04-25 15:24:38 Starting - Preparing the instances for training.........
2019-04-25 15:26:28 Downloading - Downloading input data
2019-04-25 15:26:28 Training - Training image download completed. Training in progress..
[31mDocker entrypoint called with argument(s): train[0m
[31m[04/25/2019 15:26:31 INFO 140241282664256] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'index_metric': u'L2', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'_log_level': u'info', u'faiss_index_ivf_nlists': u'auto', u'epochs': u'1', u'index_type': u'faiss.Flat', u'_faiss_index_nprobe': u'5', u'_kvstore': u'dist_async', u'_num_kv_servers': u'1', u'mini_batch_size': u'5000'}[0m
[31m[04/25/2019 15:26:31 INFO 140241282664256] Reading provided configuration from /opt/ml/input/config/hyperparameters.

## Part 4 - Batch Transform

In this section, we will use SageMaker's batch transform option to batch predict top X for all the users.

In [28]:
#upload inference data to S3
knn_batch_data_path = writeDatasetToProtobuf(knn_user_matrix, bucket, knn_prefix, train_key, "dense")
print "Batch inference data path: ",knn_batch_data_path

# Initialize the transformer object
transformer =sagemaker.transformer.Transformer(
    base_transform_job_name="knn",
    model_name=knn_model_name,
    instance_count=1,
    instance_type=instance_type,
    output_path=knn_output_prefix,
    accept="application/jsonlines; verbose=true"
)

# Start a transform job:
transformer.transform(knn_batch_data_path, content_type='application/x-recordio-protobuf')
transformer.wait()


#Download predictions 
results_file_name = "inference_output"
inference_output_file = "matt_knn/output/train.protobuf.out"
s3_client = boto3.client('s3')
s3_client.download_file(bucket, inference_output_file, results_file_name)
with open(results_file_name) as f:
    results = f.readlines()  

INFO:sagemaker:Creating transform job with name: knn-2019-04-25-15-44-00-833


Batch inference data path:  s3://movie-lens-dataset/matt_knn/train.protobuf
..........................................!


In [29]:
# Building move lock up table.
import csv

movies = {} 
with open('ml-100k/u.item','r') as f:
    samples=csv.reader(f,delimiter='|')
    for movieItem in samples:
        movies[str(int(movieItem[0])-1)] = str(movieItem[1]) 

In [30]:
import json
test_user_idx = 0
u_one_json = json.loads(results[test_user_idx])

In [36]:
print "Recommended movie Ids for user #{}".format(test_user_idx+1)
i=0
for movie_id in u_one_json['labels']:
    if i == 10:
        break
    i=i+1
    print str(i) +") " + movies[str(int(movie_id)-1)]

Recommended movie Ids for user #1
1) Dunston Checks In (1996)
2) Sense and Sensibility (1995)
3) Brazil (1985)
4) Circle of Friends (1995)
5) Crude Oasis, The (1995)
6) Curdled (1996)
7) Addicted to Love (1997)
8) Third Man, The (1949)
9) All Things Fair (1996)
10) Nell (1994)


In [32]:
print(len(results))
print "Recommended movie Ids for user #{} : {}".format(test_user_idx+1, [int(movie_id) for movie_id in u_one_json['labels']])
print
print "Movie distances for user #{} : {}".format(test_user_idx+1,  [round(distance, 4) for distance in u_one_json['distances']])

943
Recommended movie Ids for user #1 : [1177, 275, 175, 724, 1340, 1081, 535, 513, 1619, 729, 453, 1312, 1415, 83, 1681, 1417, 667, 634, 887, 31, 1404, 1261, 155, 1658, 1386, 1117, 516, 665, 872, 1519, 1162, 966, 1144, 1468, 462, 1262, 873, 987, 623, 1089, 629, 1188, 229, 160, 433, 1186, 1450, 543, 1644, 1413, 447, 1391, 111, 1026, 1573, 812, 549, 676, 1307, 538, 519, 135, 1556, 1670, 1270, 1477, 182, 694, 230, 497, 764, 985, 163, 1099, 632, 261, 1518, 1683, 270, 376, 1482, 464, 737, 1038, 713, 284, 317, 546, 223, 196, 212, 337, 98, 288, 501, 227, 217, 15, 626, 187]

Movie distances for user #1 : [0.5376, 0.539, 0.5394, 0.5396, 0.5406, 0.5418, 0.542, 0.5421, 0.5421, 0.5431, 0.5435, 0.5439, 0.544, 0.5443, 0.5444, 0.5447, 0.5449, 0.5451, 0.5456, 0.5459, 0.546, 0.5461, 0.5466, 0.5467, 0.5476, 0.5485, 0.5488, 0.5492, 0.5492, 0.5494, 0.5495, 0.5501, 0.5506, 0.551, 0.5513, 0.5514, 0.5517, 0.5517, 0.5548, 0.5557, 0.5565, 0.5566, 0.558, 0.5583, 0.5583, 0.5584, 0.5593, 0.5594, 0.56, 0.5621, 0.