# Movie recommendation on Amazon SageMaker with Factorization Machines

Recommendation is one of the most popular applications in machine learning (ML). This lab is a modified version of [Build a movie recommender with factorization machines on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/) AWS ML blog. It will show you how to build a movie recommendation model based on factorization machines — one of the built-in algorithms of Amazon SageMaker — and the popular [MovieLens](https://grouplens.org/datasets/movielens/) dataset.

This lab will take around 12 minutes.

## A word about factorization machines

Factorization Machines (FM) are a supervised machine learning technique introduced in 2010 ([research paper](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf), PDF). FM get their name from their ability to reduce problem dimensionality thanks to matrix factorization.

Factorization machines can be used for classification or regression and are much more computationally efficient on large sparse data sets than traditional algorithms like linear regression. This property is why FM are widely used for recommendation. User count and item count are typically very large although the actual number of recommendations is very small (users don’t rate all available items!).

Here’s a simple example: Where a sparse rating matrix (dimension 4×4) is factored into a dense user matrix (dimension 4×2) and a dense item matrix (2×4). As you can see, the number of factors (2) is smaller than the number of columns of the rating matrix (4). In addition, this multiplication also lets us fill all blank values in the rating matrix, which we can then use to recommend new items to any user.

<img src="images/Factorization2.png" alt="Factorization" style="width: 800px;"/>

### The MovieLens dataset

This dataset is a great starting point for recommendation. It comes in multiples sizes. In this blog post we’ll use ml100k: 100,000 ratings from 943 users on 1682 movies. As you can see, the ml100k rating matrix is quite sparse (93.6% to be precise) because  it only holds 100,000 ratings out of a possible 1,586,126 (943*1682).

Here are the first 10 lines in the data set: user 754 gave movie 595 a 2-star rating, and so on.

<pre>
# user id, movie id, rating, timestamp
754         595         2    879452073
932         157         4    891250667
751         100         4    889132252
101         820         3    877136954
606         1277        3    878148493
581         475         4    879641850
13          50          5    882140001
457         59          5    882397575
111         321         3    891680076
123         657         4    879872066
</pre>

## Recommendation Engine Implementation

In [26]:
import time                                          #1
print time.strftime("%m/%d/%Y %H:%M:%S")             #2

09/05/2018 03:47:49


In [4]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer

import boto3, csv, io, json
import numpy as np
from scipy.sparse import lil_matrix

bucket = 'pilho-sagemaker-ai-workshop'
prefix = 'sagemaker/fm-movielens'

### Download ml-100k dataset

In [5]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2018-09-05 01:29:43--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2018-09-05 01:29:43 (20.2 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base    

In [6]:
%cd ml-100k
!shuf ua.base -o ua.base.shuffled
!head -10 ua.base.shuffled

/home/ec2-user/SageMaker/pilho-lab/HotelSila/ml-100k
707	211	3	886287051
346	55	5	874948639
488	845	3	891294853
359	323	3	886453431
443	260	1	883504818
59	58	4	888204389
351	689	4	879481386
487	288	4	883440572
318	485	5	884495921
660	209	4	891406212


In [7]:
!head -10 ua.test

1	20	4	887431883
1	33	4	878542699
1	61	4	878542420
1	117	3	874965739
1	155	2	878542201
1	160	4	875072547
1	171	5	889751711
1	189	3	888732928
1	202	5	875072442
1	265	4	878542441


### Build training set and test set

In [8]:
nbUsers=943
nbMovies=1682
nbFeatures=nbUsers+nbMovies

nbRatingsTrain=90570
nbRatingsTest=9430

In [9]:
# For each user, build a list of rated movies.
# We'd need this to add random negative samples.
moviesByUser = {}
for userId in range(nbUsers):
    moviesByUser[str(userId)]=[]
 
with open('ua.base.shuffled','r') as f:
    samples=csv.reader(f,delimiter='\t')
    for userId,movieId,rating,timestamp in samples:
        moviesByUser[str(int(userId)-1)].append(int(movieId)-1) 

In [10]:
def loadDataset(filename, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    with open(filename,'r') as f:
        samples=csv.reader(f,delimiter='\t')
        for userId,movieId,rating,timestamp in samples:
            X[line,int(userId)-1] = 1
            X[line,int(nbUsers)+int(movieId)-1] = 1
            if int(rating) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1
            
    Y=np.array(Y).astype('float32')
    return X,Y

In [11]:
X_train, Y_train = loadDataset('ua.base.shuffled', nbRatingsTrain, nbFeatures)
X_test, Y_test = loadDataset('ua.test',nbRatingsTest,nbFeatures)

In [12]:
print(X_train[1000])

  (0, 275)	1.0
  (0, 1297)	1.0


In [13]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nbRatingsTrain, nbFeatures)
assert Y_train.shape == (nbRatingsTrain, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nbRatingsTrain-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nbRatingsTest, nbFeatures)
assert Y_test.shape  == (nbRatingsTest, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nbRatingsTest-zero_labels))

(90570, 2625)
(90570,)
Training labels: 49906 zeros, 40664 ones
(9430, 2625)
(9430,)
Test labels: 5469 zeros, 3961 ones


### Convert to protobuf and save to S3

In [14]:
train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train3')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test3')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [15]:
def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)    
test_data  = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)    
  
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))

s3://pilho-sagemaker-ai-workshop/sagemaker/fm-movielens/train3/train.protobuf
s3://pilho-sagemaker-ai-workshop/sagemaker/fm-movielens/test3/test.protobuf
Output: s3://pilho-sagemaker-ai-workshop/sagemaker/fm-movielens/output


### Run training job

In [16]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'factorization-machines')

In [17]:
fm = sagemaker.estimator.Estimator(container,
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type='ml.c4.xlarge',
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nbFeatures,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=100)

fm.fit({'train': train_data, 'test': test_data})

INFO:sagemaker:Creating training-job with name: factorization-machines-2018-09-05-01-30-02-100


.....................
[31mDocker entrypoint called with argument(s): train[0m
[31m[09/05/2018 01:33:20 INFO 140343674808128] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', u'bias_wd': u'0.01', u'use_linear': u'true', u'bias_lr': u'0.1', u'mini_batch_size': u'1000', u'_use_full_symbolic': u'true', u'batch_metrics_publish_interval': u'500', u'bias_init_sigma': u'0.01', u'_num_gpus': u'auto', u'_data_format': u'record', u'factors_wd': u'0.00001', u'linear_wd': u'0.001', u'_kvstore': u'auto', u'_learning_rate': u'1.0', u'_optimizer': u'adam'}[0m
[31m[09/05/2018 01:33:20 

[31m[09/05/2018 01:33:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=11, batch=0 train binary_classification_accuracy <score>=0.727[0m
[31m[09/05/2018 01:33:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=11, batch=0 train binary_classification_cross_entropy <loss>=0.607448242188[0m
[31m[09/05/2018 01:33:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=11, batch=0 train binary_f_1.000 <score>=0.75204359673[0m
[31m[09/05/2018 01:33:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=11, train binary_classification_accuracy <score>=0.714054945055[0m
[31m[09/05/2018 01:33:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=11, train binary_classification_cross_entropy <loss>=0.604973278004[0m
[31m[09/05/2018 01:33:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=11, train binary_f_1.000 <score>=0.765523766614[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 762.423038482666, "sum": 762.4230384

[31m[09/05/2018 01:33:39 INFO 140343674808128] #quality_metric: host=algo-1, epoch=24, train binary_classification_accuracy <score>=0.735384615385[0m
[31m[09/05/2018 01:33:39 INFO 140343674808128] #quality_metric: host=algo-1, epoch=24, train binary_classification_cross_entropy <loss>=0.566906655783[0m
[31m[09/05/2018 01:33:39 INFO 140343674808128] #quality_metric: host=algo-1, epoch=24, train binary_f_1.000 <score>=0.772163875485[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 769.7751522064209, "sum": 769.7751522064209, "min": 769.7751522064209}}, "EndTime": 1536111219.645237, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1536111218.875026}
[0m
[31m[09/05/2018 01:33:39 INFO 140343674808128] #progress_metric: host=algo-1, completed 25 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset":

[31m[09/05/2018 01:33:49 INFO 140343674808128] #quality_metric: host=algo-1, epoch=38, train binary_classification_accuracy <score>=0.740659340659[0m
[31m[09/05/2018 01:33:49 INFO 140343674808128] #quality_metric: host=algo-1, epoch=38, train binary_classification_cross_entropy <loss>=0.547119114467[0m
[31m[09/05/2018 01:33:49 INFO 140343674808128] #quality_metric: host=algo-1, epoch=38, train binary_f_1.000 <score>=0.773468995969[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 728.3539772033691, "sum": 728.3539772033691, "min": 728.3539772033691}}, "EndTime": 1536111229.892146, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1536111229.163245}
[0m
[31m[09/05/2018 01:33:49 INFO 140343674808128] #progress_metric: host=algo-1, completed 39 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset":

[31m[09/05/2018 01:33:59 INFO 140343674808128] #quality_metric: host=algo-1, epoch=51, train binary_classification_accuracy <score>=0.743714285714[0m
[31m[09/05/2018 01:33:59 INFO 140343674808128] #quality_metric: host=algo-1, epoch=51, train binary_classification_cross_entropy <loss>=0.535659007649[0m
[31m[09/05/2018 01:33:59 INFO 140343674808128] #quality_metric: host=algo-1, epoch=51, train binary_f_1.000 <score>=0.775023151721[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 714.6210670471191, "sum": 714.6210670471191, "min": 714.6210670471191}}, "EndTime": 1536111239.39467, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1536111238.679573}
[0m
[31m[09/05/2018 01:33:59 INFO 140343674808128] #progress_metric: host=algo-1, completed 52 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset": 

[31m[09/05/2018 01:34:09 INFO 140343674808128] #quality_metric: host=algo-1, epoch=65, train binary_classification_accuracy <score>=0.74610989011[0m
[31m[09/05/2018 01:34:09 INFO 140343674808128] #quality_metric: host=algo-1, epoch=65, train binary_classification_cross_entropy <loss>=0.527024399642[0m
[31m[09/05/2018 01:34:09 INFO 140343674808128] #quality_metric: host=algo-1, epoch=65, train binary_f_1.000 <score>=0.776552737964[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 709.6560001373291, "sum": 709.6560001373291, "min": 709.6560001373291}}, "EndTime": 1536111249.666124, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1536111248.955989}
[0m
[31m[09/05/2018 01:34:09 INFO 140343674808128] #progress_metric: host=algo-1, completed 66 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset": 

[31m[09/05/2018 01:34:19 INFO 140343674808128] #quality_metric: host=algo-1, epoch=79, train binary_classification_accuracy <score>=0.747747252747[0m
[31m[09/05/2018 01:34:19 INFO 140343674808128] #quality_metric: host=algo-1, epoch=79, train binary_classification_cross_entropy <loss>=0.520592326741[0m
[31m[09/05/2018 01:34:19 INFO 140343674808128] #quality_metric: host=algo-1, epoch=79, train binary_f_1.000 <score>=0.777759490362[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 708.1468105316162, "sum": 708.1468105316162, "min": 708.1468105316162}}, "EndTime": 1536111259.778182, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1536111259.069576}
[0m
[31m[09/05/2018 01:34:19 INFO 140343674808128] #progress_metric: host=algo-1, completed 80 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset":

[31m[09/05/2018 01:34:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=93, train binary_classification_accuracy <score>=0.749076923077[0m
[31m[09/05/2018 01:34:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=93, train binary_classification_cross_entropy <loss>=0.515383036519[0m
[31m[09/05/2018 01:34:29 INFO 140343674808128] #quality_metric: host=algo-1, epoch=93, train binary_f_1.000 <score>=0.77882603642[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 704.9980163574219, "sum": 704.9980163574219, "min": 704.9980163574219}}, "EndTime": 1536111269.948627, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1536111269.243168}
[0m
[31m[09/05/2018 01:34:29 INFO 140343674808128] #progress_metric: host=algo-1, completed 94 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset": 


Billable seconds: 185


### Deploy model

In [18]:
fm_predictor = fm.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1)

INFO:sagemaker:Creating model with name: factorization-machines-2018-09-05-01-35-16-223
INFO:sagemaker:Creating endpoint with name factorization-machines-2018-09-05-01-30-02-100


---------------------------------------------------------------------------!

In [19]:
def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    #print js
    return json.dumps(js)

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer

## Run predictions

Let's test the performance. We will perform a batch prediction on below 10 test sets.

In [20]:
print(X_test[1000:1010])

  (0, 100)	1.0
  (0, 1164)	1.0
  (1, 100)	1.0
  (1, 1194)	1.0
  (2, 100)	1.0
  (2, 1223)	1.0
  (3, 100)	1.0
  (3, 1224)	1.0
  (4, 100)	1.0
  (4, 1246)	1.0
  (5, 100)	1.0
  (5, 1311)	1.0
  (6, 100)	1.0
  (6, 1347)	1.0
  (7, 100)	1.0
  (7, 1413)	1.0
  (8, 100)	1.0
  (8, 1538)	1.0
  (9, 100)	1.0
  (9, 1771)	1.0


In the above cell output, each test set contains two non-zero values (ex, (0,100) and (0,1164). The first one (100) indicates the user ID and the second one subtrated by the total user count (1164 - nbUsers) indicates the movie ID.

In [21]:
result = fm_predictor.predict(X_test[1000:1010].toarray())
print(result)

{u'predictions': [{u'score': 0.6599310040473938, u'predicted_label': 1.0}, {u'score': 0.1793009489774704, u'predicted_label': 0.0}, {u'score': 0.2245723009109497, u'predicted_label': 0.0}, {u'score': 0.5920941233634949, u'predicted_label': 1.0}, {u'score': 0.5138829350471497, u'predicted_label': 1.0}, {u'score': 0.1430586725473404, u'predicted_label': 0.0}, {u'score': 0.37031662464141846, u'predicted_label': 0.0}, {u'score': 0.4893360137939453, u'predicted_label': 0.0}, {u'score': 0.34443169832229614, u'predicted_label': 0.0}, {u'score': 0.11674283444881439, u'predicted_label': 0.0}]}


Let's test a whole batch of data and evaluate our predictive accuracy. We will firstly see the prediction result on the training set.

In [56]:
import numpy as np

predictions = []
for array in np.array_split(X_train[0:20000].toarray(), 100):
    result = fm_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)

In [57]:
import pandas as pd

pd.crosstab(Y_train[0:20000], predictions, rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,6580,2451
1.0,2584,8385


Let's now check the prediction result on the test set which was not used in the training process.

In [58]:
import numpy as np

predictions = []
for array in np.array_split(X_test.toarray(), 100):
    result = fm_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)

In [59]:
print(X_test.toarray().shape)

(9430, 2625)


In [60]:
print(Y_test.shape)

(9430,)


In [61]:
print(predictions.shape)

(9430,)


In [62]:
pd.crosstab(Y_test, predictions, rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,2655,1306
1.0,1606,3863


In [63]:
import time                                          #1
print time.strftime("%m/%d/%Y %H:%M:%S")             #2

09/05/2018 07:27:50


## Online-demo

Hoang and Alastair developed an online demo based on [Simon's blog](https://medium.com/@julsimon/building-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker-cedbfc8c93d8) which is available at the below link. Their demo includes training cases using bigger data set (20 million) too.

* http://sagemaker-nab-demo.s3-website-us-west-2.amazonaws.com/

## References

* Original blog: [Build a movie recommender with factorization machines on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/)