# Movie recommendation on Amazon SageMaker with Factorization Machines

### Step SM1: Download ml-100k data  
***The data sets are needed to train our Factorization Machine. We use the 100,000 movie ratings given by users from MovieLens data sets.***

#####  The data sets are needed to train our Factorization Machine. We use the 100,000 movie ratings given by users from MovieLens data sets.

In [32]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

--2019-03-04 07:08:32--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.34.235
Connecting to files.grouplens.org (files.grouplens.org)|128.101.34.235|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.1’


2019-03-04 07:08:33 (20.0 MB/s) - ‘ml-100k.zip.1’ saved [4924029/4924029]

Archive:  ml-100k.zip
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating:

### Data Information
*ua.base : data for training*  
*ua.test : data for test/validation*  
*Headers/columns :* ***user id | item id | rating (1-5) | timestamp***

### Step SM2: Let's shuffle rating items data

Columns in ua.base and ua.test file:
user id | item id | rating | timestamp

***The code below will show how ua.test file look like for first 10 lines:***

In [1]:
import pandas as pd

In [3]:
train_df = pd.read_csv('./ml-100k/ua.base', sep='\t', names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP'])
test_df = pd.read_csv('./ml-100k/ua.test', sep='\t', names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP'])
pd.set_option('display.max_rows', 5)
train_df

Unnamed: 0,USER_ID,ITEM_ID,RATING,TIMESTAMP
0,1,1,5,874965758
1,1,2,3,876893171
...,...,...,...,...
90568,943,1228,3,888640275
90569,943,1330,3,888692465


### Step SM3: Build training set and test set

***Import necessary modules***

In [4]:
import boto3, csv, io, json
import numpy as np
from scipy.sparse import lil_matrix

import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer


***Initialize number of total users and movies in data set, as well as number of train and test data***

In [5]:
nbUsers=943
nbMovies=1682
nbFeatures=nbUsers+nbMovies

nbRatingsTrain=90570
nbRatingsTest=9430

***For each user, build a list of rated movies. We'd need this to add random negative samples.***  
This is achieved by a dictionary moviesByUser that will look like this:  
```
{
  '0':[875072546,875072441],
  '1':[887431882]
}
```
where key represents userId (stored as userId - 1) and each element in the values represents movieId (stored as movieId -1)  

In [6]:
moviesByUser = {}
for userId in range(nbUsers):
    moviesByUser[str(userId)]=[]
 
with open('./ml-100k/ua.base','r') as f:
    samples=csv.reader(f,delimiter='\t')
    for userId,movieId,rating,timestamp in samples:
        moviesByUser[str(int(userId)-1)].append(int(movieId)-1) 

### Step SM4: Define method to load dataset

***The data will be loaded into 2 vectors: feature vector X and label vector Y***  
Feature vector X is a one-hot encoded vector that sticks and flattens user Ids and movie Ids together. It should look like this below (without rows' and columns' labels):   

|<pre></pre>| 1 	    |      2    |    3  	|<pre>...</pre>| 1  | 2  |<pre>...</pre>|
|   :---:   |:---:      |    :---:	|   :---:	|  :---:	|  :---:	 |  :---:     |   :---:	  |
| **data0** | 1 	    |<pre></pre>|<pre></pre>|<pre></pre>|<pre></pre> |<pre>1</pre>|<pre></pre>|
| **data1** |<pre></pre>| 1 	    |<pre></pre>|<pre></pre>|<pre>1</pre>|<pre></pre> |<pre></pre>|
|<pre>...</pre>|<pre></pre>|<pre></pre>|<pre></pre>|<pre></pre>|<pre></pre> |<pre></pre> |<pre></pre>|
 
It is a 2D sparse matrix where columns are user Ids and movie Ids, and rows are data items in the training/test data set.
One row represents 1 training/test data that has 2 ones (1s) that mark the user Id and movie Id that he/she rated.   

Label vector Y is a 1D vector containing expected output. It looks like this below (without rows' labels):

|<pre></pre>|<pre></pre>|
| :--- | :---:|
|**data0**| 1 |
|**data1**| 1 |
|**data2**| 0 |
|**data3**| 1 |
|<pre>...</pre>|<pre></pre>|


If user's rating for that movie is 4 or 5, then value is 1, otherwise 0. Each element corresponds to one data.


In [7]:
def loadDataset(filename, lines, columns):

    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    with open(filename,'r') as f:
        samples=csv.reader(f,delimiter='\t')
        for userId,movieId,rating,timestamp in samples:
            X[line,int(userId)-1] = 1
            X[line,int(nbUsers)+int(movieId)-1] = 1
            if int(rating) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1
            
    Y=np.array(Y).astype('float32')
    
    return X,Y

***Now that we have defined the loadDataset method, lets load both training and test data***

In [8]:
X_train, Y_train = loadDataset('./ml-100k/ua.base', nbRatingsTrain, nbFeatures)
X_test, Y_test = loadDataset('./ml-100k/ua.test',nbRatingsTest,nbFeatures)

***Let's examine the dimensions of X and Y vectors***

In [9]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nbRatingsTrain, nbFeatures)
assert Y_train.shape == (nbRatingsTrain, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nbRatingsTrain-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nbRatingsTest, nbFeatures)
assert Y_test.shape  == (nbRatingsTest, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nbRatingsTest-zero_labels))

(90570, 2625)
(90570,)
Training labels: 49906 zeros, 40664 ones
(9430, 2625)
(9430,)
Test labels: 5469 zeros, 3961 ones


### Step SM5: Convert to protobuf and save to S3
**IMPORTANT ! :** ***Remember to change the S3 bucket name below to your S3 bucket name***

In [10]:
bucket = 'product-recommendation-personalize'
prefix = 'sagemaker/fm-movielens-binary-basics'

train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train3')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test3')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [11]:
def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)    
test_data  = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)    
  
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))

s3://product-recommendation-personalize/sagemaker/fm-movielens-binary-basics/train3/train.protobuf
s3://product-recommendation-personalize/sagemaker/fm-movielens-binary-basics/test3/test.protobuf
Output: s3://product-recommendation-personalize/sagemaker/fm-movielens-binary-basics/output


### Step SM6: Run training job
***We are done with the data preparation part. Let's begin training our Factorization Machine model.***  
***SageMaker provides both the container and built-in algorithm to run the training and inference.***

Below is the list of container images containing built-in algorithm for factorization machine in SageMaker per region:

In [15]:
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/factorization-machines:latest',
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:latest',
              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/factorization-machines:latest',
              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/factorization-machines:latest',
              'ap-southeast-1': '475088953585.dkr.ecr.ap-southeast-1.amazonaws.com/factorization-machines:latest'}

***Behing the scene, SageMaker provisions a container to run the training, and terminate it after training job succeeds. Metrics during training, including accuracy are posted to CloudWatch Metrics.***    

Note: If you like GUI (Graphical User Interface), you can execute the training via AWS Console too. Basically we can interact with AWS in 3 ways: AWS Console (GUI), CLI, and SDK. For this lab, we are using SDK. You can inspect https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs (change the region as necessary) to see the running training job after you run the step below

In [19]:
fm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type='ml.c4.xlarge',
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nbFeatures,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=10)

fm.fit({'train': train_data, 'test': test_data})

INFO:sagemaker:Creating training-job with name: factorization-machines-2019-03-03-15-14-17-123


2019-03-03 15:14:17 Starting - Starting the training job...
2019-03-03 15:14:18 Starting - Launching requested ML instances......
2019-03-03 15:15:22 Starting - Preparing the instances for training...
2019-03-03 15:16:19 Downloading - Downloading input data
2019-03-03 15:16:19 Training - Downloading the training image....
[31mDocker entrypoint called with argument(s): train[0m
[31m[03/03/2019 15:16:53 INFO 140396320343872] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', u'bias_wd': u'0.01', u'use_linear': u'true', u'bias_lr': u'0.1', u'mini_batch_size': u'1000', u'_use_

[31m[03/03/2019 15:16:57 INFO 140396320343872] #quality_metric: host=algo-1, epoch=5, train binary_classification_accuracy <score>=0.601516483516[0m
[31m[03/03/2019 15:16:57 INFO 140396320343872] #quality_metric: host=algo-1, epoch=5, train binary_classification_cross_entropy <loss>=0.666080986274[0m
[31m[03/03/2019 15:16:57 INFO 140396320343872] #quality_metric: host=algo-1, epoch=5, train binary_f_1.000 <score>=0.713719545892[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 525.2301692962646, "sum": 525.2301692962646, "min": 525.2301692962646}}, "EndTime": 1551626217.407132, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1551626216.8815}
[0m
[31m[03/03/2019 15:16:57 INFO 140396320343872] #progress_metric: host=algo-1, completed 6 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset": {"cou


2019-03-03 15:16:52 Training - Training image download completed. Training in progress.[31m[03/03/2019 15:17:07 INFO 140396320343872] #quality_metric: host=algo-1, epoch=23, train binary_classification_accuracy <score>=0.683395604396[0m
[31m[03/03/2019 15:17:07 INFO 140396320343872] #quality_metric: host=algo-1, epoch=23, train binary_classification_cross_entropy <loss>=0.63193313431[0m
[31m[03/03/2019 15:17:07 INFO 140396320343872] #quality_metric: host=algo-1, epoch=23, train binary_f_1.000 <score>=0.74930170636[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 512.8679275512695, "sum": 512.8679275512695, "min": 512.8679275512695}}, "EndTime": 1551626227.27941, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1551626226.766136}
[0m
[31m[03/03/2019 15:17:07 INFO 140396320343872] #progress_metric: host=algo-1, completed 24 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {

[31m[03/03/2019 15:17:17 INFO 140396320343872] #quality_metric: host=algo-1, epoch=42, train binary_classification_accuracy <score>=0.71521978022[0m
[31m[03/03/2019 15:17:17 INFO 140396320343872] #quality_metric: host=algo-1, epoch=42, train binary_classification_cross_entropy <loss>=0.607239904383[0m
[31m[03/03/2019 15:17:17 INFO 140396320343872] #quality_metric: host=algo-1, epoch=42, train binary_f_1.000 <score>=0.764821721888[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 504.9450397491455, "sum": 504.9450397491455, "min": 504.9450397491455}}, "EndTime": 1551626237.326794, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1551626236.821329}
[0m
[31m[03/03/2019 15:17:17 INFO 140396320343872] #progress_metric: host=algo-1, completed 43 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset": 

[31m[03/03/2019 15:17:27 INFO 140396320343872] #quality_metric: host=algo-1, epoch=61, train binary_classification_accuracy <score>=0.729175824176[0m
[31m[03/03/2019 15:17:27 INFO 140396320343872] #quality_metric: host=algo-1, epoch=61, train binary_classification_cross_entropy <loss>=0.585964433943[0m
[31m[03/03/2019 15:17:27 INFO 140396320343872] #quality_metric: host=algo-1, epoch=61, train binary_f_1.000 <score>=0.769183220477[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 543.4520244598389, "sum": 543.4520244598389, "min": 543.4520244598389}}, "EndTime": 1551626247.424182, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1551626246.880276}
[0m
[31m[03/03/2019 15:17:27 INFO 140396320343872] #progress_metric: host=algo-1, completed 62 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 91, "sum": 91.0, "min": 91}, "Number of Batches Since Last Reset":


2019-03-03 15:17:49 Uploading - Uploading generated training model[31m[03/03/2019 15:17:37 INFO 140396320343872] #quality_metric: host=algo-1, epoch=80, train binary_classification_accuracy <score>=0.735483516484[0m
[31m[03/03/2019 15:17:37 INFO 140396320343872] #quality_metric: host=algo-1, epoch=80, train binary_classification_cross_entropy <loss>=0.568705273974[0m
[31m[03/03/2019 15:17:37 INFO 140396320343872] #quality_metric: host=algo-1, epoch=80, train binary_f_1.000 <score>=0.771459767387[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 514.9531364440918, "sum": 514.9531364440918, "min": 514.9531364440918}}, "EndTime": 1551626257.426866, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1551626256.911452}
[0m
[31m[03/03/2019 15:17:37 INFO 140396320343872] #progress_metric: host=algo-1, completed 81 % of epochs[0m
[31m#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max":

[31m[03/03/2019 15:17:47 INFO 140396320343872] #quality_metric: host=algo-1, epoch=99, train binary_classification_accuracy <score>=0.740362637363[0m
[31m[03/03/2019 15:17:47 INFO 140396320343872] #quality_metric: host=algo-1, epoch=99, train binary_classification_cross_entropy <loss>=0.554843309675[0m
[31m[03/03/2019 15:17:47 INFO 140396320343872] #quality_metric: host=algo-1, epoch=99, train binary_f_1.000 <score>=0.774027563913[0m
[31m[03/03/2019 15:17:47 INFO 140396320343872] #quality_metric: host=algo-1, train binary_classification_accuracy <score>=0.740362637363[0m
[31m[03/03/2019 15:17:47 INFO 140396320343872] #quality_metric: host=algo-1, train binary_classification_cross_entropy <loss>=0.554843309675[0m
[31m[03/03/2019 15:17:47 INFO 140396320343872] #quality_metric: host=algo-1, train binary_f_1.000 <score>=0.774027563913[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 514.0359401702881, "sum": 514.0359401702881, "min": 514.0359401702881}}, "EndTim

***If the training was successful, you wil see 'Training job completed' at the end of the output. Scroll up to see the  train and test accuracy***    

***After training phase completed, we have the model parameters stored in S3 (in the output path you specified). You can check your S3 bucket that contains the output to inspect how the training job output looks like***

### Step SM7: Deploy model

***Now, let's deploy the model for inference using SageMaker SDK. It will spin-up a new virtual machine with container containing algorithm for inference. It will give us an API endpoint for inference.***

In [20]:
fm_predictor = fm.deploy(instance_type='ml.t2.medium', initial_instance_count=1)

INFO:sagemaker:Creating model with name: factorization-machines-2019-03-03-15-20-51-428
INFO:sagemaker:Creating endpoint with name factorization-machines-2019-03-03-15-14-17-123


--------------------------------------------------------------------------!

### Step SM8: Run predictions

***After the model is deployed and given an endpoint, we can run the prediction / inference.***  
Below we define the serializer and deserializer for the prediction request/response data

In [12]:
def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    #print js
    return json.dumps(js)

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer

NameError: name 'fm_predictor' is not defined

***Let's test the prediction with some data from the test set***

In [22]:
index_from = 900
index_to = 910
result = fm_predictor.predict(X_test[index_from:index_to].toarray())

***Display the prediction in pretty table, being compared againts the actual rating (label) from the test set.***.      
Observe that for score between 0.3 to 0.7 our recommender may guess incorrectly.

In [23]:
!pip install tabulate
import tabulate
from IPython.display import HTML, display

scores, predicted_rating = ['Score'], ['Predicted Rating']
for r in result['predictions']:
    scores.append("%.2f" % r['score'])
    predicted_rating.append(r['predicted_label'])


table = [scores, predicted_rating, ['Actual Rating'] + Y_test[index_from:index_to].tolist() ]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


0,1,2,3,4,5,6,7,8,9,10
Score,0.78,0.86,0.75,0.38,0.37,0.77,0.85,0.82,0.69,0.46
Predicted Rating,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0
Actual Rating,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0


### Step SM9: Get Movies Recommendation

***After testing the prediction, let's get real movies recommendation for a particular user***    
First, let's prepare a dictionary that maps movie ID to its title. We use the u.item data containing movies' details

In [25]:
movies = {}
with open('./ml-100k/u.item','r') as f:
    samples=csv.reader(f,delimiter='|')
    for movieId,m_title,r_date,video_r_date,imdb_URL,unkwn,act,adv,anm,kid,cmd,crime,doc,drama,fantasy,f_noir,horror,msc,myst,rom,sfy,thriller,war,west in samples:
        movies[int(movieId)] = m_title

***Define some parameters***    
userId = the ID of user who needs the recommendations    
score_threshold = Cut-off score. Value nearer to 1 means that we only consider strong predictions. Value 0.5 is the minimun.    
maximum_recommendations = Maximum of movies recommendation. The actual result may be less than this if not many movies are strongly recommended.

In [None]:
def find_top_rated_movies(user_id, k):
    sort(train_data[:,userId])

In [29]:
userId = 1000
score_threshold = 0.50
maximum_recommendations = 20

***Run predictions for all movies for this particular user and sort the output based on score***

In [30]:
recommended_movies=[]
for movieId in range(nbMovies):
    test_input = lil_matrix((1, nbFeatures)).astype('float32')
    test_input[0, int(userId)-1] = 1
    test_input[0, nbUsers+int(movieId)-1] = 1
    result = fm_predictor.predict(test_input.toarray())
    result_label, result_score = int(result['predictions'][0]['predicted_label']), float(result['predictions'][0]['score'])
    if (result_label == 1) and (result_score > score_threshold):
        recommended_movies.append([int(movieId),result_score])
        
def getVal(item):
    return item[1]
recommended_movies = sorted(recommended_movies,key=getVal,reverse=True)



***Print out the result of top recommended movies***

In [31]:
output_table = [['<strong>Movie Title</strong>','<strong>Score</strong>']]
for i in range(min(maximum_recommendations,len(recommended_movies))):
    output_table.append([movies[int(recommended_movies[i][0])],recommended_movies[i][1]])

display(HTML(tabulate.tabulate(output_table, tablefmt='html')))

0,1
Movie Title,Score
Dante's Peak (1997),0.735660433769
Volcano (1997),0.714289307594
Conspiracy Theory (1997),0.703901529312
Air Force One (1997),0.696575164795
Broken Arrow (1996),0.695904254913
"Net, The (1995)",0.694929182529
Twister (1996),0.692556977272
Murder at 1600 (1997),0.687928795815
"Saint, The (1997)",0.685971915722


***Do you think that the recommended movies are similar?***