# Movie recommendation on Amazon SageMaker with Factorization Machines

In [1]:
import pandas as pd

### Step SM1: Download ml-100k data  
***The data sets are needed to train our Factorization Machine. We use the 100,000 movie ratings given by users from MovieLens data sets. This dataset has been pre curated on the Spark-preprocessing notebook and uploaded to an S3 bucket***

### Data Information
*ua.base (train.csv): data for training*  
*ua.test (test.csv): data for test/validation*  
*Headers/columns :* ***user id | item id | rating (1-5) | timestamp | rating_b (0,1)***

In [2]:
!aws s3 ls s3://pedro-spark-sagemaker/raw/train/

2019-05-30 16:56:09          0 _SUCCESS
2019-05-30 16:56:09    1973623 part-00000-b35d2537-f655-4056-a3f7-e2c49a0b0a9a-c000.csv


In [11]:
!aws s3 ls s3://pedro-spark-sagemaker/raw/test/

2019-05-30 17:14:30          0 _SUCCESS
2019-05-30 17:14:30     205513 part-00000-f5786d74-515f-4b6d-b42a-dd5612d611b1-c000.csv


In [12]:
!aws s3 cp s3://pedro-spark-sagemaker/raw/train/part-00000-b35d2537-f655-4056-a3f7-e2c49a0b0a9a-c000.csv train.csv

Completed 256.0 KiB/1.9 MiB (3.6 MiB/s) with 1 file(s) remainingCompleted 512.0 KiB/1.9 MiB (6.9 MiB/s) with 1 file(s) remainingCompleted 768.0 KiB/1.9 MiB (9.9 MiB/s) with 1 file(s) remainingCompleted 1.0 MiB/1.9 MiB (12.7 MiB/s) with 1 file(s) remaining Completed 1.2 MiB/1.9 MiB (15.4 MiB/s) with 1 file(s) remaining Completed 1.5 MiB/1.9 MiB (18.0 MiB/s) with 1 file(s) remaining Completed 1.8 MiB/1.9 MiB (20.6 MiB/s) with 1 file(s) remaining Completed 1.9 MiB/1.9 MiB (22.0 MiB/s) with 1 file(s) remaining download: s3://pedro-spark-sagemaker/raw/train/part-00000-b35d2537-f655-4056-a3f7-e2c49a0b0a9a-c000.csv to ./train.csv


In [14]:
!aws s3 cp s3://pedro-spark-sagemaker/raw/test/part-00000-f5786d74-515f-4b6d-b42a-dd5612d611b1-c000.csv test.csv

Completed 200.7 KiB/200.7 KiB (4.2 MiB/s) with 1 file(s) remainingdownload: s3://pedro-spark-sagemaker/raw/test/part-00000-f5786d74-515f-4b6d-b42a-dd5612d611b1-c000.csv to ./test.csv


In [17]:
import pandas as pd
import s3fs

df= pd.read_csv('train.csv',names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP','RATING_B'])

In [25]:
df_test= pd.read_csv('test.csv',names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP','RATING_B'])

In [26]:
df_test

Unnamed: 0,USER_ID,ITEM_ID,RATING,TIMESTAMP,RATING_B
0,1,33,4,878542699,1
1,1,61,4,878542420,1
...,...,...,...,...,...
9427,943,808,4,888639868,1
9428,943,1067,2,875501756,0


In [18]:
pd.set_option('display.max_rows', 5)
df

Unnamed: 0,USER_ID,ITEM_ID,RATING,TIMESTAMP,RATING_B
0,1,2,3,876893171,0
1,1,3,4,878542960,1
...,...,...,...,...,...
90567,943,1228,3,888640275,0
90568,943,1330,3,888692465,0


### Step SM3: Build training set and test set
***Import necessary modules***

In [19]:
import boto3, csv, io, json
import numpy as np
from scipy.sparse import lil_matrix

import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer

***Set S3 bucket and prefix***

In [31]:
%%time

role = get_execution_role()
print(role)
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'factorization-machine-sagemaker'

arn:aws:iam::349934754982:role/service-role/AmazonSageMaker-ExecutionRole-20190509T114602
CPU times: user 149 ms, sys: 8.32 ms, total: 157 ms
Wall time: 3.82 s


***Initialize number of total users and movies in data set, as well as number of train and test data***

In [20]:
nbUsers=943
nbMovies=1682
nbFeatures=nbUsers+nbMovies

nbRatingsTrain=90570
nbRatingsTest=9430

### Step SM4: Define method to load dataset

***The data will be loaded into 2 vectors: feature vector X and label vector Y***  
Feature vector X is a one-hot encoded vector that sticks and flattens user Ids and movie Ids together. It should look like this below (without rows' and columns' labels):   

|<pre></pre>| 1 	    |      2    |    3  	|<pre>...</pre>| 1  | 2  |<pre>...</pre>|
|   :---:   |:---:      |    :---:	|   :---:	|  :---:	|  :---:	 |  :---:     |   :---:	  |
| **data0** | 1 	    |<pre></pre>|<pre></pre>|<pre></pre>|<pre></pre> |<pre>1</pre>|<pre></pre>|
| **data1** |<pre></pre>| 1 	    |<pre></pre>|<pre></pre>|<pre>1</pre>|<pre></pre> |<pre></pre>|
|<pre>...</pre>|<pre></pre>|<pre></pre>|<pre></pre>|<pre></pre>|<pre></pre> |<pre></pre> |<pre></pre>|
 
It is a 2D sparse matrix where columns are user Ids and movie Ids, and rows are data items in the training/test data set.
One row represents 1 training/test data that has 2 ones (1s) that mark the user Id and movie Id that he/she rated.   

Label vector Y is a 1D vector containing expected output. It looks like this below (without rows' labels):

|<pre></pre>|<pre></pre>|
| :--- | :---:|
|**data0**| 1 |
|**data1**| 1 |
|**data2**| 0 |
|**data3**| 1 |
|<pre>...</pre>|<pre></pre>|


If user's rating for that movie is 4 or 5, then value is 1, otherwise 0. Each element corresponds to one data.


In [50]:

def loadDataset(filename, lines, columns):

    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines-1, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    with open(filename,'r') as f:
        samples=csv.reader(f,delimiter=',')
        for userId,movieId,rating,timestamp,rating_b in samples:
            if line==0:
                Y.append(rating_b)
                line=line+1
            else:
                Y.append(rating_b)
                X[line-1,int(userId)-1] = 1
                X[line-1,int(nbUsers)+int(movieId)-1] = 1
                line=line+1
            
    Y=np.array(Y).astype('float32')
    
    return X,Y



***Now that we have defined the loadDataset method, lets load both training and test data***

In [28]:
X_train, Y_train = loadDataset('train.csv', nbRatingsTrain, nbFeatures)
X_test, Y_test = loadDataset('test.csv',nbRatingsTest,nbFeatures)

***Let's examine the dimensions of X and Y vectors***

In [29]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nbRatingsTrain-1, nbFeatures)
assert Y_train.shape == (nbRatingsTrain-1, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nbRatingsTrain-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nbRatingsTest-1, nbFeatures)
assert Y_test.shape  == (nbRatingsTest-1, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nbRatingsTest-zero_labels))

(90569, 2625)
(90569,)
Training labels: 49905 zeros, 40665 ones
(9429, 2625)
(9429,)
Test labels: 5468 zeros, 3962 ones


### Step SM5: Convert to protobuf and save to S3

In [32]:
train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train3')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test3')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [33]:
def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)    
test_data  = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)    
  
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))

s3://sagemaker-ap-southeast-1-349934754982/factorization-machine-sagemaker/train3/train.protobuf
s3://sagemaker-ap-southeast-1-349934754982/factorization-machine-sagemaker/test3/test.protobuf
Output: s3://sagemaker-ap-southeast-1-349934754982/factorization-machine-sagemaker/output


### Step SM6: Run training job
***We are done with the data preparation part. Let's begin training our Factorization Machine model.***  
***SageMaker provides both the container and built-in algorithm to run the training and inference.***

In [34]:
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/factorization-machines:latest',
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:latest',
              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/factorization-machines:latest',
              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/factorization-machines:latest',
              'ap-southeast-1': '475088953585.dkr.ecr.ap-southeast-1.amazonaws.com/factorization-machines:latest'}

***Behing the scene, SageMaker provisions a container to run the training, and terminate it after training job succeeds. Metrics during training, including accuracy are posted to CloudWatch Metrics.***    

Note: If you like GUI (Graphical User Interface), you can execute the training via AWS Console too. Basically we can interact with AWS in 3 ways: AWS Console (GUI), CLI, and SDK. For this lab, we are using SDK. You can inspect https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs (change the region as necessary) to see the running training job after you run the step below

In [35]:
fm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type='ml.c4.xlarge',
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nbFeatures,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=10)

fm.fit({'train': train_data, 'test': test_data})

2019-05-30 17:22:41 Starting - Starting the training job...
2019-05-30 17:22:42 Starting - Launching requested ML instances.........
2019-05-30 17:24:17 Starting - Preparing the instances for training......
2019-05-30 17:25:19 Downloading - Downloading input data...
2019-05-30 17:26:12 Training - Training image download completed. Training in progress.
2019-05-30 17:26:12 Uploading - Uploading generated training model
[31mDocker entrypoint called with argument(s): train[0m
[31m[05/30/2019 17:26:03 INFO 140290400139072] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', u'b

[31m[2019-05-30 17:26:08.529] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 16, "duration": 536, "num_examples": 91, "num_bytes": 5796400}[0m
[31m[05/30/2019 17:26:08 INFO 140290400139072] #quality_metric: host=algo-1, epoch=7, train binary_classification_accuracy <score>=0.579494505495[0m
[31m[05/30/2019 17:26:08 INFO 140290400139072] #quality_metric: host=algo-1, epoch=7, train binary_classification_cross_entropy <loss>=0.674279990395[0m
[31m[05/30/2019 17:26:08 INFO 140290400139072] #quality_metric: host=algo-1, epoch=7, train binary_f_1.000 <score>=0.702404653767[0m
[31m#metrics {"Metrics": {"update.time": {"count": 1, "max": 538.672924041748, "sum": 538.672924041748, "min": 538.672924041748}}, "EndTime": 1559237168.530466, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1559237167.991013}
[0m
[31m[05/30/2019 17:26:08 INFO 140290400139072] #progress_metric: host=algo-1, 


2019-05-30 17:26:19 Completed - Training job completed
Billable seconds: 60


***If the training was successful, you wil see 'Training job completed' at the end of the output. Scroll up to see the  train and test accuracy***    

***After training phase completed, we have the model parameters stored in S3 (in the output path you specified). You can check your S3 bucket that contains the output to inspect how the training job output looks like***

### Step SM7: Deploy model

***Now, let's deploy the model for inference using SageMaker SDK. It will spin-up a new virtual machine with container containing algorithm for inference. It will give us an API endpoint for inference.***

In [36]:
fm_predictor = fm.deploy(instance_type='ml.t2.medium', initial_instance_count=1)

---------------------------------------------------------------------------------------------------------------!

### Step SM8: Run predictions

***After the model is deployed and given an endpoint, we can run the prediction / inference.***  
Below we define the serializer and deserializer for the prediction request/response data

In [53]:
def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    #print js
    return json.dumps(js)

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer

***Let's test the prediction with some data from the test set***

In [54]:
index_from = 900
index_to = 910
result = fm_predictor.predict(X_test[index_from:index_to].toarray())

***Display the prediction in pretty table, being compared againts the actual rating (label) from the test set.***.      


In [55]:
!pip install tabulate
import tabulate
from IPython.display import HTML, display

scores, predicted_rating = ['Score'], ['Predicted Rating']
for r in result['predictions']:
    scores.append("%.2f" % r['score'])
    predicted_rating.append(r['predicted_label'])


table = [scores, predicted_rating, ['Actual Rating'] + Y_test[index_from:index_to].tolist() ]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


0,1,2,3,4,5,6,7,8,9,10
Score,0.61,0.53,0.56,0.55,0.59,0.64,0.59,0.53,0.54,0.53
Predicted Rating,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Actual Rating,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0


### Step SM9: Get Movies Recommendation

***After testing the prediction, let's get real movies recommendation for a particular user***    
First, let's prepare a dictionary that maps movie ID to its title. We use the u.item data containing movies' details

In [56]:
userId = 344
score_threshold = 0.50
maximum_recommendations = 20

In [41]:
movies = {}
with open('./ml-100k/u.item','r', encoding = "ISO-8859-1") as f:
    samples=csv.reader(f,delimiter='|')
    for movieId,m_title,r_date,video_r_date,imdb_URL,unkwn,act,adv,anm,kid,cmd,crime,doc,drama,fantasy,f_noir,horror,msc,myst,rom,sfy,thriller,war,west in samples:
        movies[int(movieId)] = m_title

***Run predictions for all movies for this particular user and sort the output based on score***

In [42]:
recommended_movies=[]
for movieId in range(nbMovies):
    test_input = lil_matrix((1, nbFeatures)).astype('float32')
    test_input[0, int(userId)-1] = 1
    test_input[0, nbUsers+int(movieId)-1] = 1
    result = fm_predictor.predict(test_input.toarray())
    result_label, result_score = int(result['predictions'][0]['predicted_label']), float(result['predictions'][0]['score'])
    if (result_label == 1) and (result_score > score_threshold):
        recommended_movies.append([int(movieId),result_score])
        
def getVal(item):
    return item[1]
recommended_movies = sorted(recommended_movies,key=getVal,reverse=True)

***Print out the result of top recommended movies***

In [43]:
output_table = [['<strong>Movie Title</strong>','<strong>Score</strong>']]
for i in range(min(maximum_recommendations,len(recommended_movies))):
    output_table.append([movies[int(recommended_movies[i][0])],recommended_movies[i][1]])

display(HTML(tabulate.tabulate(output_table, tablefmt='html')))

0,1
Movie Title,Score
"Princess Bride, The (1987)",0.6682114005088806
Fargo (1996),0.6564514636993408
Return of the Jedi (1983),0.6532304286956787
Raiders of the Lost Ark (1981),0.6506448984146118
Aliens (1986),0.6488392949104309
Snow White and the Seven Dwarfs (1937),0.6474416851997375
"Shining, The (1980)",0.6474149823188782
"Maltese Falcon, The (1941)",0.6462321877479553
"Clockwork Orange, A (1971)",0.6449532508850098


In [61]:
moviesByUser = {}
for userId in range(nbUsers):
    moviesByUser[str(userId)]=[]
 
with open('train.csv','r') as f:
    samples=csv.reader(f,delimiter=',')
    for userId,movieId,rating,timestamp, rating_b in samples:
        moviesByUser[str(int(userId)-1)].append([int(movieId)-1,rating]) 

***Compare the recommendation with the top 20 movies that are actually rated by that particular user, sorted from the highest rating***

In [62]:
def find_top_rated_movies(user_id, k):
    rated_movies = moviesByUser[str(int(user_id)-1)]
    rated_movies = sorted(rated_movies,key=getVal,reverse=True)
    results = []
    
    for movie in rated_movies:
        results.append([movies[int(movie[0]+1)],movie[1]])
    return results[0:k]

output_table = [['<strong>Movie Title</strong>','<strong>Actual Rating</strong>']]
for m in find_top_rated_movies(userId,20):
    output_table.append(m)

display(HTML(tabulate.tabulate(output_table, tablefmt='html')))


0,1
Movie Title,Actual Rating
GoldenEye (1995),5
"Usual Suspects, The (1995)",5
Clerks (1994),5
"Professional, The (1994)",5
Pulp Fiction (1994),5
"Shawshank Redemption, The (1994)",5
Forrest Gump (1994),5
"Fugitive, The (1993)",5
True Romance (1993),5
