# Building Movie Recommendation System using SageMaker and Factorization Machines

In this workshop you will be building a Movie Recommendation System using Factorization Machines. You will learn about the capabilities of SageMaker in building, training, and deploying models at scale. You will fast track many of the troublesome tasks required through productionising your model and focus more on the actual development and value. But before we get started, let's have a short introduction on Factorization Machines. Also, ask us questions as many times as you wish, we won't bite.

## Factorization Machines 
Factorization Machines is one of the new craze in the in the supervised learning algorithm world. It is built for both classification and regression problems as it is an extension of the linear model. One of the key benefits behind Factorization Machines is its ability to deal with high dimensional sparse data (which is an awesome example of a sparse matrix of user ratings and movies).

If you wish to learn more about Factorizatin Machines, there are plenty of mathematical resources out there and you will be surprise how simple the whole thing is... 

# Let's get started!
We will first create a username for yourself which will be used to run the website and get your recommendations and not others. Create your username as "{your_initial}_{4 random numbers}" see example below:

In [None]:
#"jy_1234" < example
username = "jy_1234"

# Let's get started!
You can use the <b><i>conda_amazonei_mxnet_p36</i></b> kernel

First we will load a few packages that we require for general purpose and those required by SageMaker.

In [None]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from scipy.sparse import lil_matrix
import io

import boto3
import s3fs #!pip install s3fs

import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri



## Getting and Setting Permission
In order to use SageMaker you will need to initiate a few sesions and get your role. We will be initiating the SageMaker session when we train our model but before that we will need to call the s3 resource which we will use to upload our file to our s3 bucket

In [None]:
role = get_execution_role()
current_region = boto3.Session().region_name
s3 = boto3.resource('s3')
sm_sess = sagemaker.Session()

## Reading in Data
In this workshop we will be using the 100k MovieLens dataset. The MovieLens data was collected by the Grouplens Research Project, from the MovieLens website. The data contains users (and their demographic information), movies and the ratings for each movies provided by each users.

Summary:
* ~100,000 ratings made by 943 users
* Each user has rated at least 20 movies

From the 100k Dataset, we will be using the ua.base and ua.test tab delimited data. The ua.base data will be used as our training set while the ua.test will be used as our test set.

In [None]:
bucket = 'aimlws001s3-dataset'

# user movie ratings data
umr_train_key = 'Dataset/ml-100k/ua.base'
umr_train_location = 's3://{}/{}'.format(bucket, umr_train_key) 

umr_test_key = 'Dataset/ml-100k/ua.test'
umr_test_location = 's3://{}/{}'.format(bucket, umr_test_key) 

# Load Training Set #
umr_train = pd.read_csv(
    umr_train_location, 
    sep = '\t',
    dtype={'userId':'int32', 'movieId':'int32', 'rating':'float32'},
    names = ['user_id' , 'movie_id' , 'rating'], 
    index_col = False
)
umr_train = shuffle(umr_train) # shuffle data

# Load Test Set
umr_test = pd.read_csv(
    umr_test_location, 
    sep = '\t',
    dtype={'userId':'int32', 'movieId':'int32', 'rating':'float32'},
    names = ['user_id' , 'movie_id' , 'rating'], 
    index_col = False
)

#### Activity: Create your own rating, and pick a movie using the final_movie_metadata, appendend that to umr_train

In [None]:
# my choice movies: toy story, bad boys, apollo13, batman forever, exotica, alien, jaws, dumbo, ransom, georgia
movie_id = [1, 27, 28, 29, 46, 183, 234, 501, 742, 950] # change this
rating= [5.0, 4.0, 5.0, 2.0, 1.0, 3.0, 4.0, 4.0, 2.0, 1.0] # change this
user_id = [944] * len(rating) # don't change this
my_data = pd.DataFrame({"user_id": user_id, "movie_id": movie_id, "rating": rating})
umr_train = my_data.append(umr_train)

## Let's Explore the data!

We can see that there are ~90k ratings in ua.base and ~1k ratings in ua.test.

In [None]:
nb_users_train = umr_train['user_id'].max()
nb_movies_train = umr_train['movie_id'].max()
nb_ratings_train = umr_train.shape[0]
nb_features = nb_users_train + nb_movies_train

print("Number of users: ", nb_users_train)
print("Number of movies: ", nb_movies_train)
print("Number of ratings: ", nb_ratings_train)
print("Number of features: ", nb_features)
umr_train.head()

In [None]:
nb_users_test = umr_test['user_id'].max()
nb_movies_test = umr_test['movie_id'].max()
nb_ratings_test = umr_test.shape[0]

print("Number of users: ", nb_users_test)
print("Number of movies: ", nb_movies_test)
print("Number of ratings: ", nb_ratings_test)
umr_test.head()



## Transforming the data into protobuf
### Create one-hot encoded sparse matrix
SM Factorization Machine requires that the data be in RecordIO-protobuf format with a Float32 tensor. Luckily, we don't have to build our own utilities function as SageMaker can easily help us out with this through their SDK.

However, we will first create one-hot encoded sparse matrix. Since FM is a binary classifier, any movies with >=4 rating score will be assigned with 1, else 0. 

In [None]:
def create_sparse_matrix(dataframe, lines, nb_users, columns):
    # Create sparse matrix of one-hot encoded features
    X = lil_matrix((lines, columns)).astype('float32')
    Y = [] # Store labels in a vector
    
    line = 0
    for index, row in dataframe.iterrows():
        X[line, row['user_id'] - 1] = 1
        X[line, nb_users + (row['movie_id'] - 1)] = 1
            
        if int(row['rating']) >= 4:
            Y.append(1)
        else:
            Y.append(0)
            
        line = line+1

    Y = np.array(Y).astype('float32')            
    return (X, Y)

In [None]:
X_train, Y_train = create_sparse_matrix(umr_train, nb_ratings_train, nb_users_train, nb_features)
X_test, Y_test = create_sparse_matrix(umr_test, nb_ratings_test, nb_users_test, nb_features)

### Write sparse matrix as protobuf to S3 Bucket
We will now write our sparse matrix as protobuf and upload that to s3. be sure you use your own folder, your <i>name</i> as a prefix to <i>_pf_train_test</i> is fine.

The data files, training and test protobuf sets, will be saved in a training and test folder respectively within the 'aimlws001s3-dataset' bucket. You will also be setting the path of where your model output (or artifact) will be saved which will likely be in aimlws001s3-dataset bucket within output folder in your folder. 

In [None]:
def write_matrix_protobuf_s3(X, bucket, prefix, key, Y=None):
    
    buffer = io.BytesIO()
    
    smac.write_spmatrix_to_sparse_tensor(buffer, X, labels=Y)
        
    buffer.seek(0)
    obj = '{}/{}'.format(prefix, key)
    
    uploaded_path = 's3://{}/{}'.format(bucket, obj)
    
    s3.Bucket(bucket).Object(obj).upload_fileobj(buffer)
    
    return (uploaded_path)

In [None]:
your_folder_name = '{}/pf_train_test'.format(username)

umr_train_key = 'training/umr.train.protobuf'
umr_test_key = 'test/umr.test.protobuf'

model_output_path = 's3://{}/{}/output'.format(bucket, your_folder_name)

In [None]:
train_protobuf_path = write_matrix_protobuf_s3(X_train, bucket, your_folder_name, umr_train_key, Y_train)    
test_protobuf_path  = write_matrix_protobuf_s3(X_test, bucket, your_folder_name, umr_test_key, Y_test)    

print('Location of your protobuf training set: ', train_protobuf_path)
print('Location of your protobuf test set: ', test_protobuf_path)
print('Location of your model output: ', model_output_path)

## Build SageMaker Factorization Machines Model

#### Algorithm Image Uri

In order to use factoziation-machines as our algorithm, we need to grab the container that holds that amazon algorithm first. To do so, we specify the region and the algorithm to get the amazong image uri. 

#### Output Path 

Our SageMaker will output the model's artifact in the path that we have set above. 

#### Instance

The key thing about SageMaker is that you will only pay for what you use in training. You can specify the instance type (compute power/memory power) you want to use to train and build your model. We will default this to ml.m5.large for now. Once it's done training, you will not be charged, pay for what you use (per second billable model). 

In [None]:
algorithm = 'factorization-machines'

fm = sagemaker.estimator.Estimator(
    get_image_uri(current_region, algorithm),
    role, 
    train_instance_count = 1, 
    train_instance_type = 'ml.m5.xlarge',
    output_path = model_output_path,
    sagemaker_session = sm_sess,
)

We can play around with the hyperparameters of the factorization machines. This is used to fine tune your model to achieve a better accuracy. For the time being we will default the mini-batch size and number of epochs for our first run

In [None]:
fm.set_hyperparameters(
    feature_dim = nb_features,
    predictor_type = 'binary_classifier',
    num_factors = 64,
    mini_batch_size = 1000,
    epochs = 100
)

To train the model you just simply call SageMaker fit and associate it with the path of your training data. Optional, you can also add the test set to give you the accuracy score as it trains your model and evaluates it against the test set - we will do that

In [None]:
fm.fit({
    'train': train_protobuf_path,  
    'test': test_protobuf_path
})

## Deploy Model

Once you're happy with your accuracy, you can deploy the model using the command below. We will be using ml.c5.xlarge to host the model. You can scale it according to your need but we can just stick to 1 instance for now.

In [None]:
fm_predictor = fm.deploy(initial_instance_count = 1,
                         instance_type = 'ml.t2.medium') # Change this to a higher instance if you want

In [None]:
print("This is your endpoint, make note of this: ", fm_predictor.endpoint)

## Make Predictions

Before we make predictions, we need to make sure we build some functions to serialize and deserialize the model as this is what SageMaker FM requires. 

In [None]:
import json
from sagemaker.predictor import json_deserializer
def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    return json.dumps(js)

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer

## Batch prediction using the endpoint to find your movie recommendations

In [None]:
predictions = []
user_id = 944 # leave this as it is, this is your unique userid
for i in range (nb_users_train + 1, nb_features): 
    X_new = lil_matrix((1, nb_features)).astype('float32')
    X_new[0, user_id] = 1
    X_new[0, i] = 1
    
    pred = fm_predictor.predict(X_new[0].toarray())["predictions"][0]
    pred["movie_id"] = i - nb_users_train
    predictions.append(pred)
top_n_predictions = sorted(predictions, key = lambda i: i['score'], reverse = True)

In [None]:
list_movie_id = []
for x in top_n_predictions[0:44]:
    list_movie_id.append(x["movie_id"])

## Let's take the top 45 movies and load it into Servianflix

In [None]:
import boto3
from boto3.dynamodb.conditions import Key, Attr

In [None]:
movies_load = pd.read_csv(
    "final_movie_metadata.csv", 
    sep = ',',
    index_col = False
)

In [None]:
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('recmovie-2itpioohnnfxhcq7yzp2oax4nq-dev') # Point this to your created Dynamo Table

for index, row in movies_load.iterrows():

    if row["movieId"] in list_movie_id:
        my_item = {
            'id': str(row["movieId"]),
            'movieUri': str(row["moviePoster"]),
            'movieTitle': str(row["movieTitle"]),
            'movieReleaseYear': str(row["movieReleaseYear"]),
            'movieDescription': str(row['overview']),
            'user': 'jeno',
            'type': 'jeno'
        }

        response = table.put_item(Item=my_item)

## Loading Movies
##### Note: If you run this after loading top 45 movies, run the top 45 movies load to DynamoDB again
#### Action

In [None]:
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('recmovie-2itpioohnnfxhcq7yzp2oax4nq-dev')

for index, row in movies_load.iterrows():

    if row["Action"] == 1:
        my_item = {
            'id': str(row["movieId"]),
            'movieUri': str(row["moviePoster"]),
            'movieTitle': str(row["movieTitle"]),
            'movieReleaseYear': str(row["movieReleaseYear"]),
            'movieDescription': str(row['overview']),
            'user': '',
            'type': 'action'
        }

        response = table.put_item(Item=my_item)        

In [None]:
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('recmovie-2itpioohnnfxhcq7yzp2oax4nq-dev')

for index, row in movies_load.iterrows():

    if row["Comedy"] == 1:
        my_item = {
            'id': str(row["movieId"]),
            'movieUri': str(row["moviePoster"]),
            'movieTitle': str(row["movieTitle"]),
            'movieReleaseYear': str(row["movieReleaseYear"]),
            'movieDescription': str(row['overview']),
            'user': '',
            'type': 'comedy'
        }

        response = table.put_item(Item=my_item)        

In [None]:
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('recmovie-2itpioohnnfxhcq7yzp2oax4nq-dev')

for index, row in movies_load.iterrows():

    if row["Animation"] == 1:
        my_item = {
            'id': str(row["movieId"]),
            'movieUri': str(row["moviePoster"]),
            'movieTitle': str(row["movieTitle"]),
            'movieReleaseYear': str(row["movieReleaseYear"]).replace("/", ""),
            'movieDescription': str(row['overview']),
            'user': '',            
            'type': 'animation'
        }

        response = table.put_item(Item=my_item)        

## (Optional) Delete Endpoint

Run this after you're done with everything so you don't have an outgoing cost, but if you want to be evil and burn through our beer funds, let the endpoint running ;) 

In [None]:
#sagemaker.Session().delete_endpoint(fm_predictor.endpoint)

## Special Thanks to AWS

Credits to AWS - specifically Julia Simon, Zohar Karnin, Rama Thamman, Sireesha Muppala, Yuri Astashanok, David Arpin, and Guy Ernest for creating many of the baseline functionalities.