# CS 406 Movie Recommender

### Ryder McDowell

#### OSU Cascades

This matrix factorization algorithm uses AWS’s built in factorization machines and builds a model that allows clients to hit the deployed endpoint and receive predictions for new movies that a user may rate highly based on their previously rated movies and the similarity of that matrix to others’. This type of algorithm works best with extremely sparse data, so the features are one-hot encoded in order to make the vectors sparse to a degree of 2 (userID and movieID) of number of users plus number of movies.

In order to receive predictions from a client, it can make a request to this lambda endpoint:

https://github.com/osu-cascades/movie-recommender-lambda

With the request body as json containing a list of userIDs and movieIDs to get predicted:

```
{
  "samples": [
    {
      "userId": 1,
      "movieId": 20
    },
    {
      "userId": 1,
      "movieId": 33
    }
  ]
}
```

This will return a response that contains a matching list that provides whether or not the model thinks the user will rate that movie above a 3 stars and to what confidence level.

```
{
    "predictions": [
        {
            "prediction_score": 0.84,
            "predicted_label": 1
        },
        {
            "prediction_score": 0.62,
            "predicted_label": 0
        },
    ]
}
```

Dataset used: https://grouplens.org/datasets/movielens/  
Tutorial followed at: https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/

# Fetch Data

In [None]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

In [19]:
print("Training Data:")
!head -10 ./ml-100k/ua.base

print("\nTesting Data:")
!head -10 ./ml-100k/ua.test

Training Data:
1	1	5	874965758
1	2	3	876893171
1	3	4	878542960
1	4	3	876893119
1	5	3	889751712
1	6	5	887431973
1	7	4	875071561
1	8	1	875072484
1	9	5	878543541
1	10	3	875693118

Testing Data:
1	20	4	887431883
1	33	4	878542699
1	61	4	878542420
1	117	3	874965739
1	155	2	878542201
1	160	4	875072547
1	171	5	889751711
1	189	3	888732928
1	202	5	875072442
1	265	4	878542441


# Import Libraries

In [20]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer

import boto3, csv, io, json
import numpy as np
from scipy.sparse import lil_matrix

# Load Data

In [21]:
def get_samples(csv_reader):
    samples = []
    for userId,movieId,rating,timestamp in csv_reader:
        samples.append({
            'userId': userId,
            'movieId': movieId,
            'rating': rating,
            'timestamp': timestamp
        })
        
    return samples
    
    
def get_maximums(samples):
    users = []
    movies = []
    for sample in samples:
        users.append(int(sample['userId']))
        movies.append(int(sample['movieId']))

    max_user_id = max(users)
    max_movie_id = max(movies)
    
    return max_user_id, max_movie_id


def get_matrix_shape(max_user_id, max_movie_id, samples):
    total_samples = len(samples)
    total_features = max_user_id + max_movie_id

    return total_samples, total_features


def fill_data(data, labels, samples):
    row = 0
        
    # Build matrix and labels
    for sample in samples:

        # One hot-encode userId and movieId at row
        user_index = int(sample['userId']) - 1
        movie_index = 943 + int(sample['movieId']) - 1    #!!

        data[row, user_index] = 1
        data[row, movie_index] = 1

        # Append binary to labels for whether user "enjoyed" movie
        if int(sample['rating']) >= 4:
            labels.append(1)
        else:
            labels.append(0)

        row = row + 1

    # Convert labels list to float 32
    labels = np.array(labels).astype('float32')
    
    return data, labels
    

def load_dataset(training_data_file_path, testing_data_file_path):
    # Training Data
    with open(training_data_file_path, 'r') as file:
        csv_reader = csv.reader(file, delimiter='\t')
        
        # Get all training samples in form of [{}, {}, ...]
        training_samples = get_samples(csv_reader)
        
        # Get maximum number of users and movies
        max_user_id, max_movie_id = get_maximums(training_samples)
        
        # Get shape of training matrix
        training_matrix_shape = get_matrix_shape(max_user_id, max_movie_id, training_samples)
        
        # Initialize training data and labels structures
        training_data = lil_matrix(training_matrix_shape).astype('float32')
        training_labels = []

        # Fill training data and labels structures with sample training data 
        training_data, training_labels = fill_data(training_data, training_labels, training_samples)
        
    # Testing Data
    with open(testing_data_file_path, 'r') as file:
        csv_reader = csv.reader(file, delimiter='\t')
        
        # Get all testing samples in form of [{}, {}, ...]
        testing_samples = get_samples(csv_reader)
        
        #Get shape of testing matrix
        testing_matrix_shape = get_matrix_shape(max_user_id, max_movie_id, testing_samples)
        
        # Initialize testing data and labels structures
        testing_data = lil_matrix(testing_matrix_shape).astype('float32')
        testing_labels = []
        
        # Fill testing data and labels structurs with sample testing data
        testing_data, testing_labels = fill_data(testing_data, testing_labels, testing_samples)
        
    
    
    return (training_data, training_labels), (testing_data, testing_labels)

In [22]:
training_data_file_path = './ml-100k/ua.base'
testing_data_file_path = './ml-100k/ua.test'

(training_data, training_labels), (testing_data, testing_labels) = load_dataset(training_data_file_path, testing_data_file_path)

# Summary Statistics

### Shapes

In [23]:
print("(Ratings, Features)")
print(training_data.shape)
print(training_labels.shape)

print(testing_data.shape)
print(testing_labels.shape)

(Ratings, Features)
(90570, 2625)
(90570,)
(9430, 2625)
(9430,)


### Insight

In [24]:
print(training_data[1000:1005])
print(training_labels[1000:1005])

print(training_data[1000:1005])
print(testing_labels[1000:1005])

  (0, 6)	1.0
  (0, 1493)	1.0
  (1, 6)	1.0
  (1, 1494)	1.0
  (2, 6)	1.0
  (2, 1495)	1.0
  (3, 6)	1.0
  (3, 1496)	1.0
  (4, 6)	1.0
  (4, 1497)	1.0
[0. 1. 0. 0. 1.]
  (0, 6)	1.0
  (0, 1493)	1.0
  (1, 6)	1.0
  (1, 1494)	1.0
  (2, 6)	1.0
  (2, 1495)	1.0
  (3, 6)	1.0
  (3, 1496)	1.0
  (4, 6)	1.0
  (4, 1497)	1.0
[0. 0. 0. 0. 0.]


### Label Balance

In [25]:
print("{:0.2f}% Movies Rated Above 3 in Training Data".format(np.count_nonzero(training_labels) / training_data.shape[0] * 100))
print("{:0.2f}% Movies Rated Above 3 in Testing Data".format(np.count_nonzero(testing_labels) / testing_data.shape[0] * 100))

0.00% Movies Rated Above 3 in Training Data
0.00% Movies Rated Above 3 in Testing Data


### Sparcity

In [26]:
encoded_values = training_data.shape[0] * 2
total_values = training_data.shape[0] * training_data.shape[1]

print("{:0.5f}% Sparse".format(100 - (encoded_values / total_values)))

100.00000% Sparse


## Convert to protobuf and save to S3

In [27]:
bucket = 'rydermcdowell-sagemaker'
prefix = 'fm-movielens'

training_data_key = '{}/training-data/training.protobuf'.format(prefix)
testing_data_key = '{}/testing-data/testing.protobuf'.format(prefix)

output_path = 's3://{}/{}/output'.format(bucket, prefix)

In [28]:
def write_dataset_to_protobuf(data, labels, bucket, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, data, labels)
    buf.seek(0)
    boto3.resource('s3').Bucket(bucket).Object(key).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket, key)

In [29]:
training_data_location = write_dataset_to_protobuf(training_data, training_labels, bucket, training_data_key)
testing_data_location = write_dataset_to_protobuf(testing_data, testing_labels, bucket, testing_data_key)

print('Training data written to: {}'.format(training_data_location))
print('Testing data written to: {}'.format(testing_data_location))
print('Output location: {}'.format(output_path))

Training data written to: s3://rydermcdowell-sagemaker/fm-movielens/training-data/training.protobuf
Testing data written to: s3://rydermcdowell-sagemaker/fm-movielens/testing-data/testing.protobuf
Output location: s3://rydermcdowell-sagemaker/fm-movielens/output


## Run training

In [30]:
containers = {
                'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/factorization-machines:latest',
                'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:latest',
                'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/factorization-machines:latest',
                'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/factorization-machines:latest'
             }

In [34]:
fm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                   get_execution_role(),
                                   train_instance_count = 1,
                                   train_instance_type = 'ml.m5.large',
                                   output_path = output_path,
                                   sagemaker_session = sagemaker.Session()
                                  )

fm.set_hyperparameters(feature_dim = training_data.shape[1],
                       predictor_type = 'binary_classifier',
                       mini_batch_size = 1000,
                       num_factors = 64,
                       factors_lr = 0.01,
                       epochs = 200
                      )

fm.fit({ 'train': training_data_location, 'test': testing_data_location })

2019-06-11 16:44:41 Starting - Starting the training job...
2019-06-11 16:44:43 Starting - Launching requested ML instances......
2019-06-11 16:45:53 Starting - Preparing the instances for training...
2019-06-11 16:46:27 Downloading - Downloading input data...
2019-06-11 16:47:04 Training - Downloading the training image..
[31mDocker entrypoint called with argument(s): train[0m
[31m[06/11/2019 16:47:17 INFO 140374414108480] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', u'bias_wd': u'0.01', u'use_linear': u'true', u'bias_lr': u'0.1', u'mini_batch_size': u'1000', u'_use

## Deploy

In [None]:
fm_predictor = fm.deploy(instance_type = 'ml.m5.large', initial_instance_count = 1)

------------------------------------------------------------------------------------

In [None]:
def fm_serializer(data):
    js = { 'instances': [] }
    for row in data:
        js['instances'].append({ 'features': row.tolist() })
    return json.dumps(js)

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer

## Run predictions

In [None]:
result = fm_predictor.predict(testing_data[4].toarray())
print("{}\n".format(result))
print("{}".format(map(lambda prediction: prediction['predicted_label'], result['predictions'])))
print(testing_labels[4])