# Predictive Movie Recommendations

## Problem Overview

This scenario focuses on predictions related to movie recommendations where the goal is to predict the rating a user would give to a movie the user never rated before. Using this value, you can identify the list of most interesting movies for a particular user and recommend those movies. The dataset we are working with in this scenario consists of user actions related to movies:

- ```details```: the user viewed the details of the movie
- ```addToCart```: the user added the movie to the shopping cart
- ```buy```: the user bought the movie 

One of the most widely used appproaches to solve this kind of problem is Collaborative Filtering. Collaborative Filtering is a technique based on the assumption that if two users have the same taste or opinion on a given issue, then it’s more likely that those two users will share the same taste or opinion on a different/new issue. There are multiple types of collaborative filtering: memory-based, model-based, hybrid, and deep learning-based. In this notebook we will showcase one particular model-based collaborative filtering algorithm - SVD (Singular Value Decomposition). The ```scikit-surprise``` library provides a wide range of collaborative filtering algorithms but you can use any other library instead of it.

One important aspect to notice is the input dataset does not contain explicit ratings (which are required by the SVD algorithm). Still, the data contains information that can be translated into ratings using a fairly simple approach (in this case we call the ratings implicit ratings). 

The scenario details the development of a machine learning movie rating prediction model. The model is trained on a dataset containing movie-related user actions. As part of the development process, the calculation of implicit ratings is also showcased.

## Solution Overview

We will use the remote model training capabilities of the [Azure Machine Learning service](https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml). The remote compute resources used will be provided by [Azure Machine Learning Compute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute). We will model our problem as a model-based collaborative filtering problem where the goal of the trained model is to predict the rating a user would give to a movie. These predictions can then be used to rank movies for that user and recommend the most interesting ones.

This notebook is organized into the following sections:

1. Basic setup

2. Data prep

3. Model training

4. Load the trained model and make movie recommendations

## Section 1. Basic setup

Make sure the ```scikit-surprise``` (Simple Python RecommendatIon System Engine) library is installed. This will be provide the recommender algorithm used in this notebook.

In [None]:
!pip install --upgrade scikit-surprise

Make the necessary namespace and class imports.

In [None]:
import pandas as pd
import numpy as np
import logging
import warnings
import os
# Squash warning messages for cleaner output in the notebook
warnings.showwarning = lambda *args, **kwargs: None

from matplotlib import pyplot as plt

from surprise import dump

import azureml.core
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.core import ScriptRunConfig

Before starting this step, you need to create an Azure Machine Learning service workspace ([instructions](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace)).

Let's get started by creating an experiment in your Azure Machine Learning workspace. An experiment is a named object in a workspace, which is used to do model training.

In [None]:
subscription_id = "<subscription id goes here>"
resource_group = "<resource group goes here>"
workspace_name = "<workspace name goes here>"

In [None]:
ws = Workspace(workspace_name = workspace_name,
               subscription_id = subscription_id,
               resource_group = resource_group)

In [None]:
# choose a name for the run history container in the workspace
experiment_name = 'PredictiveMovieRecommendations'

# project folder
project_folder = './sample_projects/predictivemovierecommendations'
os.makedirs(project_folder, exist_ok=True)

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

## Section 2. Data prep

Movie_Customer_Actions.csv contains 4000 movie-related customer actions (details, addToCart, and buy) performed during various sessions.

In [None]:
data = pd.read_csv("https://quickstartsws9073123377.blob.core.windows.net/amlnotebooktutorials/movie-customer-actions/movie_customer_actions.csv")
data.head(10)

### 2.1 Calculate implicit ratings

The dataset records movie-related customer actions which do not contain explicit ratings for the movies. The first step we need to perform is to determine these ratings using a simple approach: for each combination of MovieId and UserId we count the occurences of each action (details, addToCart, buy), combine them using a predefined set of weights, and then normalize the results to get a 0 to 10 scale.

First, let's count the actions:

In [None]:
counts = data.groupby(['MovieId', 'UserId', 'Action'])['Action'].count()

In [None]:
agg_data = dict()

for i in counts.index:
    
    if i[0] not in agg_data.keys():
        
        movie_agg_data = dict()
        user_agg_data = {
            'buy': 0,
            'addToCart': 0,
            'details': 0
        }
        movie_agg_data[i[1]] = user_agg_data
        agg_data[i[0]] = movie_agg_data
        
    else:
        
        movie_agg_data = agg_data[i[0]]
        if i[1] not in movie_agg_data.keys():
            user_agg_data = {
            'buy': 0,
            'addToCart': 0,
            'details': 0
        }
        movie_agg_data[i[1]] = user_agg_data
        
    agg_data[i[0]][i[1]][i[2]] += counts[i]

Using the predefined weights, calculate the raw rating for each MovieId, UserId pair.

In [None]:
weights = {
    'buy': 100,
    'addToCart': 50,
    'details': 15
}

raw_ratings = []
max_rating = 0

for movie_key in agg_data.keys():
    for user_key in agg_data[movie_key].keys():
        raw_rating = 0
        for weights_key in weights.keys():
            raw_rating += agg_data[movie_key][user_key][weights_key] * weights[weights_key]
        if raw_rating > max_rating:
            max_rating = raw_rating
        raw_ratings.append([movie_key, user_key, raw_rating])

Finally, normalize the ratings to get a scale from 0 to 10.

In [None]:
ratings = pd.DataFrame(raw_ratings, columns=['MovieId', 'UserId', 'RawRating'])
ratings['Rating'] = float(10) * ratings['RawRating'] / max_rating
ratings.head(10)

## Section 3. Model training

First, make sure the necessary compute resources are available. If you want to reuse an existing Azure Machine Learning compute cluster, change the value of the ```cpu_cluster_name``` variable below.

In [None]:
# Choose a name for your CPU cluster
cpu_cluster_name = "aml-compute-01"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=1)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Create a run configuration for the compute target. Notice the dependency that is specified: ```scikit-surprise```.

In [None]:
# Create a new runconfig object
run_amlcompute = RunConfiguration()

# Use the cpu_cluster you created above. 
run_amlcompute.target = cpu_cluster

# Enable Docker
run_amlcompute.environment.docker.enabled = True

# Set Docker base image to the default CPU-based image
run_amlcompute.environment.docker.base_image = DEFAULT_CPU_IMAGE

# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
run_amlcompute.environment.python.user_managed_dependencies = False

# Specify CondaDependencies obj, add necessary packages
run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-surprise'])

Create the training script that will be submitted to the remote AML compute cluster. The content of the training script is based on the logic we have shown so far (load data -> calculate implicit ratings -> normalize ratings) completed with the use of the SVD (Singular Value Decomposition) algorithm from ```scikit-surprise```.

The ```scikit-surprise``` library supports many more recommendation algorithms:
- Basic: NormalPredictor, BaselineOnly
- k-NN: KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
- Matrix factorization: SVD, SVDpp, NMF, SlopeOne, CoClustering

More details about these are available in the [scikit-suprise documentation](https://surprise.readthedocs.io/en/stable/).

The last step performed by the training script is to write the trained model in the ```outputs``` folder. With AML, anything that is written to that folder during the training process becomes available as part of the experiment run record. This allows you do download the trained model at any later point in time and use it to make predictions.

In [None]:
%%writefile $project_folder/train.py

import pandas as pd
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise import dump
from surprise.model_selection import train_test_split

from azureml.core import Run

data = pd.read_csv("https://quickstartsws9073123377.blob.core.windows.net/amlnotebooktutorials/movie-customer-actions/movie_customer_actions.csv")
counts = data.groupby(['MovieId', 'UserId', 'Action'])['Action'].count()

agg_data = dict()

for i in counts.index:
    
    if i[0] not in agg_data.keys():
        
        movie_agg_data = dict()
        user_agg_data = {
            'buy': 0,
            'addToCart': 0,
            'details': 0
        }
        movie_agg_data[i[1]] = user_agg_data
        agg_data[i[0]] = movie_agg_data
        
    else:
        
        movie_agg_data = agg_data[i[0]]
        if i[1] not in movie_agg_data.keys():
            user_agg_data = {
            'buy': 0,
            'addToCart': 0,
            'details': 0
        }
        movie_agg_data[i[1]] = user_agg_data
        
    agg_data[i[0]][i[1]][i[2]] += counts[i]
    
weights = {
    'buy': 100,
    'addToCart': 50,
    'details': 15
}

raw_ratings = []
max_rating = 0

for movie_key in agg_data.keys():
    for user_key in agg_data[movie_key].keys():
        raw_rating = 0
        for weights_key in weights.keys():
            raw_rating += agg_data[movie_key][user_key][weights_key] * weights[weights_key]
        if raw_rating > max_rating:
            max_rating = raw_rating
        raw_ratings.append([movie_key, user_key, raw_rating])
        
ratings = pd.DataFrame(raw_ratings, columns=['MovieId', 'UserId', 'RawRating'])
ratings['Rating'] = float(10) * ratings['RawRating'] / max_rating

reader = Reader(rating_scale=(0, 10))
data = Dataset.load_from_df(ratings[['UserId', 'MovieId', 'Rating']], reader)

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
rmse = accuracy.rmse(predictions)

# Get the containing AML run and log the metric
run = Run.get_context()
run.log('rmse', rmse)

# Save the trained model into the outputs to make sure is automatically uploaded into experiment record
dump.dump('outputs/movie_recommender.pkl', algo=algo)

Finally, submit your experiment run using the ```train.py``` file you've just created. 

Notice the url displayed right after execution starts - this allows you to view the progress of the run directly in the AML portal.

In [None]:
src = ScriptRunConfig(source_directory = project_folder, script = 'train.py', run_config = run_amlcompute)
run = experiment.submit(src)
run.wait_for_completion(show_output = True)

You can also check the details of the experiment run that has just finished. 

Notice the urls on the right side - the link to the AML portal and the link to the AML documentation.

In [None]:
run

## Section 4. Load the trained model and make movie recommendations

The trained recommender model is now available in the experiment run record and can be downloaded.

In [None]:
run.download_file('outputs/movie_recommender.pkl')

You can now load the trained model which will be available in the ```algo``` variable.

In [None]:
pred, algo = dump.load('movie_recommender.pkl')

Next, we will attempt to find a MovieId, UserId pair that does not exists in the ```ratings``` data frame. To do that, we will choose a UserId (400001) and find all the MovieId values that have been rated by other users and have not been rated by this user.

In [None]:
set(ratings.MovieId.unique()).difference(ratings[ratings.UserId == 400001].MovieId.unique())

Verify the fact that there is not rating for MovieId 2404435 by UserId 400001.

In [None]:
ratings[(ratings.UserId == 400001) & (ratings.MovieId == 2404435)]

We've reached now our final step - predicting the rating for a movie that hasn't been rated previously by a given user.

Notice the ```est``` value which is the rating we are looking for.

In [None]:
algo.predict(400001, 2404435)