# COMP9033 - Data Analytics Lab 07c: Recommender systems
## Introduction

In this lab, you will build a simple movie recommender using $k$ nearest neighbours regression. At the end of the lab, you should be able to use `scikit-learn` to:

- Impute missing values in a data set.
- Create a $k$ nearest neighbours regression model.
- Use the model to predict new values.
- Measure the accuracy of the model.

### Getting started

Let's start by importing the packages we'll need. This week, we're going to use the `neighbors` subpackage from `scikit-learn` to build $k$ nearest neighbours models. We'll also use the `dummy` package to build a baseline model from we which can gauge how good our final model is and the `preprocessing` package to impute missing values in our data.

In [2]:
%matplotlib inline
import pandas as pd

from math import sqrt
from matplotlib import pyplot as plt

from sklearn.cross_validation import train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.neighbors import KNeighborsRegressor, DistanceMetric
from sklearn.neighbors.dist_metrics import EuclideanDistance
from sklearn.preprocessing import Imputer
from sklearn.pipeline import make_pipeline

Next, let's load the data. Write the path to your `ml-100k.csv` file in the cell below:

In [3]:
path = 'data/ml-100k.csv'

Execute the cell below to load the CSV data into a pandas data frame indexed by the `user_id` field in the CSV file.

In [4]:
df = pd.read_csv(path, index_col='user_id')
df.head()

Unnamed: 0_level_0,Kolya (1996),L.A. Confidential (1997),Heavyweights (1994),Legends of the Fall (1994),Jackie Brown (1997),Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963),"Hunt for Red October, The (1990)","Jungle Book, The (1994)",Grease (1978),"Remains of the Day, The (1993)",...,Sleepover (1995),Everest (1998),Nobody Loves Me (Keiner liebt mich) (1994),Getting Away With Murder (1996),Scream of Stone (Schrei aus Stein) (1991),Mamma Roma (1962),"Eighth Day, The (1996)",Girls Town (1996),"Silence of the Palace, The (Saimt el Qusur) (1994)",Dadetown (1995)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,4.0,,,4.0,,,5.0,...,,,,,,,,,,
2,5.0,5.0,,,,,,,,,...,,,,,,,,,,
3,,2.0,,,5.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,1.0,,,,,,1.0,,...,,,,,,,,,,


Let's start by computing some summary statistics about the data:

In [None]:
stats = df.describe()
stats

As can be seen, the data consists of film ratings in the range [1, 5] for 1664 films. Some films have been rated by many users, but the vast majority have been rated by only a few (i.e. there are many missing values):

In [None]:
rating_counts = stats.ix['count']
rating_counts.hist(bins=30)
plt.xlabel('Number of ratings')
plt.ylabel('Frequency')

Let's build a model of the data and use it to build a movie recommender system.

## Data modelling

Let's build a movie recommender using user-based collaborative filtering. For this, we'll need to build a model that can identify the most similar users to a given user and use that relationship to predict ratings for new movies. We can use $k$ nearest neighbours regression for this.

Before we build the model, let's specify ratings for some of the films in the data set. This gives us a target variable to fit our model to. The values below are just examples - feel free to add your own ratings or change the films.

In [None]:
y = pd.Series({
    'L.A. Confidential (1997)': 3.5,
    'Jaws (1975)': 3.5,
    'Evil Dead II (1987)': 4.5,
    'Fargo (1996)': 5.0,
    'Naked Gun 33 1/3: The Final Insult (1994)': 2.5,
    'Wings of Desire (1987)': 5.0,
    'North by Northwest (1959)': 5.0,
    'Monty Python\'s Life of Brian (1979)': 4.5,
    'Raiders of the Lost Ark (1981)': 4.0,
    'Annie Hall (1977)': 5.0,
    'True Lies (1994)': 3.0,
    'GoldenEye (1995)': 2.0,
    'Good, The Bad and The Ugly, The (1966)': 4.0,
    'Empire Strikes Back, The (1980)': 4.0,
    'Godfather, The (1972)': 4.5,
    'Waterworld (1995)': 1.0,
    'Blade Runner (1982)': 4.0,
    'Seven (Se7en) (1995)': 3.5,
    'Alien (1979)': 4.0,
    'Free Willy (1993)': 1.0
})

Next, let's form the matrix of explanatory variables. In user-based collaborative filtering, we need to identify the users that are most similar to us. Consequently, we need to transpose our data matrix (with the `T` attribute of the data frame) so that its columns (i.e. features) represent users and its rows (i.e. samples) represent films. We'll also need to select just the films that we specified above, as our target variable consists of these only.

In [None]:
X = df.ix[:, y.index].T

X.head()

As usual, we can split our data into a training set and a test set using the `train_test_split` function:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Data cleaning

Before we can fit the data, we'll need to replace the missing values with appropriate replacements. Normally, it would also be an option to simply remove the rows or columns corresponding to the missing entries, but in this case there are so many that it results in dropping the entire data set:

In [None]:
df.dropna(axis=0) # Drop rows with missing values

In [None]:
df.dropna(axis=1) # Drop columns with missing values

Instead, let's fill in the missing values with suitable replacements. We can use the [`Imputer`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer) class for this. The `Imputer` class works by filling in the mean, median or most frequent (i.e. mode) value per row or per column of the data frame. Depending on the data set, it may be better to use one value over another. Similarly, it may be better to compute the value to fill based on row values in some cases and column values in others.

In our case, it's not clear which value is best to fill or whether its better to compute the value to fill based on the row data (e.g. average rating per film) or column data (e.g. average rating per user). Let's use model selection to choose the best options for us.

### Dummy modelling

Let's start by creating a dummy regression model of our data, to give us a baseline from which we can improve:  

In [None]:
pipeline = make_pipeline(
    Imputer(),
    DummyRegressor()
)

parameters = {
    'imputer__axis': [0, 1],
    'imputer__strategy': ['mean', 'median', 'most_frequent']
}

gs = GridSearchCV(pipeline, parameters, cv=10) # Use 10-fold cross validation
gs.fit(X_train, y_train) # Fit using the training set

# Make predictions about the test data
y_pred = gs.predict(X_test)

# Print error measurements
print('MAE: %.2f' % mean_absolute_error(y_test, y_pred))
print('RMSE: %.2f' % sqrt(mean_squared_error(y_test, y_pred))) # Use sqrt to get the RMSE from the MSE

The dummy model has a mean absolute error of 1.75, which means that it can predict ratings to an average accuracy of $\pm1.75$. This isn't very good, but it does give us a baseline.

### $k$ nearest neighbours modelling

Let's build a $k$ nearest neighbours regression model to see what improvement can be made over the dummy model:

In [None]:
pipeline = make_pipeline(
    Imputer(),
    KNeighborsRegressor()
)

parameters = {
    'imputer__axis': [0, 1],
    'imputer__strategy': ['mean', 'median', 'most_frequent'],
    'kneighborsregressor__n_neighbors': range(1, int(y_train.shape[0] * 0.9)), # Use as large a range as possible
    'kneighborsregressor__weights': ['uniform', 'distance'],
    'kneighborsregressor__metric': ['manhattan', 'euclidean']
}

gs = GridSearchCV(pipeline, parameters, cv=10, n_jobs=-1) # n_jobs=-1 uses all available CPUs for calculation
gs.fit(X_train, y_train) # Fit using the training set

# Make predictions about the test data
y_pred = gs.predict(X_test)

# Print error measurements
print('MAE: %.2f' % mean_absolute_error(y_test, y_pred))
print('RMSE: %.2f' % sqrt(mean_squared_error(y_test, y_pred))) # Use sqrt to get the RMSE from the MSE

As can be seen, the $k$ nearest neighbours model decreases MAE from 1.75 to 1 and RMSE from 1.86 to 0.87, approximately a unit rating improvement in both cases. The model error is still quite large, but not so large that it won't be useful.

### Making predictions

Now that we have a final model, we can make recommendations about films we haven't rated:

In [None]:
predictions = pd.Series()
for col in df.drop(y.index, axis=1).columns:
    predictions[col] = gs.predict(df.ix[:, [col]].T)[0]
predictions = predictions.sort_values(ascending=False)

predictions.head(10)

It's worth noting that we have just filled in missing values in an arbitrary way here - there's no guarantee that filling in a missing entry with the mean, median or mode is the right thing to do. In practice, to build a robust recommender system, we would need to consider this more carefully, e.g. whether we should remove some entries and replace others, use different replacements in different cases, etc.

An alternative approach would be to round all ratings to the nearest integer and treat the set of integer ratings [1, 2, 3, 4, 5] as categories rather than numbers (i.e. use $k$ nearest neighbours classification). This way, we could treat missing values (NaNs) as a separate category and not have to replace them at all.