# Understand Collaborative Filtering From Scratch

## Introduction to different types of Collaborative Filtering

### User-User Approach 
Also known as "Neighborhood-based approach", "User-User Collaborative Filtering", it calculates the similarity between users and use it as the weight to predict the target item. 

![Moving Rating](./_pic/MovieRatingExample.png)

In this example, we want to know whether if we should recommend Fargo to Ken or not. Using User-User approach, we firstly calculate the similiarity between each user to Ken. Typically, similiarity can be calculated using pearson correlation. To make a prediction, we take the Fargo's rating from each User as value(- as 0, + as 1) and calculate the weighted sum of it using the similarity between each user. The Denominator is not "n" since it's the weighted sum. Instead of divided equally by 4, the similar the user is, the more weight it should be given.

\begin{align}
\ p(X_i) = \frac{\sum_{i=1}^n Y_i * r(X, Y)}{\sum_{i=1}^n |r(X, Y)|}, \text{where n=4}.
\end{align}

$X$: target user

$Y$: other users

$n$: number of users

$Y_1$: Amy's rating on Fargo, which is -

r(X, Y): when i = 1, it represents the pearson correlation between Ken($X$) and Amy($Y$). X, Y are both a vector of length 4.

$p(X_i)$: the final predicted value for whether if we should recommend Fargo to Ken or not.

### Item-Item Approach 
This approach is simply an inversion of the neighborhood-based approach. The intuition is that instead of recommending using the similarity between users, we recommend it based on the similarity between items.

Based on the same example above, to predict whether if we should recommend Fargo to Ken, we calculate the similarity between each items. The rating for Fargo and Pulp Fiction are the same, which means that people tend to have similar rating for these two items, so we may predict Ken will also like Fargo. In the form of equations:

\begin{align}
\ p(X_i) = \frac{\sum_{i=1}^n Y_i * r(X, Y)}{\sum_{i=1}^n |r(X, Y)|}, \text{where n=4}.
\end{align}

$X$: target item

$Y$: other items

$n$: number of items

$Y_1$: The piano's rating by Ken, which is +.

r(X, Y): when i = 1, it represents the pearson correlation between Fargo($X$) and Piano($Y$). X, Y are both a vector of length 4.

$p(X_i)$: the final predicted value for whether if we should recommend Fargo to Ken or not.

### Classification Approach

Collaborative filtering can also be formulated as a classification problem. Once the user ratings are converted into
the following screenshot's format, almost any supervised learning method can be used to classify unseen items. Pazzani and Billsus, some recommender system researchers, first reduce the dimensionality of the vectors by applying singular value decomposition and then use a neural network as a learning method.

![Moving Rating Classification](./_pic/MovieRatingExample-Classification.png)

### Note
* 'memory based', 'model based' and 'hybrid' are different types of Collaborative Filtering.

## Using MovieLens Dataset to implement Collaborative Filtering

The data can be downloaded [here](https://grouplens.org/datasets/movielens/100k/).

According to GroupLens, MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies
	* Each user has rated at least 20 movies
    * Simple demographic info for the users (age, gender, occupation, zip)
    * Genre information of movies

In [1]:
import numpy as np
import pandas as pd

In [2]:
cd ml-100k/

/Users/johnnychiu/Desktop/MyFiles/learning/Machine-Learning/RecommenderSystem/ml-100k


### Read files

In [16]:
names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=names)
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [5]:
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print(str(n_users) + ' users')
print(str(n_items) + ' items')

943 users
1682 items


We want to get the rating matrix which each row represent each user and the value of each column represent the rating this particular user gave.

In [6]:
ratings = np.zeros((n_users, n_items))
ratings.shape

(943, 1682)

In [9]:
for row in df.itertuples():
    ratings[row[1]-1, row[2]-1] = row[3]
ratings

array([[ 5.,  3.,  4., ...,  0.,  0.,  0.],
       [ 4.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 5.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  5.,  0., ...,  0.,  0.,  0.]])

In supervised learning, separate data into training and testing dataset, we usually do random sampling to rows of the dataframe. The reason is that we want to fix a function to the training dataset and use the function to predict the outcome of the testing dataset.

However, for recommender systems with collaborative filtering (no features), this just won't work anymore, because all of the items/users need to be available when the model is first built. We do need the rating data for each user to calculate their recommendation items, in this case, movies. Since we have already known that each users has rated at least 20 movies. We then randomly take 10 rating out of these 20+ movies and assign them to test dataset.

(I am not sure how to calculate the mse using the pred of train and test yet. It's not clear since pred of train should have more rating than test. How can they compare with each other? )

We will only use the rated movies in the test dataset to calculate the mean square error(MSE). That is, for train dataset, we will keep only the index of non-zero value from test dataset, and use them to calculate the MSE with test dataset.

### Generating Train and Test datasets

In [31]:
def train_test_split(ratings):
    test = np.zeros(ratings.shape)
    train = ratings.copy()
    for user in xrange(ratings.shape[0]):
        test_ratings = np.random.choice(ratings[user, :].nonzero()[0], 
                                        size=10, 
                                        replace=False)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings[user, test_ratings]
        
    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test

In [32]:
train, test = train_test_split(ratings)
print '#Train'
print train.shape
print train
print '\n'

print '#Test'
print test.shape
print test

#Train
(943, 1682)
[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]


#Test
(943, 1682)
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
[ 0.  0.  0. ...,  0.  0.  0.]


The following function calculates the cosine similarity, ie pearson correlation between each user using the input dataset.

### Build a recommender system using collaborative filtering
1. calcuate similarity
2. predict the probability using User-User and Item-Item approach
3. validating the result using MSE

In [37]:
def fast_similarity(ratings, kind='user', epsilon=1e-9):
    # epsilon -> small number for handling dived-by-zero errors
    if kind == 'user':
        sim = ratings.dot(ratings.T) + epsilon
    elif kind == 'item':
        sim = ratings.T.dot(ratings) + epsilon
    norms = np.array([np.sqrt(np.diagonal(sim))])
    return (sim / norms / norms.T)

In [38]:
user_similarity = fast_similarity(train, kind='user')
item_similarity = fast_similarity(train, kind='item')

In [42]:
print user_similarity.shape
user_similarity

(943, 943)


array([[ 1.        ,  0.14723095,  0.04579818, ...,  0.11663262,
         0.18089804,  0.36751597],
       [ 0.14723095,  1.        ,  0.07608248, ...,  0.11787781,
         0.19125926,  0.0819063 ],
       [ 0.04579818,  0.07608248,  1.        , ...,  0.143708  ,
         0.12051638,  0.02956977],
       ..., 
       [ 0.11663262,  0.11787781,  0.143708  , ...,  1.        ,
         0.1062942 ,  0.05239333],
       [ 0.18089804,  0.19125926,  0.12051638, ...,  0.1062942 ,
         1.        ,  0.17675665],
       [ 0.36751597,  0.0819063 ,  0.02956977, ...,  0.05239333,
         0.17675665,  1.        ]])

In [44]:
def predict_fast_simple(ratings, similarity, kind='user'):
    if kind == 'user':
        return similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif kind == 'item':
        return ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])

In [55]:
item_prediction = predict_fast_simple(train, item_similarity, kind='item')
user_prediction = predict_fast_simple(train, user_similarity, kind='user')

user_prediction, item_prediction records the predict recommendation probability for each movie.

In [56]:
print user_prediction.shape
user_prediction

(943, 1682)


array([[  2.24055006e+00,   7.74126048e-01,   4.59618296e-01, ...,
          5.47242008e-04,   6.91019414e-03,   7.74567790e-03],
       [  1.74732903e+00,   3.41433616e-01,   3.13708503e-01, ...,
          3.45427384e-03,   3.08290521e-03,   2.01289791e-03],
       [  1.29736821e+00,   3.30136379e-01,   2.61173855e-01, ...,
          9.50596148e-03,   3.16705197e-03,   1.63376661e-03],
       ..., 
       [  1.80659731e+00,   3.93393694e-01,   3.28852773e-01, ...,
          2.41578076e-03,   4.24774455e-03,   1.12324913e-03],
       [  1.88930554e+00,   5.99816782e-01,   3.19051195e-01, ...,
          2.47856144e-03,   5.43287722e-03,   3.59721904e-03],
       [  2.30551510e+00,   9.09929162e-01,   4.93472907e-01, ...,
          7.86337360e-15,   7.94449896e-03,   7.59479840e-03]])

In [58]:
from sklearn.metrics import mean_squared_error

def get_mse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return mean_squared_error(pred, actual)

In [59]:
item_prediction = predict_fast_simple(train, item_similarity, kind='item')
user_prediction = predict_fast_simple(train, user_similarity, kind='user')

print 'User-based CF MSE: ' + str(get_mse(user_prediction, test))
print 'Item-based CF MSE: ' + str(get_mse(item_prediction, test))

User-based CF MSE: 8.4931401358
Item-based CF MSE: 11.6132805105


In this case, User-User Approach is a better way for recommendation.

----

### Build a recommender system using Top $K$ collaborative filtering
We can attempt to improve our prediction MSE by only considering the top $k$ users who are most similar to the input user (or, similarly, the top $k$ items). That is, when we calculate the sums over $Y_i$, we only sum over the top $k$ most similar users.


\begin{align}
\ p(X_i) = \frac{\sum_{i=1}^n Y_i * r(X, Y)}{\sum_{i=1}^n |r(X, Y)|}, \text{where n=k, instead of the total number of other users}.
\end{align}

1. calcuate similarity, which is the same as the previous one.
2. predict the probability using the top $k$ User-User and Item-Item approach
3. validating the result using MSE

In [61]:
def predict_topk(ratings, similarity, kind='user', k=40):
    pred = np.zeros(ratings.shape)
    if kind == 'user':
        for i in xrange(ratings.shape[0]):
            top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]]
            for j in xrange(ratings.shape[1]):
                pred[i, j] = similarity[i, :][top_k_users].dot(ratings[:, j][top_k_users]) 
                pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users]))
    if kind == 'item':
        for j in xrange(ratings.shape[1]):
            top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]]
            for i in xrange(ratings.shape[0]):
                pred[i, j] = similarity[j, :][top_k_items].dot(ratings[i, :][top_k_items].T) 
                pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items]))        
    
    return pred

In [62]:
pred = predict_topk(train, user_similarity, kind='user', k=40)
print 'Top-k User-based CF MSE: ' + str(get_mse(pred, test))

pred = predict_topk(train, item_similarity, kind='item', k=40)
print 'Top-k Item-based CF MSE: ' + str(get_mse(pred, test))

Top-k User-based CF MSE: 6.51275063533
Top-k Item-based CF MSE: 7.74005106481


----

### Build a recommender system using Bias-subtracted Collaborative Filtering
Some user may tend to give lower ratings while some may tend to give generally higher ratings. It will affect the recommend probability since the user that has higher rating will affect the probabilty more than those who generally rates lower. 

Let us try subtracting each user's average rating when summing over similar user's ratings and then add that average back in at the end. We modify our equation to become:

\begin{align}
\ p(X_i) = \bar{Y} + \frac{\sum_{i=1}^n (Y_i- \bar{Y}) * r(X, Y)}{\sum_{i=1}^n |r(X, Y)|}, \text{where n=4}.
\end{align}

where $\bar{Y}$ is user $Y$'s average rating.

In [64]:
def predict_nobias(ratings, similarity, kind='user'):
    if kind == 'user':
        user_bias = ratings.mean(axis=1)
        ratings = (ratings - user_bias[:, np.newaxis]).copy()
        pred = similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T
        pred += user_bias[:, np.newaxis]
    elif kind == 'item':
        item_bias = ratings.mean(axis=0)
        ratings = (ratings - item_bias[np.newaxis, :]).copy()
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
        pred += item_bias[np.newaxis, :]
        
    return pred

In [65]:
user_pred = predict_nobias(train, user_similarity, kind='user')
print 'Bias-subtracted User-based CF MSE: ' + str(get_mse(user_pred, test))

item_pred = predict_nobias(train, item_similarity, kind='item')
print 'Bias-subtracted Item-based CF MSE: ' + str(get_mse(item_pred, test))

Bias-subtracted User-based CF MSE: 8.7710448431
Bias-subtracted Item-based CF MSE: 9.79353560567


------

### Validation the Recommender System
Let's look at our item similarity matrix and see if similar items "make sense". We input a selected movie, and check if the top 5 recommend movies are "good". There is a webiste called [themoviedb.org](themoviedb.org) which has a free API. If we have the IMDB "movie id" for a movie, then we can use this API to return the posters of movies. 

(The movie_url doesn't work. I will skip this part) In u.item data, we have the movie_url for each movie. We can somehow use the movie_url to get the movie_id for each movie and send it as a parameter to themobedb.org's API to get the posters of movies.

In [67]:
!head -5 u.item

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0


In [83]:
# Load in movie data
idx_to_movie = {}
with open('u.item', 'r') as f:
    for line in f.readlines():
        info = line.split('|')
        idx_to_movie[int(info[0])-1] = info[1]
        
def top_k_movies(similarity, mapper, movie_idx, k=6):
    return [mapper[x] for x in np.argsort(similarity[movie_idx,:])[:-k-1:-1]]

In [84]:
idx = 0 # Toy Story
movies = top_k_movies(item_similarity, idx_to_movie, idx)
movies

['Toy Story (1995)',
 'Star Wars (1977)',
 'Return of the Jedi (1983)',
 'Independence Day (ID4) (1996)',
 'Rock, The (1996)',
 'Fargo (1996)']

In [86]:
idx = 70 # Lion King
movies = top_k_movies(item_similarity, idx_to_movie, idx)
movies

['Lion King, The (1994)',
 'Aladdin (1992)',
 'Forrest Gump (1994)',
 'Beauty and the Beast (1991)',
 'Jurassic Park (1993)',
 'E.T. the Extra-Terrestrial (1982)']

Reference
* http://recommender-systems.org/collaborative-filtering/
* https://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/
* http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/
* http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/recsys/1_ALSWR.ipynb

Mathematical expressions reference
* http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Typesetting%20Equations.html
* http://csrgxtu.github.io/2015/03/20/Writing-Mathematic-Fomulars-in-Markdown/