# Memory Based Collaborative Filtering

Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, that is it uses the “wisdom of the crowd” to recommend items.

#### Import the libraries

In [38]:
import numpy as np
import pandas as pd
from sklearn import cross_validation as cv
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt

**Read the Dataset** We will use MovieLens dataset, which is one of the most common datasets used when implementing and testing recommender engines. It contains 100k movie ratings from 943 users and a selection of 1682 movies. It can be downlaoded from <a href='http://files.grouplens.org/datasets/movielens/ml-100k.zip'>here</a>

In [4]:
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('ml-100k/u.data', sep='\t', names=header)

#### Sneak peek of the dataset

In [5]:
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)

Number of users = 943 | Number of movies = 1682


#### Split the dataset in Train and Test

In [40]:
train_data, test_data = cv.train_test_split(df, test_size=0.25)

Memory-Based Collaborative Filtering approaches can be divided into two main sections: **user-item filtering** and **item-item filtering**. A *user-item filtering* will take a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. In contrast, *item-item filtering* will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations.

1. *Item-Item Collaborative Filtering*: “Users who liked this item also liked …”
2. *User-Item Collaborative Filtering*: “Users who are similar to you also liked …”

In both cases, we create a user-item matrix which we build from the entire dataset. Since we have split the data into testing and training we will need to create two $[943 \times 1682]$ matrices. The training matrix contains 75% of the ratings and the testing matrix contains 25% of the ratings.

Example of User-Item matrix:
<img src="files/user-item-matrix.png">

The similarity values between items in *Item-Item Collaborative Filtering* are measured by observing all the users who have rated both items.
<img src="files/item-item-cf.png">

For *User-Item Collaborative Filtering* the similarity values between users are measured by observing all the items that are rated by both users.
<img src="files/user-item-cf.png">

**Cosine Similarity** is used to measure similarity between 2 vectors ``m`` and `n`. If the value is 1, then the vectors are completely similar and if the value is -1, they are not at all similar. 

#### Create User-Item matrix

In [18]:
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

**Calculate Cosine Similarity**

In [33]:
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

**Make Predictions** 

To make a prediction we can use the following formula for user-based CF $\hat{x_{k,m}}=\bar{x_k}+\frac{\sum_{u_a}sim_i(i_m,i_b)(x_{k,b})}{\sum_{i_b}|sim_i(i_m,i_b)|}$. We can look at the similarity between users $k$ and $a$, as weights that are multiplied by ratings of a similar user $a$ (corrected for the average rating of that user). Also, we normalise it so that ratings between 1 and 5. In the final step, we sum the average ratings for the user that we are trying to predict.

For making a prediction for item-based CF we use the formula
$\hat{x_{k,m}}=\frac{\sum_{i_b}(i_m,i_b)(x_{k,b})}{\sum_{i_b}|sim_i(i_m,i_b)|}$

In [34]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])     
    return pred

item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

**Evaluation**
To evaluate accuracy of predicted ratings we use *Root Mean Squared Error (RMSE)*
$RMSE = \sqrt{\frac{1}{N} \sum_i(x_i - \hat{x_i})^2}$

In [39]:
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))\

print 'User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))
print 'Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix))

User-based CF RMSE: 3.11312149866
Item-based CF RMSE: 3.44216869476


Memory-based algorithms are easy to implement and produce reasonable prediction quality.
The drawback of memory-based CF is that it doesn’t scale to real-world scenarios and doesn’t address the well-known cold-start problem, that is when new user or new item enters the system. Model-based CF methods are scalable and can deal with higher sparsity level than memory-based models, but also suffer when new users or items that don’t have any ratings enter the system.