# Collaborative Filtering Project

***
***Group Member***:
***

## Introduction

Collaborative filtering is a technique used by recommander systems. It is used to predict the user's taste (rating) on a specific item by collecting preferences or taste information from many other users. There are two main methods used in collaborative filtering algorithm: memory based collaborative filtering and model based collaborative filtering.

In this project, we are going to test on both memory based and model based methods. And we are going to discuss the difference of results these two methods yield.

## Data

The data we use is MovieLens 100k, which contains 100000 ratings from 943 users on 1682 movies. For each row of the data file, it has the user id, which movie the user rates, the rating, and the time stamp. We will preprocess the data into two sets - training set and testing set.

In [4]:
import numpy as np
import pandas as pd

column_names = ['user_id','item_id','rating','timestamp']
df = pd.read_csv('data/u.data', sep='\t', names=column_names)

n_users = df.user_id.unique().shape[0]  
n_items = df.item_id.unique().shape[0]

from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25)

Then we build training and testing matrix for convenience

In [5]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]  

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

## Procedure

### Memory Based

write something

In [14]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

write how to predict

In [15]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])     
    return pred

item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

### Model Based

There are lots of methods used in model-based collaborative filtering. Here, we are going to explore the method by using singular value decomposition (SVD).

SVD is a matrix factorization technique that is usually used to reduce the number of features of a data set by reduing the space dimensions from N to K. When it is used in the collaborative filtering, it can help us to learn the latent preferences of users and the latent attributes of items from known rating. Latent variables are those not shown in the data. For example, from the dataset we can get information  of user id, age, location, gender, movie id, director, actor, language, year, rating. After matrix factorization, the model can learns users' agfe group, such as under 10, 10- 18, 18- 30, etc. For movie feature, it can also learns that decade, director, and actor are most important. However, if looking into the information we have, decade is not a known feature from the dataset.

SVD can be decribed as for a given $m\times n$ matrix X, it can be factorized as:
$$X = USV^T$$
where 
* U is a $m\times r$ orthogonal matrix represents the feature vectors corresponding to the users in the hidden feature space.
* S is a $r\times r$ diagonal matrix with non-negative real number represents the singular value of X.
* $V^T$ is a $r\times n$ orthogonal matrix represents the feature vectors corresponding to the items (movies here) in the hidden feature space.

In [32]:
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)

## Evaluation

A lot of methods for evaluation are used, but one of the most popular metric used is *Root Mean Square Error (RMSE)*
$$RMSE=\sqrt{\frac{1}{N}\sum (x_i-\hat{x}_i)^2}$$
Since we only want to compare the accuracy based on the ratings in test set, we filter out all other elements in prediction that are not in testing matrix.

In [10]:
from sklearn.metrics import mean_squared_error
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return np.sqrt(mean_squared_error(prediction, ground_truth))

In [16]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.1235583441397843
Item-based CF RMSE: 3.4522629345325306


In [33]:
print('Model-based CF RMSE: ' + str(rmse(X_pred, test_data_matrix)))

Model-based CF RMSE: 2.7144649423411744


## Conclusion

writing something, any difference? why?