# Colaborative Filtering

Collaborative filtering is a way recommendation systems filter information by using the preferences of other people. It uses the assumption that if person A has similar preferences to person B on items they have both reviewed, then person A is likely to have a similar preference to person B on an item only person B has reviewed.

Collaborative filtering is used by many recommendation systems in various fields, including music, shopping, financial data, and social networks and by various services (YouTube, Reddit, Last.fm). Any service that uses a recommendation system most likely employs collaborative filtering.

source: https://brilliant.org/wiki/collaborative-filtering/

## Memory-based CF

Memory-Based Collaborative Filtering approaches can be divided into two main sections: user-item (user-based) filtering and item-item (item-based) filtering. A user-item filtering takes a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. In contrast, item-item filtering will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations.

Item-Item Collaborative Filtering: “Users who liked this item also liked …”

User-Item Collaborative Filtering: “Users who are similar to you also liked …”

source: https://blog.cambridgespark.com/nowadays-recommender-systems-are-used-to-personalize-your-experience-on-the-web-telling-you-what-120f39b89c3c

## Model-baesd CF

Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF. The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector. You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

source: https://blog.cambridgespark.com/nowadays-recommender-systems-are-used-to-personalize-your-experience-on-the-web-telling-you-what-120f39b89c3c

## Task

In this exercise, you shall work with various collaborative filtering approaches. Specifically, you shall compare

* User based filtering
* Item based filtering
* One other model based filtering approach

You can reuse existing libraries and code examples (if you do so, please properly quote the origin, otherwise it has to be considered plagiarism), and you shall compare their performance in two ways

* Effectiveness of the recommendation on a supplied training set.
* Efficiency of the recommendation (i.e. runtime).

The dataset to be used is the MovieLens dataset. You shall first work with the smallest version available, with 100k ratings, at https://grouplens.org/datasets/movielens/100k/

To ensure that we can compare the results across all your peers in the course, you shall proceed as follows

* Split the dataset into 80:20 training:test set, after shuffling the data. 
* For evaluation of effectiveness, we will utilise MSE.

Your solution shall include the code, and a report on your findings - which methods worked well in regards to effectiveness and efficiency? Are the result in general usable?

After the first step on the 100k database, obtain the next bigger version (1M), and just test your algorithms for effectiveness - do the methods scale to the increased size?

## Dataset description

For this exercise 2 movielense datasets were used: 100k (https://grouplens.org/datasets/movielens/100k/) and 1M (https://grouplens.org/datasets/movielens/1M/)

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

The 100k data set consists of:
* 100,000 ratings (1-5) 
* from 943 users 
* on 1682 movies.

The 1m data set consists of:
* 1,000,209 ratings (1-5) 
* from 6,040 users 
* on 3,952 movies.  

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

sources: http://files.grouplens.org/datasets/movielens/ml-100k-README.txt http://files.grouplens.org/datasets/movielens/ml-1m-README.txt


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from hashlib import md5
from zipfile import ZipFile
import urllib.request
import os

from sklearn.model_selection import train_test_split
from sklearn.metrics import pairwise_distances
from scipy.sparse.linalg import svds

from sklearn.metrics import mean_squared_error
from math import sqrt

data_attributes=["user_id","item_id","rating","timestamp"]

## Data aquicition and preperation

First the dataset is downloaded as zip-archive. To save bandwidth ,time and space it is cached but not decompressed. Only the relevant data file inside the archive is read and stored in a DataFrame.

In [None]:
def get_content_file(url):
    h = md5(url.encode()).hexdigest()
    path = ".tmp/" + h
    file_path = path + "/data.zip"
    if not os.path.exists(file_path):
        if not os.path.exists(path):
            os.makedirs(path)
        urllib.request.urlretrieve(url, file_path)
    return file_path


file_path = get_content_file("http://files.grouplens.org/datasets/movielens/ml-100k.zip")
with ZipFile(file_path).open("ml-100k/u.data") as decompressed_file:
    df_100k = pd.read_csv(
        decompressed_file,
        names=data_attributes,
        sep="\t"
    )

To make evaluating the predictions possible, the dataset is split in a 80:20 ratio and converted to a "user x item = rating" pivot table. As shown below, the distribution of the test and train set is almost identical. It can be assumed that both sets are representitve for the whole set.

In [None]:
def split_and_pivot_dataset(df, test_size):
    train_data, test_data = train_test_split(df, test_size=test_size)

    # create user-item matrix as pivot table
    train_data_pivot = train_data.pivot_table(index='user_id', columns='item_id', values='rating')\
        .reindex(sorted(df.user_id.unique()), axis=0)\
        .reindex(sorted(df.item_id.unique()), axis=1)

    # create testset
    test_data_pivot = test_data.pivot_table(index='user_id', columns='item_id', values='rating')\
        .reindex(sorted(df.user_id.unique()), axis=0)\
        .reindex(sorted(df.item_id.unique()), axis=1)
    
    return (train_data_pivot, test_data_pivot)


train_data_pivot_100k, test_data_pivot_100k = split_and_pivot_dataset(df_100k, 0.2)

print("Train Data Historgram")
pd.DataFrame(train_data_pivot_100k.values.flatten()).hist(bins=5, range=(1,5), density=True)
plt.title('Train set distribution')
print("Test Data Historgram")
pd.DataFrame(test_data_pivot_100k.values.flatten()).hist(bins=5, range=(1,5), density=True)
plt.title('Test set distribution')

## User-based prediction

To predict items of interrest using an user-based approach a pairwise-distance matrix for all users (user x user) is computed. Using a cosine metric showed good results. To circumvent differences in the rating-behavior of different users the ratings have to be normalized using each users mean rating as the 0 value. By appling the score formular for user-based CF a prediction matrix is calculated. This matrix shows the predicted ratings for all items and users.  


In [None]:
def predict_user(ratings):
    # calculate pairwise distances for users is calculated with cosine metric
    similarity = pairwise_distances(ratings.fillna(0), metric="cosine")
    # calculate mean values 
    mean_user_rating = ratings.mean(axis=1)
    # normalize ratings
    norm_ratings = (ratings.fillna(0) - mean_user_rating[:, np.newaxis])
    # calculate user-item correlations
    prediction = mean_user_rating[:, np.newaxis] + similarity.dot(norm_ratings) / np.array([np.abs(similarity).sum(axis=1)]).T
    prediction = prediction + mean_user_rating[:, np.newaxis]
    return prediction

user_prediction_100k = predict_user(train_data_pivot_100k)

pd.DataFrame(pd.DataFrame(user_prediction_100k).values.flatten()).hist(bins=5, range=(1,5), density=True)
plt.title('User-based prediction distribution')

In [None]:
## Item-based prediction



In [None]:
def predict_item(ratings):
    # calculate pairwise distances for users to items with cosine metric
    similarity = pairwise_distances(ratings.fillna(0).transpose(), metric="cosine")
    # calculate item-item correlations
    pred = ratings.fillna(0).values[:, np.newaxis] + ratings.fillna(0).values.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

item_prediction_100k = predict_item(train_data_pivot_100k)

#pd.DataFrame(pd.DataFrame(item_prediction_100k).values.flatten()).hist(density=True)
#plt.title('User-based prediction distribution')
pd.DataFrame(item_prediction_100k)

In [None]:
# calculate pairwise distances for items with cosine metric
data = train_data_pivot_100k.fillna(0)
data_i = train_data_pivot_100k.fillna(0).transpose()
similarity = pairwise_distances(data.transpose(), metric="cosine")
#similarity[train_data_pivot_100k == np.nan] = 0
# calculate item-item correlations
# (sim(Thor,Avengers) * R(Alex,Avengers) + sim(Thor, Iron man) * R(Alex, Iron man)) / (sim(Thor, Avengers) + sim(Thor, Iron man)) 
pred = similarity.dot(train_data_pivot_100k.transpose().fillna(0).values) / np.array([np.abs(similarity).sum(axis=1)]).T

#u_i = train_data_pivot_100k.values[0]
#u_i2 = u_i[~np.isnan(u_i)] * similarity[~np.isnan(u_i)]

pd.DataFrame(pd.DataFrame(pred).values.flatten()).hist(density=True)
plt.title('User-based prediction distribution')
pd.DataFrame(similarity.flatten()).hist(density=True)
d

In [None]:
def predict_model(ratings):
    R = ratings.values.astype(float)
    U, S, Vt = svds(R, k=20)
    Sd = np.diag(S)
    return np.dot(np.dot(U, Sd), Vt)

model_prediction_100k = predict_model(train_data_pivot_100k)
model_prediction_100k

In [None]:
print("User-based prediction 100k runtime:")
%timeit predict_user(train_data_pivot_100k)
print("Item-based prediction  100k runtime:")
%timeit predict_item(train_data_pivot_100k)
print("Model-based prediction 100k runtime:")
%timeit predict_model(train_data_pivot_100k)

In [None]:
predicion = user_prediction_100k[test_data_pivot_100k.values != 0].flatten()
groud_truth = test_data_pivot_100k.values[test_data_pivot_100k.values != 0].flatten()

def mse(predicion, groud_truth):
    predicion = predicion[~np.isnan(groud_truth)].flatten()
    groud_truth = groud_truth[~np.isnan(groud_truth)].flatten()
    
    return mean_squared_error(predicion, groud_truth)

print("User-based prediction  100k MSE: %f" % mse(user_prediction_100k, test_data_pivot_100k.values))
#print("Item-based prediction  100k MSE: %f" % mse(item_prediction_100k, test_data_pivot_100k.values))
#print("Model-based prediction 100k MSE: %f" % mse(model_prediction_100k, test_data_pivot_100k.values))

In [None]:
file_path = get_content_file("http://files.grouplens.org/datasets/movielens/ml-1m.zip")
with ZipFile(file_path).open("ml-1m/ratings.dat") as decompressed_file:
    df_1m = pd.read_csv(
        decompressed_file,
        names=data_attributes,
        sep="::"
    )

In [None]:
# Split into train and test-set (20% = 80:20) and create "user x item = rating" pivot table
train_data_pivot_1m, test_data_pivot_1m = split_and_pivot_dataset(df_1m, 0.2)
(train_data_pivot_1m.shape, test_data_pivot_1m.shape)

In [None]:
user_prediction_1m = predict_user(train_data_pivot_1m)
user_prediction_1m

In [None]:
item_prediction_1m = predict_item(train_data_pivot_1m)
item_prediction_1m

In [None]:
model_prediction_1m = predict_model(train_data_pivot_1m)
model_prediction.sh

In [None]:
print("User-based prediction 1M runtime:")
%timeit predict_user(train_data_pivot_1m)
print("Item-based prediction  1M runtime:")
%timeit predict_item(train_data_pivot_1m)
print("Model-based prediction 1M runtime:")
%timeit predict_model(train_data_pivot_1m)

In [None]:
print("User-based prediction  1M MSE: %f" % mse(user_prediction_1m, test_data_pivot_1m.values))
print("Item-based prediction  1M MSE: %f" % mse(item_prediction_1m, test_data_pivot_1m.values))
print("Model-based prediction 1M MSE: %f" % mse(model_prediction_1m, test_data_pivot_1m.values))