# Colaborative Filtering


https://grouplens.org/datasets/movielens/100k/

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.

This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

source: http://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Memory-Based Collaborative Filtering approaches can be divided into two main sections: user-item filtering and item-item filtering. A user-item filtering takes a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. In contrast, item-item filtering will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations.
Item-Item Collaborative Filtering: “Users who liked this item also liked …”
User-Item Collaborative Filtering: “Users who are similar to you also liked …”

source: https://blog.cambridgespark.com/nowadays-recommender-systems-are-used-to-personalize-your-experience-on-the-web-telling-you-what-120f39b89c3c


In [33]:
import pandas as pd

from hashlib import md5
from zipfile import ZipFile
import urllib.request
import os

data_attributes=["user_id","item_id","rating","timestamp"]

def get_content_file(url):
    h = md5(url.encode()).hexdigest()
    path = ".tmp/" + h
    file_path = path + "/data.zip"
    if not os.path.exists(file_path):
        if not os.path.exists(path):
            os.makedirs(path)
        urllib.request.urlretrieve(url, file_path)
    return file_path

file_path = get_content_file("http://files.grouplens.org/datasets/movielens/ml-1m.zip")
with ZipFile(file_path).open("ml-1m/ratings.dat") as decompressed_file:
    df_1m = pd.read_csv(
        decompressed_file,
        names=data_attributes,
        sep="::"
    )

file_path = get_content_file("http://files.grouplens.org/datasets/movielens/ml-100k.zip")
with ZipFile(file_path).open("ml-100k/u.data") as decompressed_file:
    df_100k = pd.read_csv(
        decompressed_file,
        names=data_attributes,
        sep="\t"
    )

In [34]:
import pandas as pd
import numpy as np

df = df_1m
df

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [35]:
info = { 
    "n_users": df.user_id.unique().shape[0],
    "n_items": df.item_id.unique().shape[0]
}
info

{'n_users': 6040, 'n_items': 3706}

In [36]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.2)
train_data

Unnamed: 0,user_id,item_id,rating,timestamp
155263,1001,2624,5,976298903
465300,2869,2971,5,972439495
188272,1168,3328,5,974932080
320026,1899,1127,3,974777304
816849,4904,1224,5,962684525
...,...,...,...,...
927236,5604,2989,3,960255912
995175,6010,1591,4,956860661
997160,6021,3107,3,956756924
392244,2304,1371,4,974496110


In [38]:
# create user-item matrix as pivot table
train_data_pivot = train_data.pivot_table(index='user_id', columns='item_id', values='rating', fill_value=0)\
    .reindex(sorted(df.user_id.unique()), axis=0, fill_value=0)\
    .reindex(sorted(df.item_id.unique()), axis=1, fill_value=0)

# create testset
test_data_pivot = test_data.pivot_table(index='user_id', columns='item_id', values='rating', fill_value=0)\
    .reindex(sorted(df.user_id.unique()), axis=0, fill_value=0)\
    .reindex(sorted(df.item_id.unique()), axis=1, fill_value=0)

(train_data_pivot.shape, test_data_pivot.shape)

((6040, 3706), (6040, 3706))

In [39]:
from sklearn.metrics.pairwise import pairwise_distances

user_similarity = pairwise_distances(train_data_pivot, metric="cosine")
item_similarity = pairwise_distances(train_data_pivot.transpose(), metric="cosine")

(user_similarity.shape, item_similarity.shape)

((6040, 6040), (3706, 3706))

In [40]:
def predict_user(ratings, similarity):
    mean_user_rating = ratings.mean(axis=1)
    rating_diff = (ratings - mean_user_rating[:, np.newaxis])
    df = pd.DataFrame(mean_user_rating[:, np.newaxis] + similarity.dot(rating_diff) / np.array([np.abs(similarity).sum(axis=1)]).T)
    df.index = np.arange(1, len(df) + 1)
    df.index.name="user_id"
    return df

user_prediction = predict_user(train_data_pivot, user_similarity)
user_prediction

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,3696,3697,3698,3699,3700,3701,3702,3703,3704,3705
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.002642,0.206215,0.112557,-0.019123,0.034155,0.404822,0.120747,-0.051881,-0.041597,0.313946,...,-0.041129,-0.077683,-0.070110,-0.048480,-0.049632,0.338199,0.078069,-0.052307,-0.056598,0.112096
2,1.078713,0.262279,0.169107,0.038949,0.091531,0.438871,0.175861,0.008327,0.015032,0.351915,...,0.018238,-0.018119,-0.010026,0.010145,0.008772,0.395114,0.135859,0.006755,0.003172,0.169104
3,1.017694,0.203684,0.110913,-0.019761,0.033410,0.395198,0.121196,-0.052141,-0.043922,0.302484,...,-0.042353,-0.078878,-0.071178,-0.050123,-0.051331,0.338138,0.077748,-0.053591,-0.057256,0.112579
4,1.017603,0.183576,0.091289,-0.042910,0.012017,0.367641,0.102406,-0.076547,-0.068651,0.281234,...,-0.066537,-0.103698,-0.095722,-0.074491,-0.076166,0.320990,0.053779,-0.078000,-0.082059,0.089647
5,1.108642,0.294011,0.197812,0.064475,0.119678,0.463480,0.206202,0.034506,0.042494,0.391615,...,0.042345,0.007560,0.015574,0.035994,0.034396,0.416835,0.154560,0.031665,0.027584,0.190564
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,1.591255,0.794615,0.708229,0.577857,0.630858,0.968604,0.712050,0.549401,0.557306,0.891938,...,0.557676,0.523757,0.531642,0.552034,0.549247,0.929194,0.668114,0.546607,0.543705,0.703961
6037,1.142457,0.334627,0.241238,0.106446,0.162063,0.508848,0.248443,0.074464,0.083501,0.435390,...,0.082795,0.047567,0.055899,0.076823,0.073291,0.464699,0.197405,0.071419,0.068408,0.232607
6038,1.005387,0.180066,0.080673,-0.052523,0.002306,0.374701,0.089211,-0.084444,-0.075364,0.285119,...,-0.074951,-0.111642,-0.103657,-0.082438,-0.084231,0.309387,0.045123,-0.085914,-0.090503,0.079536
6039,1.089589,0.273752,0.178004,0.045536,0.099671,0.471960,0.184310,0.013213,0.023220,0.380776,...,0.022366,-0.013416,-0.005447,0.016101,0.012854,0.407541,0.142030,0.011571,0.007651,0.177031


In [41]:
def predict_item(ratings, similarity):
    return ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])

item_prediction = predict_item(train_data_pivot, item_similarity)
item_prediction

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,3696,3697,3698,3699,3700,3701,3702,3703,3704,3705
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.035411,0.040733,0.043054,0.044247,0.043156,0.041721,0.041928,0.044431,0.045592,0.041080,...,0.045565,0.046251,0.045554,0.046009,0.046658,0.041726,0.044150,0.045633,0.046185,0.043674
2,0.086217,0.093044,0.096726,0.099126,0.097277,0.088262,0.094525,0.101057,0.098590,0.088348,...,0.101470,0.102755,0.103498,0.101662,0.101807,0.094419,0.098569,0.101315,0.103391,0.096988
3,0.033904,0.037885,0.040331,0.042490,0.041082,0.037115,0.040076,0.043197,0.042176,0.036391,...,0.043368,0.044750,0.043847,0.043419,0.043942,0.039259,0.042100,0.043449,0.044668,0.041535
4,0.016184,0.018191,0.019730,0.020731,0.020050,0.016727,0.019738,0.020680,0.020101,0.016979,...,0.020614,0.021194,0.021020,0.020794,0.020658,0.018871,0.019801,0.020690,0.021262,0.019699
5,0.114827,0.123525,0.125861,0.125448,0.126905,0.115263,0.124160,0.128696,0.128069,0.120559,...,0.125751,0.129297,0.129764,0.128158,0.128511,0.119860,0.120420,0.126444,0.127414,0.121369
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.579550,0.607014,0.621907,0.622420,0.624545,0.592716,0.612757,0.631540,0.632545,0.601432,...,0.628650,0.636794,0.639259,0.635822,0.631277,0.608334,0.615461,0.628694,0.635108,0.614014
6037,0.141991,0.156029,0.161227,0.164112,0.162854,0.148085,0.159029,0.165316,0.165286,0.153177,...,0.163075,0.166286,0.167789,0.166054,0.162781,0.155162,0.158311,0.163092,0.166576,0.157185
6038,0.011867,0.013260,0.013526,0.013569,0.013664,0.013083,0.013054,0.013886,0.014062,0.013164,...,0.013723,0.013987,0.014041,0.014008,0.013853,0.013179,0.013401,0.013824,0.013980,0.013308
6039,0.095141,0.103236,0.105370,0.107503,0.106250,0.103960,0.103007,0.107690,0.109712,0.103336,...,0.107032,0.107399,0.108885,0.109314,0.106618,0.103985,0.106206,0.107359,0.109228,0.105600


In [42]:
from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(predicion, groud_truth):
    groud_truth = groud_truth.values.flatten()
    s = groud_truth != 0
    groud_truth = groud_truth[s]
    predicion = predicion.values.flatten()
    predicion = predicion[s]
    
    return sqrt(mean_squared_error(predicion, groud_truth))

{'user_prediction_rmse': rmse(user_prediction, test_data_pivot), 'item_prediction_rmse':  rmse(item_prediction, test_data_pivot)}

{'user_prediction_rmse': 3.2186276896192383,
 'item_prediction_rmse': 3.510667832769373}