# Colaborative Filtering


https://grouplens.org/datasets/movielens/100k/

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.

This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

source: http://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Memory-Based Collaborative Filtering approaches can be divided into two main sections: user-item filtering and item-item filtering. A user-item filtering takes a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked. In contrast, item-item filtering will take an item, find users who liked that item, and find other items that those users or similar users also liked. It takes items and outputs other items as recommendations.
Item-Item Collaborative Filtering: “Users who liked this item also liked …”
User-Item Collaborative Filtering: “Users who are similar to you also liked …”

source: https://blog.cambridgespark.com/nowadays-recommender-systems-are-used-to-personalize-your-experience-on-the-web-telling-you-what-120f39b89c3c


In [9]:
import pandas as pd
import numpy as np

data_attributes=["user_id","item_id","rating","timestamp"]

df = pd.read_csv(
    "http://files.grouplens.org/datasets/movielens/ml-100k/u.data",
    names=data_attributes,
    sep="\t"
)
df

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


In [31]:
info = { 
    "n_users": df.user_id.unique().shape[0],
    "n_items": df.item_id.unique().shape[0]
}
info

{'n_users': 943, 'n_items': 1682}

In [23]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.2)
train_data

Unnamed: 0,user_id,item_id,rating,timestamp
32339,100,349,3,891375629
11744,389,1530,2,880088753
68093,896,880,4,887235664
76468,765,50,2,880346255
89963,689,358,4,876674762
...,...,...,...,...
64121,18,52,5,880130680
42328,514,19,4,875463128
82118,894,1404,3,882404536
6691,189,1098,4,893265506


In [172]:
train_data_matrix = np.zeros((info["n_users"], info["n_items"]))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]

tt = train_data_matrix[train_data_matrix != 0]
#train_data_matrix[train_data_matrix.nonzero()].flatten().shape
tt.shape

(80000,)

In [198]:
x = test_data_pivot.values.flatten()
s = x != 0
x.size

1584444

In [215]:
# create user-item matrix as pivot table
train_data_pivot = train_data.pivot_table(index='user_id', columns='item_id', values='rating', fill_value=0)\
    .reindex(sorted(df.user_id.unique()), axis=0, fill_value=0)\
    .reindex(sorted(df.item_id.unique()), axis=1, fill_value=0)

# create testset
test_data_pivot = test_data.pivot_table(index='user_id', columns='item_id', values='rating', fill_value=0)\
    .reindex(sorted(df.user_id.unique()), axis=0, fill_value=0)\
    .reindex(sorted(df.item_id.unique()), axis=1, fill_value=0)

(train_data_pivot.shape, test_data_pivot.shape)

((943, 1682), (943, 1682))

In [78]:
from sklearn.metrics.pairwise import pairwise_distances

user_similarity = pairwise_distances(train_data_pivot, metric="cosine")
item_similarity = pairwise_distances(train_data_pivot.transpose(), metric="cosine")

(user_similarity.shape, item_similarity.shape)

((943, 943), (1682, 1682))

In [218]:
def predict_user(ratings, similarity):
    mean_user_rating = ratings.mean(axis=1)
    rating_diff = (ratings - mean_user_rating[:, np.newaxis])
    df = pd.DataFrame(mean_user_rating[:, np.newaxis] + similarity.dot(rating_diff) / np.array([np.abs(similarity).sum(axis=1)]).T)
    df.index = np.arange(1, len(df) + 1)
    df.index.name="user_id"
    return df

user_prediction = predict_user(train_data_pivot, user_similarity)
user_prediction

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.640053,0.598829,0.493651,0.800931,0.493227,0.359747,1.485463,0.903942,1.203040,0.540702,...,0.284774,0.285341,0.284782,0.283776,0.284285,0.283055,0.285640,0.284347,0.281762,0.284004
2,1.403151,0.330732,0.163136,0.573692,0.181128,0.012386,1.257005,0.662912,0.908787,0.213074,...,-0.067401,-0.066187,-0.067415,-0.068499,-0.067682,-0.069612,-0.067501,-0.068556,-0.070667,-0.067304
3,1.407865,0.288126,0.137114,0.537502,0.142462,-0.015819,1.243605,0.631795,0.914709,0.189250,...,-0.102729,-0.101485,-0.103339,-0.104259,-0.102940,-0.105411,-0.104036,-0.104724,-0.106099,-0.102756
4,1.361413,0.256854,0.109691,0.500590,0.116689,-0.038976,1.202659,0.594614,0.878623,0.166354,...,-0.124717,-0.123602,-0.124756,-0.125805,-0.124821,-0.127072,-0.125410,-0.126241,-0.127903,-0.124632
5,1.448761,0.403349,0.298807,0.624226,0.299459,0.170374,1.312081,0.719122,1.048874,0.355544,...,0.086003,0.086609,0.086068,0.084993,0.085906,0.084082,0.086562,0.085322,0.082843,0.085485
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,1.334669,0.302718,0.151884,0.544064,0.162487,0.011750,1.211854,0.634021,0.868963,0.208778,...,-0.071567,-0.070298,-0.071884,-0.072843,-0.071749,-0.073618,-0.071335,-0.072477,-0.074760,-0.071551
940,1.461572,0.375370,0.246547,0.590738,0.250927,0.101972,1.285553,0.674008,0.973114,0.294906,...,0.019349,0.020250,0.019501,0.018441,0.019248,0.017371,0.019474,0.018422,0.016319,0.019231
941,1.258534,0.238670,0.081863,0.483019,0.097587,-0.059191,1.116564,0.574984,0.837876,0.141187,...,-0.143888,-0.142798,-0.144034,-0.145142,-0.144468,-0.146269,-0.144092,-0.145181,-0.147358,-0.144027
942,1.441377,0.356062,0.230834,0.586993,0.232825,0.081489,1.301711,0.665339,0.973322,0.276177,...,-0.001875,-0.001145,-0.001621,-0.002659,-0.001763,-0.003578,-0.001266,-0.002422,-0.004734,-0.001588


In [219]:
def predict_item(ratings, similarity):
    return ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])

item_prediction = predict_item(train_data_pivot, item_similarity)
item_prediction

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.358834,0.377382,0.401504,0.357170,0.395166,0.416901,0.357030,0.372450,0.374311,0.400413,...,0.438733,0.435869,0.440376,0.440376,0.435737,0.445237,0.445237,0.445237,0.441736,0.430870
2,0.092469,0.107400,0.102130,0.101950,0.104833,0.104374,0.092949,0.099913,0.093648,0.101149,...,0.109001,0.109569,0.109131,0.109131,0.108177,0.109292,0.109292,0.109292,0.109988,0.109876
3,0.072092,0.075458,0.073512,0.074870,0.073448,0.075228,0.070134,0.074370,0.072218,0.073801,...,0.074994,0.075540,0.072180,0.072180,0.074357,0.070666,0.070666,0.070666,0.074911,0.075449
4,0.047534,0.050011,0.049081,0.049538,0.049452,0.051132,0.046828,0.049522,0.048530,0.050698,...,0.050912,0.050888,0.050780,0.050780,0.050310,0.047402,0.047402,0.047402,0.051130,0.051348
5,0.196575,0.200997,0.222632,0.197230,0.218662,0.245415,0.199102,0.206164,0.217225,0.229643,...,0.245123,0.243313,0.246491,0.246491,0.245431,0.248351,0.248351,0.248351,0.246136,0.242437
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.080410,0.093542,0.092610,0.092566,0.091428,0.099818,0.084185,0.092976,0.083682,0.097405,...,0.100565,0.101404,0.098489,0.098489,0.099818,0.101623,0.101623,0.101623,0.101070,0.101060
940,0.150460,0.163437,0.171301,0.151123,0.169882,0.180040,0.146964,0.150933,0.153719,0.169427,...,0.183904,0.183193,0.185488,0.185488,0.183795,0.185126,0.185126,0.185126,0.186088,0.183367
941,0.021619,0.028458,0.027221,0.027366,0.028546,0.030756,0.021521,0.027421,0.025811,0.029084,...,0.031583,0.031633,0.031395,0.031395,0.030543,0.031480,0.031480,0.031480,0.031510,0.031498
942,0.139024,0.148967,0.158869,0.143440,0.155918,0.161287,0.142535,0.139401,0.144274,0.152912,...,0.163488,0.161552,0.165183,0.165183,0.163986,0.166601,0.166601,0.166601,0.165874,0.165615


In [222]:
from sklearn.metrics import mean_squared_error
from math import sqrt

def rmse(predicion, groud_truth):
    groud_truth = groud_truth.values.flatten()
    s = groud_truth != 0
    groud_truth = groud_truth[s]
    predicion = predicion.values.flatten()
    predicion = predicion[s]
    
    return sqrt(mean_squared_error(predicion, groud_truth))

{'user_prediction_rmse': rmse(user_prediction, test_data_pivot), 'item_prediction_rmse':  rmse(item_prediction, test_data_pivot)}

{'user_prediction_rmse': 3.1043685831632986,
 'item_prediction_rmse': 3.451138462140362}