# Collaborative Filtering
Source of code: https://github.com/sharmin2697/Movie-Recommender-System

In [1]:
import pandas as pd
import time
import statistics
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise.model_selection import split
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

In [2]:
rating_index = ["userId", "movieId", "rating", "timestamp"]
movie_index = ["movieId","title","release_date", "video_release_dat", "IMDb_url","unknown", "action", "adventure", "animation", "children", "comedy", "crime", "documentary", "drama", "fantasy", "film-noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"]
ratings = pd.read_csv(r"G:\My Drive\FH_Technikum\MSC\Semester_2_SS2022\DAS\Data\ml-100k\u.data", sep="\t", names=rating_index)
movies = pd.read_csv(r"G:\My Drive\FH_Technikum\MSC\Semester_2_SS2022\DAS\Data\ml-100k\u.item", sep="|", names=movie_index, encoding='latin-1')
movies = movies.drop(["release_date", "video_release_dat", "IMDb_url","unknown", "action", "adventure", "animation", "children", "comedy", "crime", "documentary", "drama", "fantasy", "film-noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"], axis=1)
ratings = ratings.drop(["timestamp"], axis=1)

data = ratings

data.head()

Unnamed: 0,userId,movieId,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [3]:
reader = Reader(rating_scale=(1, 5))
dataset = Dataset.load_from_df(data, reader)
dataset

<surprise.dataset.DatasetAutoFolds at 0x227c2febe50>

## Items based collaborative filtering

As a similarity measure I am chosing cosine. I use the K-nn algorithm taking into account the mean ratings of each user to counteract the differences in each users preference for maximum and minimum ratings (e.g. some user never give 5 out of 5).

In [4]:
datasets = []

datasets.append(train_test_split(dataset, test_size=0.2, random_state=547998))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))

item_based = {'name': 'cosine',
               'user_based': False} #defines if user-based filtering or items-based filtering should be used

In [5]:
mse_results = []

start_time = time.time()

for (trainset, testset) in datasets:
    algo = KNNWithMeans(20, 1, item_based)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)

print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9142
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9091
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9107
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9157
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9108
Execution time: 19.49854016304016 seconds
Mean mse: 0.9120843895897578


The mean value of the results (mean squared errors) is listed above. The values are rather stable and don't vary significantly. They also show a high predict accuracy (high values of mse).

The algorithm performs well. Execution of all 5 splits is listed above in the coude output. It was below 20 seconds for the 100k dataset.

## User based collaborative filtering

I am using the same algorithm and smilarity measure as with items based collaborative filtering.

In [6]:
rmse_results = []

user_based = {'name': 'cosine',
               'user_based': True} #defines if user-based filtering or items-based filtering should be used

start_time = time.time()

for (trainset, testset) in datasets:
    algo = KNNWithMeans(20, 1, user_based)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)
    
print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9314
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9292
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9272
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9172
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9233
Execution time: 15.89564061164856 seconds
Mean mse: 0.918864230115619


The user based approach has a slightly higher mean mse and also performs faster (execution time is lower than with the items based approach).

## Another model based approach collaborative filtering

Another interesting (at least to me) algorithm seems to be CoClustering provided by surprise (https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering). Users and items are assigned some clusters. The clusters are generated similarly to k-means.

In [8]:
rmse_results = []

user_based = {'name': 'cosine',
               'user_based': True} #defines if user-based filtering or items-based filtering should be used

start_time = time.time()

for (trainset, testset) in datasets:
    algo = CoClustering(random_state=547998)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)
    
print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  algo.fit(trainset)


MSE: 0.9329
MSE: 0.9287
MSE: 0.9519
MSE: 0.9486
MSE: 0.9256
Execution time: 10.747952461242676 seconds
Mean mse: 0.9282104467555625


This algorithm is the most efficient and precise one. The mean mse from 5 executions is higher than measured from the prior algorithms used. The execution time is also considerably faster.

This method/algorithm seems to be the best among the tested.