# Collaborative Filtering
Source of code: https://github.com/sharmin2697/Movie-Recommender-System

In [3]:
import pandas as pd
import time
import statistics
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise.model_selection import split
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

In [4]:
rating_index = ["userId", "movieId", "rating", "timestamp"]
movie_index = ["movieId","title","release_date", "video_release_dat", "IMDb_url","unknown", "action", "adventure", "animation", "children", "comedy", "crime", "documentary", "drama", "fantasy", "film-noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"]
ratings = pd.read_csv(r"G:\My Drive\FH_Technikum\MSC\Semester_2_SS2022\DAS\Data\ml-100k\u.data", sep="\t", names=rating_index)
movies = pd.read_csv(r"G:\My Drive\FH_Technikum\MSC\Semester_2_SS2022\DAS\Data\ml-100k\u.item", sep="|", names=movie_index, encoding='latin-1')
movies = movies.drop(["release_date", "video_release_dat", "IMDb_url","unknown", "action", "adventure", "animation", "children", "comedy", "crime", "documentary", "drama", "fantasy", "film-noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"], axis=1)
ratings = ratings.drop(["timestamp"], axis=1)

data = ratings

data.head()

Unnamed: 0,userId,movieId,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [5]:
reader = Reader(rating_scale=(1, 5))
dataset = Dataset.load_from_df(data, reader)
dataset

<surprise.dataset.DatasetAutoFolds at 0x287073f44f0>

## Items based collaborative filtering

As a similarity measure I am chosing cosine. I use the K-nn algorithm taking into account the mean ratings of each user to counteract the differences in each users preference for maximum and minimum ratings (e.g. some user never give 5 out of 5).

In [6]:
datasets = []

datasets.append(train_test_split(dataset, test_size=0.2, random_state=547998))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))

item_based = {'name': 'cosine',
               'user_based': False} #defines if user-based filtering or items-based filtering should be used

In [9]:
mse_results = []

start_time = time.time()

for (trainset, testset) in datasets:
    algo = KNNWithMeans(100, 1, item_based)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)

print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8910
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8769
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8928
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8894
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8770
Execution time: 26.100154161453247 seconds
Mean mse: 0.8853872252800996


The mean value of the results (mean squared errors) is listed above. The values are rather stable and don't vary significantly. They also show a high predict accuracy (low mse).

The algorithm performs well. It is not fast, but accurate all in all.

## User based collaborative filtering

I am using the same algorithm and smilarity measure as with items based collaborative filtering.

In [10]:
mse_results = []

user_based = {'name': 'cosine',
               'user_based': True} #defines if user-based filtering or items-based filtering should be used

start_time = time.time()

for (trainset, testset) in datasets:
    algo = KNNWithMeans(100, 1, user_based)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)
    
print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9148
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8998
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9161
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9152
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.9049
Execution time: 21.171416521072388 seconds
Mean mse: 0.9101523294562632


The prediction accuracy (slightly higher mean mse) is significantly lower but the execution time is faster.

## Another model based approach collaborative filtering

Another interesting (at least to me) algorithm seems to be CoClustering provided by surprise (https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering). Users and items are assigned some clusters. The clusters are generated similarly to k-means.

In [11]:
mse_results = []

user_based = {'name': 'cosine',
               'user_based': True} #defines if user-based filtering or items-based filtering should be used

start_time = time.time()

for (trainset, testset) in datasets:
    algo = CoClustering(random_state=547998)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)
    
print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  algo.fit(trainset)


MSE: 0.9329
MSE: 0.9178
MSE: 0.9588
MSE: 0.9441
MSE: 0.9146
Execution time: 11.157257080078125 seconds
Mean mse: 0.9336540574458415


This algorithm is the most efficient. It takes the least amount of time to generate the model and test. However, it has a worse accuracy than both prior methods.

All in all it can be stated that the prediction accuracy is stable amon all 5 splits but not very accurate (0.9). However, I am not certain as to where the threashold regarding an acceptable mse value lies. It is clear that the closer to 0 the better. Providing a fixed threshold for mse (e.g. 0.5) where everything below is acceptable seems arbitrary.

Compared to processing the small datdaset, generating predictions with the large dataset did not yield considerably lower mean squared error values. This is rather surprising since one would expect a larger dataset to provide better models. This did not happen. Only the processing time increased rather massively.