# Collaborative Filtering
Source of code: https://github.com/sharmin2697/Movie-Recommender-System

In [1]:
import pandas as pd
import time
import statistics
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import KNNWithMeans
from surprise import CoClustering
from surprise import accuracy
from surprise.model_selection import train_test_split

In [2]:
rating_index = ["userId", "movieId", "rating", "timestamp"]
movie_index = ["movieId","title","release_date", "video_release_dat", "IMDb_url","unknown", "action", "adventure", "animation", "children", "comedy", "crime", "documentary", "drama", "fantasy", "film-noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"]
ratings = pd.read_csv(r"G:\My Drive\FH_Technikum\MSC\Semester_2_SS2022\DAS\Data\ml-1m\ratings.dat", sep="::", names=rating_index, engine="python")
movies = pd.read_csv(r"G:\My Drive\FH_Technikum\MSC\Semester_2_SS2022\DAS\Data\ml-1m\movies.dat", sep="::", names=movie_index, engine="python", encoding='latin-1')
movies = movies.drop(["release_date", "video_release_dat", "IMDb_url","unknown", "action", "adventure", "animation", "children", "comedy", "crime", "documentary", "drama", "fantasy", "film-noir", "horror", "musical", "mystery", "romance", "sci-fi", "thriller", "war", "western"], axis=1)
ratings = ratings.drop(["timestamp"], axis=1)

data = ratings

data.head()

Unnamed: 0,userId,movieId,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [3]:
reader = Reader(rating_scale=(1, 5))
dataset = Dataset.load_from_df(data, reader)
dataset

<surprise.dataset.DatasetAutoFolds at 0x2016e054d90>

## Items based collaborative filtering

As a similarity measure I am chosing cosine. I use the K-nn algorithm taking into account the mean ratings of each user to counteract the differences in each users preference for maximum and minimum ratings (e.g. some user never give 5 out of 5).

In [4]:
datasets = []

datasets.append(train_test_split(dataset, test_size=0.2, random_state=547998))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))
datasets.append(train_test_split(dataset, test_size=0.2))

item_based = {'name': 'cosine',
               'user_based': False} #defines if user-based filtering or items-based filtering should be used

In [5]:
mse_results = []

start_time = time.time()

for (trainset, testset) in datasets:
    algo = KNNWithMeans(40, 1, item_based)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)

print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7988
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7957
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7989
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7955
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.7976
Execution time: 419.0358748435974 seconds
Mean mse: 0.7973270654883968


MSE values are lower than with the small dataset. However, of course the execution time is rather long (nearly 7 minutes at one time). The values do not vary a lot between splits and kan be determined as stable.

## User based collaborative filtering

I am using the same algorithm and smilarity measure as with items based collaborative filtering.

In [6]:
mse_results = []

user_based = {'name': 'cosine',
               'user_based': True} #defines if user-based filtering or items-based filtering should be used

start_time = time.time()

for (trainset, testset) in datasets:
    algo = KNNWithMeans(40, 1, user_based)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)
    
print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8843
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8785
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8799
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8797
Computing the cosine similarity matrix...
Done computing similarity matrix.
MSE: 0.8798
Execution time: 908.3881342411041 seconds
Mean mse: 0.8804642603023926


As expected, the user based method takes a lot longer. In addition, it does not perform as accurate as the item based approach. Considering the massively increased time and the decreased accuracy, this approach is not optimal.

## Another model based approach collaborative filtering

Another interesting (at least to me) algorithm seems to be CoClustering provided by surprise (https://surprise.readthedocs.io/en/stable/co_clustering.html#surprise.prediction_algorithms.co_clustering.CoClustering). Users and items are assigned some clusters. The clusters are generated similarly to k-means.

In [7]:
mse_results = []

user_based = {'name': 'cosine',
               'user_based': True} #defines if user-based filtering or items-based filtering should be used

start_time = time.time()

for (trainset, testset) in datasets:
    algo = CoClustering(random_state=547998)

    algo.fit(trainset)
    predictions = algo.test(testset)

    mse = accuracy.mse(predictions)
    mse_results.append(mse)
    
print("Execution time: " + str(time.time() - start_time) + " seconds")
print("Mean mse: " + str(statistics.mean(mse_results)))

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  algo.fit(trainset)


MSE: 0.8469
MSE: 0.8352
MSE: 0.8358
MSE: 0.8340
MSE: 0.8350
Execution time: 101.9414484500885 seconds
Mean mse: 0.8373902323341371


The mse values for this approach are slightly better than with the user based approach. However, the time to generate and test the model is a quarter compared to the time required with the item based approach and even approximately 1/9 of the time spent to process the user based approach. Even though the mse values are higher than with the items based approach, the considerably faster processing time makes this algorithm very efficient.