Comparison with a classification model of a Movie Recommender System using Collaborative Filtering Methods
-----------------

The dataset was downloaded from Kaggle: https://www.kaggle.com/datasets/ranitsarkar01/movies-recommender-system-dataset

In [39]:
# Utilities libraries
import pandas as pd
import arviz as az
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ML libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score, confusion_matrix
import pymc as pm
import aesara.tensor as at
import arviz as az

Let's explore the dataset

From the dataset, we know that the files follow this format:
- movies.dat:  MovieID::Title::Genres
- ratings.dat: UserID::MovieID::Rating::Timestamp
- users.dat:   UserID::Gender::Age::Occupation::Zip-code

Let's explore the movies dataset: MovieID::Title::Genres

In [2]:
df_movies = pd.read_csv('./Dataset/movies.dat', sep='::', engine='python', names=['MovieID', 'Title', 'Genres'],  encoding='ISO-8859-1')

In [3]:
df_movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


As it can be seen, the Genres are read in the same column. Let's split the genres column and apply one-hot encoding:

In [4]:
genres_split = df_movies['Genres'].str.get_dummies(sep='|')

In [5]:
genres_split.head()

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [6]:
df_movies = pd.concat([df_movies, genres_split], axis=1)

In [7]:
df_movies.head()

Unnamed: 0,MovieID,Title,Genres,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Animation|Children's|Comedy,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's modify more this dataset with movies, so that the year of the movie is moved to another column

In [8]:
# Extract year from the Title and create a new column 'Year'
df_movies['YearMovie'] = df_movies['Title'].str.extract(r'\((\d{4})\)')

# Remove the year from the Title column 
df_movies['Title'] = df_movies['Title'].str.replace(r'\(\d{4}\)', '', regex=True).str.strip()

In [9]:
df_movies.head()

Unnamed: 0,MovieID,Title,Genres,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,...,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,YearMovie
0,1,Toy Story,Animation|Children's|Comedy,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,1995
1,2,Jumanji,Adventure|Children's|Fantasy,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1995
2,3,Grumpier Old Men,Comedy|Romance,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1995
3,4,Waiting to Exhale,Comedy|Drama,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1995
4,5,Father of the Bride Part II,Comedy,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1995


In [10]:
print(f'There are {len(df_movies)} movies.')

There are 3883 movies.


Let's explore the ratings file:

it has the following format: UserID::MovieID::Rating::Timestamp

In [11]:
df_ratings = pd.read_csv('./Dataset/ratings.dat', sep='::', engine='python', names=['UserID', 'MovieID', 'Rating', 'Timestamp'],  encoding='ISO-8859-1')
df_ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [12]:
print(f'There are {len(df_ratings)} ratings.')

There are 1000209 ratings.


Let's explore the users file:

In [13]:
df_users = pd.read_csv('./Dataset/users.dat', sep='::', engine='python', names=['UserID', 'Gender', 'Age', 'Occupation', 'ZipCode'],  encoding='ISO-8859-1')
df_users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,ZipCode
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [14]:
print(f'There are {len(df_users)} users.')

There are 6040 users.


Note:
- Age is chosen from the following ranges:
	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

Merging datasets
------------------

In [15]:
# Merge ratings with movies
df_ratings_movies = pd.merge(df_ratings, df_movies, on='MovieID')

# Merge the result with users
df_full = pd.merge(df_ratings_movies, df_users, on='UserID')

df_full.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,Genres,Action,Adventure,Animation,Children's,...,Romance,Sci-Fi,Thriller,War,Western,YearMovie,Gender,Age,Occupation,ZipCode
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest,Drama,0,0,0,0,...,0,0,0,0,0,1975,F,1,10,48067
1,1,661,3,978302109,James and the Giant Peach,Animation|Children's|Musical,0,0,1,1,...,0,0,0,0,0,1996,F,1,10,48067
2,1,914,3,978301968,My Fair Lady,Musical|Romance,0,0,0,0,...,1,0,0,0,0,1964,F,1,10,48067
3,1,3408,4,978300275,Erin Brockovich,Drama,0,0,0,0,...,0,0,0,0,0,2000,F,1,10,48067
4,1,2355,5,978824291,"Bug's Life, A",Animation|Children's|Comedy,0,0,1,1,...,0,0,0,0,0,1998,F,1,10,48067


Prepare dataset for HPF
-------------------

Ensure that data is in correct format, i.e., ensure that UserID and MovieID are sequential (and start from 0).
(common requierement for matrix factorization algorithms)

For training and test, use only 20 users
--------------

In [16]:
df_full_safe_copy = df_full.copy(deep=True)

In [17]:
len(set(df_full_safe_copy['UserID']))

6040

In [18]:
filtered_by_id = df_full_safe_copy[df_full_safe_copy['UserID'] <=10].copy(deep = True)
df_full = filtered_by_id.copy(deep = True)
print(f"The smaller dataset contains {len(df_full)} entries. The total dataset contains {len(df_full_safe_copy)} entries.")

The smaller dataset contains 1200 entries. The total dataset contains 1000209 entries.


Split the dataset into train and test
-------------------

We can either choose a random split, a temporal split, or a user-based split (or all). For the purpose of this project, the data will be split based on time and user presence. Reasons:
- Time-split:  because it includes a timestamp for each rating; ---> ratings up to a certain date for the training  and the ratings after that date for testing.
- User-based split: Ensures that every user present in the test set is also present in the training set; the model needs past data about the user to make predictions for them.

This approach seems more realistic because it simulates how the model would work in production, predicting future ratings based on past data. Moreover, we can insure that every movie in the test set has appeared in the training set.

Temporal split
------------------

In [19]:
# Convert the Timestamp column to a datetime format
df_full['Timestamp'] = pd.to_datetime(df_full['Timestamp'], unit='s')
df_full.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,Genres,Action,Adventure,Animation,Children's,...,Romance,Sci-Fi,Thriller,War,Western,YearMovie,Gender,Age,Occupation,ZipCode
0,1,1193,5,2000-12-31 22:12:40,One Flew Over the Cuckoo's Nest,Drama,0,0,0,0,...,0,0,0,0,0,1975,F,1,10,48067
1,1,661,3,2000-12-31 22:35:09,James and the Giant Peach,Animation|Children's|Musical,0,0,1,1,...,0,0,0,0,0,1996,F,1,10,48067
2,1,914,3,2000-12-31 22:32:48,My Fair Lady,Musical|Romance,0,0,0,0,...,1,0,0,0,0,1964,F,1,10,48067
3,1,3408,4,2000-12-31 22:04:35,Erin Brockovich,Drama,0,0,0,0,...,0,0,0,0,0,2000,F,1,10,48067
4,1,2355,5,2001-01-06 23:38:11,"Bug's Life, A",Animation|Children's|Comedy,0,0,1,1,...,0,0,0,0,0,1998,F,1,10,48067


Definition of the cutoff point: For this case, we are going to define the cutoff point as the first 10% of the dataset for training and the last 90% for test. It is done in this way because the test data will become eventually smaller.

In [20]:
cutoff_timestamp = df_full['Timestamp'].quantile(0.03)
cutoff_timestamp

Timestamp('2000-12-31 01:15:33')

Split (temporally the dataset)

In [21]:
train_data_temporal = df_full[df_full['Timestamp'] <= cutoff_timestamp]
test_data_temporal = df_full[df_full['Timestamp'] > cutoff_timestamp]
print(f"The percentages are: {len(train_data_temporal)/len(df_full)} train (temporal data), and {len(test_data_temporal)/len(df_full)} test (temporal data).")
print("These percentages are relative to the smaller dataset.")

The percentages are: 0.03166666666666667 train (temporal data), and 0.9683333333333334 test (temporal data).
These percentages are relative to the smaller dataset.


User-Based Data Split
------------

Ensure that every user and movie in the test set is also present in the training set.

In [22]:
# Identify users and movies in the training set
train_users = set(train_data_temporal['UserID'])
train_movies = set(train_data_temporal['MovieID'])

In [23]:
# Filter the test set to only include users and movies that are also in the training set
test_data = test_data_temporal[test_data_temporal['UserID'].isin(train_users) & test_data_temporal['MovieID'].isin(train_movies)]

In [24]:
train_data = train_data_temporal

In [25]:
print(f"The new percentages are: {len(train_data)/len(df_full)} train (temporal data), and {len(test_data)/len(df_full)} test (temporal data).")
print("These percentages are relative to the dataset containg 20 user.\n")

print(f"Length of the train sample: {len(train_data)}.")
print(f"Length of the test sample: {len(test_data)}")

datasetlen = len(train_data) + len(test_data)
ratio_test = len(test_data)/datasetlen
ratio_train = len(train_data)/datasetlen

print(f"The final percentages are: {ratio_train}% train; {ratio_test}% test.")

The new percentages are: 0.03166666666666667 train (temporal data), and 0.008333333333333333 test (temporal data).
These percentages are relative to the dataset containg 20 user.

Length of the train sample: 38.
Length of the test sample: 10
The final percentages are: 0.7916666666666666% train; 0.20833333333333334% test.


Factorization of the Training Set
-----------

In [26]:
user_id_mapping = {original_id: numerical_id for numerical_id, original_id in enumerate(train_data['UserID'].unique())}
movie_id_mapping = {original_id: numerical_id for numerical_id, original_id in enumerate(train_data['MovieID'].unique())}

In [27]:
# Apply the mapping on the set:
train_data['numerical_user_id'] = train_data['UserID'].map(user_id_mapping)
train_data['numerical_movie_id'] = train_data['MovieID'].map(movie_id_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['numerical_user_id'] = train_data['UserID'].map(user_id_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['numerical_movie_id'] = train_data['MovieID'].map(movie_id_mapping)


Factorization of the Test Set
---------------

Apply the SAME mapping as on the training set

In [28]:
test_data['numerical_user_id'] = test_data['UserID'].map(user_id_mapping)
test_data['numerical_movie_id'] = test_data['MovieID'].map(movie_id_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['numerical_user_id'] = test_data['UserID'].map(user_id_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['numerical_movie_id'] = test_data['MovieID'].map(movie_id_mapping)


Check for missing values:

In [29]:
if test_data.isna().sum().any() or train_data.isna().sum().any():
    print("NaN values, STOP")
else:
    print("Test and train data OK!")

Test and train data OK!


In [30]:
# Save the data:
df_full.to_csv('./saved/df_full.csv')
train_data.to_csv('./saved/train_data.csv')
test_data.to_csv('./saved/test_data.csv')

Implementation of the model - Run it with the same dataset as HPF
---------------

In [31]:
n_users= train_data['numerical_user_id'].nunique()
n_movies = train_data['numerical_movie_id'].nunique()

In [32]:
# Extract rating and IDs
user_ids = train_data['numerical_user_id'].values
movie_ids = train_data['numerical_movie_id'].values
ratings = train_data['Rating'].values

In [None]:
train_data['numerical_user_id']

In [46]:
def run_model(model, data_train, data_test, X, Y):

    model.fit(data_train[X], data_train[Y])
    y_pred = model.predict(data_test[X])

    print(f"accuracy score: {accuracy_score(test_data[Y], y_pred)}.")
    print(f"confusion matrix: {confusion_matrix(test_data[Y], y_pred)}.")

In [50]:
print("Model that runs with the same dataset as the large HPF model.")
run_model(RandomForestClassifier(random_state = 30), train_data, test_data, ['numerical_user_id', 'numerical_movie_id'], 'Rating')

Model that runs with the same dataset as the large HPF model.
accuracy score: 0.4.
confusion matrix: [[2 1 0]
 [3 2 0]
 [2 0 0]].


Let's increase the dataset, following the same split. Consider 2000 users.
------------

In [51]:
df_full = df_full_safe_copy.copy(deep = True)

In [58]:
filtered_by_id = df_full_safe_copy[df_full_safe_copy['UserID'] <=2000].copy(deep = True)
df_full = filtered_by_id.copy(deep = True)
print(f"The smaller dataset contains {len(df_full)} entries. The total dataset contains {len(df_full_safe_copy)} entries.")

The smaller dataset contains 339542 entries. The total dataset contains 1000209 entries.


Temporal split

In [63]:
# Convert the Timestamp column to a datetime format
df_full['Timestamp'] = pd.to_datetime(df_full['Timestamp'], unit='s')
cutoff_timestamp = df_full['Timestamp'].quantile(0.03)
train_data_temporal = df_full[df_full['Timestamp'] <= cutoff_timestamp]
test_data_temporal = df_full[df_full['Timestamp'] > cutoff_timestamp]
print(f"The percentages are: {len(train_data_temporal)/len(df_full)} train (temporal data), and {len(test_data_temporal)/len(df_full)} test (temporal data).")
print("These percentages are relative to the smaller dataset.")

The percentages are: 0.03000512455012929 train (temporal data), and 0.9699948754498707 test (temporal data).
These percentages are relative to the smaller dataset.


User-based split

In [64]:
# Identify users and movies in the training set
train_users = set(train_data_temporal['UserID'])
train_movies = set(train_data_temporal['MovieID'])
# Filter the test set to only include users and movies that are also in the training set
test_data = test_data_temporal[test_data_temporal['UserID'].isin(train_users) & test_data_temporal['MovieID'].isin(train_movies)]
train_data = train_data_temporal
print(f"The new percentages are: {len(train_data)/len(df_full)} train (temporal data), and {len(test_data)/len(df_full)} test (temporal data).")
print("These percentages are relative to the dataset containg 20 user.\n")

print(f"Length of the train sample: {len(train_data)}.")
print(f"Length of the test sample: {len(test_data)}")

datasetlen = len(train_data) + len(test_data)
ratio_test = len(test_data)/datasetlen
ratio_train = len(train_data)/datasetlen

print(f"The final percentages are: {ratio_train}% train; {ratio_test}% test.")

The new percentages are: 0.03000512455012929 train (temporal data), and 0.04460420213110602 test (temporal data).
These percentages are relative to the dataset containg 20 user.

Length of the train sample: 10188.
Length of the test sample: 15145
The final percentages are: 0.40216318635771525% train; 0.5978368136422848% test.


Factorization of the sets

In [65]:
user_id_mapping = {original_id: numerical_id for numerical_id, original_id in enumerate(train_data['UserID'].unique())}
movie_id_mapping = {original_id: numerical_id for numerical_id, original_id in enumerate(train_data['MovieID'].unique())}
# Apply the mapping on the set:
train_data['numerical_user_id'] = train_data['UserID'].map(user_id_mapping)
train_data['numerical_movie_id'] = train_data['MovieID'].map(movie_id_mapping)
test_data['numerical_user_id'] = test_data['UserID'].map(user_id_mapping)
test_data['numerical_movie_id'] = test_data['MovieID'].map(movie_id_mapping)
if test_data.isna().sum().any() or train_data.isna().sum().any():
    print("NaN values, STOP")
else:
    print("Test and train data OK!")

Test and train data OK!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['numerical_user_id'] = train_data['UserID'].map(user_id_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data['numerical_movie_id'] = train_data['MovieID'].map(movie_id_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['numerical_user_id'] = test_data['UserID

In [66]:
print("Model that runs with the 2000 users:")
run_model(RandomForestClassifier(random_state = 30), train_data, test_data, ['numerical_user_id', 'numerical_movie_id'], 'Rating')

Model that runs with the 2000 users:
accuracy score: 0.3200396170353252.
confusion matrix: [[  35   68  331  212   88]
 [  58  133  764  604  199]
 [  90  368 1602 1907  599]
 [  54  427 1797 2371  953]
 [  39  178  575  987  706]].


Let's include more data about the users (features)
-------------------

In [67]:
df_full = df_full_safe_copy.copy(deep = True)

In [69]:
df_full.columns

Index(['UserID', 'MovieID', 'Rating', 'Timestamp', 'Title', 'Genres', 'Action',
       'Adventure', 'Animation', 'Children's', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical',
       'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western',
       'YearMovie', 'Gender', 'Age', 'Occupation', 'ZipCode'],
      dtype='object')

In [70]:
df_full.head(2)

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,Genres,Action,Adventure,Animation,Children's,...,Romance,Sci-Fi,Thriller,War,Western,YearMovie,Gender,Age,Occupation,ZipCode
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest,Drama,0,0,0,0,...,0,0,0,0,0,1975,F,1,10,48067
1,1,661,3,978302109,James and the Giant Peach,Animation|Children's|Musical,0,0,1,1,...,0,0,0,0,0,1996,F,1,10,48067


In [72]:
print("Model that runs with the 2000 users:")
run_model(RandomForestClassifier(random_state = 30), train_data, test_data, [
        'numerical_user_id', 'numerical_movie_id', 'Action',
       'Adventure', 'Animation', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical',
       'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western',
       'YearMovie', 'ZipCode', 'Occupation'], 'Rating')

Model that runs with the 2000 users:
accuracy score: 0.36764608781776165.
confusion matrix: [[  64   66  297  232   75]
 [  54   93  721  699  191]
 [  55  188 1625 2138  560]
 [  48  119 1461 2988  986]
 [  18   41  421 1207  798]].


As it can be seen, the clasifier performed worse than the HPF model, even with much more data!