## Implement collaborative filtering recommender that predicts user rating for an item.

- Test different configurations (e.g. different number of nearest neighbors, different similarities)
- Evaluate them by usage of (implemented by yourself) MAE and RMSE
- Choose between cross-validation and hold-out validation to perform you evaluation.

### About the dataset:
This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

In [15]:
# %pip install --upgrade scikit-surprise numpy
import pandas as pd
from surprise import Dataset, Reader, Prediction
from surprise import KNNBasic, BaselineOnly
from surprise.model_selection import cross_validate
from surprise.model_selection import KFold
from surprise import accuracy
import numpy as np

In [8]:
# Load the MovieLens 100k dataset (standard in Surprise)
data = Dataset.load_builtin('ml-100k')

print("Data loaded successfully.")

Data loaded successfully.


In [4]:
data

<surprise.dataset.DatasetAutoFolds at 0x11cc62d50>

In [5]:
# custom performance metrics
def calculate_mae(predictions):
    """Calculates Mean Absolute Error (MAE) for a list of predictions."""
    if not predictions:
        return 0
    # The 'r_ui' is the true rating, and 'est' is the estimated rating.
    errors = [abs(true_rating - est_rating) for (_, _, true_rating, est_rating, _) in predictions]
    return np.mean(errors)

def calculate_rmse(predictions):
    """Calculates Root Mean Squared Error (RMSE) for a list of predictions."""
    if not predictions:
        return 0
    # The 'r_ui' is the true rating, and 'est' is the estimated rating.
    squared_errors = [(true_rating - est_rating)**2 for (_, _, true_rating, est_rating, _) in predictions]
    return np.sqrt(np.mean(squared_errors))

### Testing and Evaluation
We will use 5-fold cross-validation and test different combinations of collaborative filtering settings:
- Similarity Metrics: cosine, MSD, and pearson
- Number of Neighbors (k): 20, 40, and 60

#### We will use the User-Based Collaborative Filtering approach (user_based=True) with the KNNBasic algorithm.

In [6]:
import itertools

# Define cross-validation splitter
kf = KFold(n_splits=5, random_state=42, shuffle=True)

# Define configurations to test
similarity_metrics = ['cosine', 'MSD', 'pearson']
k_values = [20, 40, 60]

# Generate all combinations (Cartesian product)
combinations = list(itertools.product(similarity_metrics, k_values))

# Build the final list of configuration dictionaries using a list comprehension
configurations = [
    {
        'sim_options': {'name': sim, 'user_based': True},
        'k': k
    }
    for sim, k in combinations
]

results = []

print("Starting cross-validation and evaluation for all configurations...")

for config in configurations:
    sim_name = config['sim_options']['name']
    k_val = config['k']

    # Set the algorithm with current configuration
    algo = KNNBasic(k=k_val, sim_options=config['sim_options'], random_state=42)

    # Perform 5-fold cross-validation
    mae_list = []
    rmse_list = []

    for trainset, testset in kf.split(data):
        # Train the algorithm
        algo.fit(trainset)

        # Make predictions on the test set
        predictions = algo.test(testset)

        # Calculate MAE and RMSE using the custom functions
        mae_fold = calculate_mae(predictions)
        rmse_fold = calculate_rmse(predictions)

        mae_list.append(mae_fold)
        rmse_list.append(rmse_fold)

    # Calculate average MAE and RMSE over all folds
    avg_mae = np.mean(mae_list)
    avg_rmse = np.mean(rmse_list)

    # Store results
    results.append({
        'Similarity': sim_name,
        'k (Neighbors)': k_val,
        'Mean MAE': avg_mae,
        'Mean RMSE': avg_rmse
    })

print("Evaluation complete.")

# Convert results to a DataFrame for clean display
results_df = pd.DataFrame(results)

Starting cross-validation and evaluation for all configurations...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine 

In [7]:
results_df

Unnamed: 0,Similarity,k (Neighbors),Mean MAE,Mean RMSE
0,cosine,20,0.809675,1.02523
1,cosine,40,0.80439,1.017077
2,cosine,60,0.804802,1.016486
3,MSD,20,0.770803,0.976807
4,MSD,40,0.773483,0.978931
5,MSD,60,0.778333,0.983958
6,pearson,20,0.808883,1.019592
7,pearson,40,0.803166,1.011994
8,pearson,60,0.802392,1.010806


Best performing combination: MSD similarity, 20 Neighbors. MAE = 0.770803, RMSE = 0.976807

#### Baseline model - Naive predictor

In [13]:
# We use BaselineOnly, which predicts r_ui = global_mean + user_bias + item_bias, item_bias is related to the item's average rating.
baseline = []

algo = BaselineOnly()

mae_list = []
rmse_list = []

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Use the custom-implemented functions
    mae_fold = calculate_mae(predictions)
    rmse_fold = calculate_rmse(predictions)

    mae_list.append(mae_fold)
    rmse_list.append(rmse_fold)

# Calculate average MAE and RMSE over all folds
avg_mae = np.mean(mae_list)
avg_rmse = np.mean(rmse_list)

# Store results
baseline.append({
    'Mean MAE': avg_mae,
    'Mean RMSE': avg_rmse
})

print("Evaluation complete.")

baseline_df = pd.DataFrame(baseline)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluation complete.


In [14]:
baseline_df

Unnamed: 0,Mean MAE,Mean RMSE
0,0.748089,0.943637
