# Group Project: Movie Recommendations (2487-T2 Machine Learning) [Group 2]
- Nova School of Business and Economics, Portugal
- Instructor: Qiwei Han, Ph.D.
- Program: Masters Program in Business Analytics
- Group Members: 
    - **Luca Silvano Carocci (53942)**
    - **Fridtjov Höyerholt Stokkeland (52922)**
    - **Diego García Rieckhof (53046)**
    - **Matilde Pesce (53258)**
    - **Florian Fritz Preiss (54385)**<br>
---

# Phase 5: Evaluation [06 Model Evaluation]

## 5.1 Evaluation of Content-Based Recommender Systems

In this phase, we will evaluate the performance of various content-based recommender models. Our focus will be on comparing four different approaches: TF-IDF Vectorizer, Count Vectorizer, Optimized Count Vectorizer, Word2Vec, Optimized Word2Vec, Doc2Vec, Optimized Doc2Vec. The evaluation process is a vital step in the machine learning pipeline, as it offers insights into the models' accuracy, relevance, and efficiency, while highlighting areas that may require further optimization. To provide a comprehensive assessment, we will employ a set of established evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Precision.

**Code Explanation:**

To evaluate our content-based recommender models, we will first identify users who have rated movies with a 4.0 rating. This information will help us gauge whether our models can recommend movies that align with these users' preferences. We will iterate through each movie in the dataset and generate recommendations using each of the nine content recommender models. For each movie, we will then gather the actual ratings given by users who received the recommendation and compare these ratings with the ideal rating (5.0) to compute the evaluation metrics.

This evaluation process will enable us to compare the performance of the different content-based recommender models and identify the most effective approach for generating accurate and relevant recommendations. By exploring various models and their optimizations, we aim to provide a comprehensive understanding of the underlying techniques and their effectiveness in the context of movie recommendation systems.

**Rationale for choosing the rating higher or equal to 4.0 as a like for a movie:**

In our evaluation process, we have chosen to consider a rating of 4.0 or higher as an indication of the user's liking for a movie. This decision is based on a rationale that takes into account user behavior, the nature of rating scales, and previous research in the field of recommender systems.

Firstly, users tend to exhibit a positivity bias when rating items, which means that they are more likely to rate items they like rather than dislike (Hu, Zhang, & Pavlou, 2009). This phenomenon leads to a skewed distribution of ratings, with a higher concentration of positive ratings. By setting the threshold at 4.0 or higher, we account for this bias and focus on movies that users have genuinely appreciated.

Secondly, a 5-point rating scale, with 0.5-step increments, provides users with ten options to express their opinion about a movie. A rating of 4.0 or higher corresponds to the top 20% of the available options, indicating that the user has a strong positive preference for the movie. This threshold is commonly used in research and practice as a proxy for users' "likes" or "favorites" (Bobadilla, Ortega, Hernando, & Gutiérrez, 2013).

Moreover, previous research in recommender systems has demonstrated that using a threshold of 4.0 or higher for identifying liked items can lead to better performance and more accurate recommendations (Cremonesi, Koren, & Turrin, 2010). In particular, focusing on highly-rated items helps the algorithms identify strong patterns of user preferences, which can contribute to generating more relevant recommendations.

In conclusion, setting the threshold for a "like" at 4.0 or higher is a well-justified decision that considers user behavior, the nature of the rating scale, and existing research in the field of recommender systems. By focusing on highly-rated items, we aim to create a robust evaluation framework that can effectively measure the performance of our content-based recommender models.

**Rationale for using a random sample of movies to evaluate the recommendation models:**

We decided to use a random sample of movies to evaluate our recommendation models, as this is a much more practical and effective approach, especially with the large movie database of 50,000 movies we work with. Evaluating the models for the entire dataset would be very computationally intensive and time consuming. By selecting a representative sample, we can significantly reduce the evaluation time while maintaining the validity of the evaluation results.


Rationale for using a random sample:

- Computational Efficiency: Evaluating models on the entire dataset is computationally expensive, especially when dealing with large datasets. Using a random sample reduces the computational cost, allowing for quicker evaluation and optimization of the models.

- Representative Results: If the random sample is chosen correctly, it should provide a good approximation of the overall dataset, maintaining the distribution of features and ratings. This ensures that the evaluation results are representative of the models' performance on the entire dataset.

- Flexibility: Using a random sample allows for flexibility in choosing the sample size based on the available computational resources and the desired level of confidence in the evaluation results.

- Widely Accepted Practice: Sampling is a widely accepted practice in statistics and machine learning, as it allows researchers to draw inferences about the population without having to analyze the entire dataset (Cochran, 1977).

Consequently, we use the following formula in order to calculate the required sample size:

n = (Z^2 * p * (1-p)) / E^2

Where:

- n is the sample size
- Z is the Z-score (based on the desired confidence level, e.g., 1.96 for a 95% confidence level)
- p is the estimated proportion of movies with a rating of 4.0 or higher
- E is the desired margin of error


In [30]:
# Standard libraries
import ast
import joblib
import math
import pickle
import random
import time
import warnings

# Third-party libraries
from collections import defaultdict
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
import scipy.stats
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import MinMaxScaler
from sentence_transformers import SentenceTransformer
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split
from transformers import AutoModelForMaskedLM, AutoTokenizer

# Custom recommender module
from recommender import (ContentRecommenderTFIDFOptimized, ContentRecommenderCountVec,
                         ContentRecommenderCountVecOptimized, ContentRecommenderW2V,
                         ContentRecommenderW2VOptimized, ContentRecommenderD2V,
                         ContentRecommenderD2VOptimized, ContentRecommenderDistilRoBERTa)

# Suppress warnings
warnings.filterwarnings("ignore")

In [2]:
# Load the movies dataset
movies_df = pd.read_csv('../00_Data/engineered/movies_df_engineered.csv', dtype={'movieId': int})
movies_df.head(2)

Unnamed: 0,movieId,title,movie_age,genres,combined_text,vote_average,vote_count,score,sentiment
0,1,Toy Story (1995),28,"['Adventure', 'Animation', 'Children', 'Comedy...",adventure animation children comedy fantasy re...,3.893708,57309.0,3.883305,0.112121
1,2,Jumanji (1995),28,"['Adventure', 'Children', 'Fantasy']",adventure children fantasy adaptationofbook ad...,3.251527,24228.0,3.242912,-0.21875


In [3]:
# Import ratings table
ratings_df = pd.read_csv('../00_Data/pre-processed/prepr_ratings.csv', dtype={'movieId': int})
ratings_df.drop(['Unnamed: 0', 'timestamp'], axis=1, inplace=True)
ratings_df.head(2)

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5


In [4]:
def evaluate_movie(movie_id, ratings_df, users_rated_high, recommender):
    """Evaluates a single movie for a recommender system.

    Args:
        movie_id (int): The ID of the movie to evaluate.
        ratings_df (DataFrame): DataFrame containing movie ratings.
        users_rated_high (Series): Users who rated a movie 4.0 or higher.
        recommender (object): The recommender model to use.

    Returns:
        ratings_array (np.array): Ratings array for a single movie.
    """
    recommendations = recommender.recommend(movie_id, top_n=10)
    recommended_ids = recommendations.index.tolist()

    ratings = ratings_df[(ratings_df['userId'].isin(users_rated_high)) &
                         (ratings_df['movieId'].isin(recommended_ids))]

    ratings_array = np.array(ratings['rating'])
    return ratings_array

In [5]:
def evaluate_recommender(ratings_df, users_rated_high, recommender, random_movie_ids):
    """Evaluates a recommender system using a random sample of movie IDs.

    Args:
        ratings_df (DataFrame): DataFrame containing movie ratings.
        users_rated_high (Series): Users who rated a movie 4.0 or higher.
        recommender (object): The recommender model to use.
        random_movie_ids (list): Randomly selected movie IDs.

    Returns:
        all_actual_ratings (np.array): Array of actual ratings for all movies.
    """
    all_actual_ratings = np.array([])

    start_time = time.time()

    num_cores = -1

    all_actual_ratings = Parallel(n_jobs=num_cores)(
        delayed(evaluate_movie)(movie_id, ratings_df, users_rated_high, recommender)
        for movie_id in random_movie_ids
    )

    all_actual_ratings = np.concatenate(all_actual_ratings)

    elapsed_time = time.time() - start_time

    print(f"Time taken for evaluation: {elapsed_time} seconds\n")

    return all_actual_ratings

In [6]:
def evaluate_and_print_metrics(recommender, recommender_name, ratings_df, users_rated_high, random_movie_ids):
    """Evaluates and prints metrics for a recommender system.

    Args:
        recommender (object): The recommender model to use.
        recommender_name (str): Name of the recommender model.
        ratings_df (DataFrame): DataFrame containing movie ratings.
        users_rated_high (Series): Users who rated a movie 4.0 or higher.
        random_movie_ids (list): Randomly selected movie IDs.

    Returns:
        metrics (dict): Metrics for the recommender model.
    """

    all_actual_ratings = evaluate_recommender(ratings_df, users_rated_high, recommender, random_movie_ids)

    mae = np.mean(np.abs(all_actual_ratings - 5))
    mse = np.mean((all_actual_ratings - 5) ** 2)
    rmse = np.sqrt(mse)
    like_threshold = 4.5
    recommended_movie_liked = np.where(all_actual_ratings >= like_threshold, 1, 0)
    pred_pos = len(recommended_movie_liked)
    true_pos = np.sum(recommended_movie_liked)
    precision = true_pos / pred_pos

    metrics = {
        'Predicted Likes': pred_pos,
        'Actual Likes': true_pos,
        'MAE': mae,
        'MSE': mse,
        'RMSE': rmse,
        'Precision': precision
    }

    print(f"Metrics for {recommender_name}:")
    print(f"- MAE: {mae:.4f}")
    print(f"- MSE: {mse:.4f}")
    print(f"- RMSE: {rmse:.4f}")
    print(f"- Precision: {precision:.4f}\n")

    return metrics

In [7]:
def calculate_sample_size(confidence_level, margin_of_error, proportion):
    """
    Calculate the required sample size based on the desired confidence level,
    margin of error, and estimated proportion of the population with the
    desired characteristic.
    
    Args:
        confidence_level (float): The desired confidence level (e.g., 0.95 for 95%).
        margin_of_error (float): The desired margin of error (e.g., 0.05 for ±5%).
        proportion (float): The estimated proportion of the population with the
                            desired characteristic (e.g., the proportion of movies
                            with a rating of 4.0 or higher).
                            
    Returns:
        sample_size (int): The required sample size.
    """
    # Calculate the Z-score based on the confidence level
    z_score = abs(scipy.stats.norm.ppf((1 - confidence_level) / 2))
    
    # Calculate the required sample size
    sample_size = math.ceil((z_score ** 2 * proportion * (1 - proportion)) / (margin_of_error ** 2))
    
    return sample_size

In [8]:
# Set the desired rating threshold
rating_threshold = 4.0

# Calculate the proportion of movies with a rating equal to or higher than the threshold
ratings_higher_than_threshold = ratings_df[ratings_df['rating'] >= rating_threshold]
proportion = len(ratings_higher_than_threshold) / len(ratings_df)

print(f"Proportion of movies with a rating of {rating_threshold} or higher: {proportion:.4f}")

Proportion of movies with a rating of 4.0 or higher: 0.4983


In [9]:
# Calculate the required sample size
confidence_level = 0.95
margin_of_error = 0.05
proportion = proportion 

required_sample_size = calculate_sample_size(confidence_level, margin_of_error, proportion)
print(f"Required sample size: {required_sample_size}")

Required sample size: 385


In [10]:
# All ratings of 4.0 or higher
users_rated_high = ratings_df[ratings_df['rating'] >= rating_threshold]['userId']

# Get all movie IDs
movie_ids = movies_df['movieId'].tolist()

# Randomly select 385 movie IDs (according to calculated sample size)
random_movie_ids = random.sample(movie_ids, required_sample_size)

In [11]:
# Evaluate the models and store the metrics in a list of dictionaries
metrics = []

### **5.1.1 TF-IDF Vectorizer**

In [12]:
# Load the saved model
with open('../02_Models/content_recommender_tfidf.pkl', 'rb') as file:
    recommenderTFIDF = pickle.load(file)

In [13]:
# Evaluate the TF-IDF model
metrics_tfidf = evaluate_and_print_metrics(recommenderTFIDF, "TF-IDF Vectorizer", ratings_df, users_rated_high, random_movie_ids)
metrics.append({'model': 'TF-IDF Vectorizer', 'mae': metrics_tfidf['MAE'], 'mse': metrics_tfidf['MSE'], 'rmse': metrics_tfidf['RMSE'], 'precision': metrics_tfidf['Precision']})

# Display the results as a DataFrame
pd.DataFrame(metrics_tfidf, index=[0])

Time taken for evaluation: 150.19900488853455 seconds

Metrics for TF-IDF Vectorizer:
- MAE: 1.4247
- MSE: 3.1621
- RMSE: 1.7782
- Precision: 0.2478



Unnamed: 0,Predicted Likes,Actual Likes,MAE,MSE,RMSE,Precision
0,3441873,852930,1.424663,3.162091,1.778227,0.24781


### **5.1.2 Count Vectorizer**

**a. Regular Model**

In [14]:
# Load the saved model
with open('../02_Models/content_recommender_countvec.pkl', 'rb') as file:
    recommenderCountVec = pickle.load(file)

In [15]:
# Evaluate the Count Vectorizer model
metrics_countvec = evaluate_and_print_metrics(recommenderCountVec, "Count Vectorizer", ratings_df, users_rated_high, random_movie_ids)
metrics.append({'model': 'Count Vectorizer', 'mae': metrics_countvec['MAE'], 'mse': metrics_countvec['MSE'], 'rmse': metrics_countvec['RMSE'], 'precision': metrics_countvec['Precision']})

# Display the results as a DataFrame
pd.DataFrame(metrics_countvec, index=[0])

Time taken for evaluation: 460.995080947876 seconds

Metrics for Count Vectorizer:
- MAE: 1.4983
- MSE: 3.4003
- RMSE: 1.8440
- Precision: 0.2211



Unnamed: 0,Predicted Likes,Actual Likes,MAE,MSE,RMSE,Precision
0,3057786,676028,1.498283,3.400338,1.844,0.221084


**b. Optimized Model**

In [16]:
# Load the saved model
with open('../02_Models/content_recommender_countvec_opt.pkl', 'rb') as file:
    recommenderCountVecOptimized = pickle.load(file)

In [17]:
# Evaluate the Count Vectorizer model
metrics_countvec_opt = evaluate_and_print_metrics(recommenderCountVecOptimized, "Count Vectorizer (Optimized)", ratings_df, users_rated_high, random_movie_ids)
metrics.append({'model': 'Count Vectorizer (Optimized)', 'mae': metrics_countvec_opt['MAE'], 'mse': metrics_countvec_opt['MSE'], 'rmse': metrics_countvec_opt['RMSE'], 'precision': metrics_countvec_opt['Precision']})

# Display the results as a DataFrame
pd.DataFrame(metrics_countvec_opt, index=[0])

Time taken for evaluation: 123.34686255455017 seconds

Metrics for Count Vectorizer (Optimized):
- MAE: 1.7996
- MSE: 4.1810
- RMSE: 2.0447
- Precision: 0.0825



Unnamed: 0,Predicted Likes,Actual Likes,MAE,MSE,RMSE,Precision
0,222345,18335,1.799564,4.180955,2.044738,0.082462


### **5.1.3 Word2Vec**

**a. Regular Model**

In [18]:
# Load the saved model
with open('../02_Models/content_recommender_w2v.pkl', 'rb') as file:
    recommenderW2V = pickle.load(file)

In [19]:
# Evaluate the Word2Vec model
metrics_w2v = evaluate_and_print_metrics(recommenderW2V, "Word2Vec", ratings_df, users_rated_high, random_movie_ids)
metrics.append({'model': 'Word2Vec', 'mae': metrics_w2v['MAE'], 'mse': metrics_w2v['MSE'], 'rmse': metrics_w2v['RMSE'], 'precision': metrics_w2v['Precision']})

# Display the results as a DataFrame
pd.DataFrame(metrics_w2v, index=[0])

Time taken for evaluation: 749.6605615615845 seconds

Metrics for Word2Vec:
- MAE: 1.4717
- MSE: 3.3029
- RMSE: 1.8174
- Precision: 0.2275



Unnamed: 0,Predicted Likes,Actual Likes,MAE,MSE,RMSE,Precision
0,4343442,988238,1.471733,3.30288,1.817383,0.227524


**b. Optimized Model**

In [20]:
# Load the saved model
with open('../02_Models/content_recommender_w2v_opt.pkl', 'rb') as file:
    recommenderW2VOptimized = pickle.load(file)

In [21]:
# Evaluate the Word2Vec model
metrics_w2v_opt = evaluate_and_print_metrics(recommenderW2VOptimized, "Word2Vec (Optimized)", ratings_df, users_rated_high, random_movie_ids)
metrics.append({'model': 'Word2Vec (Optimized)', 'mae': metrics_w2v_opt['MAE'], 'mse': metrics_w2v_opt['MSE'], 'rmse': metrics_w2v_opt['RMSE'], 'precision': metrics_w2v_opt['Precision']})

# Display the results as a DataFrame
pd.DataFrame(metrics_w2v_opt, index=[0])

Time taken for evaluation: 126.89361047744751 seconds

Metrics for Word2Vec (Optimized):
- MAE: 1.4426
- MSE: 3.2046
- RMSE: 1.7901
- Precision: 0.2381



Unnamed: 0,Predicted Likes,Actual Likes,MAE,MSE,RMSE,Precision
0,4077471,970648,1.442558,3.204612,1.790143,0.238051


### **5.1.4 Doc2Vec**

**a. Regular Model**

In [22]:
# Load the saved model
with open('../02_Models/content_recommender_D2V.pkl', 'rb') as file:
    recommenderD2V = pickle.load(file)

In [23]:
# Evaluate the Doc2Vec model
metrics_d2v = evaluate_and_print_metrics(recommenderD2V, "Doc2Vec", ratings_df, users_rated_high, random_movie_ids)
metrics.append({'model': 'Doc2Vec', 'mae': metrics_d2v['MAE'], 'mse': metrics_d2v['MSE'], 'rmse': metrics_d2v['RMSE'], 'precision': metrics_d2v['Precision']})

# Display the results as a DataFrame
pd.DataFrame(metrics_d2v, index=[0])

Time taken for evaluation: 780.2526478767395 seconds

Metrics for Doc2Vec:
- MAE: 1.4299
- MSE: 3.1305
- RMSE: 1.7693
- Precision: 0.2342



Unnamed: 0,Predicted Likes,Actual Likes,MAE,MSE,RMSE,Precision
0,4125879,966463,1.42985,3.130451,1.769308,0.234244


**b. Optimized Model**

In [24]:
# Load the saved model
with open('../02_Models/content_recommender_D2V_opt.pkl', 'rb') as file:
    recommenderD2VOptimized = pickle.load(file)

In [25]:
# Evaluate the Doc2Vec model
metrics_d2v_opt = evaluate_and_print_metrics(recommenderD2VOptimized, "Doc2Vec (Optimized)", ratings_df, users_rated_high, random_movie_ids)
metrics.append({'model': 'Doc2Vec (Optimized)', 'mae': metrics_d2v_opt['MAE'], 'mse': metrics_d2v_opt['MSE'], 'rmse': metrics_d2v_opt['RMSE'], 'precision': metrics_d2v_opt['Precision']})

# Display the results as a DataFrame
pd.DataFrame(metrics_d2v_opt, index=[0])

Time taken for evaluation: 136.86385297775269 seconds

Metrics for Doc2Vec (Optimized):
- MAE: 1.4728
- MSE: 3.3083
- RMSE: 1.8189
- Precision: 0.2286



Unnamed: 0,Predicted Likes,Actual Likes,MAE,MSE,RMSE,Precision
0,3550239,811561,1.47277,3.308277,1.818867,0.228593


---

## 5.2 Evaluation of Collaborative-Based Recommender Systems

In [26]:
# Start tracking time
start_time = time.time()

# Create a reader with a rating scale from 1 to 5
reader = Reader(rating_scale=(1, 5))

# Load the ratings data into a Surprise Dataset format
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Split the dataset into training and testing sets using a 75/25 ratio
trainset, testset = train_test_split(data, test_size=.25)

# Function to load the SVD model
def load_model(model_path):
    return joblib.load(model_path)

# Load the pre-trained SVD model
svd = load_model('../02_Models/svd_model.pkl')

# Use SVD model to make predictions on testing set
predicts = svd.test(testset)

# Calculate and print the elapsed time
elapsed_time = time.time() - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

Elapsed time: 387.98 seconds


In [27]:
# Singular Value Decomposition (SVD)
# The number of components was selected based on an experiment where we evaluated 3 different options: 50 - 150 - 250. The one with the best RMSE was 50 components, given the process of running the experiment
# consumes a lot of time, the results of that expirement are stored in a separated notebook under 02_code/05_icbf.ipynb
svd = SVD(n_factors=50)
svd.fit(trainset)
predicts = svd.test(testset)

In [28]:
def precision_recall_at_k(predictions, k=10, threshold=4.0):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    return precisions, recalls

In [32]:
# Evaluate the SVD model
precisions, recalls = precision_recall_at_k(predicts, k=10, threshold=4)
mse, rmse, mae = accuracy.mse(predicts,verbose=False), accuracy.rmse(predicts,verbose=False), accuracy.mae(predicts,verbose=False)

# Display the results
metrics_svd = {'Collab Filtering (SVD)': {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'RECALL': np.mean(list(recalls.values()))}}
metrics.append({'model': 'Collab Filtering (SVD)', 'mae': mae, 'mse': mse, 'rmse': rmse, 'precision': np.mean(list(recalls.values()))})
metrics_svd = pd.DataFrame.from_dict(metrics_svd, orient='index')
metrics_svd

Unnamed: 0,MAE,MSE,RMSE,RECALL
Collab Filtering (SVD),0.590705,0.611554,0.782019,0.352268


---

## 5.3 Evaluation of Hybrid Recommender Systems

### **5.3.1 Importing and applying model**

**Load content-based recommender model (Word2Vec)**

In [None]:
# Load the saved model
with open('../02_Models/content_recommender_w2v.pkl', 'rb') as file:
    recommenderW2V = pickle.load(file)f

**Load item-based collaborative-filtering recommender model (SVD)**

In [None]:
# Start tracking time
start_time = time.time()

# Create a reader with a rating scale from 1 to 5
reader = Reader(rating_scale=(1, 5))

# Load the ratings data into a Surprise Dataset format
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Build a full trainset for the SVD model
full_trainset = data.build_full_trainset()

# Function to load the SVD model
def load_model(model_path):
    return joblib.load(model_path)

# Load the pre-trained SVD model
svd = load_model('../02_Models/svd_model.pkl')

# Fit the SVD model on the full trainset
svd.fit(full_trainset)

# Calculate and print the elapsed time
elapsed_time = time.time() - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

---

## 5.4 Results

In [41]:
# Convert the list of dictionaries into a DataFrame
metrics_df = pd.DataFrame(metrics)

In [43]:
# Display the DataFrame
metrics_df

Unnamed: 0,model,mae,mse,rmse,precision
0,TF-IDF Vectorizer,1.424663,3.162091,1.778227,0.24781
1,Count Vectorizer,1.498283,3.400338,1.844,0.221084
2,Count Vectorizer (Optimized),1.799564,4.180955,2.044738,0.082462
3,Word2Vec,1.471733,3.30288,1.817383,0.227524
4,Word2Vec (Optimized),1.442558,3.204612,1.790143,0.238051
5,Doc2Vec,1.42985,3.130451,1.769308,0.234244
6,Doc2Vec (Optimized),1.47277,3.308277,1.818867,0.228593
7,Collab Filtering (SVD),0.590705,0.611554,0.782019,0.352268


---

## 5.5 Limitations

One limitation of comparing content-based recommenders with the ratings-based (collaborative-filtering) recommender in terms of metrics such as MAE, MSE, RMSE, and Precision is that the models are constructed to address different prediction tasks. Ratings-based systems predict user-item ratings based on past user behavior, while content-based systems recommend items based on their similarity to items the user has liked in the past. The MAE, for example, focuses on the accuracy of predicted ratings, which is not the primary goal of content-based recommendations.

Accordingly, the ratings-based model leads to ratings that are 24.8% higher than the average rating score across all movies while the content-based model outperforms the average rating by only 1.2% (both metrics with 95% confidence). Accordingly, a hybrid model of the two may lead to the best recommendations as it combines both user behavior and thematic interest.

---

## 5.6 Next Steps

As stated above, the two types of recommender systems fulfill different purposes and are therefore difficult to compare using above metrics. Therefore, we suggest Streamify to run an A/B testing on it's platform and benchmark the following three recommenders:
* TF-IDF Vectorizer (the best-performing content-based recommender)
* Collaborative Filtering using SVD (the best-perofming recommender overall)
* The Hybrid model of both recommenders

In doing so, the platform will be able to pick the best recommender for long-term implementation. Implementation examples for the recommenders can be found in the three Showcase Notebooks (07).

---

# **Sources**

    Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. (2013). Recommender systems survey. Knowledge-Based Systems, 46, 109-132.

    Cochran, W.G. (1977). Sampling Techniques (3rd ed.). New York: John Wiley & Sons.

    Cremonesi, P., Koren, Y., & Turrin, R. (2010). Performance of recommender algorithms on top-N recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems (pp. 39-46).

    Hu, N., Zhang, J., & Pavlou, P. A. (2009). Overcoming the J-shaped distribution of product reviews. Communications of the ACM, 52(10), 144-147.