## Initial Setup and Data Loading

In this section, we import necessary libraries and load the Yelp dataset for businesses and reviews. This includes setting up visualization preferences for consistent and appealing plots, and ensuring our data is ready for analysis.


In [1]:
# Structured Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from tqdm.notebook import tqdm 


business = pd.read_csv('business.csv')
review = pd.read_csv('review.csv')



## Dataset Splitting and Preparation

This section covers the shuffling and splitting of our dataset into training and testing sets. We also detail the process of preparing these sets for the recommendation system's model training phase.



In [2]:
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate, GridSearchCV, KFold
import numpy as np
import pandas as pd

# Deduplicate reviews to ensure each user-business pair is unique
unique_reviews = review.drop_duplicates(['user_id', 'business_id'], keep='first').reset_index(drop=True)

# Set a random seed for reproducible results
np.random.seed(42)

# Shuffle and split the dataset: 90% for training and 10% for testing
shuffled_indices = np.random.permutation(unique_reviews.index)
split_idx = int(0.9 * len(shuffled_indices))
train_indices, test_indices = shuffled_indices[:split_idx], shuffled_indices[split_idx:]

# Function to select relevant columns and convert 'stars' to numeric, removing any nulls
def prepare_dataset(df, indices):
    subset = df.iloc[indices][['user_id', 'business_id', 'stars']]
    subset['stars'] = pd.to_numeric(subset['stars'], errors='coerce').dropna()
    return subset

# Prepare the training and testing datasets
trainset = prepare_dataset(unique_reviews, train_indices)
testset = prepare_dataset(unique_reviews, test_indices)

# Initialize a Reader with the rating scale and load the training dataset
reader = Reader(rating_scale=(1, 5))
data_train = Dataset.load_from_df(trainset, reader)
training = data_train.build_full_trainset()

# Prepare the testing dataset as a list of (user, item, rating) tuples
testing = [(uid, iid, float(r)) for uid, iid, r in testset.itertuples(index=False, name=None)]


## Model Training and Evaluation

We train our model using the Singular Value Decomposition (SVD) algorithm, evaluate its performance through RMSE, and extract insights from the latent feature matrices. This step is crucial for understanding the effectiveness of our recommendation system.



In [3]:
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split

# The unique_reviews DataFrame contains deduplicated reviews with columns: 'user_id', 'business_id', and 'stars'
# Initializing the dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(unique_reviews[['user_id', 'business_id', 'stars']], reader)

# Configuring and training the SVD model incorporating user and item biases by default
svd_biased = SVD(n_factors=10, n_epochs=20, biased=True, random_state=42)
svd_biased.fit(training)

# Evaluating the model's performance by predicting ratings on the test set and calculating the RMSE
predictions_biased = svd_biased.test(testing)
rmse_biased = accuracy.rmse(predictions_biased)

print(f"RMSE (with bias): {rmse_biased}")


RMSE: 1.2719
RMSE (with bias): 1.271863541367604


## Optimization and Hyperparameter Tuning

Optimizing the SVD model's parameters through cross-validation and GridSearch to achieve better accuracy. This involves adjusting factors like the number of epochs, regularization terms, and the inclusion of bias terms.



In [7]:
from surprise import Dataset, Reader
from surprise import NMF, SVD
from surprise.model_selection import cross_validate, GridSearchCV, KFold
from surprise import accuracy

# Conducting parameter optimization for the SVD algorithm with bias using scikit-surprise's GridSearchCV.
# This step is crucial for enhancing model accuracy by finding the optimal set of hyperparameters.

# Defining the parameter grid for SVD hyperparameter tuning.
param_grid = {
    'n_factors': [10, 20, 50],  # Number of factors
    'n_epochs': [10, 20, 40],  # Number of iterations of the SGD procedure
    'reg_all': [0.06, 0.08, 0.1],  # Regularization term for all parameters
    'biased': [True]  # Use the baseline estimates in the algorithm
}

# Initializing GridSearchCV with the SVD algorithm, specified parameter grid, and cross-validation settings.
svd_gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=KFold(3, random_state=42), joblib_verbose=2)

# Fitting GridSearchCV to the training data to find the best SVD model parameters.
svd_gs.fit(data_train)

# Displaying the best RMSE score achieved during the optimization process.
print("Best RMSE score from grid search:", svd_gs.best_score['rmse'])

# Showing the combination of parameters that achieved the best RMSE score.
print("Best parameter combination:", svd_gs.best_params['rmse'])

# Retrieving the best SVD model from the grid search and fitting it to the training data.
svd_gs_best = svd_gs.best_estimator['rmse']
svd_gs_best.fit(training)

# Making predictions on the test set with the optimized model and evaluating the RMSE.
pred_svd_gs_best = svd_gs_best.test(testing)
rmse_optimized = accuracy.rmse(pred_svd_gs_best)

print(f"RMSE with optimized parameters: {rmse_optimized}")


[Parallel(n_jobs=1)]: Done  40 tasks      | elapsed:   43.6s


Best RMSE score from grid search: 1.2773447062068526
Best parameter combination: {'n_factors': 10, 'n_epochs': 40, 'reg_all': 0.1, 'biased': True}
RMSE: 1.2678
RMSE with optimized parameters: 1.2677970713264346


## Normalized Discounted Cumulative Gain (NDCG) Evaluation

This section focuses on evaluating our recommendation system using the Normalized Discounted Cumulative Gain (NDCG) metric. NDCG is a standard measure in information retrieval and recommendation systems to quantify the effectiveness of our ranking algorithms. By computing NDCG, we can assess how well our system ranks items in a way that aligns with the user's preferences, with particular emphasis on the importance of the order in which items are presented.


In [14]:
import numpy as np
from collections import defaultdict
from surprise import SVD, Dataset

# Applying the optimized SVD model, `svd_biased`, to generate predictions for the test dataset.
predictions = svd_gs_best.test(testing)

# Organizing the predictions for each user into a dictionary for easier manipulation.
user_pred = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
    user_pred[uid].append((iid, est))

# Defining functions to calculate Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG).
def dcg_at_k(ranked_list, k):
    """Calculate Discounted Cumulative Gain at rank k."""
    ranked_list = np.asfarray(ranked_list)[:k]
    if ranked_list.size:
        return np.sum(ranked_list / np.log2(np.arange(2, ranked_list.size + 2)))
    return 0.0

def ndcg_at_k(ranked_list, k):
    """Calculate Normalized Discounted Cumulative Gain at rank k."""
    dcg_max = dcg_at_k(sorted(ranked_list, reverse=True), k)
    if not dcg_max:
        return 0.
    return dcg_at_k(ranked_list, k) / dcg_max

# Setting the rank threshold to evaluate the top 10 recommended items.
k = 10
ndcg_scores = []

# Computing NDCG for each user based on their predicted ratings.
for uid, user_ratings in user_pred.items():
    user_ratings.sort(key=lambda x: x[1], reverse=True)  # Sort ratings in descending order.
    est_rank = [rating for iid, rating in user_ratings]  # Extract estimated ranks.
    ndcg = ndcg_at_k(est_rank, k)  # Compute NDCG for the current user.
    ndcg_scores.append(ndcg)

# Calculating the average NDCG across all users to measure overall system performance.
avg_ndcg = np.mean(ndcg_scores)
print(f"Average NDCG@{k}: {avg_ndcg}")




Average NDCG@10: 1.0
[Prediction(uid='z7Z4yaqrM62_4U2JURPmGA', iid='f5S8fr9DruZNwSev1gyFWQ', r_ui=4.0, est=2.651792640718827, details={'was_impossible': False}), Prediction(uid='FIiCVRKPdo4zZrRC4pMqDA', iid='C-oyGb6TZDP0IVd1NQNnoQ', r_ui=1.0, est=4.3305309239655525, details={'was_impossible': False}), Prediction(uid='9qy1rroAjPU9-rlIjfY6Dw', iid='J6ZSgoKCLBDMMzuhM-aYGg', r_ui=3.0, est=2.8779570895559505, details={'was_impossible': False}), Prediction(uid='fPCojc7hjYk1iKI4Y09dMQ', iid='GRSS2xFmaiX6hw7QReoM_g', r_ui=1.0, est=3.8881197553743094, details={'was_impossible': False}), Prediction(uid='ZGnUQTLwVZTzSLGu6vR4nQ', iid='9wPX6OoUifn9qfKGwrX0xw', r_ui=5.0, est=4.647771343791642, details={'was_impossible': False}), Prediction(uid='qmffS8iufgmzkx9KmHGEHQ', iid='6w96uZv8Q9FeBcwd7h-TAA', r_ui=1.0, est=2.1716151031686715, details={'was_impossible': False}), Prediction(uid='V-yKMPsFwkKzby5jxNXSnw', iid='ClEUgRru8zSwV4gB32bHFQ', r_ui=4.0, est=4.254876163616355, details={'was_impossible': Fal