## Initial Setup and Data Loading

In this section, we import necessary libraries and load the Yelp dataset for businesses and reviews. This includes setting up visualization preferences for consistent and appealing plots, and ensuring our data is ready for analysis.


In [2]:
# Structured Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from tqdm.notebook import tqdm 


business = pd.read_csv('business.csv')
review = pd.read_csv('review.csv')



## Dataset Splitting and Preparation

This section covers the shuffling and splitting of our dataset into training and testing sets. We also detail the process of preparing these sets for the recommendation system's model training phase.



In [3]:
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate, GridSearchCV, KFold
import numpy as np
import pandas as pd

# Deduplicate reviews to ensure each user-business pair is unique
unique_reviews = review.drop_duplicates(['user_id', 'business_id'], keep='first').reset_index(drop=True)

# Set a random seed for reproducible results
np.random.seed(42)

# Shuffle and split the dataset: 90% for training and 10% for testing
shuffled_indices = np.random.permutation(unique_reviews.index)
split_idx = int(0.9 * len(shuffled_indices))
train_indices, test_indices = shuffled_indices[:split_idx], shuffled_indices[split_idx:]

# Function to select relevant columns and convert 'stars' to numeric, removing any nulls
def prepare_dataset(df, indices):
    subset = df.iloc[indices][['user_id', 'business_id', 'stars']]
    subset['stars'] = pd.to_numeric(subset['stars'], errors='coerce').dropna()
    return subset

# Prepare the training and testing datasets
trainset = prepare_dataset(unique_reviews, train_indices)
testset = prepare_dataset(unique_reviews, test_indices)

# Initialize a Reader with the rating scale and load the training dataset
reader = Reader(rating_scale=(1, 5))
data_train = Dataset.load_from_df(trainset, reader)
training = data_train.build_full_trainset()

# Prepare the testing dataset as a list of (user, item, rating) tuples
testing = [(uid, iid, float(r)) for uid, iid, r in testset.itertuples(index=False, name=None)]


## Model Training and Evaluation

We train our model using the KNNWithMeans item-based algorithm, evaluate its performance through RMSE, and extract insights from the latent feature matrices. This step is crucial for understanding the effectiveness of our recommendation system.



In [10]:
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate

# The unique_reviews DataFrame contains deduplicated reviews with columns: 'user_id', 'business_id', and 'stars'
# Initializing the dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(unique_reviews[['user_id', 'business_id', 'stars']], reader)

# Configuring and training the KNNWithMeans model incorporating user and item biases by default
knn = KNNWithMeans(k=80, sim_options={
      "name": "cosine",
      "user_based": False,  
    })
knn.fit(training)

# Evaluating the model's performance by predicting ratings on the test set and calculating the RMSE
predictions_biased = knn.test(testing)
rmse_biased = accuracy.rmse(predictions_biased)

print(f"RMSE (with bias): {rmse_biased}")


Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.3908
RMSE (with bias): 1.390791127171579


## Optimization and Hyperparameter Tuning

Optimizing the KNNWithMeans model's parameters through cross-validation and GridSearch to achieve better accuracy. This involves adjusting factors like the number of epochs, regularization terms, and the inclusion of bias terms.



In [11]:
from surprise import Dataset, Reader
from surprise import NMF, SVD
from surprise.model_selection import cross_validate, GridSearchCV, KFold
from surprise import accuracy

# Conducting parameter optimization for the KNNWithMeans algorithm with bias using scikit-surprise's GridSearchCV.
# This step is crucial for enhancing model accuracy by finding the optimal set of hyperparameters.

# Defining the parameter grid for KNNWithMeans hyperparameter tuning.
param_grid = {
    'k': [10, 20, 40, 80],  # Different values for k
    'sim_options': {
        'name': ['cosine', 'pearson'],  # Different similarity metrics
        'user_based': [False]  # Item-based 
    }
}

knn_gs = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse', 'mae'], cv=KFold(3, random_state=42), joblib_verbose=2)

# Fitting GridSearchCV to the training data to find the best KNNWithMeans model parameters.
knn_gs.fit(data_train)

# Displaying the best RMSE score achieved during the optimization process.
print("Best RMSE score from grid search:", knn_gs.best_score['rmse'])

# Showing the combination of parameters that achieved the best RMSE score.
print("Best parameter combination:", knn_gs.best_params['rmse'])

# Retrieving the best KNNWithMeans model from the grid search and fitting it to the training data.
knn_gs_best = knn_gs.best_estimator['rmse']
knn_gs_best.fit(training)

# Making predictions on the test set with the optimized model and evaluating the RMSE.
pred_knn_gs_best = knn_gs_best.test(testing)
rmse_optimized = accuracy.rmse(pred_knn_gs_best)

print(f"RMSE with optimized parameters: {rmse_optimized}")


Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Comput

In [13]:
print("Best parameter combination:", knn_gs.best_params['rmse'])

Best parameter combination: {'k': 80, 'sim_options': {'name': 'cosine', 'user_based': False}}


## Normalized Discounted Cumulative Gain (NDCG) Evaluation

This section focuses on evaluating our recommendation system using the Normalized Discounted Cumulative Gain (NDCG) metric. NDCG is a standard measure in information retrieval and recommendation systems to quantify the effectiveness of our ranking algorithms. By computing NDCG, we can assess how well our system ranks items in a way that aligns with the user's preferences, with particular emphasis on the importance of the order in which items are presented.


In [15]:
import numpy as np
from collections import defaultdict
from surprise import Dataset
from sklearn.metrics import ndcg_score

# Assuming svd_gs_best.test(testing) is already defined and produces predictions.
predictions = knn_gs_best.test(testing)

# Organizing the predictions for each user into a dictionary for easier manipulation.
user_true = defaultdict(list)
user_pred = defaultdict(list)
for uid, _, true_r, est, _ in predictions:
    user_true[uid].append(true_r)
    user_pred[uid].append(est)

k = 10
average_ndcg_scores = []

for uid in user_true.keys():
    # Ensure both true and predicted ratings are sorted by the predicted rating's order
    # and limit the length to the top-k items to compare.
    temp_true = [true_r for _, true_r in sorted(zip(user_pred[uid], user_true[uid]), reverse=True)[:k]]
    temp_pred = sorted(user_pred[uid], reverse=True)[:k]
    
    # Check if the user has at least k ratings; if not, continue to the next user.
    if len(temp_true) < k or len(temp_pred) < k:
        continue
    
    # Calculate NDCG for the current user and append to the list of scores.
    user_ndcg_score = ndcg_score([temp_true], [temp_pred])
    average_ndcg_scores.append(user_ndcg_score)

# Calculate the average NDCG across all users.
if average_ndcg_scores:
    avg_ndcg = np.mean(average_ndcg_scores)
    print(f"Average NDCG@{k}: {avg_ndcg}")
else:
    print("Not enough data to calculate average NDCG.")

Average NDCG@10: 0.9605759073505854
