# Neural Collaborative Based Filtering

- using neural network to learn the user-item interaction

This piece is a TensorFlow implementation of Neural Collaborative Filtering (NCF) from the paper [He et al. (2017)](https://arxiv.org/pdf/1708.05031.pdf).

## Summary

NCF uses neural networks to model the interactions between users and items. NCF replaces the inner product (used in ordinary MF methods) with a neural architecture that can learn an arbitrary function from data. This allows NCF to express and generalize matrix factorization under its framework. Essentially, it uses a neural network to learn the user-item interaction function, and uses the learned function to predict the corresponding rating. A multi-layer perceptron is used to learn the user-item interaction function. 


## Model 1: Ratings Only

The steps are as follows:

1. Read in Original Data
2. Remove some ratings to create the test set
3. With remaining ratings, create training set
4. Preprocess the data (melt the data, create user and item indices, normalize the ratings)
5. Create neural network model (NCF)
6. Train the model
7. Hyperparameter tuning
8. Evaluate the model on the test set
9. Gather all ratings prediction metrics (MAE, MSE, RMSE)

In [None]:
# reset space
%reset -f

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import tensorflow as tf
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import warnings
warnings.filterwarnings('ignore')

# tensorflow libraries load
from tensorflow.keras.layers import Embedding, Input, Flatten, Concatenate, Dense, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l2


In [None]:
# load data
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set3_data_modelling.csv')
display(amz_data.head())

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Min number of ratings per user: ', amz_data['reviewerID'].value_counts().min())
print('Max number of ratings per user: ', amz_data['reviewerID'].value_counts().max())
print('Min number of ratings per product: ', amz_data['asin'].value_counts().min())
print('Max number of ratings per product: ', amz_data['asin'].value_counts().max())



# Creating User Item Matrix =====================================================
# create user-item matrix
data = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
print("\n\nUser-Item Matrix")
display(data.head())


### Creating Train and Test Sets

In [None]:
# DATA PREP ====================================

# create a copy of the original matrix to store hidden ratings
x_hidden = data.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    x_hidden.iloc[user_id, hidden_indices] = 'Hidden'

# get indices of hidden ratings
test_data = x_hidden.copy()
test_data = test_data.reset_index()
test_data = test_data.melt(id_vars=test_data.columns[0], var_name='book', value_name='rating')
test_data.columns = ['user', 'product', 'rating']
indices_hidden = test_data[test_data['rating'] == 'Hidden'].index

# Melt the DataFrame into a format where each row is a user-item interaction
data_hidden = x_hidden.reset_index()
data_hidden = data_hidden.melt(id_vars=data_hidden.columns[0], var_name='product', value_name='rating')

# change rows with hidden ratings to NaN
data_hidden.iloc[indices_hidden, 2] = np.nan

# rename columns
data_hidden.columns = ['user', 'product', 'rating']

# Filter out the rows where rating is NaN
data_hidden = data_hidden[data_hidden['rating'].notna()]

# Convert user and item to categorical
data_hidden['user'] = data_hidden['user'].astype('category')
data_hidden['product'] = data_hidden['product'].astype('category')

# see what the data looks like
display(data_hidden.head(4))
print("Data is in format: user, product, rating.\nIt is ready to be partitioned into training and testing sets.")

In [None]:
# # validation data (take 2 more random ratings)
# x_validation = data_hidden.copy()
# indices_tracker_val = []

# # number of products to hide for each user
# N = 2

# # identifies rated items and randomly selects N products to hide ratings for each user
# np.random.seed(2207)  # You can use any integer value as the seed
# for user_id in range(x_validation.shape[0]):
#     rated_products = np.where(x_validation.iloc[user_id, :] > 0)[0]
#     hidden_indices = np.random.choice(rated_products, N, replace=False)
#     indices_tracker_val.append(hidden_indices)
#     x_validation.iloc[user_id, hidden_indices] = 'Hidden'

# # get indices of hidden ratings
# val_data = x_validation.copy()
# val_data = val_data.reset_index()
# val_data = val_data.melt(id_vars=val_data.columns[0], var_name='book', value_name='rating')
# val_data.columns = ['user', 'product', 'rating']
# indices_hidden_val = val_data[val_data['rating'] == 'Hidden'].index

# # Melt the DataFrame into a format where each row is a user-item interaction
# data_hidden_val = x_validation.reset_index()
# data_hidden_val = data_hidden_val.melt(id_vars=data_hidden_val.columns[0], var_name='product', value_name='rating')

In [None]:
# TEST AND TRAIN DATA ====================================

# Prepare the data - trining
train_x = data_hidden[['user', 'product']].apply(lambda x: x.cat.codes)
train_y = data_hidden['rating'].astype(np.float64)
train_y = (train_y - 1) / 4

# Prepare the data - testing
copy = data.copy()
copy = copy.reset_index()
copy = copy.melt(id_vars=copy.columns[0], var_name='product', value_name='rating')
copy.columns = ['user', 'product', 'rating']
test_x = copy.iloc[indices_hidden, 0:2]
test_x['user'] = test_x['user'].astype('category')
test_x['product'] = test_x['product'].astype('category')
test_x = test_x.apply(lambda x: x.cat.codes)
test_y = copy.iloc[indices_hidden, 2].astype(np.float64)
test_y = (test_y - 1) / 4

# show the data
print("Training Data")
display(train_x.head(3))

print("\nTesting Data")
display(test_x.head(3))

### Creating NCF Model

Inputs:
user_input and product_input: These are integer inputs representing user and product IDs.
user_embedding and product_embedding: These layers create dense embeddings for users and products based on their IDs.
user_vecs and product_vecs: These flatten the embeddings to create feature vectors.
input_vecs: The concatenated feature vectors serve as the input to the neural network.
Neural Network Architecture:
You’ve designed a feedforward neural network with multiple layers.
The first layer (i == 0) has n_nodes neurons, followed by dropout regularization.
Subsequent layers reduce the number of neurons by halving n_nodes.
The final output layer predicts the rating (regression task).
Optimizers:
You’ve implemented three optimizers: Adam, SGD, and RMSprop.
The choice of optimizer affects how the model updates its weights during training.
Loss Function:
The mean squared error (MSE) loss is used for regression tasks.
The model aims to minimize the difference between predicted and actual ratings.
Training:
The model is trained using user and product IDs as input features.
The train_x dictionary contains user and product data.
The train_y array holds the corresponding ratings.
You’ve split the data into training and validation sets (10% validation split).

In [None]:
# Function to train a neural network model for collaborative filtering
def train_model_1(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2_reg, train_x, train_y, seed=2207, train_plot=True, callback=True):
    
    # Set random seed
    np.random.seed(seed)

    # Create user and product embedding layers
    user_input = Input(shape=(1,), dtype='int32', name='user_input')
    product_input = Input(shape=(1,), dtype='int32', name='product_input')

    user_embedding = Embedding(input_dim=len(data_hidden['user'].cat.categories), output_dim=50, name='user_embedding')(user_input)
    product_embedding = Embedding(input_dim=len(data_hidden['product'].cat.categories), output_dim=50, name='product_embedding')(product_input)

    # Flatten the embedding vectors
    user_vecs = Flatten()(user_embedding)
    product_vecs = Flatten()(product_embedding)

    # Concatenate the embedding vectors
    input_vecs = Concatenate()([user_vecs, product_vecs])

    # Add dense layers
    x = input_vecs
    for i in range(n_layers):
        if i == 0:
            x = Dense(n_nodes, activation='relu', kernel_regularizer=l2(l2_reg))(x)
            x = Dropout(dropout)(x)
        else:
            n_nodes = n_nodes/2
            x = Dense(n_nodes, activation='relu', kernel_regularizer=l2(l2_reg))(x)
            x = Dropout(dropout)(x)
    y = Dense(1)(x)

    # Compile the model
    model = Model(inputs=[user_input, product_input], outputs=y)
    if optimizer == 'adam':
        opt = Adam(learning_rate)
    elif optimizer == 'sgd':
        opt = SGD(learning_rate)
    elif optimizer == 'rmsprop':
        opt = RMSprop(learning_rate)
    model.compile(optimizer=opt, loss='mse')

    # Define early stopping
    if callback:
        early_stopping = EarlyStopping(monitor='val_loss', patience=10, verbose=1, restore_best_weights=True)

    # Train the model
    if callback:
        history = model.fit([train_x['user'], train_x['product']], train_y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=[early_stopping])
    else:
        history = model.fit([train_x['user'], train_x['product']], train_y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

    if train_plot:
        # Plot training & validation loss values
        plt.figure(figsize=(15, 8))
        plt.plot(history.history['loss'], label='Training Loss', marker='o')
        plt.plot(history.history['val_loss'], label='Validation Loss', marker='o')
        # plt.title(f'Model loss for Architecture: {optimizer} optimizer, {n_layers} layers, {n_nodes} nodes, {epochs} epochs, {learning_rate} learning rate, {batch_size} batch size')
        plt.ylabel('Loss', fontsize=40)
        plt.xlabel('Epoch', fontsize = 40)
        plt.xticks(fontsize=36)
        plt.yticks(fontsize=36)
        plt.tight_layout()
        plt.savefig("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Final Writing/Figures/ncf_training_1.pdf")
        plt.show()
    
    return model, history

### Training NCF Model

- increasing batch size is good:128
- increase nodes
- change optimizer
- change learning rate

In [None]:
# # model with validation data
# model, history = train_model_1(n_layers=2, n_nodes=64, optimizer='adam', epochs=100, learning_rate=0.001, batch_size=64, dropout=0.2, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=2207, train_plot=True, callback=True, validation_data=True)

In [None]:
# Model 1 
model, history = train_model_1(n_layers=2, n_nodes=512, optimizer='adam', epochs=50, learning_rate=0.001, batch_size=128, dropout=0.5, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=10, train_plot=False, callback=True)

# # Model 2 
model2, history2 = train_model_1(n_layers=3, n_nodes=1024, optimizer='sgd', epochs=1000, learning_rate=0.001, batch_size=128, dropout=0.5, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=10, train_plot=False, callback=True)

# # Model 3 
model3, history3 = train_model_1(n_layers=3, n_nodes=1024, optimizer='rmsprop', epochs=200, learning_rate=0.001, batch_size=128, dropout=0.5, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=10, train_plot=False, callback=True)

# # Model 4
model4, history4 = train_model_1(n_layers=8, n_nodes=1024, optimizer='adam', epochs=450, learning_rate=0.001, batch_size=128, dropout=0.5, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=10, train_plot=False, callback=True)

# # Model 5 
model5, history5 = train_model_1(n_layers=12, n_nodes=2048, optimizer='sgd', epochs=1000, learning_rate=0.001, batch_size=128, dropout=0.5, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=10, train_plot=False, callback=True)

In [None]:
# Which model had lowest validation loss?
print("Model 1 Validation Loss: ", min(history.history['val_loss']))
print("Model 2 Validation Loss: ", min(history2.history['val_loss']))
print("Model 3 Validation Loss: ", min(history3.history['val_loss']))
print("Model 4 Validation Loss: ", min(history4.history['val_loss']))
print("Model 5 Validation Loss: ", min(history5.history['val_loss']))

In [None]:
# Visualize training and validation loss for all models
plt.figure(figsize=(20, 8))

# Training Loss
plt.subplot(1, 3, 1)
plt.plot(history.history['loss'], label='Model 1', marker='o', color = 'b')
plt.plot(history2.history['loss'], label='Model 2', marker='o', color = 'g')
plt.plot(history3.history['loss'], label='Model 3', marker='o', color = 'r')
plt.plot(history4.history['loss'], label='Model 4', marker='o', color = 'y')
plt.plot(history5.history['loss'], label='Model 5', marker='o', color = 'c')
plt.title('Model Training Loss for different architectures', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')

# Validation Loss
plt.subplot(1, 3, 2)
plt.plot(history.history['val_loss'], label='Model 1', marker='o', color = 'b')
plt.plot(history2.history['val_loss'], label='Model 2', marker='o', color = 'g')
plt.plot(history3.history['val_loss'], label='Model 3', marker='o', color = 'r')
plt.plot(history4.history['val_loss'], label='Model 4', marker='o', color = 'y')
plt.plot(history5.history['val_loss'], label='Model 5', marker='o', color = 'c')
plt.title('Model Validation Loss for different architectures', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')

# Plot validation and training loss on same plot
plt.subplot(1, 3, 3)
plt.plot(history.history['loss'], label='Model 1 Train', marker='o', color = 'b')
plt.plot(history.history['val_loss'], label='Model 1 Validation', marker='o', color = 'b')
plt.plot(history2.history['loss'], label='Model 2 Train', marker='o', color = 'g')
plt.plot(history2.history['val_loss'], label='Model 2 Validation', marker='o', color = 'g')
plt.plot(history3.history['loss'], label='Model 3 Train', marker='o', color = 'r')
plt.plot(history3.history['val_loss'], label='Model 3 Validation', marker='o', color = 'r')
plt.plot(history4.history['loss'], label='Model 4 Train', marker='o', color = 'y')
plt.plot(history4.history['val_loss'], label='Model 4 Validation', marker='o', color = 'y')
plt.plot(history5.history['loss'], label='Model 5 Train', marker='o', color = 'c')
plt.plot(history5.history['val_loss'], label='Model 5 Validation', marker='o', color = 'c')
plt.title('Model Training and Validation Loss', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

# Print models details

### Hyperparameter Tuning / Grid Search

In [None]:
import itertools

# Grid Search Parameters
n_layers = [1,2,3,6,8] 
n_nodes = [128,256,512,1024] 
optimizer = ['adam', 'sgd']
epochs = [50,150,300] 
learning_rate = [0.001, 0.01,  0.0001] 
batch_size = [32,64,128] 
dropout = [0, 0.01, 0.05, 0.08]
l2 = [0.01, 0.001, 0.0001]

print(f"Number of combinations: {len(n_layers) * len(n_nodes) * len(optimizer) * len(epochs) * len(learning_rate) * len(batch_size)* len(dropout)* len(l2)}")

def grid_search(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2, train_x, train_y):
    # Initialize best parameters and best model variables
    best_params = None
    best_model = None
    best_score = None

    # Generate all possible combinations of hyperparameters
    param_combinations = itertools.product(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2)

    # Loop through all combinations
    for combination in param_combinations:
        # Unpack the combination
        n_layer, n_node, opt, epoch, lr, bs, dropout, l2 = combination

        # Train the model
        model, history = train_model_1(n_layer, n_node, opt, epoch, lr, bs, dropout, l2, train_x, train_y, seed=10, train_plot=False, callback=True)

        # Evaluate the model
        min_loss = min(history.history['val_loss'])

        # Check if this model is better than the previous best
        if best_score is None or min_loss < best_score:
            best_score = min_loss
            best_params = combination
            best_model = model

    return best_params, best_model


# run grid search
best_params, best_model = grid_search(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2, train_x, train_y)
print(f"Best Parameters: {best_params}")

In [None]:
# fit best model 
best_model, history = train_model_1(n_layers=best_params[0], n_nodes=best_params[1], optimizer=best_params[2], epochs=best_params[3], learning_rate=best_params[4], batch_size=best_params[5], dropout=best_params[6], l2_reg=best_params[7], train_x=train_x, train_y=train_y, train_plot=True, callback=True, seed=10)

### Evaluating NCF Model

In [None]:
# MODEL EVALUATION ====================================
# Predict the ratings
# y_pred = best_model.predict([test_x['user'], test_x['product']])
y_pred = model.predict([test_x['user'], test_x['product']])

# Rescale the predictions back to the 1-5 range
y_pred = y_pred * 4 + 1

# set predictions and actual ratings to variables
hidden_ratings_array = (np.array(test_y)*4 + 1)
predicted_ratings_array = np.array(y_pred).flatten()

# Rating predictions
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)
print("\nRating Metrics")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
# results.to_csv(r"/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/Results/NCF_results_1.csv", index=False)
results.index = ['NCF']
results

### Additional Accuracy Insights

1. Want to see if accuracy is better for users who have rated more items. (i.e., for users who have rated more items, is the accuracy of the model better?)

2. Want to see if accuracy is better for items that have been rated more times. (i.e., for items that have been rated more times, is the accuracy of the model better?)

3. Want to see if accuracy is better for some product categories. (i.e., for some product categories, is the accuracy of the model better?)
- TOO FEW REVIEWS PER CATEGORY
- RESULTS WOULD NOT OFFER MUCH INSIGHT

4. Want to see if accuracy is better for reviews that are longer. (i.e., for reviews that are longer, is the accuracy of the model better?)


Effectively, we want to see if accuracy varies according to some variables X or Y.

In [None]:
###### QUESTION 1: PROCESS
# 1. Group Users by the Number of Rated Items: Count the number of rated items for each user in your dataset.

# Count the number of rated items for each user
user_ratings = train_x.groupby('user')['product'].count().reset_index()
user_ratings.columns = ['user', 'n_rated_items']

# 2. Divide Users into Groups: Divide users into groups based on the number of rated items. You can define these groups based on quartiles, for example, or any other criteria that make sense for your dataset.

# Divide users into groups based on the number of rated items
user_ratings['group'] = pd.qcut(user_ratings['n_rated_items'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
display(user_ratings)

# what are the number of users in each group?
print(f'Number of Users in Low Group: {user_ratings[user_ratings["group"] == "Low"].shape[0]}')
print(f'Number of Users in Medium Group: {user_ratings[user_ratings["group"] == "Medium"].shape[0]}')
print(f'Number of Users in High Group: {user_ratings[user_ratings["group"] == "High"].shape[0]}')
print(f'Number of Users in Very High Group: {user_ratings[user_ratings["group"] == "Very High"].shape[0]}')

# 3. Evaluate the Model for Each Group: Evaluate the model for each group of users. You can use the same metrics you used in the previous question.
low_group = user_ratings[user_ratings['group'] == 'Low']
medium_group = user_ratings[user_ratings['group'] == 'Medium']
high_group = user_ratings[user_ratings['group'] == 'High']
very_high_group = user_ratings[user_ratings['group'] == 'Very High']

# get test set items for these groups
low_group_test = test_x[test_x['user'].isin(low_group['user'])]
medium_group_test = test_x[test_x['user'].isin(medium_group['user'])]
high_group_test = test_x[test_x['user'].isin(high_group['user'])]
very_high_group_test = test_x[test_x['user'].isin(very_high_group['user'])]

# get test set ratings for these groups
low_group_ratings = test_y[test_x['user'].isin(low_group['user'])]
medium_group_ratings = test_y[test_x['user'].isin(medium_group['user'])]
high_group_ratings = test_y[test_x['user'].isin(high_group['user'])]
very_high_group_ratings = test_y[test_x['user'].isin(very_high_group['user'])]

# get predictions for these groups
low_group_pred = y_pred[test_x['user'].isin(low_group['user'])]
medium_group_pred = y_pred[test_x['user'].isin(medium_group['user'])]
high_group_pred = y_pred[test_x['user'].isin(high_group['user'])]
very_high_group_pred = y_pred[test_x['user'].isin(very_high_group['user'])]

# set predictions and actual ratings to variables
low_group_ratings_array = (np.array(low_group_ratings)*4 + 1)
low_group_pred_array = np.array(low_group_pred).flatten()

medium_group_ratings_array = (np.array(medium_group_ratings)*4 + 1)
medium_group_pred_array = np.array(medium_group_pred).flatten()

high_group_ratings_array = (np.array(high_group_ratings)*4 + 1)
high_group_pred_array = np.array(high_group_pred).flatten()

very_high_group_ratings_array = (np.array(very_high_group_ratings)*4 + 1)
very_high_group_pred_array = np.array(very_high_group_pred).flatten()

# Rating predictions
low_group_mae = mean_absolute_error(low_group_ratings_array, low_group_pred_array)
low_group_mse = mean_squared_error(low_group_ratings_array, low_group_pred_array)
low_group_rmse = np.sqrt(low_group_mse)

medium_group_mae = mean_absolute_error(medium_group_ratings_array, medium_group_pred_array)
medium_group_mse = mean_squared_error(medium_group_ratings_array, medium_group_pred_array)
medium_group_rmse = np.sqrt(medium_group_mse)

high_group_mae = mean_absolute_error(high_group_ratings_array, high_group_pred_array)
high_group_mse = mean_squared_error(high_group_ratings_array, high_group_pred_array)    
high_group_rmse = np.sqrt(high_group_mse)

very_high_group_mae = mean_absolute_error(very_high_group_ratings_array, very_high_group_pred_array)
very_high_group_mse = mean_squared_error(very_high_group_ratings_array, very_high_group_pred_array)
very_high_group_rmse = np.sqrt(very_high_group_mse)

# display results
print("Checking if the number of reviews impact the model performance.")
results = pd.DataFrame({'MAE': [low_group_mae.round(3), medium_group_mae.round(3), high_group_mae.round(3), very_high_group_mae.round(3)], 'MSE': [low_group_mse.round(3), medium_group_mse.round(3), high_group_mse.round(3), very_high_group_mse.round(3)], 'RMSE': [low_group_rmse.round(3), medium_group_rmse.round(3), high_group_rmse.round(3), very_high_group_rmse.round(3)]})
results.index = ['Low', 'Medium', 'High', 'Very High']
print(f'Number of Users in Low Group: {low_group.shape[0]}')
print(f'Number of Users in Medium Group: {medium_group.shape[0]}')
print(f'Number of Users in High Group: {high_group.shape[0]}')
print(f'Number of Users in Very High Group: {very_high_group.shape[0]}')
results


In [None]:
# evaluate the performance of model for each user
for user in range(user_ratings.shape[0]):
    user = user_ratings['user'][user]
    # get test set items for user
    user_test = test_x[test_x['user'] == user]
    # get test set ratings for user
    users_ratings = test_y[test_x['user'] == user]
    # get predictions for user
    user_pred = y_pred[test_x['user'] == user]
    # set predictions and actual ratings to variables
    user_ratings_array = (np.array(users_ratings)*4 + 1)
    user_pred_array = np.array(user_pred).flatten()
    # Rating predictions
    user_mae = mean_absolute_error(user_ratings_array, user_pred_array)
    user_mse = mean_squared_error(user_ratings_array, user_pred_array)
    user_rmse = np.sqrt(user_mse)
    # assing results to user_ratings
    user_ratings.loc[user, 'MAE'] = user_mae
    user_ratings.loc[user, 'MSE'] = user_mse
    user_ratings.loc[user, 'RMSE'] = user_rmse

In [None]:
user_ratings

In [None]:
# 4. Visualize the Results: Plot the accuracy metrics (RMSE, MSE, MAE) against the number of rated items for each group. This will help you visualize any patterns or trends in the accuracy of your model based on the number of rated items.
## plot number of rated items vs MAE, MSE, RMSE scatter plot
plt.figure(figsize=(20, 8))
plt.subplot(1, 3, 1)
plt.scatter(user_ratings['n_rated_items'], user_ratings['MAE'])
plt.title('Number of Rated Items vs MAE')
plt.xlabel('Number of Rated Items')
plt.ylabel('Mean Absolute Error')

plt.subplot(1, 3, 2)
plt.scatter(user_ratings['n_rated_items'], user_ratings['MSE'])
plt.title('Number of Rated Items vs MSE')
plt.xlabel('Number of Rated Items')
plt.ylabel('Mean Squared Error')

plt.subplot(1, 3, 3)
plt.scatter(user_ratings['n_rated_items'], user_ratings['RMSE'])
plt.title('Number of Rated Items vs RMSE')
plt.xlabel('Number of Rated Items')
plt.ylabel('Root Mean Squared Error')
plt.tight_layout()
plt.show()


In [None]:
# see summary statistics for each group
display(user_ratings.groupby('group').agg({'MAE': ['mean', 'std'], 'MSE': ['mean', 'std'], 'RMSE': ['mean', 'std']}))

# apply anova test
import scipy.stats as stats
f_val, p_val = stats.f_oneway(user_ratings[user_ratings['group'] == 'Low']['RMSE'], user_ratings[user_ratings['group'] == 'Medium']['RMSE'], user_ratings[user_ratings['group'] == 'High']['RMSE'], user_ratings[user_ratings['group'] == 'Very High']['RMSE'])
print(f'F-Value: {f_val}')
print(f'P-Value: {p_val}')

# apply post-hoc test
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
mc = MultiComparison(user_ratings['RMSE'], user_ratings['group'])
result = mc.tukeyhsd()
print(result)

In [None]:
# QUESTION 2: Want to see if accuracy is better for items that have been rated more times. (i.e., for items that have been rated more times, is the accuracy of the model better?)

# Count the number of ratings for each item
item_ratings = train_x.groupby('product')['user'].count().reset_index()
item_ratings.columns = ['product', 'n_ratings']

# Divide items into groups based on the number of ratings
item_ratings['group'] = pd.qcut(item_ratings['n_ratings'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

# Evaluate the model for each group of items
low_group = item_ratings[item_ratings['group'] == 'Low']
medium_group = item_ratings[item_ratings['group'] == 'Medium']
high_group = item_ratings[item_ratings['group'] == 'High']
very_high_group = item_ratings[item_ratings['group'] == 'Very High']

# get test set items for these groups
low_group_test = test_x[test_x['product'].isin(low_group['product'])]
medium_group_test = test_x[test_x['product'].isin(medium_group['product'])]
high_group_test = test_x[test_x['product'].isin(high_group['product'])]
very_high_group_test = test_x[test_x['product'].isin(very_high_group['product'])]

# get test set ratings for these groups
low_group_ratings = test_y[test_x['product'].isin(low_group['product'])]
medium_group_ratings = test_y[test_x['product'].isin(medium_group['product'])]
high_group_ratings = test_y[test_x['product'].isin(high_group['product'])]
very_high_group_ratings = test_y[test_x['product'].isin(very_high_group['product'])]

# get predictions for these groups
low_group_pred = y_pred[test_x['product'].isin(low_group['product'])]
medium_group_pred = y_pred[test_x['product'].isin(medium_group['product'])]
high_group_pred = y_pred[test_x['product'].isin(high_group['product'])]
very_high_group_pred = y_pred[test_x['product'].isin(very_high_group['product'])]

# set predictions and actual ratings to variables
low_group_ratings_array = (np.array(low_group_ratings)*4 + 1)
low_group_pred_array = np.array(low_group_pred).flatten()

medium_group_ratings_array = (np.array(medium_group_ratings)*4 + 1)
medium_group_pred_array = np.array(medium_group_pred).flatten()

high_group_ratings_array = (np.array(high_group_ratings)*4 + 1)
high_group_pred_array = np.array(high_group_pred).flatten()

very_high_group_ratings_array = (np.array(very_high_group_ratings)*4 + 1)
very_high_group_pred_array = np.array(very_high_group_pred).flatten()

# Rating predictions
low_group_mae = mean_absolute_error(low_group_ratings_array, low_group_pred_array)
low_group_mse = mean_squared_error(low_group_ratings_array, low_group_pred_array)
low_group_rmse = np.sqrt(low_group_mse)

medium_group_mae = mean_absolute_error(medium_group_ratings_array, medium_group_pred_array)
medium_group_mse = mean_squared_error(medium_group_ratings_array, medium_group_pred_array)
medium_group_rmse = np.sqrt(medium_group_mse)

high_group_mae = mean_absolute_error(high_group_ratings_array, high_group_pred_array)
high_group_mse = mean_squared_error(high_group_ratings_array, high_group_pred_array)
high_group_rmse = np.sqrt(high_group_mse)

very_high_group_mae = mean_absolute_error(very_high_group_ratings_array, very_high_group_pred_array)
very_high_group_mse = mean_squared_error(very_high_group_ratings_array, very_high_group_pred_array)
very_high_group_rmse = np.sqrt(very_high_group_mse)

# display results
print("Checking if the number of reviews of an impact the model performance for items.")
results = pd.DataFrame({'MAE': [low_group_mae.round(3), medium_group_mae.round(3), high_group_mae.round(3), very_high_group_mae.round(3)], 'MSE': [low_group_mse.round(3), medium_group_mse.round(3), high_group_mse.round(3), very_high_group_mse.round(3)], 'RMSE': [low_group_rmse.round(3), medium_group_rmse.round(3), high_group_rmse.round(3), very_high_group_rmse.round(3)]})
results.index = ['Low', 'Medium', 'High', 'Very High']
print(f'Number of Items in Low Group: {low_group.shape[0]}')
print(f'Number of Items in Medium Group: {medium_group.shape[0]}')
print(f'Number of Items in High Group: {high_group.shape[0]}')
print(f'Number of Items in Very High Group: {very_high_group.shape[0]}')
results

In [None]:
# evaluate the performance of model for each item
for item in item_ratings['product']:
    # Filter test set data for the current item
    item_test = test_x[test_x['product'] == item]
    if not item_test.empty:  # Check if there are samples available
        # Get test set ratings for the current item
        items_ratings = test_y[test_x['product'] == item]
        # Get predictions for the current item
        item_pred = y_pred[test_x['product'] == item]
        # Set predictions and actual ratings to variables
        item_ratings_array = (np.array(items_ratings) * 4 + 1)
        item_pred_array = np.array(item_pred).flatten()
        # Rating predictions
        item_mae = mean_absolute_error(item_ratings_array, item_pred_array)
        item_mse = mean_squared_error(item_ratings_array, item_pred_array)
        item_rmse = np.sqrt(item_mse)
        # Assign results to item_ratings
        item_ratings.loc[item, 'MAE'] = item_mae
        item_ratings.loc[item, 'MSE'] = item_mse
        item_ratings.loc[item, 'RMSE'] = item_rmse
    else:
        # No samples available for this item
        print(f"No test set data available for item {item}. Skipping evaluation.")

In [None]:
# how many nas in item_ratings
item_ratings.isna().sum()

#the test set items were randomly selected from the users' rated items list, there's a possibility that certain items may not have been included in the test set due to the random sampling process. As a result, when we attempt to evaluate the model's performance for each item using the test set, some items may not have any corresponding test set data available.

In [None]:
# 4. Visualize the Results: Plot the accuracy metrics (RMSE, MSE, MAE) against the number of reviews for each group. This will help you visualize any patterns or trends in the accuracy of your model based on the number of rated items.¸
## plot number of rated items vs MAE, MSE, RMSE scatter plot
plt.figure(figsize=(20, 8))
plt.subplot(1, 3, 1)
plt.scatter(item_ratings['n_ratings'], item_ratings['MAE'])
plt.title('Number of Ratings vs MAE')
plt.xlabel('Number of Ratings')
plt.ylabel('Mean Absolute Error')

plt.subplot(1, 3, 2)
plt.scatter(item_ratings['n_ratings'], item_ratings['MSE'])
plt.title('Number of Ratings vs MSE')
plt.xlabel('Number of Ratings')
plt.ylabel('Mean Squared Error')

plt.subplot(1, 3, 3)
plt.scatter(item_ratings['n_ratings'], item_ratings['RMSE'])
plt.title('Number of Ratings vs RMSE')
plt.xlabel('Number of Ratings')
plt.ylabel('Root Mean Squared Error')

plt.tight_layout()
plt.show()

In [None]:
# see summary statistics for each group
display(item_ratings.groupby('group').agg({'MAE': ['mean', 'std'], 'MSE': ['mean', 'std'], 'RMSE': ['mean', 'std']}))

# apply anova test
item_ratings.dropna(inplace=True)
f_val, p_val = stats.f_oneway(item_ratings[item_ratings['group'] == 'Low']['RMSE'], item_ratings[item_ratings['group'] == 'Medium']['RMSE'], item_ratings[item_ratings['group'] == 'High']['RMSE'], item_ratings[item_ratings['group'] == 'Very High']['RMSE'])
print(f'F-Value: {f_val}')
print(f'P-Value: {p_val}')

# apply post-hoc test
item_ratings['RMSE'] = pd.to_numeric(item_ratings['RMSE'], errors='coerce')  # coerce errors to NaN if conversion fails
mc = MultiComparison(item_ratings['RMSE'], item_ratings['group'])
result = mc.tukeyhsd()
print(result)


### Top-N Recommendations

#### ***Process***

***TLDR***: adjust  test setit to only contain items that a user liked (above a threshold). Then use my predictions in completed matrix, to get a list of top-N list of items to recommend to the user. If the user has already rated the item, then I will not recommend it. See how many of the recommended items are in the test set.

- By adjusting your test set to only include items that meet or exceed a certain threshold (for example, ratings of 4 or above), we can evaluate our model's performance specifically on predicting items that the user likes or interacts with positively
-  if an item receives a low rating from a user, it is unlikely to be recommended by your model or to be of interest to the user in the future. Therefore, you can focus your evaluation on how well your model predicts high-quality recommendations, which are more likely to lead to user satisfaction and engagement

Here's a step-by-step breakdown:

1. Test Set Adjustment: Modify your test set to only include items that the user liked, typically by setting a threshold for ratings (e.g., only items rated 4 or 5).
2. Predictions: Generate predictions for each user-item pair using your model. For pairs where a user has not rated an item yet, your model should predict a rating.
3. Top Recommendations: Rank the predicted ratings for each user's unrated items and select the top recommendations (e.g., top 100) based on these predicted ratings.
4. Avoiding Already Rated Items: Check if the recommended items are already rated by the user (in your modified test set). If an item is already rated, you should not recommend it.
5. Evaluation: Assess the performance of your recommender system using metrics like Precision and Recall, which evaluate how well your recommendations align with the user's preferences (as captured by the test set).


#### ***Metrics***

**Precision@K:**
- Precision@K measures the proportion of relevant items among the top-K recommended items.
- It answers the question: “Out of the top-K recommendations, how many are actually relevant?”


**Recall@K:**
- It answers the question: “Out of all relevant items, how many were included in the top-K recommendations?”
- Recall@K (also known as Hit Rate@K) measures how well you capture relevant items among all the relevant items.


In [None]:
# getting a dataframe with interactions and ratings
data_mat = data.copy()
data_mat = data_mat.reset_index()
data_mat = data_mat.melt(id_vars=data_mat.columns[0], var_name='product', value_name='rating')
data_mat.columns = ['user', 'product', 'rating']
data_mat['user'] = data_mat['user'].astype('category')
data_mat['product'] = data_mat['product'].astype('category')
data_mat['user'] = data_mat['user'].cat.codes
data_mat['product'] = data_mat['product'].cat.codes
display(data_mat.head(3))

In [None]:
# Function to fill NaN ratings with predictions in the user-item dataframe
def fill_nan_ratings_with_predictions(model, data):
    
    # Create a copy of the DataFrame to avoid modifying the original
    completed = data.copy()

    # Find rows with NaN ratings
    nan_rows = completed[completed['rating'].isna()]

    # Predict the ratings for these rows
    predictions = model.predict([nan_rows['user'], nan_rows['product']])
    predictions = predictions * 4 + 1

    # Fill in the predictions
    completed.loc[nan_rows.index, 'rating'] = predictions.flatten()

    return completed

In [None]:

# Fill NaN ratings with predictions
completed = fill_nan_ratings_with_predictions(model=best_model, data=data_mat)

In [None]:
# see original data with user item interactions
print("User Item Interactions with Ratings")
display(data_mat.head(3))

# see data with predictions
print("\nUser Item Interactions with Predicted Ratings")
display(completed.head(3))

In [None]:
# details on completed dataframe
print('Number of Rows: ', completed.shape[0])
print('Number of Columns: ', completed.shape[1])
print('Number of Unique Users: ', len(completed['user'].unique()))
print('Number of Unique Products: ', len(completed['product'].unique()))

In [None]:
print("Test Data: X")
display(test_x.head(3))
print('Shape of Test Data: ', test_x.shape)

# Define the threshold for positive interaction
test_y_top_n = test_y.copy()
test_y_top_n = pd.DataFrame(test_y_top_n)
test_y_top_n = test_y_top_n* 4 + 1

# Now, test_y will have a 'label' column with 0 for negative interactions and 1 for positive interactions
print("\nTest Data: Y")
display(test_y_top_n.head(3))
print('Shape of Test Data: ', test_y_top_n.shape)

# predicted data
print("\nPredicted Data")
predicted_rats = pd.Series(predicted_ratings_array)
predicted_rats.index = test_y.index
display(predicted_rats.head(3))

#### Execute for One User

1. Test Set Adjustment: Modify your test set to only include items that the user liked, typically by setting a threshold for ratings (e.g., only items rated 4 or 5).
2. Predictions: Generate predictions for each user-item pair using your model. For pairs where a user has not rated an item yet, your model should predict a rating.
3. Top Recommendations: Rank the predicted ratings for each user's unrated items and select the top recommendations (e.g., top 100) based on these predicted ratings.
4. Avoiding Already Rated Items: Check if the recommended items are already rated by the user (in your modified test set). If an item is already rated, you should not recommend it.
5. Evaluation: Assess the performance of your recommender system using metrics like Precision and Recall, which evaluate how well your recommendations align with the user's preferences (as captured by the test set).


In [None]:
# set N - number of recommendations
N = 1000

# Get interactions for User 1 (including ratings)
user_1 = completed[completed['user'] == 0]
print("Number of Interactions for User 1: ", user_1.shape[0])

# Identify liked items for User 1 (above a threshold, e.g., rating > 3)
liked_items = user_1[user_1['rating'] > 3.5]
print("Number of Liked Items for User 1: ", liked_items.shape[0])

# get test set for user 1, including actual ratings and predicted ratings
user_1_test = test_x[test_x['user'] == 0]
user_1_test['actual_rating'] = test_y_top_n
user_1_test['predicted_rating'] = predicted_rats[user_1_test.index]
print("Number of Test Interactions for User 1: ", user_1_test.shape[0])


# threshold for positive interaction
threshold = 3.5
print("Number of Test Interactions that the User Liked: ", user_1_test[user_1_test['actual_rating'] > threshold].shape[0])

# adjust test set to include a label column using the threshold
user_1_test['label'] = user_1_test['actual_rating'].apply(lambda x: 1 if x > threshold else 0)
user_1_test

# get predictions for user 
completed_user_1 = completed[completed['user'] == 0]


# add label for used interactions (add 1 to all interactions that exist in train_x)
train_x_user_1 = train_x[train_x['user'] == 0]

# for each user interaction, check if it exists in train_x
completed_user_1['used_ind'] = 0
for i in range(completed_user_1.shape[0]):
    if completed_user_1.iloc[i, 1] in list(train_x_user_1['product']):
        completed_user_1.iloc[i, 3] = 1


# count how many interactions are in train_x
print("Number of Interactions in Train Set for User 1: ", train_x_user_1.shape[0])

# count how many 1 in completed_user_1
print("Number of Interactions in Completed User 1: ", completed_user_1[completed_user_1['used_ind'] == 1].shape[0])

# add label liked for completed_user_1
completed_user_1['liked'] = completed_user_1['rating'].apply(lambda x: 1 if x > threshold else 0)


# get top N recommendations for user 1 - exclude items where used_ind = 1
user_1_top_n = completed_user_1[completed_user_1['used_ind'] == 0]
user_1_top_n = user_1_top_n.sort_values(by='rating', ascending=False)
user_1_top_n = user_1_top_n.head(N)


# add a label column to user_1_top_n: test_ind
user_1_top_n['test_ind'] = 0
for i in range(user_1_top_n.shape[0]):
    if user_1_top_n.iloc[i, 1] in list(user_1_test[user_1_test['label'] == 1]['product']):
        user_1_top_n.iloc[i, 5] = 1


# count how many 1 in user_1_top_n
print("Number of Items in Top N for User 1 that Were Used and Liked: ", user_1_top_n[user_1_top_n['test_ind'] == 1].shape[0])

# see top N recommendations for user 1
print("\n\nTop N Recommendations for User 1")
display(user_1_top_n)


# Calculate precision@K (top N recommendations)
precision_at_N = user_1_top_n['test_ind'].sum() / N

# Calculate recall@K
recall_at_N = user_1_top_n['test_ind'].sum() / liked_items.shape[0]

# calculate F1 score
f1_at_N = 2 * (precision_at_N * recall_at_N) / (precision_at_N + recall_at_N)

print(f"Precision@{N}: {precision_at_N:.4f}")
print(f"Recall@{N}: {recall_at_N:.4f}")
print(f"F1@{N}: {f1_at_N:.4f}")

# save results to csv
results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
results

In [None]:
def evaluate_topN_user(user, threshold, N, data):
    # Get interactions for the specified user (including ratings)
    user_interactions = data[data['user'] == user]
    print("\nNumber of Interactions for User {}: {}".format(user, user_interactions.shape[0]))

    # Identify liked items for the specified user (above the specified threshold)
    liked_items = user_interactions[user_interactions['rating'] >= threshold]
    print("Number of Liked Items for User {}: {}".format(user, liked_items.shape[0]))

    # Get the test set for the specified user, including actual ratings and predicted ratings
    user_test_set = test_x[test_x['user'] == user]
    user_test_set['actual_rating'] = test_y_top_n
    user_test_set['predicted_rating'] = predicted_rats[user_test_set.index]
    print("Number of Test Interactions for User {}: {}".format(user, user_test_set.shape[0]))

    # Adjust the test set to include a label column using the specified threshold
    user_test_set['label'] = user_test_set['actual_rating'].apply(lambda x: 1 if x > threshold else 0)

    # Count the number of interactions in the train set for the specified user
    train_x_user = train_x[train_x['user'] == user]
    print("Number of Interactions in Train Set for User {}: {}\n\n".format(user, train_x_user.shape[0]))

    # Add a label for used interactions (add 1 to all interactions that exist in train_x)
    user_interactions['used_ind'] = 0
    for i in range(user_interactions.shape[0]):
        if user_interactions.iloc[i, 1] in list(train_x_user['product']):
            user_interactions.iloc[i, 3] = 1

    # Add a label for liked items
    user_interactions['liked'] = user_interactions['rating'].apply(lambda x: 1 if x > threshold else 0)

    # Get the top N recommendations for the specified user (excluding items where used_ind = 1)
    user_top_n = user_interactions[user_interactions['used_ind'] == 0]
    user_top_n = user_top_n.sort_values(by='rating', ascending=False)
    user_top_n = user_top_n.head(N)

    # Add a label column to user_top_n: test_ind
    user_top_n['test_ind'] = 0
    for i in range(user_top_n.shape[0]):
        if user_top_n.iloc[i, 1] in list(user_test_set[user_test_set['label'] == 1]['product']):
            user_top_n.iloc[i, 5] = 1

    # Calculate Precision@N, Recall@N, and F1@N
    precision_at_N = user_top_n['test_ind'].sum() / N
    recall_at_N = user_top_n['test_ind'].sum() / liked_items.shape[0]
    if precision_at_N + recall_at_N == 0:
        f1_at_N = 0
    else: f1_at_N = 2 * (precision_at_N * recall_at_N) / (precision_at_N + recall_at_N)

    # Display the results
    print(f"Results for User {user} with Threshold {threshold} and Top {N} Recommendations!")
    print(f"Precision@{N}: {precision_at_N:.4f}")
    print(f"Recall@{N}: {recall_at_N:.4f}")
    print(f"F1@{N}: {f1_at_N:.4f}")

    # Save the results to a dataframe and return it
    results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
    return results


In [None]:
def evaluate_topN_user(user, threshold, N, data):
    # Get interactions for the specified user (including ratings)
    user_interactions = data[data['user'] == user]
    print("\nNumber of Interactions for User {}: {}".format(user, user_interactions.shape[0]))

    # Identify liked items for the specified user (above the specified threshold)
    liked_items = user_interactions[user_interactions['rating'] >= threshold]
    print("Number of Liked Items for User {}: {}".format(user, liked_items.shape[0]))

    # Get the test set for the specified user, including actual ratings and predicted ratings
    user_test_set = test_x[test_x['user'] == user]
    user_test_set['actual_rating'] = test_y_top_n
    user_test_set['predicted_rating'] = predicted_rats[user_test_set.index]
    print("Number of Test Interactions for User {}: {}".format(user, user_test_set.shape[0]))

    # Adjust the test set to include a label column using the specified threshold
    user_test_set['label'] = user_test_set['actual_rating'].apply(lambda x: 1 if x > threshold else 0)

    # Count the number of interactions in the train set for the specified user
    train_x_user = train_x[train_x['user'] == user]
    print("Number of Interactions in Train Set for User {}: {}\n\n".format(user, train_x_user.shape[0]))

    # Add a label for used interactions (add 1 to all interactions that exist in train_x)
    user_interactions['used_ind'] = 0
    for i in range(user_interactions.shape[0]):
        if user_interactions.iloc[i, 1] in list(train_x_user['product']):
            user_interactions.iloc[i, 3] = 1

    # Add a label for liked items
    user_interactions['liked'] = user_interactions['rating'].apply(lambda x: 1 if x > threshold else 0)

    # Get the top N recommendations for the specified user (excluding items where used_ind = 1)
    user_top_n = user_interactions[user_interactions['used_ind'] == 0]
    user_top_n = user_top_n.sort_values(by='rating', ascending=False)
    user_top_n = user_top_n.head(N)

    # Add a label column to user_top_n: test_ind
    user_top_n['test_ind'] = 0
    for i in range(user_top_n.shape[0]):
        if user_top_n.iloc[i, 1] in list(user_test_set['product']):
            user_top_n.iloc[i, 5] = 1

    # Calculate Precision@N, Recall@N, and F1@N
    precision_at_N = user_top_n['test_ind'].sum() / N
    recall_at_N = user_top_n['test_ind'].sum() / liked_items.shape[0]
    if precision_at_N + recall_at_N == 0:
        f1_at_N = 0
    else: f1_at_N = 2 * (precision_at_N * recall_at_N) / (precision_at_N + recall_at_N)

    # Display the results
    print(f"Results for User {user} with Threshold {threshold} and Top {N} Recommendations!")
    print(f"Precision@{N}: {precision_at_N:.4f}")
    print(f"Recall@{N}: {recall_at_N:.4f}")
    print(f"F1@{N}: {f1_at_N:.4f}")

    # Save the results to a dataframe and return it
    results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
    return results


In [None]:
# use function
results = evaluate_topN_user(user=0, threshold=3, N=100, data=completed)

#### Get for All Users

It defines a function `evaluate_topN_user` that calculates the metrics for a specified `user`, `threshold`, and `N`. The function returns a dataframe with the results.

It then loops through each user in the '`completed`' dataframe and calls the `evaluate_topN_user` function to calculate the metrics for each user. The results for each user are appended to the results dataframe.

Finally, it **calculates the average of the metrics across all users and displays the aggregate metrics.**


In [None]:
# loop through users to get results for each user and save to a dataframe
results = pd.DataFrame()
for user in range(len(completed['user'].unique())):
    user_results = evaluate_topN_user(user=user, threshold=3, N=10000, data=completed)
    results = pd.concat([results, user_results])

results

In [None]:
# Get the average results for all users
average_results = results.mean()
average_results

In [None]:
results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
results.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/Results/NCF_results_1_topN.csv', index=False)

***
## Model 2: Ratings + Reviews

In [None]:
# load data
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set3_data_modelling.csv')
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set3_data_modelling.csv')
text_embeddings = pd.read_csv(r'/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/NCF Data/text_embeddings.csv')
display(amz_data.head())

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))


# Creating User Item Matrix =====================================================
# create user-item matrix
data = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
print("\n\nUser-Item Matrix")
display(data.head())

#### Word Embeddings

- https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/tensorFlow2/variations/universal-sentence-encoder/versions/2?tfhub-redirect=true

In [None]:
# import tensorflow as tf
# import tensorflow_hub as hub

# # load the model for sentence embeddings
# module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
# sent_model = hub.load(module_url)
# print(f"Module {module_url} loaded")

# # Embedding review text
# print("Applying the Universal Sentence Encoder on the review text...")
# review_text = amz_data['reviewText']  # Replace with your actual column name
# text_embeddings = sent_model(review_text)
# print("Review text embeddings generated!")
# print(f"Shape of Text Embeddings: {text_embeddings.shape}")

# # attach embeddings to dataframe
# text_embeddings = text_embeddings.numpy()
# text_embeddings = pd.DataFrame(text_embeddings)
# text_embeddings['revText'] = amz_data['reviewText']
# text_embeddings['asin'] = amz_data['asin']
# text_embeddings['reviewerID'] = amz_data['reviewerID']
display(text_embeddings.head(4))

#### Training and Test Sets

In [None]:
# DATA PREP ====================================

# create a copy of the original matrix to store hidden ratings
x_hidden = data.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    x_hidden.iloc[user_id, hidden_indices] = 'Hidden'

# get indices of hidden ratings
test_data = x_hidden.copy()
test_data = test_data.reset_index()
test_data = test_data.melt(id_vars=test_data.columns[0], var_name='book', value_name='rating')
test_data.columns = ['user', 'product', 'rating']
indices_hidden = test_data[test_data['rating'] == 'Hidden'].index

# Melt the DataFrame into a format where each row is a user-item interaction
data_hidden = x_hidden.reset_index()
data_hidden = data_hidden.melt(id_vars=data_hidden.columns[0], var_name='product', value_name='rating')

# change rows with hidden ratings to NaN
data_hidden.iloc[indices_hidden, 2] = np.nan

# rename columns
data_hidden.columns = ['user', 'product', 'rating']

# Filter out the rows where rating is NaN
data_hidden = data_hidden[data_hidden['rating'].notna()]

# add text embeddings to the data (match user and product to the embeddings)
data_hidden = pd.merge(data_hidden, text_embeddings, how='outer', left_on=['user', 'product'], right_on=['reviewerID', 'asin'])
data_hidden.drop(['revText', 'asin','reviewerID'], axis=1, inplace=True)

# Filter out the rows where rating is NaN
data_hidden = data_hidden[data_hidden['rating'].notna()]

# Convert user and item to categorical
data_hidden['user'] = data_hidden['user'].astype('category')
data_hidden['product'] = data_hidden['product'].astype('category')

# see what the data looks like
display(data_hidden.head(4))
print("Data is in format: user, product, rating, text embeddings.\nIt is ready to be partitioned into training and testing sets.")

In [None]:
# # save data 
# data_hidden.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/NCF Data/data_hidden.csv', index=False)    
# text_embeddings.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/NCF Data/text_embeddings.csv', index=False)

# load data
# data_hidden = pd.read_csv(r'/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/NCF Data/data_hidden.csv')
# text_embeddings = pd.read_csv(r'/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/NCF Data/text_embeddings.csv')

# data_hidden['user'] = data_hidden['user'].astype('category')
# data_hidden['product'] = data_hidden['product'].astype('category')



In [None]:
# TEST AND TRAIN DATA ====================================

# Prepare the data - trining
train_x = data_hidden[['user', 'product']].apply(lambda x: x.cat.codes)
train_y = data_hidden['rating'].astype(np.float64)
train_y = (train_y - 1) / 4

# add text embeddings to the training data (merge on index)
train_x = pd.merge(train_x, data_hidden, how='outer', left_index=True, right_index=True)
train_x.drop(['user_y', 'product_y', 'rating'], axis=1, inplace=True)
train_x.rename(columns={'user_x': 'user', 'product_x': 'product'}, inplace=True)

# Prepare the data - testing
copy = data.copy()
copy = copy.reset_index()
copy = copy.melt(id_vars=copy.columns[0], var_name='product', value_name='rating')
copy.columns = ['user', 'product', 'rating']
test_x = copy.iloc[indices_hidden, 0:2]


# add text embeddings to the testing data (merge on user and product)
test_x = pd.merge(test_x, text_embeddings, how='left', left_on=['user', 'product'], right_on=['reviewerID', 'asin'])
test_x.drop(['revText', 'asin','reviewerID'], axis=1, inplace=True)
test_x['user'] = test_x['user'].astype('category')
test_x['product'] = test_x['product'].astype('category')

# use cat codes to convert to numerical (for user and product)
test_x['user'] = test_x['user'].cat.codes
test_x['product'] = test_x['product'].cat.codes
test_y = copy.iloc[indices_hidden, 2].astype(np.float64)
test_y = (test_y - 1) / 4

# show the data
print("Training Data")
display(train_x.head(3))

print("\nTesting Data")
display(test_x.head(3))

#### NCF Model with Reviews


**Inputs:**
You have three input layers:
- `user_input`: Represents the user ID (integer).
- `product_input`: Represents the product ID (integer).
- `text_input`: Represents the text embeddings of user reviews (float32).

**Embeddings:**
You create embeddings for users and products using the Embedding layer. These embeddings are essential for capturing latent features.
- `user_embedding`: Embedding for user IDs.
- `product_embedding`: Embedding for product IDs.

**Flattening and Concatenation**:
You flatten the user and product embeddings to create vectors (`user_vecs` and `product_vecs`).
Then, you concatenate these vectors with the text embeddings (`text_input`) to form the combined input vector (`input_vecs`).

**Hidden Layers:**
You use a loop to create hidden layers:
For the first layer (i == 0), you apply a Dense layer with ReLU activation and dropout.
For subsequent layers, you reduce the number of nodes by half and apply the same architecture.

**Output Layer**:
The final output layer (*y*) predicts the user-item interaction (*rating*).

**Model Compilation**:
You compile the model using the specified optimizer (Adam, SGD, or RMSprop) and the mean squared error (MSE) loss.

**Training:**
The model is trained using user and product data (`train_x['user']` and `train_x['product']`) along with the target variable (`train_y`).
You split the data into training and validation sets (10% validation split).

**Return:**
The function returns the trained model and training history.

In [None]:
# Function to train a neural network model for collaborative filtering with text embeddings
def train_model_2(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2_reg, train_x, train_y, text_embedding_dim, seed=2207, train_plot=True, callback=True):
    np.random.seed(seed)

    # Inputs
    user_input = Input(shape=(1,), dtype='int32', name='user_input')
    product_input = Input(shape=(1,), dtype='int32', name='product_input')
    text_input = Input(shape=(text_embedding_dim,), dtype='float32', name='text_input') 

    # Embeddings
    user_embedding = Embedding(input_dim=len(data_hidden['user'].cat.categories), output_dim=50, name='user_embedding')(user_input)
    product_embedding = Embedding(input_dim=len(data_hidden['product'].cat.categories), output_dim=50, name='product_embedding')(product_input)

    # Flatten
    user_vecs = Flatten()(user_embedding)
    product_vecs = Flatten()(product_embedding)

    # Concatenate user, product, and text embeddings
    input_vecs = Concatenate()([user_vecs, product_vecs, text_input])

    # Add dense layers
    x = input_vecs
    for i in range(n_layers):
        if i == 0:
            x = Dense(n_nodes, activation='relu', kernel_regularizer=l2(l2_reg))(x)
            x = Dropout(dropout)(x)
        else:
            n_nodes = n_nodes/2
            x = Dense(n_nodes, activation='relu', kernel_regularizer=l2(l2_reg))(x)
            x = Dropout(dropout)(x)
    y = Dense(1)(x)

    # Compile and train the model
    model = Model(inputs=[user_input, product_input, text_input], outputs=y)

    if optimizer == 'adam':
        opt = Adam(learning_rate)
    elif optimizer == 'sgd':
        opt = SGD(learning_rate)
    elif optimizer == 'rmsprop':
        opt = RMSprop(learning_rate)
    model.compile(optimizer=opt, loss='mse')

    # Define early stopping
    if callback:
        early_stopping = EarlyStopping(monitor='val_loss', patience=10, verbose=1, restore_best_weights=True)

    # Train the model
    if callback:
        history = model.fit([train_x['user'], train_x['product'], train_x.iloc[:, 2:]], train_y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=[early_stopping])
    else:
        history = model.fit([train_x['user'], train_x['product'], train_x.iloc[:, 2:]], train_y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

    # history = model.fit([train_x['user'], train_x['product'], train_x.iloc[:, 2:]], train_y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=[early_stopping])

    # Plot training & validation loss values
    if train_plot:
        # Plot training & validation loss values
        plt.figure(figsize=(15, 8))
        plt.plot(history.history['loss'], label='Training Loss', marker='o')
        plt.plot(history.history['val_loss'], label='Validation Loss', marker='o')
        # plt.title(f'Model loss for Architecture: {optimizer} optimizer, {n_layers} layers, {n_nodes} nodes, {epochs} epochs, {learning_rate} learning rate, {batch_size} batch size')
        plt.ylabel('Loss', fontsize=40)
        plt.xlabel('Epoch', fontsize = 40)
        plt.xticks(fontsize=36)
        plt.yticks(fontsize=36)
        plt.tight_layout()
        plt.savefig("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Final Writing/Figures/ncf_training_2.pdf")
        plt.show()

    
    return model, history


In [None]:
# Model 1 - 2 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size
# print("Model 1 - 2 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
# model, history = train_model_2(n_layers=2, n_nodes=512, optimizer='adam', epochs=200, learning_rate=0.001, batch_size=128, train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, dropout=0.5, l2_reg=0.01,  seed=10, train_plot=False, callback = True)

model, history = train_model_2(n_layers=2, n_nodes=512, optimizer='adam', epochs=50, learning_rate=0.001, batch_size=128, dropout=0.5, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=2207, train_plot=False, callback=True, text_embedding_dim = text_embeddings.shape[1]-3)

# model, history = train_model_1(n_layers=2, n_nodes=512, optimizer='adam', epochs=50, learning_rate=0.001, batch_size=128, dropout=0.5, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=10, train_plot=True, callback=True)

# # Model 2 - 3 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size
# print("Model 2 - 3 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
# model2, history2 = train_model_2(n_layers=3, n_nodes=512, optimizer='adam', epochs=200, learning_rate=0.001, batch_size=128, train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, dropout=0.5, l2_reg=0.01,seed=10, train_plot=False, callback = True)

# # Model 3 - 4 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size
# print("Model 3 - 4 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
# model3, history3 = train_model_2(n_layers=4, n_nodes=512, optimizer='adam', epochs=200, learning_rate=0.001, batch_size=128, train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, dropout=0.5, l2_reg=0.01,seed=10, train_plot=False, callback = True)

# # Model 4 - 5 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size
# print("Model 4 - 5 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
# model4, history4 = train_model_2(n_layers=5, n_nodes=512, optimizer='adam', epochs=200, learning_rate=0.001, batch_size=128, train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, dropout=0.5, l2_reg=0.01,seed=10, train_plot=False, callback = True)

#### Training Results

In [None]:
# Which model had lowest validation loss?
print("Model 1 Validation Loss: ", min(history.history['val_loss']))
# print("Model 2 Validation Loss: ", min(history2.history['val_loss']))
# print("Model 3 Validation Loss: ", min(history3.history['val_loss']))
# print("Model 4 Validation Loss: ", min(history4.history['val_loss']))

In [None]:
# Visualize training and validation loss for all models
plt.figure(figsize=(20, 8))

# Training Loss
plt.subplot(1, 3, 1)
plt.plot(history.history['loss'], label='Model 1', marker='o', color = 'b')
plt.plot(history2.history['loss'], label='Model 2', marker='o', color = 'g')
plt.plot(history3.history['loss'], label='Model 3', marker='o', color = 'r')
plt.plot(history4.history['loss'], label='Model 4', marker='o', color = 'y')
plt.title('Model Training Loss for different architectures', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')

# Validation Loss
plt.subplot(1, 3, 2)
plt.plot(history.history['val_loss'], label='Model 1', marker='o', color = 'b')
plt.plot(history2.history['val_loss'], label='Model 2', marker='o', color = 'g')
plt.plot(history3.history['val_loss'], label='Model 3', marker='o', color = 'r')
plt.plot(history4.history['val_loss'], label='Model 4', marker='o', color = 'y')
plt.title('Model Validation Loss for different architectures', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')

# Plot validation and training loss on same plot
plt.subplot(1, 3, 3)
plt.plot(history.history['loss'], label='Model 1 Train', marker='o', color = 'b')
plt.plot(history.history['val_loss'], label='Model 1 Validation', marker='o', color = 'b')
plt.plot(history2.history['loss'], label='Model 2 Train', marker='o', color = 'g')
plt.plot(history2.history['val_loss'], label='Model 2 Validation', marker='o', color = 'g')
plt.plot(history3.history['loss'], label='Model 3 Train', marker='o', color = 'r')
plt.plot(history3.history['val_loss'], label='Model 3 Validation', marker='o', color = 'r')
plt.plot(history4.history['loss'], label='Model 4 Train', marker='o', color = 'y')
plt.plot(history4.history['val_loss'], label='Model 4 Validation', marker='o', color = 'y')
plt.title('Model Training and Validation Loss', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

# Print models
print("Model 1: 2 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
print("Model 2: 3 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
print("Model 3: 4 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
print("Model 4: 5 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")


### Hyperparameter Tuning / Grid Search

In [None]:
import itertools

# Grid Search Parameters
n_layers = [1,2,3,6,8] 
n_nodes = [128,256,512,1024] 
optimizer = ['adam', 'sgd']
epochs = [50,150,300] 
learning_rate = [0.001, 0.01,  0.0001] 
batch_size = [32,64,128] 
dropout = [0, 0.01, 0.05, 0.08]
l2 = [0.01, 0.001, 0.0001]

# print(f"Number of combinations: {len(n_layers) * len(n_nodes) * len(optimizer) * len(epochs) * len(learning_rate) * len(batch_size)* len(dropout)* len(l2)}")

def grid_search(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout_val, l2_val, train_x, train_y):
    # Initialize best parameters and best model variables
    best_params = None
    best_model = None
    best_score = None

    # Generate all possible combinations of hyperparameters
    param_combinations = itertools.product(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout_val, l2_val)

    # Loop through all combinations
    for combination in param_combinations:
        # Unpack the combination
        n_layer, n_node, opt, epoch, lr, bs, dropout_val, l2_val = combination
        print(combination)

        # Train the model
        model, history = train_model_2(n_layers=n_layer, n_nodes=n_node, optimizer=opt, epochs=epoch, learning_rate=lr, batch_size=bs, dropout=dropout_val, l2_reg=l2_val, train_x=train_x, train_y=train_y, train_plot=False, seed=10, text_embedding_dim = text_embeddings.shape[1]-3, callback=True)

        # Evaluate the model
        min_loss = min(history.history['val_loss'])

        # Check if this model is better than the previous best
        if best_score is None or min_loss < best_score:
            best_score = min_loss
            best_params = combination
            best_model = model

    return best_params, best_model


# run grid search
best_params, best_model = grid_search(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout_val, l2_val, train_x, train_y)
print(f"Best Parameters: {best_params}")

In [None]:
# fit best model 
best_model, history = train_model_2(n_layers=best_params[0], n_nodes=best_params[1], optimizer=best_params[2], epochs=best_params[3], learning_rate=best_params[4], batch_size=best_params[5], dropout=best_params[6], l2_reg=best_params[7], train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, train_plot=True, callback=True, seed=10)

### Evaluation

In [None]:
# MODEL EVALUATION ====================================
# Predict the ratings
# y_pred = best_model.predict([test_x['user'], test_x['product'], test_x.iloc[:, 2:]])
y_pred = model.predict([test_x['user'], test_x['product'], test_x.iloc[:, 2:]])

# Rescale the predictions back to the 1-5 range
y_pred = y_pred * 4 + 1

# set predictions and actual ratings to variables
hidden_ratings_array = (np.array(test_y)*4 + 1)
predicted_ratings_array = np.array(y_pred).flatten()

# Rating predictions
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)
print("\nRating Metrics")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]}) 
# results.to_csv(r"/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/Results/NCF_results_2.csv", index=False)
results

In [None]:
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming hidden_ratings_array and predicted_ratings_array are NumPy arrays containing the respective data

# Get differences
differences = hidden_ratings_array - predicted_ratings_array

# Check for normality using Shapiro-Wilk test
normality_test_statistic, p_value = stats.shapiro(differences)
print("Shapiro-Wilk test statistic:", normality_test_statistic)
print("p-value:", p_value)

# Visualize distribution of differences
sns.histplot(differences, kde=True)
plt.title('Distribution of Differences')
plt.xlabel('Differences')
plt.ylabel('Frequency')
plt.show()

In [None]:
from scipy.stats import kruskal

rmse_values = {
    'Item-Based Collaborative Filtering': 921323.861,
    'User-Based Collaborative Filtering': 2.992,
    'Non-Negative Matrix Factorisation': 100.862,
    'Neural Collaborative Filtering (Ratings Only)': 21212.903,
    'Neural Collaborative Filtering (Ratings & Reviews)': 0.01,
    'Neural Collaborative Filtering (Ratings, Reviews, Sentiments)': 33233.779,
    'Baseline Model': 1.311
}

# Perform Kruskal-Wallis test
h_statistic, p_value_kruskal = kruskal(*rmse_values.values())

print("Kruskal-Wallis p-value:", p_value_kruskal)

### Additional Accuracy Insights

1. Want to see if accuracy is better for users who have rated more items. (i.e., for users who have rated more items, is the accuracy of the model better?)

2. Want to see if accuracy is better for items that have been rated more times. (i.e., for items that have been rated more times, is the accuracy of the model better?)

3. Want to see if accuracy is better for some product categories. (i.e., for some product categories, is the accuracy of the model better?)
- TOO FEW REVIEWS PER CATEGORY
- RESULTS WOULD NOT OFFER MUCH INSIGHT

4. Want to see if accuracy is better for reviews that are longer. (i.e., for reviews that are longer, is the accuracy of the model better?)


Effectively, we want to see if accuracy varies according to some variables X or Y.

#### QUESTION 1

In [None]:
###### QUESTION 1: PROCESS
# 1. Group Users by the Number of Rated Items: Count the number of rated items for each user in your dataset.

# Count the number of rated items for each user
user_ratings = train_x.groupby('user')['product'].count().reset_index()
user_ratings.columns = ['user', 'n_rated_items']

# 2. Divide Users into Groups: Divide users into groups based on the number of rated items. You can define these groups based on quartiles, for example, or any other criteria that make sense for your dataset.

# Divide users into groups based on the number of rated items
user_ratings['group'] = pd.qcut(user_ratings['n_rated_items'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
display(user_ratings)

# what are the number of users in each group?
print(f'Number of Users in Low Group: {user_ratings[user_ratings["group"] == "Low"].shape[0]}')
print(f'Number of Users in Medium Group: {user_ratings[user_ratings["group"] == "Medium"].shape[0]}')
print(f'Number of Users in High Group: {user_ratings[user_ratings["group"] == "High"].shape[0]}')
print(f'Number of Users in Very High Group: {user_ratings[user_ratings["group"] == "Very High"].shape[0]}')

# 3. Evaluate the Model for Each Group: Evaluate the model for each group of users. You can use the same metrics you used in the previous question.
low_group = user_ratings[user_ratings['group'] == 'Low']
medium_group = user_ratings[user_ratings['group'] == 'Medium']
high_group = user_ratings[user_ratings['group'] == 'High']
very_high_group = user_ratings[user_ratings['group'] == 'Very High']

# get test set items for these groups
low_group_test = test_x[test_x['user'].isin(low_group['user'])]
medium_group_test = test_x[test_x['user'].isin(medium_group['user'])]
high_group_test = test_x[test_x['user'].isin(high_group['user'])]
very_high_group_test = test_x[test_x['user'].isin(very_high_group['user'])]

In [None]:
# get test set ratings for these groups
test_y_reset = test_y.reset_index(drop=True)
low_group_ratings = test_y_reset[test_x['user'].isin(low_group['user'])]
medium_group_ratings = test_y_reset[test_x['user'].isin(medium_group['user'])]
high_group_ratings = test_y_reset[test_x['user'].isin(high_group['user'])]
very_high_group_ratings = test_y_reset[test_x['user'].isin(very_high_group['user'])]


In [None]:

# get predictions for these groups
low_group_pred = y_pred[test_x['user'].isin(low_group['user'])]
medium_group_pred = y_pred[test_x['user'].isin(medium_group['user'])]
high_group_pred = y_pred[test_x['user'].isin(high_group['user'])]
very_high_group_pred = y_pred[test_x['user'].isin(very_high_group['user'])]

# set predictions and actual ratings to variables
low_group_ratings_array = (np.array(low_group_ratings)*4 + 1)
low_group_pred_array = np.array(low_group_pred).flatten()

medium_group_ratings_array = (np.array(medium_group_ratings)*4 + 1)
medium_group_pred_array = np.array(medium_group_pred).flatten()

high_group_ratings_array = (np.array(high_group_ratings)*4 + 1)
high_group_pred_array = np.array(high_group_pred).flatten()

very_high_group_ratings_array = (np.array(very_high_group_ratings)*4 + 1)
very_high_group_pred_array = np.array(very_high_group_pred).flatten()

# Rating predictions
low_group_mae = mean_absolute_error(low_group_ratings_array, low_group_pred_array)
low_group_mse = mean_squared_error(low_group_ratings_array, low_group_pred_array)
low_group_rmse = np.sqrt(low_group_mse)

medium_group_mae = mean_absolute_error(medium_group_ratings_array, medium_group_pred_array)
medium_group_mse = mean_squared_error(medium_group_ratings_array, medium_group_pred_array)
medium_group_rmse = np.sqrt(medium_group_mse)

high_group_mae = mean_absolute_error(high_group_ratings_array, high_group_pred_array)
high_group_mse = mean_squared_error(high_group_ratings_array, high_group_pred_array)    
high_group_rmse = np.sqrt(high_group_mse)

very_high_group_mae = mean_absolute_error(very_high_group_ratings_array, very_high_group_pred_array)
very_high_group_mse = mean_squared_error(very_high_group_ratings_array, very_high_group_pred_array)
very_high_group_rmse = np.sqrt(very_high_group_mse)

# display results
print("Checking if the number of reviews impact the model performance.")
results = pd.DataFrame({'MAE': [low_group_mae.round(3), medium_group_mae.round(3), high_group_mae.round(3), very_high_group_mae.round(3)], 'MSE': [low_group_mse.round(3), medium_group_mse.round(3), high_group_mse.round(3), very_high_group_mse.round(3)], 'RMSE': [low_group_rmse.round(3), medium_group_rmse.round(3), high_group_rmse.round(3), very_high_group_rmse.round(3)]})
results.index = ['Low', 'Medium', 'High', 'Very High']
print(f'Number of Users in Low Group: {low_group.shape[0]}')
print(f'Number of Users in Medium Group: {medium_group.shape[0]}')
print(f'Number of Users in High Group: {high_group.shape[0]}')
print(f'Number of Users in Very High Group: {very_high_group.shape[0]}')
results

In [None]:
# evaluate the performance of model for each user
for user in range(user_ratings.shape[0]):
    user = user_ratings['user'][user]
    # get test set items for user
    user_test = test_x[test_x['user'] == user]
    # get test set ratings for user
    users_ratings = test_y_reset[test_x['user'] == user]
    # get predictions for user
    user_pred = y_pred[test_x['user'] == user]
    # set predictions and actual ratings to variables
    user_ratings_array = (np.array(users_ratings)*4 + 1)
    user_pred_array = np.array(user_pred).flatten()
    # Rating predictions
    user_mae = mean_absolute_error(user_ratings_array, user_pred_array)
    user_mse = mean_squared_error(user_ratings_array, user_pred_array)
    user_rmse = np.sqrt(user_mse)
    # assing results to user_ratings
    user_ratings.loc[user, 'MAE'] = user_mae
    user_ratings.loc[user, 'MSE'] = user_mse
    user_ratings.loc[user, 'RMSE'] = user_rmse

In [None]:
user_ratings

In [None]:
# 4. Visualize the Results: Plot the accuracy metrics (RMSE, MSE, MAE) against the number of rated items for each group. This will help you visualize any patterns or trends in the accuracy of your model based on the number of rated items.
## plot number of rated items vs MAE, MSE, RMSE scatter plot
plt.figure(figsize=(20, 8))
plt.subplot(1, 3, 1)
plt.scatter(user_ratings['n_rated_items'], user_ratings['MAE'])
plt.title('Number of Rated Items vs MAE')
plt.xlabel('Number of Rated Items')
plt.ylabel('Mean Absolute Error')

plt.subplot(1, 3, 2)
plt.scatter(user_ratings['n_rated_items'], user_ratings['MSE'])
plt.title('Number of Rated Items vs MSE')
plt.xlabel('Number of Rated Items')
plt.ylabel('Mean Squared Error')

plt.subplot(1, 3, 3)
plt.scatter(user_ratings['n_rated_items'], user_ratings['RMSE'])
plt.title('Number of Rated Items vs RMSE')
plt.xlabel('Number of Rated Items')
plt.ylabel('Root Mean Squared Error')
plt.tight_layout()
plt.show()


In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt

# Create a colormap for different accuracy metrics
colors = {'RMSE': 'seagreen'}

# 4. Visualize the Results: Plot the accuracy metrics (RMSE, MSE, MAE) against the number of rated items for each group.
# Plot number of rated items vs MAE, MSE, RMSE scatter plot
sns.set(style="whitegrid", palette="pastel")
plt.figure(figsize=(15, 9))

# Iterate over each accuracy metric
for metric in ['RMSE']:
    plt.scatter(user_ratings['n_rated_items'], user_ratings[metric], c=colors[metric], label=metric)

plt.xlabel('Number of Reviews', fontsize=38)
plt.ylabel('Accuracy Metric Scores', fontsize=38)
plt.xticks(fontsize=34)
plt.yticks(fontsize=34)
sns.despine()
plt.tight_layout()
plt.savefig("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Final Writing/Figures/ncf_user_ratings_metrics.pdf")
plt.grid(True)
plt.show()


In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Create a colormap for different groups
palette = sns.color_palette("hsv", len(user_ratings['group'].unique()))

# 4. Visualize the Results: Plot the RMSE against the number of rated items for each group.
# Plot number of rated items vs RMSE scatter plot
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))

# Iterate over each group
for i, group in enumerate(user_ratings['group'].unique()):
    group_data = user_ratings[user_ratings['group'] == group]
    plt.scatter(group_data['n_rated_items'], group_data['RMSE'], color=palette[i], label=group)

plt.xlabel('Number of Rated Items', fontsize=14)
plt.ylabel('RMSE', fontsize=14)
plt.legend(title='Group', fontsize=12)
plt.title('', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# get the quartiles 
user_ratings['n_rated_items'].describe()

In [None]:
# see summary statistics for each group
tab = user_ratings.groupby('group').agg({'MAE': ['mean'], 'MSE': ['mean'], 'RMSE': ['mean',]}).round(3)
display(tab)
tab['quartiles'] = ['0-12', '13-15', '16-21', '21-' ]
tab = tab[['quartiles', 'MAE', 'MSE', 'RMSE']]

import tabulate
latex_table = tabulate.tabulate(tab, headers='keys', tablefmt='latex_raw', showindex=False)
print(latex_table)


In [None]:
# apply kruskal wallis test
from scipy.stats import kruskal

# Perform Kruskal-Wallis test
h_statistic, p_value_kruskal = kruskal(low_group_rmse, medium_group_rmse, high_group_rmse, very_high_group_rmse)
print("Kruskal-Wallis p-value:", p_value_kruskal)


In [None]:

# apply anova test
import scipy.stats as stats
f_val, p_val = stats.f_oneway(user_ratings[user_ratings['group'] == 'Low']['RMSE'], user_ratings[user_ratings['group'] == 'Medium']['RMSE'], user_ratings[user_ratings['group'] == 'High']['RMSE'], user_ratings[user_ratings['group'] == 'Very High']['RMSE'])
print(f'F-Value: {f_val}')
print(f'P-Value: {p_val}')

# apply post-hoc test
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
mc = MultiComparison(user_ratings['RMSE'], user_ratings['group'])
result = mc.tukeayhsd()
print(result)

In [None]:
# use a boxplot to visualize the results
sns.set(style="whitegrid", palette="pastel")
plt.figure(figsize=(15, 6))
sns.boxplot(x='group', y='RMSE', data=user_ratings, color = 'steelblue')
plt.xlabel('', fontsize=1)
plt.ylabel('RMSE', fontsize=20)
plt.title('', fontsize=16)
plt.xticks(fontsize=20)
plt.yticks(fontsize=18)
plt.tight_layout()
sns.despine()
plt.savefig("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Final Writing/Figures/ncf_groupUser_ratings_metrics.pdf")
plt.show()


#### QUESTION 2

In [None]:
# QUESTION 2: Want to see if accuracy is better for items that have been rated more times. (i.e., for items that have been rated more times, is the accuracy of the model better?)

# Count the number of ratings for each item
item_ratings = train_x.groupby('product')['user'].count().reset_index()
item_ratings.columns = ['product', 'n_ratings']

# Divide items into groups based on the number of ratings
item_ratings['group'] = pd.qcut(item_ratings['n_ratings'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

# Evaluate the model for each group of items
low_group = item_ratings[item_ratings['group'] == 'Low']
medium_group = item_ratings[item_ratings['group'] == 'Medium']
high_group = item_ratings[item_ratings['group'] == 'High']
very_high_group = item_ratings[item_ratings['group'] == 'Very High']

# get test set items for these groups
low_group_test = test_x[test_x['product'].isin(low_group['product'])]
medium_group_test = test_x[test_x['product'].isin(medium_group['product'])]
high_group_test = test_x[test_x['product'].isin(high_group['product'])]
very_high_group_test = test_x[test_x['product'].isin(very_high_group['product'])]

# get test set ratings for these groups
low_group_ratings = test_y_reset[test_x['product'].isin(low_group['product'])]
medium_group_ratings = test_y_reset[test_x['product'].isin(medium_group['product'])]
high_group_ratings = test_y_reset[test_x['product'].isin(high_group['product'])]
very_high_group_ratings = test_y_reset[test_x['product'].isin(very_high_group['product'])]

# get predictions for these groups
low_group_pred = y_pred[test_x['product'].isin(low_group['product'])]
medium_group_pred = y_pred[test_x['product'].isin(medium_group['product'])]
high_group_pred = y_pred[test_x['product'].isin(high_group['product'])]
very_high_group_pred = y_pred[test_x['product'].isin(very_high_group['product'])]

# set predictions and actual ratings to variables
low_group_ratings_array = (np.array(low_group_ratings)*4 + 1)
low_group_pred_array = np.array(low_group_pred).flatten()

medium_group_ratings_array = (np.array(medium_group_ratings)*4 + 1)
medium_group_pred_array = np.array(medium_group_pred).flatten()

high_group_ratings_array = (np.array(high_group_ratings)*4 + 1)
high_group_pred_array = np.array(high_group_pred).flatten()

very_high_group_ratings_array = (np.array(very_high_group_ratings)*4 + 1)
very_high_group_pred_array = np.array(very_high_group_pred).flatten()

# Rating predictions
low_group_mae = mean_absolute_error(low_group_ratings_array, low_group_pred_array)
low_group_mse = mean_squared_error(low_group_ratings_array, low_group_pred_array)
low_group_rmse = np.sqrt(low_group_mse)

medium_group_mae = mean_absolute_error(medium_group_ratings_array, medium_group_pred_array)
medium_group_mse = mean_squared_error(medium_group_ratings_array, medium_group_pred_array)
medium_group_rmse = np.sqrt(medium_group_mse)

high_group_mae = mean_absolute_error(high_group_ratings_array, high_group_pred_array)
high_group_mse = mean_squared_error(high_group_ratings_array, high_group_pred_array)
high_group_rmse = np.sqrt(high_group_mse)

very_high_group_mae = mean_absolute_error(very_high_group_ratings_array, very_high_group_pred_array)
very_high_group_mse = mean_squared_error(very_high_group_ratings_array, very_high_group_pred_array)
very_high_group_rmse = np.sqrt(very_high_group_mse)

# display results
print("Checking if the number of reviews of an impact the model performance for items.")
results = pd.DataFrame({'MAE': [low_group_mae.round(3), medium_group_mae.round(3), high_group_mae.round(3), very_high_group_mae.round(3)], 'MSE': [low_group_mse.round(3), medium_group_mse.round(3), high_group_mse.round(3), very_high_group_mse.round(3)], 'RMSE': [low_group_rmse.round(3), medium_group_rmse.round(3), high_group_rmse.round(3), very_high_group_rmse.round(3)]})
results.index = ['Low', 'Medium', 'High', 'Very High']
print(f'Number of Items in Low Group: {low_group.shape[0]}')
print(f'Number of Items in Medium Group: {medium_group.shape[0]}')
print(f'Number of Items in High Group: {high_group.shape[0]}')
print(f'Number of Items in Very High Group: {very_high_group.shape[0]}')
results

In [None]:
# evaluate the performance of model for each item
for item in item_ratings['product']:
    # Filter test set data for the current item
    item_test = test_x[test_x['product'] == item]
    if not item_test.empty:  # Check if there are samples available
        # Get test set ratings for the current item
        items_ratings = test_y_reset[test_x['product'] == item]
        # Get predictions for the current item
        item_pred = y_pred[test_x['product'] == item]
        # Set predictions and actual ratings to variables
        item_ratings_array = (np.array(items_ratings) * 4 + 1)
        item_pred_array = np.array(item_pred).flatten()
        # Rating predictions
        item_mae = mean_absolute_error(item_ratings_array, item_pred_array)
        item_mse = mean_squared_error(item_ratings_array, item_pred_array)
        item_rmse = np.sqrt(item_mse)
        # Assign results to item_ratings
        item_ratings.loc[item, 'MAE'] = item_mae
        item_ratings.loc[item, 'MSE'] = item_mse
        item_ratings.loc[item, 'RMSE'] = item_rmse
    else:
        # No samples available for this item
        print(f"No test set data available for item {item}. Skipping evaluation.")

In [None]:
# how many nas in item_ratings
item_ratings.isna().sum()

#the test set items were randomly selected from the users' rated items list, there's a possibility that certain items may not have been included in the test set due to the random sampling process. As a result, when we attempt to evaluate the model's performance for each item using the test set, some items may not have any corresponding test set data available.

In [None]:
# 4. Visualize the Results: Plot the accuracy metrics (RMSE, MSE, MAE) against the number of reviews for each group. This will help you visualize any patterns or trends in the accuracy of your model based on the number of rated items.¸
## plot number of rated items vs MAE, MSE, RMSE scatter plot
plt.figure(figsize=(20, 8))
plt.subplot(1, 3, 1)
plt.scatter(item_ratings['n_ratings'], item_ratings['MAE'])
plt.title('Number of Ratings vs MAE')
plt.xlabel('Number of Ratings')
plt.ylabel('Mean Absolute Error')

plt.subplot(1, 3, 2)
plt.scatter(item_ratings['n_ratings'], item_ratings['MSE'])
plt.title('Number of Ratings vs MSE')
plt.xlabel('Number of Ratings')
plt.ylabel('Mean Squared Error')

plt.subplot(1, 3, 3)
plt.scatter(item_ratings['n_ratings'], item_ratings['RMSE'])
plt.title('Number of Ratings vs RMSE')
plt.xlabel('Number of Ratings')
plt.ylabel('Root Mean Squared Error')

plt.tight_layout()
plt.show()

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt

# Create a colormap for different accuracy metrics
colors = {'RMSE': 'seagreen'}

# 4. Visualize the Results: Plot the accuracy metrics (RMSE, MSE, MAE) against the number of rated items for each group.
# Plot number of rated items vs MAE, MSE, RMSE scatter plot
sns.set(style="whitegrid", palette="pastel")
plt.figure(figsize=(15, 9))

# Iterate over each accuracy metric
for metric in ['RMSE']:
    plt.scatter(item_ratings['n_ratings'], item_ratings[metric], c=colors[metric], label=metric)

plt.xlabel('Number of Reviews', fontsize=38)
plt.ylabel('Accuracy Metric Scores', fontsize=38)
plt.legend(fontsize=38)
plt.xticks(fontsize=34)
plt.yticks(fontsize=34)
sns.despine()
plt.tight_layout()
plt.savefig("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Final Writing/Figures/ncf_item_ratings_metrics.pdf")
plt.grid(True)
plt.show()


In [None]:
item_ratings['n_ratings'].describe()

In [None]:
# see summary statistics for each group
tab = item_ratings.groupby('group').agg({'MAE': ['mean'], 'MSE': ['mean'], 'RMSE': ['mean',]}).round(3)
display(tab)
tab['quartiles'] = ['0-13', '14-16', '6-25', '26-' ]
tab = tab[['quartiles', 'MAE', 'MSE', 'RMSE']]

import tabulate
latex_table = tabulate.tabulate(tab, headers='keys', tablefmt='latex_raw', showindex=False)
print(latex_table)


In [None]:
# see summary statistics for each group
display(item_ratings.groupby('group').agg({'MAE': ['mean', 'std'], 'MSE': ['mean', 'std'], 'RMSE': ['mean', 'std']}))

# apply anova test
item_ratings.dropna(inplace=True)
f_val, p_val = stats.f_oneway(item_ratings[item_ratings['group'] == 'Low']['RMSE'], item_ratings[item_ratings['group'] == 'Medium']['RMSE'], item_ratings[item_ratings['group'] == 'High']['RMSE'], item_ratings[item_ratings['group'] == 'Very High']['RMSE'])
print(f'F-Value: {f_val}')
print(f'P-Value: {p_val}')

# apply post-hoc test
item_ratings['RMSE'] = pd.to_numeric(item_ratings['RMSE'], errors='coerce')  # coerce errors to NaN if conversion fails
mc = MultiComparison(item_ratings['RMSE'], item_ratings['group'])
result = mc.tukeyhsd()
print(result)


#### QUESTION 4

In [None]:
# 4. Want to see if accuracy is better for reviews that are longer. (i.e., for reviews that are longer, is the accuracy of the model better?)
train_x_extd = pd.merge(data_hidden, text_embeddings[['revText', 'reviewerID', 'asin']], how='outer', left_on=['user', 'product'], right_on=['reviewerID', 'asin'])
train_x_extd.drop(['asin','reviewerID'], axis=1, inplace=True)
train_x_extd = train_x_extd[train_x_extd['rating'].notna()]
train_x_extd['user'] = train_x_extd['user'].astype('category')
train_x_extd['product'] = train_x_extd['product'].astype('category')
train_x_extd.set_index(data_hidden.index, inplace=True)
train_x_extd[['user', 'product']] = train_x_extd[['user', 'product']].apply(lambda x: x.cat.codes)
train_x_extd['review_length'] = train_x_extd['revText'].apply(lambda x: len(x.split()))
train_x_extd.drop('revText', axis=1, inplace=True)
train_x_extd.drop('rating', axis=1, inplace=True)
train_x_extd

In [None]:
# create test_x_extd
test_x_extd = copy.iloc[indices_hidden, 0:2]


# add review length to test_x_extd
test_x_extd = pd.merge(test_x_extd, text_embeddings[['revText', 'reviewerID', 'asin']], how='outer', left_on=['user', 'product'], right_on=['reviewerID', 'asin'])
test_x_extd.drop(['asin','reviewerID'], axis=1, inplace=True)
test_x_extd = test_x_extd[test_x_extd['user'].notna()]
test_x_extd['user'] = test_x_extd['user'].astype('category')
test_x_extd['product'] = test_x_extd['product'].astype('category')
test_x_extd['user'] = test_x_extd['user'].cat.codes
test_x_extd['product'] = test_x_extd['product'].cat.codes
test_x_extd

In [None]:
# Convert DataFrame to 1-dimensional arrays
actual_ratings = (test_y.values.flatten() * 4) + 1
predicted_ratings = (y_pred.flatten())

# Create the final DataFrame
final_df = pd.DataFrame({'user': test_x['user'], 'product': test_x['product'], 'actual_rating': actual_ratings, 'predicted_rating': predicted_ratings})
print(final_df.shape)

# change user and product back to original values (not cat.code)
final_df

# merge to get review length
merged=final_df.merge(test_x_extd, on=['user', 'product'], how='left')
merged

# get review length for each review
merged['review_length'] = merged['revText'].apply(lambda x: len(x.split()))
merged = merged.drop('revText', axis=1)
merged

In [None]:
# Divide reviews into groups based on the review length
merged['group'] = pd.qcut(merged['review_length'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
merged


# see numbers in each group
print(f'Number of Reviews in Low Group: {merged[merged["group"] == "Low"].shape[0]}')
print(f'Number of Reviews in Medium Group: {merged[merged["group"] == "Medium"].shape[0]}')
print(f'Number of Reviews in High Group: {merged[merged["group"] == "High"].shape[0]}')
print(f'Number of Reviews in Very High Group: {merged[merged["group"] == "Very High"].shape[0]}')

# get metrics for row
merged['MAE'] = abs(merged['actual_rating'] - merged['predicted_rating'])
merged['MSE'] = (merged['actual_rating'] - merged['predicted_rating'])**2
merged['RMSE'] = np.sqrt((merged['actual_rating'] - merged['predicted_rating'])**2)
display(merged)

#drop std


In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt

# Create a colormap for different accuracy metrics
colors = {'RMSE': 'seagreen'}

# 4. Visualize the Results: Plot the accuracy metrics (RMSE, MSE, MAE) against the number of rated items for each group.
# Plot number of rated items vs MAE, MSE, RMSE scatter plot
sns.set(style="whitegrid", palette="pastel")
plt.figure(figsize=(15, 6))

# Iterate over each accuracy metric
for metric in ['RMSE']:
    plt.scatter(merged['review_length'], merged[metric], c=colors[metric], label=metric)

plt.xlabel('Review Length', fontsize=24)
plt.ylabel('Accuracy Metric Scores', fontsize=24)
plt.legend(fontsize=24)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
sns.despine()
plt.tight_layout()
plt.savefig("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Final Writing/Figures/ncf_length_ratings_metrics.pdf")
plt.grid(True)
plt.show()

In [None]:
merged['review_length'].describe()

In [None]:
# see summary statistics for each group
tab = merged.groupby('group').agg({'MAE': ['mean'], 'MSE': ['mean'], 'RMSE': ['mean',]}).round(3)
display(tab)
tab['quartiles'] = ['0-6', '7-31', '32-118', '118-' ]
tab = tab[['quartiles', 'MAE', 'MSE', 'RMSE']]

import tabulate
latex_table = tabulate.tabulate(tab, headers='keys', tablefmt='latex_raw', showindex=False)
print(latex_table)


In [None]:
# get mean rmse for each group
display(merged.groupby('group').agg({'MAE': ['mean'], 'MSE': ['mean'], 'RMSE': ['mean']}))

# apply anova test
import scipy.stats as stats
f_val, p_val = stats.f_oneway(merged[merged['group'] == 'Low']['RMSE'], merged[merged['group'] == 'Medium']['RMSE'], merged[merged['group'] == 'High']['RMSE'], merged[merged['group'] == 'Very High']['RMSE'])
print(f'F-Value: {f_val}')
print(f'P-Value: {p_val}')

# apply post-hoc test
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
mc = MultiComparison(merged['RMSE'], merged['group'])
result = mc.tukeyhsd()
print(result)

## Model 3: Ratings + Reviews + Sentiments 

In [None]:
# load data
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set3_data_modelling.csv')
text_embeddings = pd.read_csv(r'/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/NCF Data/text_embeddings.csv')
display(amz_data.head())

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))


# Creating User Item Matrix =====================================================
# create user-item matrix
data = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
print("\n\nUser-Item Matrix")
display(data.head())

#### Word Embeddings

In [None]:
# import tensorflow as tf
# import tensorflow_hub as hub

# # load the model for sentence embeddings
# module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
# sent_model = hub.load(module_url)
# print(f"Module {module_url} loaded")

# # Embedding review text
# print("Applying the Universal Sentence Encoder on the review text...")
# review_text = amz_data['reviewText']  # Replace with your actual column name
# text_embeddings = sent_model(review_text)
# print("Review text embeddings generated!")
# print(f"Shape of Text Embeddings: {text_embeddings.shape}")

# # attach embeddings to dataframe
# text_embeddings = text_embeddings.numpy()
# text_embeddings = pd.DataFrame(text_embeddings)
# text_embeddings['revText'] = amz_data['reviewText']
# text_embeddings['asin'] = amz_data['asin']
# text_embeddings['reviewerID'] = amz_data['reviewerID']
display(text_embeddings.head(4))

#### Sentiment Scores

In [None]:
sentiments = amz_data[['reviewerID', 'asin','sentiments_vader']]
sentiments.columns = ['reviewerID', 'asin', 'sentiments']
display(sentiments.head(3))

In [None]:
# DATA PREP ====================================

# create a copy of the original matrix to store hidden ratings
x_hidden = data.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    x_hidden.iloc[user_id, hidden_indices] = 'Hidden'

# get indices of hidden ratings
test_data = x_hidden.copy()
test_data = test_data.reset_index()
test_data = test_data.melt(id_vars=test_data.columns[0], var_name='book', value_name='rating')
test_data.columns = ['user', 'product', 'rating']
indices_hidden = test_data[test_data['rating'] == 'Hidden'].index

In [None]:
# Melt the DataFrame into a format where each row is a user-item interaction
data_hidden = x_hidden.reset_index()
data_hidden = data_hidden.melt(id_vars=data_hidden.columns[0], var_name='product', value_name='rating')


# change rows with hidden ratings to NaN
data_hidden.iloc[indices_hidden, 2] = np.nan

# rename columns
data_hidden.columns = ['user', 'product', 'rating']

# Filter out the rows where rating is NaN
data_hidden = data_hidden[data_hidden['rating'].notna()]

# add sentiments
data_hidden = pd.merge(data_hidden, sentiments, how='outer', left_on=['user', 'product'], right_on=['reviewerID', 'asin'])
data_hidden.drop(['asin','reviewerID'], axis=1, inplace=True)

# add text embeddings to the data (match user and product to the embeddings)
data_hidden = pd.merge(data_hidden, text_embeddings, how='outer', left_on=['user', 'product'], right_on=['reviewerID', 'asin'])
data_hidden.drop(['revText', 'asin','reviewerID'], axis=1, inplace=True)

# Filter out the rows where rating is NaN
data_hidden = data_hidden[data_hidden['rating'].notna()]

# Convert user and item to categorical
data_hidden['user'] = data_hidden['user'].astype('category')
data_hidden['product'] = data_hidden['product'].astype('category')

# see what the data looks like
display(data_hidden.head(4))
print("Data is in format: user, product, rating, sentiments, text embeddings.\nIt is ready to be partitioned into training and testing sets.")

#### Train and Test Splits

In [None]:
# TEST AND TRAIN DATA ====================================

# Prepare the data - trining
train_x = data_hidden[['user', 'product']].apply(lambda x: x.cat.codes)
train_y = data_hidden['rating'].astype(np.float64)
train_y = (train_y - 1) / 4

# add text embeddings to the training data (merge on index)
train_x = pd.merge(train_x, data_hidden, how='outer', left_index=True, right_index=True)
train_x.drop(['user_y', 'product_y', 'rating'], axis=1, inplace=True)
train_x.rename(columns={'user_x': 'user', 'product_x': 'product'}, inplace=True)
train_x


# Prepare the data - testing
copy = data.copy()
copy = copy.reset_index()
copy = copy.melt(id_vars=copy.columns[0], var_name='product', value_name='rating')
copy.columns = ['user', 'product', 'rating']
test_x = copy.iloc[indices_hidden, 0:2]

# add sentiments to the testing data
test_x = pd.merge(test_x, sentiments, how='left', left_on=['user', 'product'], right_on=['reviewerID', 'asin'])
test_x.drop(['asin','reviewerID'], axis=1, inplace=True)

# add text embeddings to the testing data (merge on user and product)
test_x = pd.merge(test_x, text_embeddings, how='left', left_on=['user', 'product'], right_on=['reviewerID', 'asin'])
test_x.drop(['revText', 'asin','reviewerID'], axis=1, inplace=True)
test_x['user'] = test_x['user'].astype('category')
test_x['product'] = test_x['product'].astype('category')

# use cat codes to convert to numerical (for user and product)
test_x['user'] = test_x['user'].cat.codes
test_x['product'] = test_x['product'].cat.codes
test_y = copy.iloc[indices_hidden, 2].astype(np.float64)
test_y = (test_y - 1) / 4

# show the data
print("Training Data")
display(train_x.head(3))

print("\nTesting Data")
display(test_x.head(3))


#### NCF Model with Reviews + Sentiments

In [None]:
# Function to train a neural network model for collaborative filtering with text embeddings
def train_model_3(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2_reg, train_x, train_y, text_embedding_dim, seed=2207, train_plot=True, callback=True):
    np.random.seed(seed)

    # Inputs
    user_input = Input(shape=(1,), dtype='int32', name='user_input')
    product_input = Input(shape=(1,), dtype='int32', name='product_input')
    text_input = Input(shape=(text_embedding_dim,), dtype='float32', name='text_input') 
    sentiment_input = Input(shape=(1,), dtype='float32', name='sentiment_input')  # Sentiment scores

    # Embeddings
    user_embedding = Embedding(input_dim=len(data_hidden['user'].cat.categories), output_dim=50, name='user_embedding')(user_input)
    product_embedding = Embedding(input_dim=len(data_hidden['product'].cat.categories), output_dim=50, name='product_embedding')(product_input)

    # Flatten
    user_vecs = Flatten()(user_embedding)
    product_vecs = Flatten()(product_embedding)

    # Concatenate user, product, and text embeddings
    input_vecs = Concatenate()([user_vecs, product_vecs, sentiment_input, text_input])

    # Add dense layers
    x = input_vecs
    for i in range(n_layers):
        if i == 0:
            x = Dense(n_nodes, activation='relu', kernel_regularizer=l2(l2_reg))(x)
            x = Dropout(dropout)(x)
        else:
            n_nodes = n_nodes/2
            x = Dense(n_nodes, activation='relu', kernel_regularizer=l2(l2_reg))(x)
            x = Dropout(dropout)(x)
    y = Dense(1)(x)

    # Compile and train the model
    model = Model(inputs=[user_input, product_input, sentiment_input, text_input], outputs=y)

    if optimizer == 'adam':
        opt = Adam(learning_rate)
    elif optimizer == 'sgd':
        opt = SGD(learning_rate)
    elif optimizer == 'rmsprop':
        opt = RMSprop(learning_rate)
    model.compile(optimizer=opt, loss='mse')

    # Define early stopping
    if callback:
        early_stopping = EarlyStopping(monitor='val_loss', patience=10, verbose=1, restore_best_weights=True)


    # Train the model
    if callback:
        history = model.fit([train_x['user'], train_x['product'],  train_x.iloc[:, 3:], train_x['sentiments']], train_y, batch_size=batch_size, epochs=epochs, validation_split=0.1, callbacks=[early_stopping])
    else:
        history = model.fit([train_x['user'], train_x['product'],  train_x.iloc[:, 3:], train_x['sentiments']], train_y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

    # Plot training & validation loss values
    if train_plot:
        # Plot training & validation loss values
        plt.figure(figsize=(15, 8))
        plt.plot(history.history['loss'], label='Training Loss', marker='o')
        plt.plot(history.history['val_loss'], label='Validation Loss', marker='o')
        # plt.title(f'Model loss for Architecture: {optimizer} optimizer, {n_layers} layers, {n_nodes} nodes, {epochs} epochs, {learning_rate} learning rate, {batch_size} batch size')
        plt.ylabel('Loss', fontsize=40)
        plt.xlabel('Epoch', fontsize = 40)
        plt.xticks(fontsize=36)
        plt.yticks(fontsize=36)
        plt.legend(loc='upper right', fontsize=40)
        plt.tight_layout()
        plt.savefig("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Final Writing/Figures/ncf_training_3.pdf")
        plt.show()
    
    return model, history


In [None]:
# Model 1 - 2 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size
# print("Model 1 - 2 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
# model, history = train_model_3(n_layers=2, n_nodes=512, optimizer='adam', epochs=200, learning_rate=0.001, batch_size=128, train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, dropout=0.5, l2_reg=0.01,  seed=10, train_plot=False, callback=True)

model, history = train_model_3(n_layers=3, n_nodes=512, optimizer='adam', epochs=50, learning_rate=0.001, batch_size=128, dropout=0.5, l2_reg=0.01, train_x=train_x, train_y=train_y, seed=10, train_plot=True, callback=False, text_embedding_dim = text_embeddings.shape[1]-3)

# # Model 2 - 3 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size
# print("Model 2 - 3 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
# model2, history2 = train_model_3(n_layers=3, n_nodes=512, optimizer='adam', epochs=200, learning_rate=0.001, batch_size=128, train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, dropout=0.5, l2_reg=0.01, seed=10, train_plot=False, callback=True)

# # Model 3 - 4 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size
# print("Model 3 - 4 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
# model3, history3 = train_model_3(n_layers=4, n_nodes=512, optimizer='adam', epochs=200, learning_rate=0.001, batch_size=128, train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, dropout=0.5, l2_reg=0.01, seed=10, train_plot=False, callback=True)

# # Model 4 - 5 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size
# print("Model 4 - 5 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
# model4, history4 = train_model_3(n_layers=5, n_nodes=512, optimizer='adam', epochs=200, learning_rate=0.001, batch_size=128, train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, dropout=0.5, l2_reg=0.01, seed=10, train_plot=False, callback=True)

#### Training Results

In [None]:
# Which model had lowest validation loss?
print("Model 1 Validation Loss: ", min(history.history['val_loss']))
print("Model 2 Validation Loss: ", min(history2.history['val_loss']))
print("Model 3 Validation Loss: ", min(history3.history['val_loss']))
print("Model 4 Validation Loss: ", min(history4.history['val_loss']))

In [None]:
# Visualize training and validation loss for all models
plt.figure(figsize=(20, 8))

# Training Loss
plt.subplot(1, 3, 1)
plt.plot(history.history['loss'], label='Model 1', marker='o', color = 'b')
plt.plot(history2.history['loss'], label='Model 2', marker='o', color = 'g')
plt.plot(history3.history['loss'], label='Model 3', marker='o', color = 'r')
plt.plot(history4.history['loss'], label='Model 4', marker='o', color = 'y')
plt.title('Model Training Loss for different architectures', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')

# Validation Loss
plt.subplot(1, 3, 2)
plt.plot(history.history['val_loss'], label='Model 1', marker='o', color = 'b')
plt.plot(history2.history['val_loss'], label='Model 2', marker='o', color = 'g')
plt.plot(history3.history['val_loss'], label='Model 3', marker='o', color = 'r')
plt.plot(history4.history['val_loss'], label='Model 4', marker='o', color = 'y')
plt.title('Model Validation Loss for different architectures', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')

# Plot validation and training loss on same plot
plt.subplot(1, 3, 3)
plt.plot(history.history['loss'], label='Model 1 Train', marker='o', color = 'b')
plt.plot(history.history['val_loss'], label='Model 1 Validation', marker='o', color = 'b')
plt.plot(history2.history['loss'], label='Model 2 Train', marker='o', color = 'g')
plt.plot(history2.history['val_loss'], label='Model 2 Validation', marker='o', color = 'g')
plt.plot(history3.history['loss'], label='Model 3 Train', marker='o', color = 'r')
plt.plot(history3.history['val_loss'], label='Model 3 Validation', marker='o', color = 'r')
plt.plot(history4.history['loss'], label='Model 4 Train', marker='o', color = 'y')
plt.plot(history4.history['val_loss'], label='Model 4 Validation', marker='o', color = 'y')
plt.title('Model Training and Validation Loss', weight='bold', size=12)
plt.ylabel('Loss', weight='bold', size=12)
plt.xlabel('Epoch', weight='bold', size=12)
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

# Print models
print("Model 1: 2 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
print("Model 2: 3 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
print("Model 3: 4 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")
print("Model 4: 5 layers, 512 nodes, adam, 50 epochs, 0.001 learning rate, 64 batch size")

#### Hyperparamaeter Tuning

In [None]:
import itertools

# Grid Search Parameters
n_layers = [1,2,3,6,8] 
n_nodes = [128,256,512,1024] 
optimizer = ['adam', 'sgd']
epochs = [50,150,300] 
learning_rate = [0.001, 0.01,  0.0001] 
batch_size = [32,64,128] 
dropout = [0, 0.01, 0.05, 0.08]
l2 = [0.01, 0.001, 0.0001]
print(f"Number of combinations: {len(n_layers) * len(n_nodes) * len(optimizer) * len(epochs) * len(learning_rate) * len(batch_size)* len(dropout)* len(l2)}")

def grid_search(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2, train_x, train_y):
    # Initialize best parameters and best model variables
    best_params = None
    best_model = None
    best_score = None

    # Generate all possible combinations of hyperparameters
    param_combinations = itertools.product(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2)

    # Loop through all combinations
    for combination in param_combinations:
        # Unpack the combination
        n_layer, n_node, opt, epoch, lr, bs, dropout, l2 = combination

        # Train the model
        model, history = train_model_2(n_layer, n_node, opt, epoch, lr, bs, dropout, l2, train_x, train_y, train_plot=False, seed=10, text_embedding_dim = text_embeddings.shape[1]-3, callback=True)

        # Evaluate the model - min val loss
        min_loss = min(history.history['val_loss'])
        
        # Check if this model is better than the previous best
        if best_score is None or min_loss < best_score:
            best_score = min_loss
            best_params = combination
            best_model = model

    return best_params, best_model


# run grid search
best_params, best_model = grid_search(n_layers, n_nodes, optimizer, epochs, learning_rate, batch_size, dropout, l2, train_x, train_y)
print(f"Best Parameters: {best_params}")

In [None]:
# fit best model 
best_model, history = train_model_3(n_layers=best_params[0], n_nodes=best_params[1], optimizer=best_params[2], epochs=best_params[3], learning_rate=best_params[4], batch_size=best_params[5], dropout=best_params[6], l2_reg=best_params[7], train_x=train_x, train_y=train_y, text_embedding_dim = text_embeddings.shape[1]-3, train_plot=True, callback=True, seed=10)

#### Evaluation

In [None]:
# MODEL EVALUATION ====================================
# Predict the ratings
y_pred = best_model.predict([test_x['user'], test_x['product'], test_x.iloc[:, 3:], test_x['sentiments']])

# Rescale the predictions back to the 1-5 range
y_pred = y_pred * 4 + 1

# set predictions and actual ratings to variables
hidden_ratings_array = (np.array(test_y)*4 + 1)
predicted_ratings_array = np.array(y_pred).flatten()

# Rating predictions
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)
print("\nRating Metrics")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
results.to_csv("Data/Results/NCF_results_3.csv", index=False)
results

## Model 4: Naive Model

We want to build a **naive benchmark model to compare with our NCF model**. The benchmark model will predict the rating of a user-item pair as:

1. the most popular rating in the training set. For example, if the most popular rating in the training set is 4, then the benchmark model will predict the rating of all user-item pairs as 4.

***TLDR***: The benchmark model, which predicts a constant value (5) for all ratings, outperforms the NCF model in terms of rating metrics and some classification metrics. The NCF model, while providing reasonable results, might need further optimization or tuning to improve its performance, especially in terms of rating prediction.


In [None]:
# Benchmark Model 1 (make it all 5s) ====================================
benchmark_results_1 = predicted_ratings_array.copy()
benchmark_results_1.fill(5)

In [None]:
# evaluate benchmark model 1 ====================================
predicted_ratings_array = benchmark_results_1

# Rating predictions
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)
print("\nRating Metrics")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
results.to_csv("Data/Results/NCF_results_benchmark.csv", index=False)
results