# User Based Collaborative Filtering
## Algorithm Summary

Item-based collaborative filtering is a model-based algorithm for making recommendations. It is based on the similarity between items calculated using people's ratings of those items. It is also known as item-item collaborative filtering.

1. **Load the data**
- data is provided in a dataframe where each row is a review

2. **Create a user-item matrix**
- convert dataframe into user-item matrix where each row is a user and each column is an item

3. **Create test and train set**
- hide $N$ ratings for each user in the training set and use them to test the performance of the model

4. **Calculate user similarity**
- using training set, calculate the similarity between users using cosine similarity

5. **Make predictions**
- for each user, for each item in the test set, calculate the weighted sum of the ratings of the items that are similar to the item in question

6. **Evaluate the model**
- calculate the predictive accuracy of the model using RMSE, MSE and MAE
- calculate the Top-N metrics of the model using NDCG and Hit Rate

## (1) Manaul / From Fundamentals

In [1]:
# %reset -f
# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

In [2]:
# load data - WINDOWS
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
# display(amz_data.head())

# load data - MAC OS
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set3_data_modelling.csv')
display(amz_data.head(3))

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())


# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("Shape: ", x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
0,AQ8OO59DJFJNZ,2018-01-05,767834739,5.0,wonderful movie,wonder movi,wonderful movie,wonderful movie,4,1,0.5719
1,A244CRJ2QSVLZ4,2008-01-29,767834739,5.0,resident evil is a great science fictionhorror...,resid evil great scienc fictionhorror hybrid p...,resident evil great science fictionhorror hybr...,resident evil great science fictionhorror hybr...,-12,-5,-0.9455
2,A1VCLTAGM5RLND,2005-07-23,767834739,5.0,i this movie has people living and working und...,movi peopl live work underground place call hi...,movie people living working underground place ...,movie people living working underground place ...,-1,0,-0.1806


Number of Rows:  83139
Number of Columns:  11
Number of Unique Users:  3668
Number of Unique Products:  3249
Fewest reviews by a reviewer: 13
Most reviews by a reviewer: 193
Fewest reviews per product: 13
Most reviews per product: 189
Shape:  (3668, 3249)


### Train and Test Split

In [3]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    # print("User:", user_id)
    # print("Indices of Rated Products:", rated_products)
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    # print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0

In [4]:
# check tracker - all hidden ratings 
indices_tracker = pd.DataFrame(indices_tracker).to_numpy()
print("Indices of Ratings per user \n", indices_tracker)

# flattened
indices_tracker_flat = indices_tracker.flatten()
print("Indices of Ratings per User joined", indices_tracker_flat)

# see updated matrix with hidden ratings
print("Updated Matrix with Hidden Ratings")
display(x_hidden)

# see original matrix
print("Original Matrix")
display(x)

Indices of Ratings per user 
 [[2807 2258 2647]
 [2111 1398 1498]
 [ 200 1102 1089]
 ...
 [2353 1482  185]
 [ 639 2206 3123]
 [ 193  533  406]]
Indices of Ratings per User joined [2807 2258 2647 ...  193  533  406]
Updated Matrix with Hidden Ratings


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZW0HVDKOXGN9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZX2RDN9YXZAE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZY157FF14CSL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Original Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZW0HVDKOXGN9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZX2RDN9YXZAE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZY157FF14CSL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Similarity Matrix

In [5]:
# get cosine sim matrix and change to pd dataframe and save to csv
sim_mat_cos = cosine_similarity(x_hidden).round(5)
print("Cosine Similarity Matrix") 
sim_mat_cos

Cosine Similarity Matrix


array([[1.     , 0.0299 , 0.     , ..., 0.     , 0.03882, 0.     ],
       [0.0299 , 1.     , 0.     , ..., 0.     , 0.     , 0.     ],
       [0.     , 0.     , 1.     , ..., 0.06884, 0.     , 0.     ],
       ...,
       [0.     , 0.     , 0.06884, ..., 1.     , 0.     , 0.     ],
       [0.03882, 0.     , 0.     , ..., 0.     , 1.     , 0.     ],
       [0.     , 0.     , 0.     , ..., 0.     , 0.     , 1.     ]])

### Grid Search

In [None]:
# set k 
k = [5,10,20,40]

# get predicted ratings for all users for each K, calculate RMSE and store in a list
for k in k:
    print("K = ", k)
    # get top k similar users
    top_k_users = np.argsort(-sim_mat_cos, axis=1)[:, 1:k+1]
    print("Top K Similar Users")
    print(top_k_users)
    
    # get predicted ratings for all users
    predicted_ratings = np.zeros(x_hidden.shape)
    for i in range(x_hidden.shape[0]):
        for j in range(x_hidden.shape[1]):
            if x_hidden.iloc[i, j] == 0:
                predicted_ratings[i, j] = np.mean(x_hidden.iloc[top_k_users[i], j])
            else:
                predicted_ratings[i, j] = x_hidden.iloc[i, j]
    print("Predicted Ratings")
    print(predicted_ratings)
    
    # calculate RMSE
    rmse = np.sqrt(np.sum((predicted_ratings - x)**2) / (x.shape[0] * x.shape[1]))
    print("RMSE: ", rmse)
    
    # store RMSE in a list
    rmse_list.append(rmse)
    print("\n")

# see best k 
print(rmse)

### Prediction Matrix

In [6]:
# get a predictions matrix
predic_matrix = x_hidden.copy()

# set k to 40
k = 40

# now get predicted ratings for all users
for user_id in range(predic_matrix.shape[0]):
    user_ratings = predic_matrix.iloc[user_id, :].values.reshape(1, -1)
    unrated_products_indices = np.where(user_ratings == 0)[1]
    rated_users_indices = np.where(user_ratings > 0)[1]
    for product_id in unrated_products_indices:
        similarity_i_j = sim_mat_cos[user_id, rated_users_indices]  # Get similarity between this user and other users who rated this product
        ratings = user_ratings[0, rated_users_indices]
        
        # sort by similarity and select top k
        sorted_indices = np.argsort(similarity_i_j)[::-1][:k]
        similarity_i_j = similarity_i_j[sorted_indices]
        ratings = ratings[sorted_indices]

        if np.any(similarity_i_j):
            predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
        else:
            # make predicted rating mean of user's ratings
            predicted_rating = np.mean(ratings)
        
        predic_matrix.iloc[user_id, product_id] = predicted_rating

# see updated matrix with predicted ratings
print("Predicted Ratings for All Users")
display(predic_matrix)

Predicted Ratings for All Users


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,...,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856,4.719856
A100WO06OQR8BQ,4.456676,4.456676,4.456676,4.456676,4.456676,4.456676,4.456676,4.456676,4.456676,4.456676,...,4.456676,4.456676,4.456676,4.456676,4.456676,4.456676,5.000000,4.456676,4.456676,4.456676
A1027EV8A9PV1O,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,...,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667,4.666667
A103KKI1Y4TFNQ,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,...,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954,4.539954
A1047P9FLHTDZJ,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,...,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351,4.580351
AZW0HVDKOXGN9,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,...,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582,4.797582
AZX2RDN9YXZAE,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,...,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667,3.416667
AZY157FF14CSL,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000


### Evaluation (Predictive Accuracy)

Now evaluate how good the predictions are vs the hidden ratings
- ***step 1***: identify the hidden ratings indices
- ***step 2***: extract hidden ratings indices and corresponding predicted ratings indices
- ***step 3***: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values

In [7]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(predic_matrix.shape[0]):
    user_predicted_ratings = predic_matrix.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Hidden Ratings: [3. 5. 5. ... 4. 4. 1.]
Corresponding Predicted Ratings: [4.71985586 4.71985586 4.71985586 ... 4.18181818 4.18181818 4.18181818]
Using sklearn
Mean Absolute Error (MAE): 0.5911754030595364
Mean Squared Error (MSE): 0.9733244526178612
Root Mean Squared Error (RMSE): 0.9865720716794395


Manually
Mean Absolute Error (MAE): 0.5911754030595364
Mean Squared Error (MSE): 0.9733244526178612
Root Mean Squared Error (RMSE): 0.9865720716794395


In [8]:
# round to 2 decimal places
mae = round(mae, 2)
mse = round(mse, 2)
rmse = round(rmse, 2)

# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
results.to_csv("Data/Results/UBCF_results_1.csv", index=False)

### Evaluation (Top-N Metrics)

In [9]:
# turn matrix into a dataframe with user and product, rating columns
preds_series = predic_matrix.stack().reset_index().rename(columns={0: 'rating'}).sort_values(by=['asin', 'reviewerID'])
preds_series = preds_series['rating'].reset_index(drop=True)
preds_series


# getting a dataframe with interactions and ratings
data = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
data_mat = data.copy()
data_mat = data_mat.reset_index()
data_mat = data_mat.melt(id_vars=data_mat.columns[0], var_name='product', value_name='rating')
data_mat.columns = ['user', 'product', 'rating']
data_mat['user'] = data_mat['user'].astype('category')
data_mat['product'] = data_mat['product'].astype('category')

# data_mat['user'] = data_mat['user'].cat.codes
# data_mat['product'] = data_mat['product'].cat.codes
display(data_mat.head(3))

# create a completed dataframe
completed = data_mat.copy()
nan_rows = completed[completed['rating'].isnull()]

# for nan_rows, replace the rating with the predicted rating
completed.loc[nan_rows.index, 'rating'] = preds_series[nan_rows.index]

# see original data with user item interactions
print("User Item Interactions with Ratings")
display(data_mat.head(3))

# see data with predictions
print("\nUser Item Interactions with Predicted Ratings")
display(completed.head(3))

# details on completed dataframe
print('\n\nNumber of Rows: ', completed.shape[0])
print('Number of Columns: ', completed.shape[1])
print('Number of Unique Users: ', len(completed['user'].unique()))
print('Number of Unique Products: ', len(completed['product'].unique()))

Unnamed: 0,user,product,rating
0,A100RH4M1W1DF0,767834739,
1,A100WO06OQR8BQ,767834739,
2,A1027EV8A9PV1O,767834739,


User Item Interactions with Ratings


Unnamed: 0,user,product,rating
0,A100RH4M1W1DF0,767834739,
1,A100WO06OQR8BQ,767834739,
2,A1027EV8A9PV1O,767834739,



User Item Interactions with Predicted Ratings


Unnamed: 0,user,product,rating
0,A100RH4M1W1DF0,767834739,4.719856
1,A100WO06OQR8BQ,767834739,4.456676
2,A1027EV8A9PV1O,767834739,4.666667




Number of Rows:  11917332
Number of Columns:  3
Number of Unique Users:  3668
Number of Unique Products:  3249


#### Execute for One User

In [10]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    # print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0

In [11]:
# get training data for user (i.e., remove the hidden ratings and keep only the observed ratings)
train_data = x_hidden.copy()
train_data = train_data.stack().reset_index().rename(columns={0: 'rating'}).sort_values(by=['reviewerID', 'asin'])

# remove all zeros and nan values
train_data = train_data[(train_data != 0)]

# remove all nan values
train_data = train_data.dropna()


# apply cat codes to the user and product columns
train_data['reviewerID'] = train_data['reviewerID'].astype('category')
train_data['asin'] = train_data['asin'].astype('category')

# train_data['reviewerID'] = train_data['reviewerID'].cat.codes
# train_data['asin'] = train_data['asin'].cat.codes
train_data.rename(columns={'reviewerID': 'user', 'asin':'product' }, inplace=True)
train_data


Unnamed: 0,user,product,rating
993,A100RH4M1W1DF0,B001NJJOCW,5.0
1189,A100RH4M1W1DF0,B003SIOXTA,5.0
1203,A100RH4M1W1DF0,B003ZXCAAC,5.0
1504,A100RH4M1W1DF0,B00AA8WPGY,5.0
1816,A100RH4M1W1DF0,B00HZ6X8QU,5.0
...,...,...,...
11914207,AZYU8M791SIFC,B000066TS5,4.0
11914303,AZYU8M791SIFC,B0000C6EDL,5.0
11914401,AZYU8M791SIFC,B0009A4EV2,3.0
11914662,AZYU8M791SIFC,B000QW9D14,3.0


In [12]:
# set N - number of recommendations
N = 10000

# get interactions for user 1 used for training
train_x_user_1 = train_data[train_data['user'] == 'A100RH4M1W1DF0']
train_x_user_1

# Get interactions for User 1 (including ratings)
user_1 = completed[completed['user'] == 'A100RH4M1W1DF0']
print("Number of Interactions for User 1: ", user_1.shape[0])

# Identify liked items for User 1 (above a threshold, e.g., rating > 3)
liked_items = user_1[user_1['rating'] > 3.5]
print("Number of Liked Items for User 1: ", liked_items.shape[0])

# get items that were hidden for user 1 (get product names)
product_ids_hidden = x.iloc[0, indices_tracker[0]].index
product_ids_hidden = product_ids_hidden.tolist()
print("Number of Hidden Items for User 1: ", len(product_ids_hidden))

# get ratings for hidden items and predicted ratings - for user 1 (put in a dataframe)
hidden_ratings = x.iloc[0, indices_tracker[0]].values
predicted_ratings = predic_matrix.iloc[0, indices_tracker[0]].values
hidden_ratings_df = pd.DataFrame({'product': product_ids_hidden, 'hidden_rating': hidden_ratings, 'predicted_rating': predicted_ratings})
hidden_ratings_df

# set  threshold for recommendations
threshold = 3.
# create a label column for hidden ratings (1 = liked, 0 = not liked)
hidden_ratings_df['label'] = hidden_ratings_df['hidden_rating'].apply(lambda x: 1 if x > threshold else 0)
hidden_ratings_df

# add label for used interactions (add 1 to all interactions that exist in training data)
user_1['used_ind'] = 0
for i in range(user_1.shape[0]):
    if user_1.iloc[i, 1] in list(train_x_user_1['product']):
        user_1.iloc[i, 3] = 1

# count how many interactions are in train_x
print("Number of Interactions in Train Set for User 1: ", train_x_user_1.shape[0])

# count how many 1 in completed_user_1
print("Number of Interactions in Completed User 1: ", user_1[user_1['used_ind'] == 1].shape[0])

# add label liked for completed_user_1
user_1['liked'] = user_1['rating'].apply(lambda x: 1 if x > threshold else 0)


# add a label column to user_1_top_n: test_ind (if the product is in hidden_ratings_df, then 1, else 0)
user_1['test_ind'] = 0
for i in range(user_1.shape[0]):
    if user_1.iloc[i, 1] in list(hidden_ratings_df['product']):
        user_1.iloc[i, 5] = 1

# for all records where test_ind = 1, replace the hidden_rating with predicted_rating
for i in range(user_1.shape[0]):
    if user_1.iloc[i, 5] == 1:
        user_1.iloc[i, 2] = hidden_ratings_df[hidden_ratings_df['product'] == user_1.iloc[i, 1]]['predicted_rating'].values[0] 

# get top N recommendations for user 1 - exclude items where used_ind = 1
user_1_top_n = user_1[user_1['used_ind'] == 0]
user_1_top_n = user_1_top_n.sort_values(by='rating', ascending=False)
user_1_top_n = user_1_top_n.head(N)

# count how many 1 in user_1_top_n
print("Number of Items in Top N for User 1 that Were Used and Liked: ", user_1_top_n[user_1_top_n['test_ind'] == 1].shape[0])

# see top N recommendations for user 1
print("\n\nTop N Recommendations for User 1")
display(user_1_top_n)

# Calculate precision@K (top N recommendations)
precision_at_N = user_1_top_n['test_ind'].sum() / N

# Calculate recall@K
recall_at_N = user_1_top_n['test_ind'].sum() / liked_items.shape[0]

# calculate F1 score
f1_at_N = 2 * (precision_at_N * recall_at_N) / (precision_at_N + recall_at_N)

print(f"Precision@{N}: {precision_at_N:.4f}")
print(f"Recall@{N}: {recall_at_N:.4f}")
print(f"F1@{N}: {f1_at_N:.4f}")

# save results to csv
results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
results


Number of Interactions for User 1:  3249
Number of Liked Items for User 1:  3247
Number of Hidden Items for User 1:  3
Number of Interactions in Train Set for User 1:  37
Number of Interactions in Completed User 1:  37
Number of Items in Top N for User 1 that Were Used and Liked:  3


Top N Recommendations for User 1


Unnamed: 0,user,product,rating,used_ind,liked,test_ind
0,A100RH4M1W1DF0,0767834739,4.719856,0,1,0
7930216,A100RH4M1W1DF0,B00N9551WA,4.719856,0,1,0
7893536,A100RH4M1W1DF0,B00N23Z8Q8,4.719856,0,1,0
7897204,A100RH4M1W1DF0,B00N2RRWQ8,4.719856,0,1,0
7900872,A100RH4M1W1DF0,B00N4ABT1C,4.719856,0,1,0
...,...,...,...,...,...,...
3939432,A100RH4M1W1DF0,B002BRZ7JE,4.719856,0,1,0
3943100,A100RH4M1W1DF0,B002BRZ852,4.719856,0,1,0
3946768,A100RH4M1W1DF0,B002BRZ8BQ,4.719856,0,1,0
3950436,A100RH4M1W1DF0,B002BRZ8FW,4.719856,0,1,0


Precision@10000: 0.0003
Recall@10000: 0.0009
F1@10000: 0.0005


Unnamed: 0,Precision@N,Recall@N,F1@N
0,0.0003,0.000924,0.000453


In [13]:
# convert to dataframe with columns: user, products
hid = pd.DataFrame(hidden_ratings_ind)
hid['user'] = x.index
hid = hid[['user', 0, 1, 2]]

# convert 0,1,2 to list
hid['products'] = hid.iloc[:, 1:].values.tolist()
hid = hid[['user', 'products']]
hid

Unnamed: 0,user,products
0,A100RH4M1W1DF0,"[2807, 2258, 2647]"
1,A100WO06OQR8BQ,"[2111, 1398, 1498]"
2,A1027EV8A9PV1O,"[200, 1102, 1089]"
3,A103KKI1Y4TFNQ,"[1709, 1650, 1304]"
4,A1047P9FLHTDZJ,"[529, 2277, 2175]"
...,...,...
3663,AZVIQ5SU7XPD5,"[2968, 1354, 147]"
3664,AZW0HVDKOXGN9,"[2964, 3173, 2901]"
3665,AZX2RDN9YXZAE,"[2353, 1482, 185]"
3666,AZY157FF14CSL,"[639, 2206, 3123]"


In [38]:
def evaluate_topN_user(user_id, threshold, N):
    print(f"Evaluating User {user_id}")
    
    train_x_user = train_data[train_data['user'] == user_id]
    user_data = completed[completed['user'] == user_id]
    
    liked_items = user_data[user_data['rating'] > threshold]
    product_ids_hidden = x.iloc[0, indices_tracker[0]].index.tolist()
    

    all_ints = x.loc[user_id, :]
    product_names = all_ints.index[hid[hid['user'] == user_id]['products'].values[0]]

    hidden_ratings = x.loc[user_id, product_names].values
    predicted_ratings = predic_matrix.loc[user_id, product_names].values
    
    hidden_ratings_df = pd.DataFrame({
        'product': product_ids_hidden,
        'hidden_rating': hidden_ratings,
        'predicted_rating': predicted_ratings
    })

    hidden_ratings_df['label'] = hidden_ratings_df['hidden_rating'].apply(lambda x: 1 if x > threshold else 0)

    user_data['used_ind'] = 0
    user_data['liked'] = user_data['rating'].apply(lambda x: 1 if x > threshold else 0)

    user_data['test_ind'] = user_data['product'].apply(lambda x: 1 if x in hidden_ratings_df['product'].tolist() else 0)

    for i in range(user_data.shape[0]):
        if user_data.iloc[i, 5] == 1:
            user_data.iloc[i, 2] = hidden_ratings_df[hidden_ratings_df['product'] == user_data.iloc[i, 1]]['predicted_rating'].values[0]

    user_top_n = user_data[user_data['used_ind'] == 0].sort_values(by='rating', ascending=False).head(N)
    display(user_top_n)

    precision_at_N = user_top_n['test_ind'].sum() / N
    recall_at_N = user_top_n['test_ind'].sum() / liked_items.shape[0]

    if precision_at_N + recall_at_N == 0:
        f1_at_N = 0
    else:
        f1_at_N = 2 * (precision_at_N * recall_at_N) / (precision_at_N + recall_at_N)

    results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
    return results


In [39]:
# Now you can call evaluate_user_recommendation with different user_id, threshold, and N
results = evaluate_topN_user(user_id='A214JN9AJNSHCJ', threshold=3.5, N=1000)
results

Evaluating User A214JN9AJNSHCJ


Unnamed: 0,user,product,rating,used_ind,liked,test_ind
999,A214JN9AJNSHCJ,0767834739,5.0,0,1,0
7960559,A214JN9AJNSHCJ,B00NASF4MS,5.0,0,1,0
7923879,A214JN9AJNSHCJ,B00N9543MO,5.0,0,1,0
7927547,A214JN9AJNSHCJ,B00N954O4Q,5.0,0,1,0
7931215,A214JN9AJNSHCJ,B00N9551WA,5.0,0,1,0
...,...,...,...,...,...,...
10634531,A214JN9AJNSHCJ,B015IO2D5M,5.0,0,1,0
10638199,A214JN9AJNSHCJ,B015IO2NO8,5.0,0,1,0
10641867,A214JN9AJNSHCJ,B015IO31U8,5.0,0,1,0
10645535,A214JN9AJNSHCJ,B015IP43MW,5.0,0,1,0


Unnamed: 0,Precision@N,Recall@N,F1@N
0,0.001,0.000308,0.000471


#### Execute for All Users

In [28]:
# get count of users
user_count = len(completed['user'].unique())
counter = 0

# loop through users to get results for each user and save to a dataframe
results = pd.DataFrame()
for user in completed['user'].unique():
    counter += 1
    print(f"User {counter} of {user_count}")
    user_results = evaluate_topN_user(user_id=user, threshold=3, N=10000)
    print(user_results)
    results = pd.concat([results, user_results])
    

results

User 1 of 3668
Evaluating User A100RH4M1W1DF0
   Precision@N  Recall@N      F1@N
0       0.0003  0.000924  0.000453
User 2 of 3668
Evaluating User A100WO06OQR8BQ
   Precision@N  Recall@N      F1@N
0       0.0003  0.000928  0.000453
User 3 of 3668
Evaluating User A1027EV8A9PV1O
   Precision@N  Recall@N      F1@N
0       0.0003  0.000924  0.000453
User 4 of 3668
Evaluating User A103KKI1Y4TFNQ
   Precision@N  Recall@N      F1@N
0       0.0003  0.000925  0.000453
User 5 of 3668
Evaluating User A1047P9FLHTDZJ
   Precision@N  Recall@N      F1@N
0       0.0003  0.000923  0.000453
User 6 of 3668
Evaluating User A105F4XQ9S1NU5
   Precision@N  Recall@N      F1@N
0       0.0003      0.25  0.000599
User 7 of 3668
Evaluating User A105S56ODHGJEK
   Precision@N  Recall@N      F1@N
0       0.0003  0.000924  0.000453
User 8 of 3668
Evaluating User A105XKMQB69VHF
   Precision@N  Recall@N      F1@N
0       0.0003  0.000923  0.000453
User 9 of 3668
Evaluating User A107652KJ8BTTN
   Precision@N  Recall@N  

Unnamed: 0,Precision@N,Recall@N,F1@N
0,0.0003,0.000924,0.000453
0,0.0003,0.000928,0.000453
0,0.0003,0.000924,0.000453
0,0.0003,0.000925,0.000453
0,0.0003,0.000923,0.000453
...,...,...,...
0,0.0003,0.000923,0.000453
0,0.0003,0.000925,0.000453
0,0.0003,0.000925,0.000453
0,0.0003,0.000923,0.000453


In [32]:
# Get the average results for all users
average_results = results.mean()
average_results

Precision@N    0.000300
Recall@N            inf
F1@N           0.000463
dtype: float64

In [33]:
# calculate recall, using precision and f1 only. 
# Recall = 2 * (precision * f1) / (precision + f1)
average_results['Recall@N'] = 2 * (average_results['Precision@N'] * average_results['F1@N']) / (average_results['Precision@N'] + average_results['F1@N'])
average_results

Precision@N    0.000300
Recall@N       0.000364
F1@N           0.000463
dtype: float64

In [34]:
precision_at_N = average_results['Precision@N']
recall_at_N = average_results['Recall@N']
f1_at_N = average_results['F1@N']

In [36]:
average_results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
average_results.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/Results/UBCF_results_1_top10000.csv', index=False)

## (2) Using Packages

In [9]:
## Using Packages for IBCF
import surprise
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise import KNNBasic
from surprise import accuracy
from surprise.model_selection import train_test_split


In [10]:
# load and Change data to User-`Item-`Rating format
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_modelling.csv', index_col=0)
display(amz_data.head())

x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0_level_0,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A14638TGYH7GD9,2010-10-28,321719816,5.0,even though i use dreamweaver a great deal and...,even though use dreamweav great deal sever boo...,even though use dreamweaver great deal several...,even though use dreamweaver great deal several...,20,11,0.99
A2JMJVNTBL7K7E,2011-04-07,321719816,5.0,i spent several hours on the lesson and i love...,spent sever hour lesson love detail clear inst...,spent several hour lesson love detailed clear ...,spent several hours lesson love detailed clear...,19,8,0.9766
A2BVNVJOFXGZUB,2010-09-26,321719816,5.0,the video is wellpaced and delivered in an und...,video wellpac deliv understand manner allow wo...,video wellpaced delivered understandable manne...,video wellpaced delivered understandable manne...,3,3,0.4939
A14JBDSWKPKTZA,2011-01-08,321719816,5.0,i have had dreamweaver mx2004 since it came ou...,dreamweav mx2004 sinc came back spent year fee...,dreamweaver mx2004 since came back spent year ...,dreamweaver mx2004 since came back spent years...,12,13,0.989
ACJT8MUC0LRF0,2010-10-16,321719816,5.0,if youve been wanting to learn how to create y...,youv want learn creat websit either lack confi...,youve wanting learn create website either lack...,youve wanting learn create website either lack...,39,18,0.9995





User-Item Matrix


asin,0321719816,0763855553,076780192X,0767824571,0767827759,0767834739,0768881714,0782010792,0783239408,0788857746,...,B01HD8OXO0,B01HD8OYSK,B01HDW58I6,B01HE0W2WC,B01HGBAFNC,B01HGD8OYM,B01HGSJPMW,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0380485C177Q6QQNJIX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A0685888WB02Q69S553P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1004703RC79J9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100JCBNALJFAW,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (11675, 10487)


In [11]:
# Import necessary libraries
from surprise import Dataset, Reader, KNNBasic, accuracy
from surprise.model_selection import train_test_split

# Assume you have a user-item matrix 'user_item_matrix'
# Convert the user-item matrix back to a DataFrame of ratings
ratings = x.stack().reset_index()
ratings.columns = ['user', 'item', 'rating']

# Remove rows where rating is 0
ratings = ratings[ratings['rating'] != 0]

# Define a Reader object
# The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader(rating_scale=(1, 5))

# Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

# Split the dataset into train and test
# Test set is made of 25% of the ratings
trainset, testset = train_test_split(data, test_size=.25, random_state=2207)

In [12]:
# Configure the algorithm - User Based Collaborative Filtering
# Use cosine similarity
sim_options = {
    'name': 'cosine',
    'user_based': True  # this will compute similarity between users
}

# set k
k = 40

# Create an instance of KNNBasic
algo = KNNBasic(sim_options=sim_options, k=k, verbose=True,random_state=2207)

# Train the algorithm on the trainset
algo.fit(trainset)

# Predict ratings for the testset
predictions = algo.test(testset)

# Then compute RMSE, MSE and MAE
print("\nUser-based Model Test Set Results:")
mae_pack = accuracy.mae(predictions).round(2)
mse_pack = accuracy.mse(predictions).round(2)
rmse_pack = accuracy.rmse(predictions).round(2)

print(f"Mean Absolute Error (MAE): {mae_pack}")
print(f"Mean Squared Error (MSE): {mse_pack}")
print(f"Root Mean Squared Error (RMSE): {rmse_pack}")

Computing the cosine similarity matrix...
Done computing similarity matrix.

User-based Model Test Set Results:
MAE:  0.7159
MSE: 1.0552
RMSE: 1.0272
Mean Absolute Error (MAE): 0.72
Mean Squared Error (MSE): 1.06
Root Mean Squared Error (RMSE): 1.03


In [13]:
# save results to csv
results = pd.DataFrame({'MAE': [mae_pack.round(3)], 'MSE': [mse_pack.round(3)], 'RMSE': [rmse_pack.round(3)]})
results.to_csv("Data/Results/UBCF_results_2.csv", index=False)

***
## (3) Manual Process with Same Data Splits

In [14]:
%reset -f

# load libraries
import surprise
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise import KNNBasic
from surprise import accuracy
from surprise.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

In [15]:
# load data - WINDOWS
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
# display(amz_data.head())

# load data - MAC OS
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_modelling.csv')
display(amz_data.head(3))

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())


# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("Shape: ", x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
0,A14638TGYH7GD9,2010-10-28,321719816,5.0,even though i use dreamweaver a great deal and...,even though use dreamweav great deal sever boo...,even though use dreamweaver great deal several...,even though use dreamweaver great deal several...,20,11,0.99
1,A2JMJVNTBL7K7E,2011-04-07,321719816,5.0,i spent several hours on the lesson and i love...,spent sever hour lesson love detail clear inst...,spent several hour lesson love detailed clear ...,spent several hours lesson love detailed clear...,19,8,0.9766
2,A2BVNVJOFXGZUB,2010-09-26,321719816,5.0,the video is wellpaced and delivered in an und...,video wellpac deliv understand manner allow wo...,video wellpaced delivered understandable manne...,video wellpaced delivered understandable manne...,3,3,0.4939


Number of Rows:  256725
Number of Columns:  11
Number of Unique Users:  11675
Number of Unique Products:  10487
Fewest reviews by a reviewer: 12
Most reviews by a reviewer: 365
Fewest reviews per product: 12
Most reviews per product: 266
Shape:  (11675, 10487)


In [16]:
# using created testset from packages chapter
ratings = x.stack().reset_index()
ratings.columns = ['user', 'item', 'rating']
ratings = ratings[ratings['rating'] != 0]
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings, reader)
trainset, testset = train_test_split(data, test_size=.25, random_state=2207)
testset_df = pd.DataFrame(testset)
testset_df = testset_df


# convert each row of the testset to a tuple
testset_tuples = [tuple(x) for x in testset_df[[0, 1]].to_numpy()]

# find indices of the testset in the original matrix
testset_indices = []
for i in range(len(testset_tuples)):
    user = testset_tuples[i][0]
    item = testset_tuples[i][1]
    user_index = x.index.get_loc(user)
    item_index = x.columns.get_loc(item)
    testset_indices.append((user_index, item_index))

# shorten the testset_indices to 100
testset_indices = testset_indices
print("Testset Indices: ")
testset_indices[0:5]

Testset Indices: 


[(3049, 6368), (4055, 5745), (4978, 2565), (8152, 3097), (5904, 10376)]

In [17]:
# # create a copy of the original matrix to store hidden ratings
# x_hidden = x.copy()
# indices_tracker = []

# # loop through the testset indices to hide the rating (make 0) - update x_hidden
# for user_id in range(x_hidden.shape[0]):
#     for item_id in range(x_hidden.shape[1]):
#         if (user_id, item_id) in testset_indices:
#             x_hidden.iloc[user_id, item_id] = 0

# # save x_hidden to csv
# x_hidden.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/suprise_hidden_ratings_matrix.csv')

In [18]:
# load hidden ratings matrix
x_hidden = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/suprise_hidden_ratings_matrix.csv', index_col=0)

In [19]:
# get cosine sim matrix and change to pd dataframe
sim_mat_cos = cosine_similarity(x_hidden)
print("Cosine Similarity Matrix") 
sim_mat_cos

Cosine Similarity Matrix


array([[1.        , 0.09245003, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.09245003, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.13994571,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.13994571, 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [20]:
# get a predictions matrix
predic_matrix = x_hidden.copy()

# set k to 40
k = 40

# now get predicted ratings for all users
for user_id in range(predic_matrix.shape[0]):
    user_ratings = predic_matrix.iloc[user_id, :].values.reshape(1, -1)
    unrated_products_indices = np.where(user_ratings == 0)[1]
    rated_users_indices = np.where(user_ratings > 0)[1]
    for product_id in unrated_products_indices:
        similarity_i_j = sim_mat_cos[user_id, rated_users_indices]  # Get similarity between this user and other users who rated this product
        ratings = user_ratings[0, rated_users_indices]
        
        # sort by similarity and select top k
        sorted_indices = np.argsort(similarity_i_j)[::-1][:k]
        similarity_i_j = similarity_i_j[sorted_indices]
        ratings = ratings[sorted_indices]

        if np.any(similarity_i_j):
            predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
        else:
            # make predicted rating mean of user's ratings
            predicted_rating = np.mean(ratings)
        
        predic_matrix.iloc[user_id, product_id] = predicted_rating

# see updated matrix with predicted ratings
print("Predicted Ratings for All Users")
display(predic_matrix)

Predicted Ratings for All Users


Unnamed: 0_level_0,0321719816,0763855553,076780192X,0767824571,0767827759,0767834739,0768881714,0782010792,0783239408,0788857746,...,B01HD8OXO0,B01HD8OYSK,B01HDW58I6,B01HE0W2WC,B01HGBAFNC,B01HGD8OYM,B01HGSJPMW,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0380485C177Q6QQNJIX,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000
A0685888WB02Q69S553P,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,...,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714
A1004703RC79J9,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,...,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000
A100JCBNALJFAW,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,...,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500
A100RH4M1W1DF0,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZYJE40XW6MFG,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000
AZYOVGJLQ03ML,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,...,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889,3.888889
AZYU8M791SIFC,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
AZZ1KF8RAO1BR,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000


In [21]:
#  get predicted ratings for the testset
predicted_ratings = []
for i in range(len(testset_indices)):
    user_id = testset_indices[i][0]
    item_id = testset_indices[i][1]
    predicted_ratings.append(predic_matrix.iloc[user_id, item_id])

print("Predicted Ratings:")
print(predicted_ratings)


# get actual ratings for the testset
print("\nActual Ratings:")
actual_ratings = testset_df[2].to_list()
print(actual_ratings)

Predicted Ratings:
[4.916666666666667, 4.6, 4.5, 4.4, 4.3037958848188165, 4.3735453699999125, 4.884236372639689, 5.0, 3.467736641700542, 4.2727272727272725, 5.0, 5.0, 3.6666666666666665, 4.3, 4.222222222222222, 4.181818181818182, 3.5, 2.888888888888889, 5.0, 4.9375, 5.0, 4.621487246754424, 5.0, 4.666666666666667, 4.291944449146344, 3.9865176918647838, 5.0, 4.535012826027247, 4.2727272727272725, 4.999999999999999, 4.3478260869565215, 4.521787253216976, 3.5, 5.0, 3.2961261582292436, 5.0, 3.0, 5.000000000000001, 5.0, 4.5, 4.999999999999999, 5.0, 5.0, 4.304347826086956, 4.0, 3.90625, 4.0, 3.0000000000000004, 2.0, 3.772727272727273, 4.666666666666667, 3.4404329480654066, 4.866666666666666, 4.545454545454546, 5.0, 4.222222222222222, 4.521787253216976, 4.64144761070827, 4.569334261202772, 4.545300471395902, 5.0, 4.0393271779090245, 4.733333333333333, 4.145258011002764, 4.666666666666667, 5.0, 5.0, 3.467736641700542, 4.428571428571429, 4.5, 5.0, 4.381245329770313, 5.0, 5.0, 4.230769230769231, 

In [22]:
# calculate MAE, MSE and RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error
print("Using sklearn")
mae = mean_absolute_error(actual_ratings, predicted_ratings)
mse = mean_squared_error(actual_ratings, predicted_ratings)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")


# Manually
print("\n\nManually")

# calculate MAE, MSE and RMSE using actual and predicted ratings
mae = np.mean(np.abs(np.array(actual_ratings) - np.array(predicted_ratings))) # Calculate Mean Absolute Error (MAE)
mse = np.mean((np.array(actual_ratings) - np.array(predicted_ratings)) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")

Using sklearn
Mean Absolute Error (MAE): 0.64
Mean Squared Error (MSE): 1.01
Root Mean Squared Error (RMSE): 1.0


Manually
Mean Absolute Error (MAE): 0.64
Mean Squared Error (MSE): 1.01
Root Mean Squared Error (RMSE): 1.0


In [23]:
# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
results.to_csv("Data/Results/UBCF_results_3.csv", index=False)
