# Item Based Collaborative Filtering

## Algorithm Summary

Item-based collaborative filtering is a model-based algorithm for making recommendations. It is based on the similarity between items calculated using people's ratings of those items. It is also known as item-item collaborative filtering.

1. **Load the data**
- data is provided in a dataframe where each row is a review

2. **Create a user-item matrix**
- convert dataframe into user-item matrix where each row is a user and each column is an item

3. **Create test and train set**
- hide $N$ ratings for each user in the training set and use them to test the performance of the model

4. **Calculate item similarity**
- using training set, calculate the similarity between items using cosine similarity

5. **Make predictions**
- for each user, for each item in the test set, calculate the weighted sum of the ratings of the items that are similar to the item in question

6. **Evaluate the model**
- calculate the predictive accuracy of the model using RMSE, MSE and MAE
- calculate the Top-N metrics of the model using NDCG and Hit Rate

## (1) Manaul / From Fundamentals

In [1]:
%reset -f
# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

### Reading in and Converting Data

In [2]:
# load data - WINDOWS
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
# display(amz_data.head())

# load data - MAC OS
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set3_data_modelling.csv')
display(amz_data.head(3))

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())


# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
0,AQ8OO59DJFJNZ,2018-01-05,767834739,5.0,wonderful movie,wonder movi,wonderful movie,wonderful movie,4,1,0.5719
1,A244CRJ2QSVLZ4,2008-01-29,767834739,5.0,resident evil is a great science fictionhorror...,resid evil great scienc fictionhorror hybrid p...,resident evil great science fictionhorror hybr...,resident evil great science fictionhorror hybr...,-12,-5,-0.9455
2,A1VCLTAGM5RLND,2005-07-23,767834739,5.0,i this movie has people living and working und...,movi peopl live work underground place call hi...,movie people living working underground place ...,movie people living working underground place ...,-1,0,-0.1806


Number of Rows:  83139
Number of Columns:  11
Number of Unique Users:  3668
Number of Unique Products:  3249
Fewest reviews by a reviewer: 13
Most reviews by a reviewer: 193
Fewest reviews per product: 13
Most reviews per product: 189



User-Item Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (3668, 3249)


### Train and Test Split

In [3]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    # print("User:", user_id)
    # print("Indices of Rated Products:", rated_products)
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    # print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0

In [4]:
# check tracker - all hidden ratings 
indices_tracker = pd.DataFrame(indices_tracker).to_numpy()
print("Indices of Ratings per user \n", indices_tracker)

# flattened
indices_tracker_flat = indices_tracker.flatten()
print("Indices of Ratings per User joined", indices_tracker_flat)

# see updated matrix with hidden ratings
print("Updated Matrix with Hidden Ratings")
display(x_hidden)

# see original matrix
print("Original Matrix")
display(x)

Indices of Ratings per user 
 [[2807 2258 2647]
 [2111 1398 1498]
 [ 200 1102 1089]
 ...
 [2353 1482  185]
 [ 639 2206 3123]
 [ 193  533  406]]
Indices of Ratings per User joined [2807 2258 2647 ...  193  533  406]
Updated Matrix with Hidden Ratings


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZW0HVDKOXGN9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZX2RDN9YXZAE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZY157FF14CSL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Original Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZW0HVDKOXGN9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZX2RDN9YXZAE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZY157FF14CSL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Similarity Matrix

In [5]:
# get cosine sim matrix and change to pd dataframe
sim_mat_cos = cosine_similarity(x_hidden.T)
print("Cosine Similarity Matrix") 
sim_mat_cos

Cosine Similarity Matrix


array([[1.        , 0.        , 0.01623839, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.14731312, ..., 0.        , 0.        ,
        0.        ],
       [0.01623839, 0.14731312, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.06160701,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.06160701, 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

### Prediction Matrix

In [6]:
# get a predictions matrix
predic_matrix = x_hidden.copy()

# set k to 40
k = 40

# now get predicted ratings for all users
for user_id in range(predic_matrix.shape[0]):
    user_ratings = predic_matrix.iloc[user_id, :].values.reshape(1, -1)
    unrated_products_indices = np.where(user_ratings == 0)[1]
    rated_products_indices = np.where(user_ratings > 0)[1]
    for product_id in unrated_products_indices:
        similarity_i_j = sim_mat_cos[product_id, rated_products_indices]
        ratings = user_ratings[0, rated_products_indices]
        
        # sort by similarity and select top k
        sorted_indices = np.argsort(similarity_i_j)[::-1][:k]
        similarity_i_j = similarity_i_j[sorted_indices]
        ratings = ratings[sorted_indices]

        if np.any(similarity_i_j):
            predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
        else:
            # make predicted rating mean of user's ratings
            predicted_rating = np.mean(ratings)
        
        predic_matrix.iloc[user_id, product_id] = predicted_rating

# see updated matrix with predicted ratings
print("Predicted Ratings for All Users")
display(predic_matrix)

Predicted Ratings for All Users


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,4.837838,4.580604,4.837838,4.837838,3.000000,4.837838,4.837838,4.837838,4.837838,3.000000,...,5.000000,5.000000,5.000000,4.828764,4.875795,5.000000,4.743291,5.000000,4.744630,5.000000
A100WO06OQR8BQ,5.000000,4.414663,4.250000,5.000000,5.000000,5.000000,5.000000,4.250000,5.000000,4.250000,...,3.877914,4.218412,3.882188,5.000000,3.000000,3.906688,5.000000,4.711882,3.000000,5.000000
A1027EV8A9PV1O,5.000000,3.420068,3.726925,5.000000,5.000000,4.666667,4.356995,5.000000,4.154642,4.589795,...,5.000000,5.000000,5.000000,4.666667,4.666667,4.666667,5.000000,4.666667,4.666667,5.000000
A103KKI1Y4TFNQ,3.495778,1.000000,4.026734,4.234364,3.418285,4.475389,4.400302,4.956550,3.939630,2.397028,...,4.378378,4.378378,4.340243,4.378378,4.378378,4.378378,5.000000,4.598120,4.378378,4.038359
A1047P9FLHTDZJ,4.928571,4.928571,4.928571,5.000000,4.928571,4.928571,4.928571,4.928571,4.928571,4.928571,...,4.928571,5.000000,5.000000,4.651346,5.000000,5.000000,5.000000,5.000000,4.578490,4.928571
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,4.000000,5.000000,4.741935,4.847458,4.820221,4.820221,4.000000,4.741935,4.000000,5.000000,...,5.000000,4.627549,4.848066,4.835809,4.840308,4.741935,5.000000,4.702223,4.789793,4.647335
AZW0HVDKOXGN9,3.454545,4.000000,3.454545,3.454545,3.454545,3.454545,3.454545,3.454545,3.454545,3.454545,...,3.454545,4.000000,3.454545,3.611079,4.000000,3.454545,3.454545,3.605696,4.534150,3.454545
AZX2RDN9YXZAE,4.000000,4.000000,3.416667,3.416667,4.000000,4.000000,4.000000,3.416667,4.000000,3.416667,...,3.227632,3.229893,3.117583,3.416667,3.507854,3.416667,3.000000,3.416667,3.416667,4.000000
AZY157FF14CSL,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000


### Evaluation (Predictive Accuracy)

Now evaluate how good the predictions are vs the hidden ratings
- ***step 1***: identify the hidden ratings indices
- ***step 2***: extract hidden ratings indices and corresponding predicted ratings indices
- ***step 3***: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values

In [7]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(predic_matrix.shape[0]):
    user_predicted_ratings = predic_matrix.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


Hidden Ratings: [3. 5. 5. ... 4. 4. 1.]
Corresponding Predicted Ratings: [4.92917958 4.89545038 4.97604447 ... 4.23250389 3.77252476 4.22142989]
Using sklearn
Mean Absolute Error (MAE): 0.5386180659341968
Mean Squared Error (MSE): 0.7431346270126726
Root Mean Squared Error (RMSE): 0.8620525662699884


Manually
Mean Absolute Error (MAE): 0.5386180659341968
Mean Squared Error (MSE): 0.7431346270126726
Root Mean Squared Error (RMSE): 0.8620525662699884


In [8]:
# round to 2 decimal places
mae = round(mae, 2)
mse = round(mse, 2)
rmse = round(rmse, 2)

# Save the results to a csv file
results = pd.DataFrame({'MAE': [mae], 'MSE': [mse], 'RMSE': [rmse]})
results.to_csv("Data/Results/IBCF_results_1.csv", index=False)

### Evaluation (Top-N Metrics)

In [9]:
# turn matrix into a dataframe with user and product, rating columns
preds_series = predic_matrix.stack().reset_index().rename(columns={0: 'rating'}).sort_values(by=['asin', 'reviewerID'])
preds_series = preds_series['rating'].reset_index(drop=True)
preds_series


# getting a dataframe with interactions and ratings
data = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
data_mat = data.copy()
data_mat = data_mat.reset_index()
data_mat = data_mat.melt(id_vars=data_mat.columns[0], var_name='product', value_name='rating')
data_mat.columns = ['user', 'product', 'rating']
data_mat['user'] = data_mat['user'].astype('category')
data_mat['product'] = data_mat['product'].astype('category')

# data_mat['user'] = data_mat['user'].cat.codes
# data_mat['product'] = data_mat['product'].cat.codes
display(data_mat.head(3))

# create a completed dataframe
completed = data_mat.copy()
nan_rows = completed[completed['rating'].isnull()]

# for nan_rows, replace the rating with the predicted rating
completed.loc[nan_rows.index, 'rating'] = preds_series[nan_rows.index]

# see original data with user item interactions
print("User Item Interactions with Ratings")
display(data_mat.head(3))

# see data with predictions
print("\nUser Item Interactions with Predicted Ratings")
display(completed.head(3))

# details on completed dataframe
print('\n\nNumber of Rows: ', completed.shape[0])
print('Number of Columns: ', completed.shape[1])
print('Number of Unique Users: ', len(completed['user'].unique()))
print('Number of Unique Products: ', len(completed['product'].unique()))

Unnamed: 0,user,product,rating
0,A100RH4M1W1DF0,767834739,
1,A100WO06OQR8BQ,767834739,
2,A1027EV8A9PV1O,767834739,


User Item Interactions with Ratings


Unnamed: 0,user,product,rating
0,A100RH4M1W1DF0,767834739,
1,A100WO06OQR8BQ,767834739,
2,A1027EV8A9PV1O,767834739,



User Item Interactions with Predicted Ratings


Unnamed: 0,user,product,rating
0,A100RH4M1W1DF0,767834739,4.837838
1,A100WO06OQR8BQ,767834739,5.0
2,A1027EV8A9PV1O,767834739,5.0




Number of Rows:  11917332
Number of Columns:  3
Number of Unique Users:  3668
Number of Unique Products:  3249


#### Execute for One User

In [10]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    # print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0

In [11]:
# get training data for user (i.e., remove the hidden ratings and keep only the observed ratings)
train_data = x_hidden.copy()
train_data = train_data.stack().reset_index().rename(columns={0: 'rating'}).sort_values(by=['reviewerID', 'asin'])

# remove all zeros and nan values
train_data = train_data[(train_data != 0)]

# remove all nan values
train_data = train_data.dropna()


# apply cat codes to the user and product columns
train_data['reviewerID'] = train_data['reviewerID'].astype('category')
train_data['asin'] = train_data['asin'].astype('category')

# train_data['reviewerID'] = train_data['reviewerID'].cat.codes
# train_data['asin'] = train_data['asin'].cat.codes
train_data.rename(columns={'reviewerID': 'user', 'asin':'product' }, inplace=True)
train_data


Unnamed: 0,user,product,rating
993,A100RH4M1W1DF0,B001NJJOCW,5.0
1189,A100RH4M1W1DF0,B003SIOXTA,5.0
1203,A100RH4M1W1DF0,B003ZXCAAC,5.0
1504,A100RH4M1W1DF0,B00AA8WPGY,5.0
1816,A100RH4M1W1DF0,B00HZ6X8QU,5.0
...,...,...,...
11914207,AZYU8M791SIFC,B000066TS5,4.0
11914303,AZYU8M791SIFC,B0000C6EDL,5.0
11914401,AZYU8M791SIFC,B0009A4EV2,3.0
11914662,AZYU8M791SIFC,B000QW9D14,3.0


In [12]:
# set N - number of recommendations
N = 10000

# get interactions for user 1 used for training
train_x_user_1 = train_data[train_data['user'] == 'A100RH4M1W1DF0']
train_x_user_1

# Get interactions for User 1 (including ratings)
user_1 = completed[completed['user'] == 'A100RH4M1W1DF0']
print("Number of Interactions for User 1: ", user_1.shape[0])

# Identify liked items for User 1 (above a threshold, e.g., rating > 3)
liked_items = user_1[user_1['rating'] > 3.5]
print("Number of Liked Items for User 1: ", liked_items.shape[0])

# get items that were hidden for user 1 (get product names)
product_ids_hidden = x.iloc[0, indices_tracker[0]].index
product_ids_hidden = product_ids_hidden.tolist()
print("Number of Hidden Items for User 1: ", len(product_ids_hidden))

# get ratings for hidden items and predicted ratings - for user 1 (put in a dataframe)
hidden_ratings = x.iloc[0, indices_tracker[0]].values
predicted_ratings = predic_matrix.iloc[0, indices_tracker[0]].values
hidden_ratings_df = pd.DataFrame({'product': product_ids_hidden, 'hidden_rating': hidden_ratings, 'predicted_rating': predicted_ratings})
hidden_ratings_df

# set  threshold for recommendations
threshold = 3.
# create a label column for hidden ratings (1 = liked, 0 = not liked)
hidden_ratings_df['label'] = hidden_ratings_df['hidden_rating'].apply(lambda x: 1 if x > threshold else 0)
hidden_ratings_df

# add label for used interactions (add 1 to all interactions that exist in training data)
user_1['used_ind'] = 0
for i in range(user_1.shape[0]):
    if user_1.iloc[i, 1] in list(train_x_user_1['product']):
        user_1.iloc[i, 3] = 1

# count how many interactions are in train_x
print("Number of Interactions in Train Set for User 1: ", train_x_user_1.shape[0])

# count how many 1 in completed_user_1
print("Number of Interactions in Completed User 1: ", user_1[user_1['used_ind'] == 1].shape[0])

# add label liked for completed_user_1
user_1['liked'] = user_1['rating'].apply(lambda x: 1 if x > threshold else 0)


# add a label column to user_1_top_n: test_ind (if the product is in hidden_ratings_df, then 1, else 0)
user_1['test_ind'] = 0
for i in range(user_1.shape[0]):
    if user_1.iloc[i, 1] in list(hidden_ratings_df['product']):
        user_1.iloc[i, 5] = 1

# for all records where test_ind = 1, replace the hidden_rating with predicted_rating
for i in range(user_1.shape[0]):
    if user_1.iloc[i, 5] == 1:
        user_1.iloc[i, 2] = hidden_ratings_df[hidden_ratings_df['product'] == user_1.iloc[i, 1]]['predicted_rating'].values[0] 

# get top N recommendations for user 1 - exclude items where used_ind = 1
user_1_top_n = user_1[user_1['used_ind'] == 0]
user_1_top_n = user_1_top_n.sort_values(by='rating', ascending=False)
user_1_top_n = user_1_top_n.head(N)

# count how many 1 in user_1_top_n
print("Number of Items in Top N for User 1 that Were Used and Liked: ", user_1_top_n[user_1_top_n['test_ind'] == 1].shape[0])

# see top N recommendations for user 1
print("\n\nTop N Recommendations for User 1")
display(user_1_top_n)

# Calculate precision@K (top N recommendations)
precision_at_N = user_1_top_n['test_ind'].sum() / N

# Calculate recall@K
recall_at_N = user_1_top_n['test_ind'].sum() / liked_items.shape[0]

# calculate F1 score
f1_at_N = 2 * (precision_at_N * recall_at_N) / (precision_at_N + recall_at_N)

print(f"Precision@{N}: {precision_at_N:.4f}")
print(f"Recall@{N}: {recall_at_N:.4f}")
print(f"F1@{N}: {f1_at_N:.4f}")

# save results to csv
results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
results


Number of Interactions for User 1:  3249
Number of Liked Items for User 1:  3233
Number of Hidden Items for User 1:  3


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_1['used_ind'] = 0


Number of Interactions in Train Set for User 1:  37
Number of Interactions in Completed User 1:  37


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_1['liked'] = user_1['rating'].apply(lambda x: 1 if x > threshold else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_1['test_ind'] = 0


Number of Items in Top N for User 1 that Were Used and Liked:  3


Top N Recommendations for User 1


Unnamed: 0,user,product,rating,used_ind,liked,test_ind
2747332,A100RH4M1W1DF0,B000ZHB2HS,5.0,0,1,0
5171880,A100RH4M1W1DF0,B006989VBG,5.0,0,1,0
2149448,A100RH4M1W1DF0,B000R0URCE,5.0,0,1,0
5003152,A100RH4M1W1DF0,B005CX2DTQ,5.0,0,1,0
9870588,A100RH4M1W1DF0,B00Z7SIJPI,5.0,0,1,0
...,...,...,...,...,...,...
392476,A100RH4M1W1DF0,B00005Q8M0,3.0,0,0,0
894992,A100RH4M1W1DF0,B0001VGFK2,3.0,0,0,0
3187492,A100RH4M1W1DF0,B00166N6SA,3.0,0,0,0
121044,A100RH4M1W1DF0,B00002SVFR,3.0,0,0,0


Precision@10000: 0.0003
Recall@10000: 0.0009
F1@10000: 0.0005


Unnamed: 0,Precision@N,Recall@N,F1@N
0,0.0003,0.000928,0.000453


In [13]:
# convert to dataframe with columns: user, products
hid = pd.DataFrame(hidden_ratings_ind)
hid['user'] = x.index
hid = hid[['user', 0, 1, 2]]

# convert 0,1,2 to list
hid['products'] = hid.iloc[:, 1:].values.tolist()
hid = hid[['user', 'products']]
hid

Unnamed: 0,user,products
0,A100RH4M1W1DF0,"[2807, 2258, 2647]"
1,A100WO06OQR8BQ,"[2111, 1398, 1498]"
2,A1027EV8A9PV1O,"[200, 1102, 1089]"
3,A103KKI1Y4TFNQ,"[1709, 1650, 1304]"
4,A1047P9FLHTDZJ,"[529, 2277, 2175]"
...,...,...
3663,AZVIQ5SU7XPD5,"[2968, 1354, 147]"
3664,AZW0HVDKOXGN9,"[2964, 3173, 2901]"
3665,AZX2RDN9YXZAE,"[2353, 1482, 185]"
3666,AZY157FF14CSL,"[639, 2206, 3123]"


In [14]:
def evaluate_topN_user(user_id, threshold, N):
    print(f"Evaluating User {user_id}")
    
    train_x_user = train_data[train_data['user'] == user_id]
    user_data = completed[completed['user'] == user_id]
    
    liked_items = user_data[user_data['rating'] > threshold]
    product_ids_hidden = x.iloc[0, indices_tracker[0]].index.tolist()
    

    all_ints = x.loc[user_id, :]
    product_names = all_ints.index[hid[hid['user'] == user_id]['products'].values[0]]

    hidden_ratings = x.loc[user_id, product_names].values
    predicted_ratings = predic_matrix.loc[user_id, product_names].values
    
    hidden_ratings_df = pd.DataFrame({
        'product': product_ids_hidden,
        'hidden_rating': hidden_ratings,
        'predicted_rating': predicted_ratings
    })

    hidden_ratings_df['label'] = hidden_ratings_df['hidden_rating'].apply(lambda x: 1 if x > threshold else 0)

    user_data['used_ind'] = 0
    user_data['liked'] = user_data['rating'].apply(lambda x: 1 if x > threshold else 0)

    user_data['test_ind'] = user_data['product'].apply(lambda x: 1 if x in hidden_ratings_df['product'].tolist() else 0)

    for i in range(user_data.shape[0]):
        if user_data.iloc[i, 5] == 1:
            user_data.iloc[i, 2] = hidden_ratings_df[hidden_ratings_df['product'] == user_data.iloc[i, 1]]['predicted_rating'].values[0]

    user_top_n = user_data[user_data['used_ind'] == 0].sort_values(by='rating', ascending=False).head(N)
    display(user_top_n)
    
    precision_at_N = user_top_n['test_ind'].sum() / N
    recall_at_N = user_top_n['test_ind'].sum() / liked_items.shape[0]

    if precision_at_N + recall_at_N == 0:
        f1_at_N = 0
    else:
        f1_at_N = 2 * (precision_at_N * recall_at_N) / (precision_at_N + recall_at_N)

    results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
    return results


In [16]:
# Now you can call evaluate_user_recommendation with different user_id, threshold, and N
results = evaluate_topN_user(user_id='A214JN9AJNSHCJ', threshold=3.5, N=10000)
results

Evaluating User A214JN9AJNSHCJ


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_data['used_ind'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_data['liked'] = user_data['rating'].apply(lambda x: 1 if x > threshold else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_data['test_ind'] = user_data['product'].apply(lambda x: 1 if x in hidden_ratings_df['produ

Unnamed: 0,user,product,rating,used_ind,liked,test_ind
6698767,A214JN9AJNSHCJ,B00HZYCCKU,5.0,0,1,0
6816143,A214JN9AJNSHCJ,B00I8G5UZ8,5.0,0,1,0
10773915,A214JN9AJNSHCJ,B016WK6XMA,5.0,0,1,0
8316355,A214JN9AJNSHCJ,B00PQNJ9W6,5.0,0,1,0
6878499,A214JN9AJNSHCJ,B00I9J3Q8W,5.0,0,1,0
...,...,...,...,...,...,...
7938551,A214JN9AJNSHCJ,B00N956AD4,5.0,0,1,0
8657479,A214JN9AJNSHCJ,B00RGNFO4G,5.0,0,1,0
9504787,A214JN9AJNSHCJ,B00W4H850E,5.0,0,1,0
1636927,A214JN9AJNSHCJ,B000GW0U9I,5.0,0,1,0


Unnamed: 0,Precision@N,Recall@N,F1@N
0,0.0003,0.000923,0.000453


#### Execute for All Users

In [42]:
# get count of users
user_count = len(completed['user'].unique())
counter = 0

# loop through users to get results for each user and save to a dataframe
results = pd.DataFrame()
for user in completed['user'].unique():
    counter += 1
    print(f"User {counter} of {user_count}")
    user_results = evaluate_topN_user(user_id=user, threshold=3, N=10000)
    print(user_results)
    results = pd.concat([results, user_results])
    

results

User 1 of 3668
Evaluating User A100RH4M1W1DF0
   Precision@N  Recall@N      F1@N
0       0.0003  0.000927  0.000453
User 2 of 3668
Evaluating User A100WO06OQR8BQ
   Precision@N  Recall@N      F1@N
0       0.0003  0.000955  0.000457
User 3 of 3668
Evaluating User A1027EV8A9PV1O
   Precision@N  Recall@N      F1@N
0       0.0003  0.000924  0.000453
User 4 of 3668
Evaluating User A103KKI1Y4TFNQ
   Precision@N  Recall@N      F1@N
0       0.0003  0.000946  0.000456
User 5 of 3668
Evaluating User A1047P9FLHTDZJ
   Precision@N  Recall@N      F1@N
0       0.0003  0.000923  0.000453
User 6 of 3668
Evaluating User A105F4XQ9S1NU5
   Precision@N  Recall@N      F1@N
0       0.0003  0.000933  0.000454
User 7 of 3668
Evaluating User A105S56ODHGJEK
   Precision@N  Recall@N      F1@N
0       0.0003  0.000931  0.000454
User 8 of 3668
Evaluating User A105XKMQB69VHF
   Precision@N  Recall@N      F1@N
0       0.0003  0.000923  0.000453
User 9 of 3668
Evaluating User A107652KJ8BTTN
   Precision@N  Recall@N  

Unnamed: 0,Precision@N,Recall@N,F1@N
0,0.0003,0.000927,0.000453
0,0.0003,0.000955,0.000457
0,0.0003,0.000924,0.000453
0,0.0003,0.000946,0.000456
0,0.0003,0.000923,0.000453
...,...,...,...
0,0.0003,0.000923,0.000453
0,0.0003,0.000977,0.000459
0,0.0003,0.001123,0.000473
0,0.0003,0.000923,0.000453


In [43]:
# Get the average results for all users
average_results = results.mean()
average_results

Precision@N    0.000300
Recall@N            inf
F1@N           0.000459
dtype: float64

In [44]:
# calculate recall, using precision and f1 only. 
# Recall = 2 * (precision * f1) / (precision + f1)
average_results['Recall@N'] = 2 * (average_results['Precision@N'] * average_results['F1@N']) / (average_results['Precision@N'] + average_results['F1@N'])
average_results

Precision@N    0.000300
Recall@N       0.000363
F1@N           0.000459
dtype: float64

In [45]:
precision_at_N = average_results['Precision@N']
recall_at_N = average_results['Recall@N']
f1_at_N = average_results['F1@N']

In [46]:
average_results = pd.DataFrame({'Precision@N': [precision_at_N], 'Recall@N': [recall_at_N], 'F1@N': [f1_at_N]})
average_results.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/Results/IBCF_results_1_top10000.csv', index=False)

## (2) Using Packages

In [36]:
## Using Packages for IBCF
import surprise
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise import KNNBasic
from surprise import accuracy
from surprise.model_selection import train_test_split


This code first converts your user-item matrix into a DataFrame of ratings, then removes any rows where the rating is 0 (indicating the user has not purchased the item). When you call the `test` method on the algorithm with the testset returned by `train_test_split`, **it will predict ratings for all user-item pairs that are not in the training set, which includes the items that users have not rated yet.**



In [37]:
# load and Change data to User-`Item-`Rating format
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set4_data_modelling.csv')
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_modelling.csv')

display(amz_data.head())


x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
0,A14638TGYH7GD9,2010-10-28,321719816,5.0,even though i use dreamweaver a great deal and...,even though use dreamweav great deal sever boo...,even though use dreamweaver great deal several...,even though use dreamweaver great deal several...,20,11,0.99
1,A2JMJVNTBL7K7E,2011-04-07,321719816,5.0,i spent several hours on the lesson and i love...,spent sever hour lesson love detail clear inst...,spent several hour lesson love detailed clear ...,spent several hours lesson love detailed clear...,19,8,0.9766
2,A2BVNVJOFXGZUB,2010-09-26,321719816,5.0,the video is wellpaced and delivered in an und...,video wellpac deliv understand manner allow wo...,video wellpaced delivered understandable manne...,video wellpaced delivered understandable manne...,3,3,0.4939
3,A14JBDSWKPKTZA,2011-01-08,321719816,5.0,i have had dreamweaver mx2004 since it came ou...,dreamweav mx2004 sinc came back spent year fee...,dreamweaver mx2004 since came back spent year ...,dreamweaver mx2004 since came back spent years...,12,13,0.989
4,ACJT8MUC0LRF0,2010-10-16,321719816,5.0,if youve been wanting to learn how to create y...,youv want learn creat websit either lack confi...,youve wanting learn create website either lack...,youve wanting learn create website either lack...,39,18,0.9995





User-Item Matrix


asin,0321719816,0763855553,076780192X,0767824571,0767827759,0767834739,0768881714,0782010792,0783239408,0788857746,...,B01HD8OXO0,B01HD8OYSK,B01HDW58I6,B01HE0W2WC,B01HGBAFNC,B01HGD8OYM,B01HGSJPMW,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0380485C177Q6QQNJIX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A0685888WB02Q69S553P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1004703RC79J9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100JCBNALJFAW,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (11675, 10487)


In [38]:
# Assume you have a user-item matrix 'user_item_matrix'
# Convert the user-item matrix back to a DataFrame of ratings
ratings = x.stack().reset_index()
ratings.columns = ['user', 'item', 'rating']

# Remove rows where rating is 0
ratings = ratings[ratings['rating'] != 0]

# Define a Reader object
# The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader(rating_scale=(1, 5))

# Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

# Split the dataset into train and test (20%) - seed 2207
trainset, testset = train_test_split(data, test_size=.25, random_state=2207)

In [39]:

# Configure the algorithm - Item Based Collaborative Filtering
# Use cosine similarity
sim_options = {
    'name': 'cosine',
    'user_based': False  # this will compute similarity between items
}

# decide on k
k = 40

# Create the algorithm object and set K to be max number of items to be considered
algo = KNNBasic(sim_options=sim_options, k=40, verbose=True, random_state=2207)

# Train the algorithm on the trainset
algo.fit(trainset)

# Predict ratings for the testset
predictions = algo.test(testset)

# Then compute RMSE, MSE and MAE
print("\nItem-based Model Test Set Results:")
mae_pack = accuracy.mae(predictions).round(2)
mse_pack = accuracy.mse(predictions).round(2)
rmse_pack = accuracy.rmse(predictions).round(2)

print(f"Mean Absolute Error (MAE): {mae_pack}")
print(f"Mean Squared Error (MSE): {mse_pack}")
print(f"Root Mean Squared Error (RMSE): {rmse_pack}")

Computing the cosine similarity matrix...
Done computing similarity matrix.

Item-based Model Test Set Results:
MAE:  0.6195
MSE: 0.9313
RMSE: 0.9650
Mean Absolute Error (MAE): 0.62
Mean Squared Error (MSE): 0.93
Root Mean Squared Error (RMSE): 0.97


In [40]:
# save results to csv
results = pd.DataFrame({'MAE': [mae_pack.round(3)], 'MSE': [mse_pack.round(3)], 'RMSE': [rmse_pack.round(3)]})
results.to_csv("Data/Results/IBCF_results_2.csv", index=False)

***
## (3) Manual Process with Same Data Splits

This alteration  includes:
1. used same test set and training sets as package. Did this for our manual process. So  we can compare the results of the manual process with the package process. 
2. prediction also uses nearest neighbors now, specifically k=40. This is default in package. 

In [41]:
%reset -f

# load libraries
import surprise
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise import KNNBasic
from surprise import accuracy
from surprise.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

In [42]:
# load data - WINDOWS
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
# display(amz_data.head())

# load data - MAC OS
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_modelling.csv')
display(amz_data.head(3))

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())


# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("Shape: ", x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
0,A14638TGYH7GD9,2010-10-28,321719816,5.0,even though i use dreamweaver a great deal and...,even though use dreamweav great deal sever boo...,even though use dreamweaver great deal several...,even though use dreamweaver great deal several...,20,11,0.99
1,A2JMJVNTBL7K7E,2011-04-07,321719816,5.0,i spent several hours on the lesson and i love...,spent sever hour lesson love detail clear inst...,spent several hour lesson love detailed clear ...,spent several hours lesson love detailed clear...,19,8,0.9766
2,A2BVNVJOFXGZUB,2010-09-26,321719816,5.0,the video is wellpaced and delivered in an und...,video wellpac deliv understand manner allow wo...,video wellpaced delivered understandable manne...,video wellpaced delivered understandable manne...,3,3,0.4939


Number of Rows:  256725
Number of Columns:  11
Number of Unique Users:  11675
Number of Unique Products:  10487
Fewest reviews by a reviewer: 12
Most reviews by a reviewer: 365
Fewest reviews per product: 12
Most reviews per product: 266
Shape:  (11675, 10487)


### Generate Train and Test Split

In [43]:
# using created testset from packages chapter
ratings = x.stack().reset_index()
ratings.columns = ['user', 'item', 'rating']
ratings = ratings[ratings['rating'] != 0]
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings, reader)
trainset, testset = train_test_split(data, test_size=.25, random_state=2207)
testset_df = pd.DataFrame(testset)
testset_df = testset_df


# convert each row of the testset to a tuple
testset_tuples = [tuple(x) for x in testset_df[[0, 1]].to_numpy()]

# find indices of the testset in the original matrix
testset_indices = []
for i in range(len(testset_tuples)):
    user = testset_tuples[i][0]
    item = testset_tuples[i][1]
    user_index = x.index.get_loc(user)
    item_index = x.columns.get_loc(item)
    testset_indices.append((user_index, item_index))

# shorten the testset_indices to 100
testset_indices = testset_indices
print("Testset Indices: ")


Testset Indices: 


In [44]:
# # create a copy of the original matrix to store hidden ratings
# x_hidden = x.copy()
# indices_tracker = []

# # loop through the testset indices to hide the rating (make 0) - update x_hidden
# for user_id in range(x_hidden.shape[0]):
#     for item_id in range(x_hidden.shape[1]):
#         if (user_id, item_id) in testset_indices:
#             x_hidden.iloc[user_id, item_id] = 0

# # save x_hidden to csv
# x_hidden.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/suprise_hidden_ratings_matrix.csv')

# # save testset_indices to csv
# testset_indices_df = pd.DataFrame(testset_indices)
# testset_indices_df.to_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/suprise_testset_indices.csv')

### Load Train Test Split

In [45]:
# load hidden ratings matrix
x_hidden = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/suprise_hidden_ratings_matrix.csv', index_col=0)

### Similarity Matrix

In [46]:
# get cosine sim matrix and change to pd dataframe
sim_mat_cos = cosine_similarity(x_hidden.T)
print("Cosine Similarity Matrix") 
sim_mat_cos

Cosine Similarity Matrix


array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.06164333,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.06164333, 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

### Prediction Matrix

In [47]:
# get a predictions matrix
predic_matrix = x_hidden.copy()

# set k to 40
k = 40

# now get predicted ratings for all users
for user_id in range(predic_matrix.shape[0]):
    user_ratings = predic_matrix.iloc[user_id, :].values.reshape(1, -1)
    unrated_products_indices = np.where(user_ratings == 0)[1]
    rated_products_indices = np.where(user_ratings > 0)[1]
    for product_id in unrated_products_indices:
        similarity_i_j = sim_mat_cos[product_id, rated_products_indices]
        ratings = user_ratings[0, rated_products_indices]
        
        # sort by similarity and select top k
        sorted_indices = np.argsort(similarity_i_j)[::-1][:k]
        similarity_i_j = similarity_i_j[sorted_indices]
        ratings = ratings[sorted_indices]

        if np.any(similarity_i_j):
            predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
        else:
            # make predicted rating mean of user's ratings
            predicted_rating = np.mean(ratings)
        
        predic_matrix.iloc[user_id, product_id] = predicted_rating

# see updated matrix with predicted ratings
print("Predicted Ratings for All Users")
display(predic_matrix)

Predicted Ratings for All Users


Unnamed: 0_level_0,0321719816,0763855553,076780192X,0767824571,0767827759,0767834739,0768881714,0782010792,0783239408,0788857746,...,B01HD8OXO0,B01HD8OYSK,B01HDW58I6,B01HE0W2WC,B01HGBAFNC,B01HGD8OYM,B01HGSJPMW,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A0380485C177Q6QQNJIX,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000
A0685888WB02Q69S553P,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,5.000000,5.000000,...,4.785714,4.785714,4.785714,4.785714,4.785714,4.785714,5.000000,4.785714,4.785714,5.000000
A1004703RC79J9,3.010548,4.000000,3.928571,3.928571,3.928571,3.928571,3.928571,3.928571,4.000000,3.928571,...,4.000000,4.000000,3.928571,2.000000,4.000000,2.000000,3.000000,3.928571,3.928571,4.000000
A100JCBNALJFAW,3.562500,3.562500,5.000000,5.000000,3.562500,1.000000,3.562500,3.562500,4.000000,4.000000,...,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500,3.562500
A100RH4M1W1DF0,4.909091,4.909091,4.909091,4.909091,4.909091,4.909091,4.909091,4.909091,4.909091,4.909091,...,5.000000,4.909091,4.909091,4.909091,5.000000,5.000000,5.000000,5.000000,5.000000,4.909091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZYJE40XW6MFG,4.372446,1.788474,1.000000,3.400000,3.400000,3.400000,1.000000,4.000000,3.400000,3.000000,...,3.400000,4.579543,3.400000,4.035676,3.272842,3.666403,2.469890,3.400000,3.400000,3.400000
AZYOVGJLQ03ML,3.888889,3.888889,3.888889,3.888889,3.260559,4.000000,3.888889,3.888889,3.000000,3.888889,...,3.358492,4.683089,3.888889,3.888889,3.888889,5.000000,3.888889,3.000000,3.206064,3.403539
AZYU8M791SIFC,4.000000,4.000000,4.000000,5.000000,4.613187,4.089350,4.000000,4.000000,3.822351,1.624972,...,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000,4.000000
AZZ1KF8RAO1BR,2.666667,3.000000,2.666667,2.666667,2.666667,5.000000,2.666667,5.000000,2.666667,5.000000,...,2.666667,2.666667,1.000000,1.000000,2.666667,2.666667,2.666667,2.666667,2.666667,2.666667


### Evaluation (Predictive Accuracy)

Now evaluate how good the predictions are vs the hidden ratings
- ***step 1***: identify the hidden ratings indices
- ***step 2***: extract hidden ratings indices and corresponding predicted ratings indices
- ***step 3***: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values

In [48]:
#  get predicted ratings for the testset
predicted_ratings = []
for i in range(len(testset_indices)):
    user_id = testset_indices[i][0]
    item_id = testset_indices[i][1]
    predicted_ratings.append(predic_matrix.iloc[user_id, item_id])

print("Predicted Ratings:")
print(predicted_ratings)

# get actual ratings for the testset
print("\nActual Ratings:")
actual_ratings = testset_df[2].to_list()
print(actual_ratings)

Predicted Ratings:
[5.0, 4.719316396716408, 4.5, 4.52232740430198, 4.0537776768897755, 4.221518731152049, 4.366990226007765, 5.0, 3.500626541028375, 3.8465629147553555, 2.9999999999999996, 5.0, 3.6666666666666665, 4.8392167498975835, 4.222222222222222, 3.0, 3.9041185733122936, 3.539046535578858, 5.000000000000001, 4.947863701804915, 5.0, 4.487504514087804, 4.761302360980933, 4.013129392853776, 4.1035373810170634, 4.005726263888764, 5.000000000000001, 4.685778028203096, 4.999999999999999, 4.999999999999999, 4.136548766730677, 4.954579127222277, 1.0, 4.720736183138814, 3.17367283583003, 4.6657657471054215, 3.164231592996815, 3.732256065655117, 5.0, 4.5, 5.0, 4.999999999999999, 5.0, 2.0, 3.9841046266635893, 3.959468803455633, 5.0, 5.0, 4.569256174676665, 4.684838825629955, 4.666666666666667, 4.0566317881604865, 4.802819398663669, 4.545454545454546, 4.198023428870018, 4.443655382256421, 5.0, 5.0, 4.655983130210817, 4.271851277419749, 5.0, 4.467564373303975, 4.642922776988164, 3.64468900967

In [49]:
# calculate MAE, MSE and RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error
print("Using sklearn")
mae = mean_absolute_error(actual_ratings, predicted_ratings)
mse = mean_squared_error(actual_ratings, predicted_ratings)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")


# Manually
print("\n\nManually")

# calculate MAE, MSE and RMSE using actual and predicted ratings
mae = np.mean(np.abs(np.array(actual_ratings) - np.array(predicted_ratings))) # Calculate Mean Absolute Error (MAE)
mse = np.mean((np.array(actual_ratings) - np.array(predicted_ratings)) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")


Using sklearn
Mean Absolute Error (MAE): 0.59
Mean Squared Error (MSE): 0.92
Root Mean Squared Error (RMSE): 0.96


Manually
Mean Absolute Error (MAE): 0.59
Mean Squared Error (MSE): 0.92
Root Mean Squared Error (RMSE): 0.96


In [50]:
# save results to csv
results = pd.DataFrame({'MAE': [mae.round(3)], 'MSE': [mse.round(3)], 'RMSE': [rmse.round(3)]})
results.to_csv("Data/Results/IBCF_results_3.csv", index=False)
