# Item Based Collaborative Filtering

## Algorithm Summary

Item-based collaborative filtering is a model-based algorithm for making recommendations. It is based on the similarity between items calculated using people's ratings of those items. It is also known as item-item collaborative filtering.

1. **Load the data**
- data is provided in a dataframe where each row is a review

2. **Create a user-item matrix**
- convert dataframe into user-item matrix where each row is a user and each column is an item

3. **Create test and train set**
- hide $N$ ratings for each user in the training set and use them to test the performance of the model

4. **Calculate item similarity**
- using training set, calculate the similarity between items using cosine similarity

5. **Make predictions**
- for each user, for each item in the test set, calculate the weighted sum of the ratings of the items that are similar to the item in question

6. **Evaluate the model**
- calculate the predictive accuracy of the model using RMSE, MSE and MAE
- calculate the Top-N metrics of the model using NDCG and Hit Rate

## Manaul / From Fundamentals

In [116]:
%reset -f
# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

### Reading in and Converting Data

In [117]:
# load data - WINDOWS
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
# display(amz_data.head())

# load data - MAC OS
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set4_data_modelling.csv')
display(amz_data.head(3))

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())


# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0.1,Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
0,76,AQ8OO59DJFJNZ,2018-01-05,767834739,5.0,wonderful movie,wonder movi,wonderful movie,wonderful movie,4,1,0.5719
1,78,A244CRJ2QSVLZ4,2008-01-29,767834739,5.0,resident evil is a great science fictionhorror...,resid evil great scienc fictionhorror hybrid p...,resident evil great science fictionhorror hybr...,resident evil great science fictionhorror hybr...,-12,-5,-0.9455
2,81,A1VCLTAGM5RLND,2005-07-23,767834739,5.0,i this movie has people living and working und...,movi peopl live work underground place call hi...,movie people living working underground place ...,movie people living working underground place ...,-1,0,-0.1806


Number of Rows:  83139
Number of Columns:  12
Number of Unique Users:  3668
Number of Unique Products:  3249
Fewest reviews by a reviewer: 13
Most reviews by a reviewer: 193
Fewest reviews per product: 13
Most reviews per product: 189



User-Item Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (3668, 3249)


### Train and Test Split

In [118]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    # print("User:", user_id)
    # print("Indices of Rated Products:", rated_products)
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    # print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0

In [119]:
# check tracker - all hidden ratings 
indices_tracker = pd.DataFrame(indices_tracker).to_numpy()
print("Indices of Ratings per user \n", indices_tracker)

# flattened
indices_tracker_flat = indices_tracker.flatten()
print("Indices of Ratings per User joined", indices_tracker_flat)

# see updated matrix with hidden ratings
print("Updated Matrix with Hidden Ratings")
display(x_hidden)

# see original matrix
print("Original Matrix")
display(x)

Indices of Ratings per user 
 [[2807 2258 2647]
 [2111 1398 1498]
 [ 200 1102 1089]
 ...
 [2353 1482  185]
 [ 639 2206 3123]
 [ 193  533  406]]
Indices of Ratings per User joined [2807 2258 2647 ...  193  533  406]
Updated Matrix with Hidden Ratings


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZW0HVDKOXGN9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZX2RDN9YXZAE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZY157FF14CSL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Original Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZW0HVDKOXGN9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZX2RDN9YXZAE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AZY157FF14CSL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Similarity Matrix

In [120]:
# get cosine sim matrix and change to pd dataframe
sim_mat_cos = cosine_similarity(x_hidden.T).round(5)
print("Cosine Similarity Matrix") 
sim_mat_cos

Cosine Similarity Matrix


array([[1.     , 0.     , 0.01624, ..., 0.     , 0.     , 0.     ],
       [0.     , 1.     , 0.14731, ..., 0.     , 0.     , 0.     ],
       [0.01624, 0.14731, 1.     , ..., 0.     , 0.     , 0.     ],
       ...,
       [0.     , 0.     , 0.     , ..., 1.     , 0.06161, 0.     ],
       [0.     , 0.     , 0.     , ..., 0.06161, 1.     , 0.     ],
       [0.     , 0.     , 0.     , ..., 0.     , 0.     , 1.     ]])

### Prediction Matrix

In [121]:
# get a predictions matrix
predic_matrix = x_hidden.copy()

# set k to 40
k = 40

# now get predicted ratings for all users
for user_id in range(predic_matrix.shape[0]):
    user_ratings = predic_matrix.iloc[user_id, :].values.reshape(1, -1)
    unrated_products_indices = np.where(user_ratings == 0)[1]
    rated_products_indices = np.where(user_ratings > 0)[1]
    for product_id in unrated_products_indices:
        similarity_i_j = sim_mat_cos[product_id, rated_products_indices]
        ratings = user_ratings[0, rated_products_indices]
        
        # sort by similarity and select top k
        sorted_indices = np.argsort(similarity_i_j)[::-1][:k]
        similarity_i_j = similarity_i_j[sorted_indices]
        ratings = ratings[sorted_indices]

        if np.any(similarity_i_j):
            predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
        else:
            # make predicted rating mean of user's ratings
            predicted_rating = np.mean(ratings)
        
        predic_matrix.iloc[user_id, product_id] = predicted_rating

# see updated matrix with predicted ratings
print("Predicted Ratings for All Users")
display(predic_matrix)

Predicted Ratings for All Users


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,4.84,4.58,4.84,4.84,3.00,4.84,4.84,4.84,4.84,3.00,...,5.00,5.00,5.00,4.83,4.88,5.00,4.74,5.00,4.74,5.00
A100WO06OQR8BQ,5.00,4.41,4.16,5.00,5.00,5.00,5.00,4.16,5.00,4.16,...,3.88,4.22,3.88,5.00,3.00,3.91,5.00,4.71,3.00,5.00
A1027EV8A9PV1O,5.00,3.42,3.73,5.00,5.00,4.67,4.36,5.00,4.15,4.59,...,5.00,5.00,5.00,4.67,4.67,4.67,5.00,4.67,4.67,5.00
A103KKI1Y4TFNQ,3.50,1.00,4.03,4.23,3.42,4.48,4.40,4.96,3.94,2.40,...,4.38,4.38,4.34,4.38,4.38,4.38,5.00,4.60,4.38,4.04
A1047P9FLHTDZJ,4.93,4.93,4.93,5.00,4.93,4.93,4.93,4.93,4.93,4.93,...,4.93,5.00,5.00,4.65,5.00,5.00,5.00,5.00,4.58,4.93
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,4.00,5.00,4.74,4.85,4.82,4.82,4.00,4.74,4.00,5.00,...,5.00,4.63,4.85,4.84,4.84,4.74,5.00,4.70,4.79,4.65
AZW0HVDKOXGN9,3.45,4.00,3.45,3.45,3.45,3.45,3.45,3.45,3.45,3.45,...,3.45,4.00,3.45,3.61,4.00,3.45,3.45,3.61,4.53,3.45
AZX2RDN9YXZAE,4.00,4.00,3.42,3.42,4.00,4.00,4.00,3.42,4.00,3.42,...,3.23,3.23,3.12,3.42,3.51,3.42,3.00,3.42,3.42,4.00
AZY157FF14CSL,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,...,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00


### Evaluation (Predictive Accuracy)

Now evaluate how good the predictions are vs the hidden ratings
- ***step 1***: identify the hidden ratings indices
- ***step 2***: extract hidden ratings indices and corresponding predicted ratings indices
- ***step 3***: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values

In [122]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(predic_matrix.shape[0]):
    user_predicted_ratings = predic_matrix.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


Hidden Ratings: [3. 5. 5. ... 4. 4. 1.]
Corresponding Predicted Ratings: [4.93 4.9  4.98 ... 4.23 3.77 4.22]
Using sklearn
Mean Absolute Error (MAE): 0.5385923300617957
Mean Squared Error (MSE): 0.7430272537259178
Root Mean Squared Error (RMSE): 0.8619902863292126


Manually
Mean Absolute Error (MAE): 0.5385923300617957
Mean Squared Error (MSE): 0.7430272537259178
Root Mean Squared Error (RMSE): 0.8619902863292126


In [16]:
# round to 2 decimal places
mae = round(mae, 2)
mse = round(mse, 2)
rmse = round(rmse, 2)

# Save the results to a csv file
results = pd.DataFrame({'MAE': [mae], 'MSE': [mse], 'RMSE': [rmse]})
results.to_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\results_IBCF.csv', index=False)

### Evaluation (Top-N Metrics)

## Using Packages

In [214]:
## Using Packages for IBCF
import surprise
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise import KNNBasic
from surprise import accuracy
from surprise.model_selection import train_test_split


This code first converts your user-item matrix into a DataFrame of ratings, then removes any rows where the rating is 0 (indicating the user has not purchased the item). When you call the `test` method on the algorithm with the testset returned by `train_test_split`, **it will predict ratings for all user-item pairs that are not in the training set, which includes the items that users have not rated yet.**



In [215]:
# load and Change data to User-`Item-`Rating format
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set4_data_modelling.csv')
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set4_data_modelling.csv', index_col=0)

display(amz_data.head())


x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
76,AQ8OO59DJFJNZ,2018-01-05,767834739,5.0,wonderful movie,wonder movi,wonderful movie,wonderful movie,4,1,0.5719
78,A244CRJ2QSVLZ4,2008-01-29,767834739,5.0,resident evil is a great science fictionhorror...,resid evil great scienc fictionhorror hybrid p...,resident evil great science fictionhorror hybr...,resident evil great science fictionhorror hybr...,-12,-5,-0.9455
81,A1VCLTAGM5RLND,2005-07-23,767834739,5.0,i this movie has people living and working und...,movi peopl live work underground place call hi...,movie people living working underground place ...,movie people living working underground place ...,-1,0,-0.1806
82,A119Q9NFGVOEJZ,2016-02-13,767834739,5.0,every single video game based movie from the s...,everi singl video game base movi super mario b...,every single video game based movie super mari...,every single video game based movie super mari...,18,6,0.9846
83,A1RP6YCOS5VJ5I,2006-09-26,767834739,5.0,i think that i like this movie more than the o...,think like movi origin origin still great real...,think like movie original original still great...,think like movie original original still great...,29,10,0.9951





User-Item Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (3668, 3249)


In [216]:
# Assume you have a user-item matrix 'user_item_matrix'
# Convert the user-item matrix back to a DataFrame of ratings
ratings = x.stack().reset_index()
ratings.columns = ['user', 'item', 'rating']

# Remove rows where rating is 0
ratings = ratings[ratings['rating'] != 0]

# Define a Reader object
# The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader(rating_scale=(1, 5))

# Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

# Split the dataset into train and test (20%) - seed 2207
trainset, testset = train_test_split(data, test_size=.25, random_state=2207)

In [220]:
x.shape[0]

3668

In [231]:

# Configure the algorithm - Item Based Collaborative Filtering
# Use cosine similarity
sim_options = {
    'name': 'cosine',
    'user_based': False  # this will compute similarity between items
}

# decide on k
k = 40

# Create the algorithm object and set K to be max number of items to be considered
algo = KNNBasic(sim_options=sim_options, k=40)

# Train the algorithm on the trainset
algo.fit(trainset)

# Predict ratings for the testset
predictions = algo.test(testset)

# Then compute RMSE, MSE and MAE
print("\nItem-based Model Test Set Results:")
mae_pack = accuracy.mae(predictions).round(2)
mse_pack = accuracy.mse(predictions).round(2)
rmse_pack = accuracy.rmse(predictions).round(2)

print(f"Mean Absolute Error (MAE): {mae_pack}")
print(f"Mean Squared Error (MSE): {mse_pack}")
print(f"Root Mean Squared Error (RMSE): {rmse_pack}")

Computing the cosine similarity matrix...
Done computing similarity matrix.

Item-based Model Test Set Results:
MAE:  0.5691
MSE: 0.7796
RMSE: 0.8830
Mean Absolute Error (MAE): 0.57
Mean Squared Error (MSE): 0.78
Root Mean Squared Error (RMSE): 0.88


***
# Manual Process with Same Data Splits

This alteration  includes:
1. used same test set and training sets as package. Did this for our manual process. So  we can compare the results of the manual process with the package process. 
2. prediction also uses nearest neighbors now, specifically k=40. This is default in package. 

In [203]:
%reset -f
# load libraries
import surprise
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise import KNNBasic
from surprise import accuracy
from surprise.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

In [204]:
# load data - WINDOWS
# amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
# display(amz_data.head())

# load data - MAC OS
amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set4_data_modelling.csv', index_col=0)
display(amz_data.head(3))

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())


# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("Shape: ", x.shape)

Unnamed: 0,reviewerID,reviewTime,asin,overall,reviewText,stemmed_words_revText,lemmatised_reviewText,filtered_tokens_revText,sentiments_afinn,sentiments_bing,sentiments_vader
76,AQ8OO59DJFJNZ,2018-01-05,767834739,5.0,wonderful movie,wonder movi,wonderful movie,wonderful movie,4,1,0.5719
78,A244CRJ2QSVLZ4,2008-01-29,767834739,5.0,resident evil is a great science fictionhorror...,resid evil great scienc fictionhorror hybrid p...,resident evil great science fictionhorror hybr...,resident evil great science fictionhorror hybr...,-12,-5,-0.9455
81,A1VCLTAGM5RLND,2005-07-23,767834739,5.0,i this movie has people living and working und...,movi peopl live work underground place call hi...,movie people living working underground place ...,movie people living working underground place ...,-1,0,-0.1806


Number of Rows:  83139
Number of Columns:  11
Number of Unique Users:  3668
Number of Unique Products:  3249
Fewest reviews by a reviewer: 13
Most reviews by a reviewer: 193
Fewest reviews per product: 13
Most reviews per product: 189



User-Item Matrix


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
A1027EV8A9PV1O,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A103KKI1Y4TFNQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1047P9FLHTDZJ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Shape:  (3668, 3249)


### Train and Test Split

In [208]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()

# using created testset from packages chapter
ratings = x.stack().reset_index()
ratings.columns = ['user', 'item', 'rating']
ratings = ratings[ratings['rating'] != 0]
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings, reader)
trainset, testset = train_test_split(data, test_size=.25, random_state=2207)
testset_df = pd.DataFrame(testset)
testset_df = testset_df


# convert each row of the testset to a tuple
testset_tuples = [tuple(x) for x in testset_df[[0, 1]].to_numpy()]

# find indices of the testset in the original matrix
testset_indices = []
for i in range(len(testset_tuples)):
    user = testset_tuples[i][0]
    item = testset_tuples[i][1]
    user_index = x.index.get_loc(user)
    item_index = x.columns.get_loc(item)
    testset_indices.append((user_index, item_index))

# shorten the testset_indices to 100
testset_indices = testset_indices
print("Testset Indices: ")
testset_indices[0:5]

Testset Indices: 


[(2098, 2152), (3101, 2450), (1465, 2895), (2951, 2239), (24, 950)]

In [209]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# loop through the testset indices to hide the rating (make 0) - update x_hidden
for user_id in range(x_hidden.shape[0]):
    for item_id in range(x_hidden.shape[1]):
        if (user_id, item_id) in testset_indices:
            x_hidden.iloc[user_id, item_id] = 0

### Similarity Matrix

In [210]:
# get cosine sim matrix and change to pd dataframe
sim_mat_cos = cosine_similarity(x_hidden.T).round(5)
print("Cosine Similarity Matrix") 
sim_mat_cos

Cosine Similarity Matrix


array([[1.     , 0.07049, 0.     , ..., 0.     , 0.     , 0.     ],
       [0.07049, 1.     , 0.19717, ..., 0.     , 0.     , 0.     ],
       [0.     , 0.19717, 1.     , ..., 0.     , 0.     , 0.     ],
       ...,
       [0.     , 0.     , 0.     , ..., 1.     , 0.     , 0.     ],
       [0.     , 0.     , 0.     , ..., 0.     , 1.     , 0.     ],
       [0.     , 0.     , 0.     , ..., 0.     , 0.     , 1.     ]])

### Prediction Matrix

In [211]:
# get a predictions matrix
predic_matrix = x_hidden.copy()

# set k to 40
k = 40

# now get predicted ratings for all users
for user_id in range(predic_matrix.shape[0]):
    user_ratings = predic_matrix.iloc[user_id, :].values.reshape(1, -1)
    unrated_products_indices = np.where(user_ratings == 0)[1]
    rated_products_indices = np.where(user_ratings > 0)[1]
    for product_id in unrated_products_indices:
        similarity_i_j = sim_mat_cos[product_id, rated_products_indices]
        ratings = user_ratings[0, rated_products_indices]
        
        # sort by similarity and select top k
        sorted_indices = np.argsort(similarity_i_j)[::-1][:k]
        similarity_i_j = similarity_i_j[sorted_indices]
        ratings = ratings[sorted_indices]

        if np.any(similarity_i_j):
            predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
        else:
            # make predicted rating mean of user's ratings
            predicted_rating = np.mean(ratings)
        
        predic_matrix.iloc[user_id, product_id] = predicted_rating

# see updated matrix with predicted ratings
print("Predicted Ratings for All Users")
display(predic_matrix)

Predicted Ratings for All Users


asin,0767834739,7799146915,B00000DMAT,B00000DMAX,B00000DMB3,B00000F1GM,B00000I1BJ,B00000I1BY,B00000ID61,B00000INR2,...,B01H353FLA,B01H353HUY,B01H3VFR6U,B01H5GB8ZW,B01H6OXQFS,B01H9SH2LU,B01HGBAFNC,B01HHVVLGQ,B01HHVWWMI,B01HIZF7XE
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100RH4M1W1DF0,4.862069,5.000000,4.862069,4.862069,4.862069,4.862069,4.862069,4.862069,4.862069,4.862069,...,5.000000,5.000000,5.000000,4.792529,5.000000,5.000000,4.637296,5.000000,4.840139,4.862069
A100WO06OQR8BQ,5.000000,3.915148,4.058824,5.000000,5.000000,5.000000,5.000000,4.058824,5.000000,4.058824,...,4.042363,4.093441,4.016397,4.372909,3.000000,3.698180,5.000000,5.000000,3.000000,5.000000
A1027EV8A9PV1O,3.359316,4.642857,4.000000,4.642857,4.273629,5.000000,4.139704,5.000000,3.929362,4.585132,...,5.000000,5.000000,5.000000,4.642857,4.642857,4.642857,5.000000,5.000000,4.642857,5.000000
A103KKI1Y4TFNQ,4.925415,4.387097,5.000000,5.000000,4.387097,5.000000,4.524627,5.000000,5.000000,2.300757,...,4.387097,4.387097,4.416199,4.387097,4.387097,4.387097,5.000000,5.000000,4.387097,4.756786
A1047P9FLHTDZJ,4.923077,4.923077,4.923077,5.000000,4.923077,4.923077,4.923077,4.923077,4.923077,4.923077,...,4.923077,4.923077,5.000000,4.000000,5.000000,5.000000,5.000000,5.000000,4.404479,4.923077
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AZVIQ5SU7XPD5,4.000000,5.000000,5.000000,5.000000,4.000000,4.000000,4.000000,4.884615,4.000000,4.884615,...,5.000000,4.884615,4.843231,5.000000,4.831249,4.884615,5.000000,4.922505,5.000000,5.000000
AZW0HVDKOXGN9,3.538462,4.000000,3.538462,3.538462,3.538462,3.538462,3.538462,3.538462,3.538462,3.538462,...,3.538462,3.538462,3.538462,3.000000,3.437963,3.538462,3.538462,3.268758,4.535473,3.538462
AZX2RDN9YXZAE,3.303382,4.000000,3.400000,3.400000,4.000000,4.000000,4.000000,3.400000,4.000000,3.400000,...,3.553873,3.000000,3.139969,3.400000,3.286709,3.400000,3.000000,3.400000,3.400000,4.000000
AZY157FF14CSL,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,...,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000,5.000000


### Evaluation (Predictive Accuracy)

Now evaluate how good the predictions are vs the hidden ratings
- ***step 1***: identify the hidden ratings indices
- ***step 2***: extract hidden ratings indices and corresponding predicted ratings indices
- ***step 3***: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values

In [212]:
#  get predicted ratings for the testset
predicted_ratings = []
for i in range(len(testset_indices)):
    user_id = testset_indices[i][0]
    item_id = testset_indices[i][1]
    predicted_ratings.append(predic_matrix.iloc[user_id, item_id])

print("Predicted Ratings:")
print(predicted_ratings)

# get actual ratings for the testset
print("\nActual Ratings:")
actual_ratings = testset_df[2].to_list()
print(actual_ratings)

Predicted Ratings:
[2.3030236150960053, 4.142857142857143, 4.217856371101346, 5.0, 5.0, 4.670016378320375, 4.432599027282894, 4.999999999999999, 4.706374210607744, 4.970299167624422, 5.0, 3.3906525935356693, 3.9063581461627193, 4.780315173458548, 4.589158253488521, 4.028825918856751, 4.999999999999999, 5.0, 5.0, 4.568657741160316, 3.369339014963829, 5.0, 3.814978642735188, 4.375, 4.578974890513675, 4.304479096975624, 3.648973115382313, 5.0, 4.040639878877224, 4.999999999999999, 4.50212170894011, 5.0, 4.655189954907444, 4.137546297361338, 4.522769169827994, 5.0, 4.212735462735464, 4.471698471470617, 5.0, 5.0, 4.652047027432669, 5.0, 5.0, 4.0, 4.999999999999999, 3.869417894628235, 4.137138729281403, 5.000000000000001, 3.0814171223129354, 5.0, 4.5382761165185, 4.39417878326597, 3.9015834128672022, 5.0, 4.631578947368421, 3.8333333333333335, 5.000000000000001, 4.899838075410595, 4.113961718200912, 4.255387735883226, 4.16240099302518, 5.0, 5.000000000000001, 4.580301540478532, 5.00000000000

In [223]:
# calculate MAE, MSE and RMSE
from sklearn.metrics import mean_absolute_error, mean_squared_error
print("Using sklearn")
mae = mean_absolute_error(actual_ratings, predicted_ratings)
mse = mean_squared_error(actual_ratings, predicted_ratings)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")


# Manually
print("\n\nManually")

# calculate MAE, MSE and RMSE using actual and predicted ratings
mae = np.mean(np.abs(np.array(actual_ratings) - np.array(predicted_ratings))) # Calculate Mean Absolute Error (MAE)
mse = np.mean((np.array(actual_ratings) - np.array(predicted_ratings)) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)

print(f"Mean Absolute Error (MAE): {mae.round(2)}")
print(f"Mean Squared Error (MSE): {mse.round(2)}")
print(f"Root Mean Squared Error (RMSE): {rmse.round(2)}")


Using sklearn
Mean Absolute Error (MAE): 0.56
Mean Squared Error (MSE): 0.77
Root Mean Squared Error (RMSE): 0.88


Manually
Mean Absolute Error (MAE): 0.56
Mean Squared Error (MSE): 0.77
Root Mean Squared Error (RMSE): 0.88
