# User Based Collaborative Filtering
## Algorithm Summary

Item-based collaborative filtering is a model-based algorithm for making recommendations. It is based on the similarity between items calculated using people's ratings of those items. It is also known as item-item collaborative filtering.

1. **Load the data**
- data is provided in a dataframe where each row is a review

2. **Create a user-item matrix**
- convert dataframe into user-item matrix where each row is a user and each column is an item

3. **Create test and train set**
- hide $N$ ratings for each user in the training set and use them to test the performance of the model

4. **Calculate user similarity**
- using training set, calculate the similarity between users using cosine similarity

5. **Make predictions**
- for each user, for each item in the test set, calculate the weighted sum of the ratings of the items that are similar to the item in question

6. **Evaluate the model**
- calculate the predictive accuracy of the model using RMSE, MSE and MAE
- calculate the Top-N metrics of the model using NDCG and Hit Rate

## Manaul / From Fundamentals

In [None]:
%reset -f
# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# load data - WINDOWS
amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
display(amz_data.head())

# load data - MAC OS
# amz_data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/set2_data_modelling.csv')
# display(amz_data.head(3))

# print details
print('Number of Rows: ', amz_data.shape[0])
print('Number of Columns: ', amz_data.shape[1])
print('Number of Unique Users: ', len(amz_data['reviewerID'].unique()))
print('Number of Unique Products: ', len(amz_data['asin'].unique()))
print('Fewest reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().min())
print('Most reviews by a reviewer:', amz_data.groupby('reviewerID')['asin'].count().max())
print("Fewest reviews per product:", amz_data.groupby('asin')['reviewerID'].count().min())
print("Most reviews per product:", amz_data.groupby('asin')['reviewerID'].count().max())


# Creating User Item Matrix =====================================================
# create user-item matrix
x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

Unnamed: 0,reviewerID,reviewerName,reviewTime,asin,reviewText,category,overall,normalized_rating,stemmed_words_revText,lemmatized_words_revText,filtered_tokens_revText,sentiments_vader_revText,sentiments_textblob_revText,subjectivities_textblob_revText,sentiment_score_afinn_revText,sentiment_score_bing_revText,sentiment_score_nrc_revText
0,A29NAG6NZOBAJ8,kingpin16,2014-11-24,B001IH8ERA,tuna yum,grocery_and_gourmet_food,5.0,1.0,"['tuna', 'yum']","['tuna', 'yum']","['tuna', 'yum']","{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,0.0,0,0,trust
1,A1WVA7V02PQOY6,Dad of Divas,2015-02-10,B000ZGY4PG,as someone that has always liked eating oatmea...,grocery_and_gourmet_food,5.0,1.0,"['someon', 'alway', 'like', 'eat', 'oatmeal', ...","['someone', 'always', 'liked', 'eating', 'oatm...","['someone', 'always', 'liked', 'eating', 'oatm...","{'neg': 0.0, 'neu': 0.82, 'pos': 0.18, 'compou...",0.397564,0.705641,4,4,positive
2,A1KQJLBDF2OEMD,Sherelle Ellis,2015-07-28,B00YLLHNHW,humans are stupid they love and they make mist...,kindle_store,4.0,0.75,"['human', 'stupid', 'love', 'make', 'mistak', ...","['human', 'stupid', 'love', 'make', 'mistake',...","['humans', 'stupid', 'love', 'make', 'mistakes...","{'neg': 0.109, 'neu': 0.703, 'pos': 0.188, 'co...",0.097186,0.688095,4,1,sadness
3,A1MUHTKSOY7WVO,Upgrade Taos Computers,2015-05-20,B00GJU4DD0,this thing rocks very lightslim great fit does...,electronics,5.0,1.0,"['thing', 'rock', 'lightslim', 'great', 'fit',...","['thing', 'rock', 'lightslim', 'great', 'fit',...","['thing', 'rocks', 'lightslim', 'great', 'fit'...","{'neg': 0.0, 'neu': 0.512, 'pos': 0.488, 'comp...",0.466667,0.483333,3,1,trust
4,APZSWNPMVSZ84,Ronald Davis,2002-08-30,B000001FUB,man this ish is on fire the mothership crashed...,cds_and_vinyl,5.0,1.0,"['man', 'ish', 'fire', 'mothership', 'crash', ...","['man', 'ish', 'fire', 'mothership', 'crashed'...","['man', 'ish', 'fire', 'mothership', 'crashed'...","{'neg': 0.036, 'neu': 0.964, 'pos': 0.0, 'comp...",0.6,1.0,-2,-3,fear


Number of Rows:  403550
Number of Columns:  17
Number of Unique Users:  161210
Number of Unique Products:  227151


Subset of the Data
Number of Unique Users:  401
Number of Unique Products:  11615
Number of Rows:  12507


User-Item Matrix


asin,0740782282,0767802799,0767805712,0767809254,0767819462,0767826728,0767827759,0780626699,0782010040,0782010792,...,B01HG1LA6S,B01HG36N0Y,B01HH79XRE,B01HHGAIHE,B01HHVWDG8,B01HHVZRRA,B01HHW0LSY,B01HI776Y0,B01HIPMSAY,B01HIWLIBM
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100WO06OQR8BQ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A10ZBR6O8S8OCY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
A119Q9NFGVOEJZ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A11OTLEDSW8ZXD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A129YBX5BVNW2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Train and Test Split

In [None]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# number of products to hide for each user
N = 3

# identifies rated items and randomly selects N products to hide ratings for each user
np.random.seed(2207)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_products = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    # print("User:", user_id)
    # print("Indices of Rated Products:", rated_products)
    hidden_indices = np.random.choice(rated_products, N, replace=False)
    indices_tracker.append(hidden_indices)
    # print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0

User: 0
Indices of Rated Products: [   30  1870  2431  2919  3167  3227  3232  3234  3312  4270  4561  4578
  5045  5343  5408  5672  5704  5782  6057  6215  6591  6704  6731  7177
  7403  7522  7603  8023  8070  8507  8838  8912  9061  9190  9273  9727
  9784 10267 10909 11598]
Indices to Hide: [2431 8023] 

User: 1
Indices of Rated Products: [ 2121  2147  4122  5183  5704  5841  5930  6293  6843  6923  7650  8369
  8575  8843  8939  9070  9626 10356 10416 10426 11099 11218 11287 11435
 11568 11605 11608]
Indices to Hide: [11218  4122] 

User: 2
Indices of Rated Products: [ 1263  1446  1470  1471  1534  1549  1569  1577  1580  1696  1838  1839
  1880  1949  1988  1989  2158  2319  2350  2492  2634  2938  2940  2981
  3053  3181  3319  3377  3428  3433  3777  3873  4312  5179  5501  5937
  5938  6569  6891  8186 10278 10992]
Indices to Hide: [5937 3319] 

User: 3
Indices of Rated Products: [ 1834  1979  2723  3298  3796  5367  5556  6066  6553  6838  7141  7649
  7760  7998  8473  9161

In [None]:
# check tracker - all hidden ratings 
indices_tracker = pd.DataFrame(indices_tracker).to_numpy()
print("Indices of Ratings per user \n", indices_tracker)

# flattened
indices_tracker_flat = indices_tracker.flatten()
print("Indices of Ratings per User joined", indices_tracker_flat)

# see updated matrix with hidden ratings
print("Updated Matrix with Hidden Ratings")
display(x_hidden)

# see original matrix
print("Original Matrix")
display(x)

### Similarity Matrix

In [None]:
# get cosine sim matrix and change to pd dataframe and save to csv
sim_mat_cos = cosine_similarity(x_hidden).round(5)
print("Cosine Similarity Matrix") 
sim_mat_cos

Cosine Similarity Matrix


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

### Prediction Matrix

In [None]:
# get a predictions matrix
predic_matrix = x_hidden.copy()

# now get predicted ratings for all users
for user_id in range(predic_matrix.shape[0]):
    user_ratings = predic_matrix.iloc[user_id, :].values.reshape(1, -1)
    unrated_products_indices = np.where(user_ratings == 0)[1]
    rated_users_indices = np.where(user_ratings > 0)[1]
    for product_id in unrated_products_indices:
        similarity_i_j = sim_mat_cos[user_id, rated_users_indices]  # Get similarity between this user and other users who rated this product
        ratings = user_ratings[0, rated_users_indices]
        
        if np.any(similarity_i_j):
            predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
        else:
            # make predicted rating mean of user's ratings
            predicted_rating = np.mean(ratings)
        
        predic_matrix.iloc[user_id, product_id] = predicted_rating.round(2)

# see updated matrix with predicted ratings
print("Predicted Ratings for All Users")
display(predic_matrix)

Predicted Ratings for All Users


asin,0740782282,0767802799,0767805712,0767809254,0767819462,0767826728,0767827759,0780626699,0782010040,0782010792,...,B01HG1LA6S,B01HG36N0Y,B01HH79XRE,B01HHGAIHE,B01HHVWDG8,B01HHVZRRA,B01HHW0LSY,B01HI776Y0,B01HIPMSAY,B01HIWLIBM
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A100WO06OQR8BQ,4.03,4.03,4.03,4.03,4.03,4.03,4.03,4.03,4.03,4.03,...,5.00,4.03,4.03,5.00,4.03,4.03,4.03,4.03,4.03,4.03
A10ZBR6O8S8OCY,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,...,4.00,4.40,4.40,5.00,4.40,4.40,4.40,4.40,4.40,4.40
A119Q9NFGVOEJZ,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,...,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00,5.00
A11OTLEDSW8ZXD,4.35,4.35,4.35,4.35,4.35,4.35,4.35,4.35,4.35,4.35,...,4.35,4.35,4.35,4.35,4.35,4.35,4.35,4.35,4.35,4.35
A129YBX5BVNW2,4.59,4.59,4.59,4.59,4.59,4.59,4.59,4.59,4.59,4.59,...,4.59,4.59,4.59,4.59,4.59,4.59,4.59,4.59,4.59,4.59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AXUJFOFQZNTN,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,...,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40
AY1EF0GOH80EK,4.09,4.09,4.09,4.09,4.09,4.09,4.09,4.09,4.09,4.09,...,4.09,4.09,4.09,4.09,4.09,4.09,4.09,4.09,4.09,4.09
AZSN1TO0JI87B,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,...,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40,4.40
AZV26LP92E6WU,4.55,4.55,4.55,4.55,4.55,4.55,4.55,4.55,4.55,4.55,...,4.55,4.55,4.55,4.55,4.55,4.55,4.55,4.55,4.55,4.55


### Evaluation (Predictive Accuracy)

Now evaluate how good the predictions are vs the hidden ratings
- ***step 1***: identify the hidden ratings indices
- ***step 2***: extract hidden ratings indices and corresponding predicted ratings indices
- ***step 3***: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values

In [None]:
# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)

# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(predic_matrix.shape[0]):
    user_predicted_ratings = predic_matrix.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)

# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

In [None]:
# round to 2 decimal places
mae = round(mae, 2)
mse = round(mse, 2)
rmse = round(rmse, 2)

# Save the results to a csv file
results = pd.DataFrame({'MAE': [mae], 'MSE': [mse], 'RMSE': [rmse]})
results.to_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\results_IBCF.csv', index=False)

## Using Packages

In [None]:
## Using Packages for IBCF
import surprise
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise import KNNBasic

In [None]:
# load and Change data to User-`Item-`Rating format
amz_data = pd.read_csv(r'C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\Data\set2_data_modelling.csv')
display(amz_data.head())


x = amz_data.pivot_table(index='reviewerID', columns='asin', values='overall')
x = x.fillna(0)
print("\n\n\nUser-Item Matrix")
display(x.head())
print('Shape: ', x.shape)

In [None]:
# Import necessary libraries
from surprise import Dataset, Reader, KNNBasic, accuracy
from surprise.model_selection import train_test_split

# Assume you have a user-item matrix 'user_item_matrix'
# Convert the user-item matrix back to a DataFrame of ratings
ratings = user_item_matrix.stack().reset_index()
ratings.columns = ['user', 'item', 'rating']

# Remove rows where rating is 0
ratings = ratings[ratings['rating'] != 0]

# Define a Reader object
# The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader(rating_scale=(1, 5))

# Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

# Split the dataset into train and test
# Test set is made of 25% of the ratings
trainset, testset = train_test_split(data, test_size=.25)

# Configure the algorithm - User Based Collaborative Filtering
# Use cosine similarity
sim_options = {
    'name': 'cosine',
    'user_based': True  # this will compute similarity between users
}
algo = KNNBasic(sim_options=sim_options)

# Train the algorithm on the trainset
algo.fit(trainset)

# Predict ratings for the testset
predictions = algo.test(testset)

# Then compute RMSE, MSE and MAE
print("User-based Model : Test Set")
accuracy.rmse(predictions, verbose=True)
accuracy.mse(predictions, verbose=True)
accuracy.mae(predictions, verbose=True)


***
# Sandbox

Here we will test out the workings of item based collaborative filtering. The steps are as follows:

1. Have User Item matrix
2. Hide some ratings to simulate a test set
3. Calculate similarity (cosine similarity)
4. Calculate weighted average of ratings
5. Fill in missing values with predicted ratings
6. Take the predicted ratings and compare them to the hidden ratings
7. Calculate MAE, RMSE, MSE
8. Binarise the ratings 
9. Calculate classification metrics


In [None]:
%reset -f

# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
x = pd.read_csv(r"C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\temp_data.csv", index_col=0)
x

Unnamed: 0,book1,book2,book3,book4,book5,book6,book7,book8,book9,book10
user1,0,0,2,5,4,3,4,4,4,4
user2,4,0,3,5,0,0,0,0,0,4
user3,0,3,4,4,0,2,0,0,0,0
user4,0,0,3,5,4,0,0,0,0,0
user5,3,4,0,4,4,0,5,5,5,5
user6,4,5,0,0,0,0,4,2,2,0
user7,2,2,0,0,0,0,5,3,3,3
user8,0,5,4,0,4,3,0,0,0,0
user9,0,5,4,0,5,2,0,2,2,0
user10,0,0,0,0,5,0,4,4,4,4


In [None]:
# create a copy of the original matrix to store hidden ratings
x_hidden = x.copy()
indices_tracker = []

# identifies rated books and randomly selects 2 books to hide ratings for each user
np.random.seed(10)  # You can use any integer value as the seed
for user_id in range(x_hidden.shape[0]):
    rated_books = np.where(x_hidden.iloc[user_id, :] > 0)[0]
    print("User:", user_id)
    print("Indices of Rated Books:", rated_books)
    hidden_indices = np.random.choice(rated_books, min(2, len(rated_books)), replace=False)
    indices_tracker.append(hidden_indices)
    print("Indices to Hide:", hidden_indices, "\n")
    x_hidden.iloc[user_id, hidden_indices] = 0


User: 0
Indices of Rated Books: [2 3 4 5 6 7 8 9]
Indices to Hide: [4 5] 

User: 1
Indices of Rated Books: [0 2 3 9]
Indices to Hide: [3 9] 

User: 2
Indices of Rated Books: [1 2 3 5]
Indices to Hide: [5 1] 

User: 3
Indices of Rated Books: [2 3 4]
Indices to Hide: [4 2] 

User: 4
Indices of Rated Books: [0 1 3 4 6 7 8 9]
Indices to Hide: [9 1] 

User: 5
Indices of Rated Books: [0 1 6 7 8]
Indices to Hide: [8 1] 

User: 6
Indices of Rated Books: [0 1 6 7 8 9]
Indices to Hide: [8 9] 

User: 7
Indices of Rated Books: [1 2 4 5]
Indices to Hide: [1 5] 

User: 8
Indices of Rated Books: [1 2 4 5 7 8]
Indices to Hide: [1 7] 

User: 9
Indices of Rated Books: [4 6 7 8 9]
Indices to Hide: [6 4] 

User: 10
Indices of Rated Books: [0 1 2 4 6 7 8]
Indices to Hide: [2 0] 

User: 11
Indices of Rated Books: [0 1 2 4 5 6 7 8]
Indices to Hide: [1 7] 



In [None]:
# check tracker - all hidden ratings 
indices_tracker = pd.DataFrame(indices_tracker).to_numpy()
print("Indices of Ratings per user \n", indices_tracker)

# flattened
indices_tracker_flat = indices_tracker.flatten()
print("Indices of Ratings per User joined", indices_tracker_flat)


Indices of Ratings per user 
 [[4 5]
 [3 9]
 [5 1]
 [4 2]
 [9 1]
 [8 1]
 [8 9]
 [1 5]
 [1 7]
 [6 4]
 [2 0]
 [1 7]]
Indices of Ratings per User joined [4 5 3 9 5 1 4 2 9 1 8 1 8 9 1 5 1 7 6 4 2 0 1 7]


In [None]:
# see updated matrix with hidden ratings
print("Updated Matrix with Hidden Ratings")
display(x_hidden)

# see original matrix
print("Original Matrix")
display(x)

Updated Matrix with Hidden Ratings


Unnamed: 0,book1,book2,book3,book4,book5,book6,book7,book8,book9,book10
user1,0,0,2,5,0,0,4,4,4,4
user2,4,0,3,0,0,0,0,0,0,0
user3,0,0,4,4,0,0,0,0,0,0
user4,0,0,0,5,0,0,0,0,0,0
user5,3,0,0,4,4,0,5,5,5,0
user6,4,0,0,0,0,0,4,2,0,0
user7,2,2,0,0,0,0,5,3,0,0
user8,0,0,4,0,4,0,0,0,0,0
user9,0,0,4,0,5,2,0,0,2,0
user10,0,0,0,0,0,0,0,4,4,4


Original Matrix


Unnamed: 0,book1,book2,book3,book4,book5,book6,book7,book8,book9,book10
user1,0,0,2,5,4,3,4,4,4,4
user2,4,0,3,5,0,0,0,0,0,4
user3,0,3,4,4,0,2,0,0,0,0
user4,0,0,3,5,4,0,0,0,0,0
user5,3,4,0,4,4,0,5,5,5,5
user6,4,5,0,0,0,0,4,2,2,0
user7,2,2,0,0,0,0,5,3,3,3
user8,0,5,4,0,4,3,0,0,0,0
user9,0,5,4,0,5,2,0,2,2,0
user10,0,0,0,0,5,0,4,4,4,4


In [None]:
# get cosine sim matrix and change to pd dataframe and save to csv
pd.DataFrame(cosine_similarity(x_hidden.T).round(2), index=x.columns, columns=x.columns).to_csv(r"C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\temp_data_sim_mat_cosine.csv")
sim_mat_cos = cosine_similarity(x_hidden.T).round(2)
print("Cosine Similarity Matrix") 
sim_mat_cos

Cosine Similarity Matrix


array([[1.  , 0.17, 0.39, 0.16, 0.28, 0.5 , 0.62, 0.36, 0.37, 0.  ],
       [0.17, 1.  , 0.  , 0.  , 0.38, 0.  , 0.66, 0.58, 0.36, 0.  ],
       [0.39, 0.  , 1.  , 0.34, 0.54, 0.56, 0.19, 0.1 , 0.31, 0.17],
       [0.16, 0.  , 0.34, 1.  , 0.19, 0.  , 0.41, 0.45, 0.45, 0.39],
       [0.28, 0.38, 0.54, 0.19, 1.  , 0.48, 0.51, 0.5 , 0.67, 0.  ],
       [0.5 , 0.  , 0.56, 0.  , 0.48, 1.  , 0.23, 0.  , 0.37, 0.  ],
       [0.62, 0.66, 0.19, 0.41, 0.51, 0.23, 1.  , 0.85, 0.71, 0.26],
       [0.36, 0.58, 0.1 , 0.45, 0.5 , 0.  , 0.85, 1.  , 0.86, 0.58],
       [0.37, 0.36, 0.31, 0.45, 0.67, 0.37, 0.71, 0.86, 1.  , 0.58],
       [0.  , 0.  , 0.17, 0.39, 0.  , 0.  , 0.26, 0.58, 0.58, 1.  ]])

In [None]:
# get a predictions matrix
predic_matrix = x_hidden.copy()

# get predicted ratings for unread books for user 1 using cosine similarity
user_ratings = predic_matrix.iloc[0, :].values.reshape(1, -1)
unread_books_indices = np.where(user_ratings == 0)[1]
rated_books_indices = np.where(user_ratings > 0)[1]

for book_id in unread_books_indices:
    similarity_i_j = sim_mat_cos[book_id, rated_books_indices]
    ratings = user_ratings[0, rated_books_indices]
    predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
    predic_matrix.iloc[0, book_id] = predicted_rating.round(2)

# see updated matrix with predicted ratings
print("Predicted Ratings for User 1")
display(predic_matrix)

# save to csv
predic_matrix.to_csv(r"C:\Users\e1002902\Documents\GitHub Repository\Masters-Dissertation\Code\temp_data_predic_matrix_cosine.csv")


Predicted Ratings for User 1


Unnamed: 0,book1,book2,book3,book4,book5,book6,book7,book8,book9,book10
user1,3.67,4,2,5,3.63,3.03,4,4,4,4
user2,4.0,0,3,0,0.0,0.0,0,0,0,0
user3,0.0,0,4,4,0.0,0.0,0,0,0,0
user4,0.0,0,0,5,0.0,0.0,0,0,0,0
user5,3.0,0,0,4,4.0,0.0,5,5,5,0
user6,4.0,0,0,0,0.0,0.0,4,2,0,0
user7,2.0,2,0,0,0.0,0.0,5,3,0,0
user8,0.0,0,4,0,4.0,0.0,0,0,0,0
user9,0.0,0,4,0,5.0,2.0,0,0,2,0
user10,0.0,0,0,0,0.0,0.0,0,4,4,4


In [None]:
# now get predicted ratings for all users
for user_id in range(predic_matrix.shape[0]):
    user_ratings = predic_matrix.iloc[user_id, :].values.reshape(1, -1)
    unread_books_indices = np.where(user_ratings == 0)[1]
    rated_books_indices = np.where(user_ratings > 0)[1]
    for book_id in unread_books_indices:
        similarity_i_j = sim_mat_cos[book_id, rated_books_indices]
        ratings = user_ratings[0, rated_books_indices]
        
        if np.any(similarity_i_j):
            predicted_rating = np.sum(ratings * similarity_i_j) / np.sum(np.abs(similarity_i_j))
        else:
            # make predicted rating mean of user's ratings
            predicted_rating = np.mean(ratings)
        
        predic_matrix.iloc[user_id, book_id] = predicted_rating.round(2)

# see updated matrix with predicted ratings
print("Predicted Ratings for All Users")
display(predic_matrix)

Predicted Ratings for All Users


Unnamed: 0,book1,book2,book3,book4,book5,book6,book7,book8,book9,book10
user1,3.67,4.0,2.0,5.0,3.63,3.03,4.0,4.0,4.0,4.0
user2,4.0,4.0,3.0,3.32,3.34,3.47,3.77,3.78,3.54,3.0
user3,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
user4,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
user5,3.0,4.67,4.11,4.0,4.0,4.06,5.0,5.0,5.0,4.78
user6,4.0,3.18,3.71,3.12,3.22,4.0,4.0,2.0,3.11,2.62
user7,2.0,2.0,2.99,3.65,3.22,2.95,5.0,3.0,3.3,3.62
user8,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
user9,3.05,3.54,4.0,3.28,5.0,2.0,3.16,3.16,2.0,2.45
user10,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0


In [None]:
# now evaluate how good the predictions are vs the hidden ratings
# step 1: identify the hidden ratings indices
# step 2: extract hidden ratings indices and corresponding predicted ratings indices
# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values)
# step 4:  binarise to get classification metrics

# step 1: identify the hidden ratings indices = indices_tracker and get the hidden ratings ==========================================================================
hidden_ratings_ind = indices_tracker.copy()

# Loop through users to append hidden ratings
hidden_ratings_arrays = []

# Loop through users to append hidden ratings arrays
for user in range(x.shape[0]):
    user_hidden_ratings = x.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    hidden_ratings_arrays.append(user_hidden_ratings)


hidden_ratings_array = pd.DataFrame(hidden_ratings_arrays).to_numpy().flatten()
print("Hidden Ratings:", hidden_ratings_array)


Hidden Ratings: [4 3 5 4 2 3 4 3 5 4 2 5 3 3 5 3 5 2 4 5 2 4 5 3]


In [None]:
# step 2: extract corresponding predicted ratings indices ==========================================================================

# Create an empty list to store predicted ratings arrays
predicted_ratings_arrays = []

# Loop through users to append predicted ratings arrays
for user in range(predic_matrix.shape[0]):
    user_predicted_ratings = predic_matrix.iloc[user, hidden_ratings_ind[user, :]].reset_index(drop=True).values
    predicted_ratings_arrays.append(user_predicted_ratings)

predicted_ratings_array = pd.DataFrame(predicted_ratings_arrays).to_numpy().flatten()
print("Corresponding Predicted Ratings:", predicted_ratings_array)


Corresponding Predicted Ratings: [3.63 3.03 3.32 3.   4.   4.   5.   5.   4.78 4.67 3.11 3.18 3.3  3.62
 4.   4.   3.54 3.16 4.   4.   5.   4.72 2.97 3.08]


In [None]:
# step 3: calculate MAE, MSE and RMSE (take the hidden ratings as the true values and the predicted ratings as the predicted values) ==========================================================================

from sklearn.metrics import mean_absolute_error, mean_squared_error

# calculate MAE, MSE and RMSE
print("Using sklearn")
mae = mean_absolute_error(hidden_ratings_array, predicted_ratings_array)
mse = mean_squared_error(hidden_ratings_array, predicted_ratings_array)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# Manually
print("\n\nManually")
mae = np.mean(np.abs(hidden_ratings_array - predicted_ratings_array)) # Calculate Mean Absolute Error (MAE)
mse = np.mean((hidden_ratings_array - predicted_ratings_array) ** 2) # Calculate Mean Squared Error (MSE)
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error (RMSE)


print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")

Using sklearn
Mean Absolute Error (MAE): 1.0529166666666667
Mean Squared Error (MSE): 1.6499708333333334
Root Mean Squared Error (RMSE): 1.2845119047067386


Manually
Mean Absolute Error (MAE): 1.0529166666666667
Mean Squared Error (MSE): 1.6499708333333334
Root Mean Squared Error (RMSE): 1.2845119047067386


In [None]:
# step 4: calculate Classification Metrics (take the hidden ratings and the predicted ratings and binarise them) ==========================================================================

# Binarise the hidden ratings and predicted ratings
threshold = 3.5
binary_prediction_ratings = (predicted_ratings_array >= threshold).astype(int) 
print(f"If predicted rating is greater than or equal to {threshold}, then 1, else 0\n")
print("Predicted Ratings:", predicted_ratings_array)
print("Binary Predictions:", binary_prediction_ratings)
binary_hidden_ratings = (hidden_ratings_array >= threshold).astype(int)
print("\n")

print("Hidden Ratings:", hidden_ratings_array)
print("Binary Hidden Ratings:", binary_hidden_ratings)

If predicted rating is greater than or equal to 3.5, then 1, else 0

Predicted Ratings: [3.63 3.03 3.32 3.   4.   4.   5.   5.   4.78 4.67 3.11 3.18 3.3  3.62
 4.   4.   3.54 3.16 4.   4.   5.   4.72 2.97 3.08]
Binary Predictions: [1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 0 1 1 1 1 0 0]


Hidden Ratings: [4 3 5 4 2 3 4 3 5 4 2 5 3 3 5 3 5 2 4 5 2 4 5 3]
Binary Hidden Ratings: [1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 1 1 0 1 1 0]


In [None]:
# calculate accuracy using sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# calculate accuracy using sklearn
print("Using sklearn")
accuracy = accuracy_score(binary_hidden_ratings, binary_prediction_ratings)
precision = precision_score(binary_hidden_ratings, binary_prediction_ratings)
recall = recall_score(binary_hidden_ratings, binary_prediction_ratings)
f1 = f1_score(binary_hidden_ratings, binary_prediction_ratings)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

# calculate accuracy manually
print("\n\nManually")
true_positives = np.sum((binary_hidden_ratings == 1) & (binary_prediction_ratings == 1))
true_negatives = np.sum((binary_hidden_ratings == 0) & (binary_prediction_ratings == 0))
false_positives = np.sum((binary_hidden_ratings == 0) & (binary_prediction_ratings == 1))
false_negatives = np.sum((binary_hidden_ratings == 1) & (binary_prediction_ratings == 0))

accuracy = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f1 = 2 * precision * recall / (precision + recall)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Using sklearn
Accuracy: 0.5833333333333334
Precision: 0.6
Recall: 0.6923076923076923
F1 Score: 0.6428571428571429


Manually
Accuracy: 0.5833333333333334
Precision: 0.6
Recall: 0.6923076923076923
F1 Score: 0.6428571428571429
