This system recommends books from the Goosebumps series to readers.  Data was sourced from Kaggle via https://www.kaggle.com/code/saurabhbagchi/recommender-system-for-books/input.  Specifically, the Ratings.csv and Books.csv file were used.  A smaller subset of data was used, first by isolating ratings for only Goosebumps titles, and then selecting the top 5 users by number of reviews provided.  From there, any Goosebumps novel with greater than 2 reviews were included in this dataset.  The input data is therefore composed of 5 users with their ratings across 9 books.  

I import the data and split the dataset intoa training and test set, with an 80%, 20% split.  This leads to 21 records in the training set and 6 records in the testing set.  The goal is to use the training set data to develop a recommender system based on the overall mean rating across all books, and then adding additional bias factors for each user and book.  The testing set will be used to validate the results and ability of the recommender system to predict user's book ratings beyond the data used in training.  

In [104]:
import pandas as pd
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/koonkimb/Data612/refs/heads/main/Goosebumps_ratings.csv", header = 0)
#print(df)

df_dropped = df.dropna()

df_random = df_dropped.sample(frac = 1, random_state = 63)
split_size = int(0.8*len(df_random))
train = df_random[:split_size]
test = df_random[split_size:]

print(train)
print(test)

    userid                        title  rating
13  185233                Monster Blood     5.0
6    13518        My Hairiest Adventure     7.0
24  201353        My Hairiest Adventure     6.0
21  201353       Egg Monsters from Mars     6.0
2    13518         Attack of the Mutant     4.0
10  185233    A Shocker on Shock Street     5.0
3    13518       Egg Monsters from Mars     2.0
31  215024                Monster Blood     9.0
19  201353    A Shocker on Shock Street     6.0
41   67782            Monster Blood III     8.0
42   67782        My Hairiest Adventure     9.0
12  185233       Egg Monsters from Mars     5.0
15  185233        My Hairiest Adventure     5.0
7    13518           The Headless Ghost     9.0
36   67782      A Night in Terror Tower     8.0
5    13518            Monster Blood III     5.0
44   67782  The Horror at Camp Jellyjam    10.0
8    13518  The Horror at Camp Jellyjam     8.0
20  201353         Attack of the Mutant     6.0
35  215024  The Horror at Camp Jellyjam 

Because the data is originally in long format, I pivot the dataset to wide format for both the training and testing set to create the user-item matrices.  For the test set, I also need to insert four additional columns, as these books did not have any user ratings.  These columns are inserted to mimick the structure of the training set, although all values in these columns are NaN.

In [80]:
pivot_train = train.pivot(index='userid', columns='title', values='rating')
print(pivot_train)

pivot_test = test.pivot(index='userid', columns='title', values='rating')
pivot_test.insert(3,'Egg Monsters from Mars', np.nan)
pivot_test.insert(4,'Monster Blood', np.nan)
pivot_test.insert(6,'My Hairiest Adventure', np.nan)
pivot_test.insert(8,'The Horror at Camp Jellyjam', np.nan)
print(pivot_test)

title   A Night in Terror Tower  A Shocker on Shock Street  \
userid                                                       
13518                       NaN                        NaN   
67782                       8.0                        NaN   
185233                      NaN                        5.0   
201353                      NaN                        6.0   
215024                      NaN                        NaN   

title   Attack of the Mutant  Egg Monsters from Mars  Monster Blood  \
userid                                                                
13518                    4.0                     2.0            NaN   
67782                    NaN                     NaN            NaN   
185233                   NaN                     5.0            5.0   
201353                   6.0                     6.0            NaN   
215024                   NaN                     NaN            9.0   

title   Monster Blood III  My Hairiest Adventure  The Headless Ghos

First, to provide a baseline of comparison, the mean across all training dataset reviews is determined.  This mean will be used as a baseline "recommendation". The mean is not expected to be a good predictor as it does not take into account the user and book biases (i.e. users may specifically rate higher or lower than average, and books may be generally considered better or worse than average).  As such, incorporating additional biases later on should yield better results.  To assess this numerically we can calculate the RMSE (root mean squared error), which is the square root of the average of squared errors, using just the mean value. Later, we can calculate the RMSE for the recommender system that does incorporate the aforementioned biases.  The expectation is that incorporating these biases will decrease the RMSE in comparison to baseline using the mean value.  The RMSE was chosen as the assessment metric as it penalizes "more" incorrect answers more strongly.  

In [29]:
train_means = train['rating'].mean()
print(train_means)

6.333333333333333


To compute the RMSE of the training set, I first create a matrix that shows NA values in the training set.  I then create a matrix of ones, and multiply it by the mean across all training set book recommendations (from above - 6.33).  Afterwards, I use the is_NA matrix to replace all corresponding NA values in the matrix of means to 0.  For example, if user 13518 has an NaN value for the book "A Night in Terror Tower", then the matrix of means should also have an NaN value for user 13518, book "A Night in Terror Tower".  However, if user 13518 has a rating for book "Attack of the Mutant", then the matrix of means should have the value 6.33 (the mean across all training set ratings) for user 13518, book "Attach of the Mutant".  This allows us to perform matrix operations on a matrix with NaN values.  As shown below, the RMSE using the mean across all training set ratings is approximately 1.98.  

In [88]:
is_na = pivot_train.isna().astype(int) # create an is_na matrix 

# create matrix of ones, then multiply by the mean, then multiply by the is_na matrix so that NaN values are replaced by 0
train_means_df = np.ones((5,9)) * train_means * (1-is_na) 

#convert to dataframe, then replace 0 with NaN
train_means_df = pd.DataFrame(train_means_df)
train_means_df = train_means_df.replace(0, np.nan)

rmse_raw_train = (((pivot_train - train_means_df) ** 2).to_numpy())
rmse_raw_train = np.sqrt(np.nanmean(rmse_raw_train))
print(rmse_raw_train)

1.9840634910475863


We can then perform the same operations as the above, but for the testing set.  When doing so, the RMSE is 2.62.

In [89]:
# Do the same as above but for the test set
is_na_test = pivot_test.isna().astype(int) # create an is_na matrix 

# create matrix of ones, then multiply by the mean, then multiply by the is_na matrix so that NaN values are replaced by 0
test_means_df = np.ones((5,9)) * train_means * (1-is_na_test) 

#convert to dataframe, then replace 0 with NaN
test_means_df = pd.DataFrame(test_means_df)
test_means_df = test_means_df.replace(0, np.nan)

rmse_raw_test = (((pivot_test - test_means_df) ** 2).to_numpy())
rmse_raw_test = np.sqrt(np.nanmean(rmse_raw_test))
print(rmse_raw_test)

2.6246692913372702


As previously stated, user and book biases can be calculated to see if specific users and books may skew ratings higher or lower than average.  To do this, we can take the mean of training set ratings by user and then subtract the mean across all training set ratings.  

In [99]:
user_bias = pivot_train.mean(axis=1) - train_means
print(user_bias)

userid
13518    -0.500000
67782     2.416667
185233   -1.333333
201353   -0.333333
215024    0.166667
dtype: float64


For book bias, we do the same except by column (i.e. we take the mean of training set ratings by book and then subtract the mean across all training set ratings).  

In [100]:
book_bias = pivot_train.mean() - train_means
print(book_bias)

title
A Night in Terror Tower        1.666667
A Shocker on Shock Street     -0.833333
Attack of the Mutant          -1.333333
Egg Monsters from Mars        -2.000000
Monster Blood                  0.666667
Monster Blood III              0.166667
My Hairiest Adventure          0.416667
The Headless Ghost             1.166667
The Horror at Camp Jellyjam    1.000000
dtype: float64


The new recommendation system can now add these biases to the matrix of means.  For any values above the 10.0 rating, the prediction is replaced with 10.0 which is the upper limit of the rating scale.  

In [101]:
train_means_user_bias = train_means_df.add(user_bias, axis = 0)
train_means_bias = train_means_user_bias.add(book_bias, axis = 1)
train_means_bias[train_means_bias > 10] = 10
print(train_means_bias)

title   A Night in Terror Tower  A Shocker on Shock Street  \
userid                                                       
13518                       NaN                        NaN   
67782                      10.0                        NaN   
185233                      NaN                   4.166667   
201353                      NaN                   5.166667   
215024                      NaN                        NaN   

title   Attack of the Mutant  Egg Monsters from Mars  Monster Blood  \
userid                                                                
13518               4.500000                3.833333            NaN   
67782                    NaN                     NaN            NaN   
185233                   NaN                3.000000       5.666667   
201353              4.666667                4.000000            NaN   
215024                   NaN                     NaN       7.166667   

title   Monster Blood III  My Hairiest Adventure  The Headless Ghos

We repeat this process of adding the biases to the matrix of means for the test set.

In [95]:
test_means_user_bias = test_means_df.add(user_bias, axis = 0)
test_means_bias = test_means_user_bias.add(book_bias, axis = 1)
test_means_bias[test_means_bias > 10] = 10
print(test_means_bias)

title   A Night in Terror Tower  A Shocker on Shock Street  \
userid                                                       
13518                 10.000000                        NaN   
67782                       NaN                       10.0   
185233                      NaN                        NaN   
201353                 7.166667                        NaN   
215024                      NaN                        NaN   

title   Attack of the Mutant  Egg Monsters from Mars  Monster Blood  \
userid                                                                
13518                    NaN                     NaN            NaN   
67782                    NaN                     NaN            NaN   
185233                   NaN                     NaN            NaN   
201353                   NaN                     NaN            NaN   
215024              4.666667                     NaN            NaN   

title   Monster Blood III  My Hairiest Adventure  The Headless Ghos

Now, we can re-compute the RMSE for both the training and test set.  We can see the RMSE for the training set decreased from 1.98 to 1.45, which is an approximate 27% improvement.  For the test set, the RMSE decreased from 2.62 to 0.87 which is an approximate 67% improvement.

In [102]:
rmse_bias_train = (((pivot_train - train_means_bias) ** 2).to_numpy())
rmse_bias_train = np.sqrt(np.nanmean(rmse_bias_train))
print(rmse_bias_train)

1.4539901311863508


In [103]:
rmse_bias_test = (((pivot_test - test_means_bias) ** 2).to_numpy())
rmse_bias_test = np.sqrt(np.nanmean(rmse_bias_test))
print(rmse_bias_test)

0.8740073734751264
