# Book Recommendation System

# Part III: Collaborative Filtering - Matrix Factorization

### Importing Libraries

In [1]:
import pandas as pd               # pandas is used for data manipulation and analysis, providing data structures like DataFrames.
import numpy as np                # numpy is used for numerical operations on large, multi-dimensional arrays and matrices.

from scipy.sparse import csr_matrix                             # csr_matrix is used for creating compressed sparse row matrices, which are efficient for arithmetic and matrix operations on sparse data.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras

# IPython's display module is used to display images within Jupyter Notebooks.
from IPython.display import Markdown, display, Image  
from IPython.display import clear_output

import dask.dataframe as dd
import sys

2024-12-06 14:15:35.869298: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-12-06 14:15:35.872934: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-12-06 14:15:35.923566: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



### Loading the Data

In [2]:
books = pd.read_csv("data/Books_cleaned.csv").drop('Unnamed: 0', axis = 1)

ratings_files = [f'data/Ratings_cleaned_part_{i}.csv' for i in range(1,6+1)]
ratings_dfs = [pd.read_csv(file) for file in ratings_files]
ratings = pd.concat(ratings_dfs, ignore_index=True).drop('Unnamed: 0', axis = 1)
del ratings_files, ratings_dfs

books_genres = pd.read_csv("data/Books_genres_cleaned.csv").drop('Unnamed: 0', axis = 1)
books_genres_list = pd.read_csv("data/Books_genres_list_cleaned.csv").drop('Unnamed: 0', axis = 1)

## Modelling

### Step 1. Preparing the datasets

The goal of a collaborative filtering recommender system is to generate two vectors: For each user, a 'parameter vector' that embodies the movie tastes of a user. For each item, a feature vector of the same size which embodies some description of the item. The dot product of the two vectors plus the bias term should produce an estimate of the rating the user might give to that item. This approach is known as matrix factorization.

Matrix factorization is a powerful approach to collaborative filtering, designed to work with sparse datasets like user ratings for books. Unlike user-based or item-based collaborative filtering, which rely on explicit similarities, matrix factorization discovers hidden patterns that connect users and items.

Imagine you’re trying to choose a new book to read. Instead of looking for users with similar reading habits or books that share obvious traits, matrix factorization uncovers abstract dimensions, like a user’s preference for complex plots or a book’s appeal in a particular genre. These hidden factors help make personalized and accurate recommendations.

Here’s how it works:

1. Representing Users and Books as Latent Features:

    - Matrix factorization maps both users and books into a latent feature space. Each user is represented by a vector of preferences, and each book is represented by a vector of attributes in this space.

    - For instance, a user who consistently rates certain books highly might have a strong preference for hidden factors captured by the model, such as a common theme, tone, or writing style that those books share. The model does not know these characteristics explicitly but infers them from the user’s ratings.

2. Predicting Ratings:

    - The model predicts how much a user would like a book by computing the dot product of their feature vector and the book’s feature vector, plus a bias term for both the user and the book.

3. Training the Model:

    - The system adjusts the user and book vectors by minimizing the error between predicted and actual ratings in the training data. This optimization process ensures that the model learns meaningful latent factors that best explain the observed ratings.

The existing ratings are meant to be provided in the form of a matrix, $Y$. This matrix is a $n_i \times n_u$ matrix, where $n_i$ is the number of items and $n_u$ is the number of users. Then, $Y[i, j]$ represents the rating that the user $j$ has given to the item $i$.

But first, if I want the algorithm to make recommendation for myself, I have to enter my own ratings.

In [3]:
# To look for the BookIDs
books[books['Title'].str.contains('metamorphosis', case=False)][['BookID','Title','Authors']]

Unnamed: 0,BookID,Title,Authors
211,213,The Metamorphosis,"Franz Kafka, Stanley Corngold"
2979,3020,The Metamorphosis and Other Stories,"Franz Kafka, Jason Baker, Donna Freed"
6555,6664,"The Metamorphosis, In the Penal Colony, and Ot...",Franz Kafka


In [4]:
# In order to have recommendations for the target user, me in this case, 
# I add my ratings and user information in the datasets
my_ratings = {
    'UserID': [19960808]*33,
    'BookID': [213,   # The Metamorphosis
               859,   # The Way of Shadows
               1412,  # Shadow's Edge
               1429,  # Beyond the Shadows
               7,     # The Hobbit	
               19,    # The Fellowship of the Ring
               155,   # The Two Towers
               161,   # The Return of the King
               389,   # The Final Empire
               565,   # The Well of Ascension
               603,   # The Hero of Ages 
               1200,  # The Alloy of Law
               2792,  # Shadows of Self
               3341,  # The Bands of Mourning
               1665,  # Warbreaker
               562,   # The Way of Kings
               8,     # The Catcher in the Rye
               192,   # The Name of the Wind
               429,   # The Color of Magic
               1343,  # The Light Fantastic
               1089,  # Equal Rites
               2109,  # Sourcery
               842,   # A Court of Thorns and Roses
               1308,  # A Court of Mist and Fury
               7373,  # A Court of Wings and Ruin
               1239,  # Chronicle of a Death Foretold
               2676,  # The Three-Body Problem
               7120,  # The Dark Forest
               276,   # Foundation
               789,   # Foundation and Empire
               890,   # Second Foundation 
               54,    # The Hitchhiker's Guide to the Galaxy
               2931,  # Flatland: A Romance of Many Dimensions
              ],
    'Rating': [4, # The Metamorphosis
               4, # The Way of Shadows
               4, # Shadow's Edge
               4, # Beyond the Shadows
               4, # The Hobbit	
               5, # The Fellowship of the Ring
               5, # The Two Towers
               5, # The Return of the King
               5, # The Final Empire
               5, # The Well of Ascension
               5, # The Hero of Ages 
               4, # The Alloy of Law
               4, # Shadows of Self
               4, # The Bands of Mourning
               5, # Warbreaker
               5, # The Way of Kings
               5, # The Catcher in the Rye
               5, # The Name of the Wind
               4, # The Color of Magic
               4, # The Light Fantastic
               3, # Equal Rites
               4, # Sourcery
               3, # A Court of Thorns and Roses
               4, # A Court of Mist and Fury
               4, # A Court of Wings and Ruin
               4, # Chronicle of a Death Foretold
               5, # The Three-Body Problem
               5, # The Dark Forest
               5, # Foundation
               5, # Foundation and Empire
               5, # Second Foundation 
               4, # The Hitchhiker's Guide to the Galaxy
               3, # Flatland: A Romance of Many Dimensions              
              ]
}

my_ratings_df = pd.DataFrame(my_ratings)
ratings = pd.concat([ratings, my_ratings_df], ignore_index=True)

for index, row in my_ratings_df.iterrows():
    userid = row['UserID']
    bookid = row['BookID']
    book_title = books[books['BookID'] == bookid]['Title'].values[0]
    rating = row['Rating']
    print(f'Rated {rating} for {book_title}')

Rated 4 for The Metamorphosis
Rated 4 for The Way of Shadows (Night Angel, #1)
Rated 4 for Shadow's Edge (Night Angel, #2)
Rated 4 for Beyond the Shadows (Night Angel, #3)
Rated 4 for The Hobbit
Rated 5 for The Fellowship of the Ring (The Lord of the Rings, #1)
Rated 5 for The Two Towers (The Lord of the Rings, #2)
Rated 5 for The Return of the King (The Lord of the Rings, #3)
Rated 5 for The Final Empire (Mistborn, #1)
Rated 5 for The Well of Ascension (Mistborn, #2)
Rated 5 for The Hero of Ages (Mistborn, #3)
Rated 4 for The Alloy of Law (Mistborn, #4)
Rated 4 for Shadows of Self (Mistborn, #5)
Rated 4 for The Bands of Mourning (Mistborn, #6)
Rated 5 for Warbreaker (Warbreaker, #1)
Rated 5 for The Way of Kings (The Stormlight Archive, #1)
Rated 5 for The Catcher in the Rye
Rated 5 for The Name of the Wind (The Kingkiller Chronicle, #1)
Rated 4 for The Color of Magic (Discworld, #1; Rincewind #1)
Rated 4 for The Light Fantastic (Discworld, #2; Rincewind #2)
Rated 3 for Equal Rites (Di

Now that I have the target user ratings, we select the users that have rated, at least, one of the books the target user has rated.

In [5]:
target_UserID = 19960808

# Original number of users
print('Original number of users: ', len(ratings['UserID'].unique()))

# Books rated by the target user
target_books = ratings[ratings['UserID'] == target_UserID].BookID.values

# Users who have rated at least 1 of the items rated by the current user
selected_users_1 = ratings[ratings['BookID'].isin(target_books)]
selected_users_1 = pd.DataFrame(selected_users_1.groupby('UserID').size(), columns=['Coincidences']).sort_values(by='Coincidences', ascending=False).reset_index()

# There are 34590 users with at least one coincidence
number_of_users_1 = selected_users_1.shape[0]
print('Users with, at least, 1 coincidence: ', number_of_users_1)

# In this case with so many users available for recommendations, we can keep just those with at least 10 coincidences
selected_users_10 = selected_users_1[selected_users_1['Coincidences'] >= 10]

# Now, we have 1279 available users
number_of_users_10 = selected_users_10.shape[0]
print('Users with, at least, 10 coincidence: ', number_of_users_10)

# Ratings of the selected users
selected_ratings = ratings[ratings['UserID'].isin(selected_users_10.UserID.values)].reset_index().drop(['index'], axis=1)

Original number of users:  53346
Users with, at least, 1 coincidence:  34590
Users with, at least, 10 coincidence:  1276


I can now construct the $Y$ matrix with the ratings. To do so, I will use a sparse matrix.

In [6]:
# Get the values of UserUDs and BookIDs to assign indices
unique_users = selected_ratings['UserID'].unique()
unique_books = selected_ratings['BookID'].unique()

# Create a dictionary to map unique IDs to indices
user_to_index = {user_id: index for index, user_id in enumerate(unique_users)}
book_to_index = {book_id: index for index, book_id in enumerate(unique_books)}

# Map the UserIDs and BookIDs to their respective indices
user_indices = selected_ratings['UserID'].map(user_to_index)
book_indices = selected_ratings['BookID'].map(book_to_index)

# Create the CSR matrix
ratings_csr_matrix = csr_matrix(
    (selected_ratings['Rating'], (book_indices, user_indices)),
    shape=(len(unique_books), len(unique_users))
)

print('Total size of the csr_matrix:', ratings_csr_matrix.shape[0] * ratings_csr_matrix.shape[1])
print('Number of non-zero elements in the csr_matrix:', ratings_csr_matrix.count_nonzero())

Total size of the csr_matrix: 8837576
Number of non-zero elements in the csr_matrix: 163709


This reflects the importance of using a sparse matrix here. Notice how much larger the total size of the array is compared to the number of non-zero elements, which is just the number of ratings of the selected users. Sparse arrays/matrices allow us to represent these objects without explicitly storing all the 0-valued elements. This means that if the transactional data can be loaded into memory, the sparse array will fit in memory as well.

In [7]:
# Example to understand how the csr matrix is created
i = 130845 # row of the ratings dataframe

print(f'In the row {i} of the ratings dataframe we have:')
print(f'\t UserID = {selected_ratings.iloc[i].UserID}')
print(f'\t BookID = {selected_ratings.iloc[i].BookID}')
print(f'\t Rating = {selected_ratings.iloc[i].Rating}\n')

print(f'Using book_indices and user_indices one can map the IDs to the indices in the sparse matrix:')
print(f'\t Book index = {book_indices[i]} (row index)')
print(f'\t User index = {user_indices[i]} (column index)\n')

print(f'Then, using these indices:')
print(f'\t CSR_Matrix[{book_indices[i]}, {user_indices[i]}] = {ratings_csr_matrix[book_indices[i], user_indices[i]]} (rating)\n')

print('One can also go from the indices in the CSR Matrix to the UserID and BookID values:')
print(f'\t UserID = unique_users[{user_indices[i]}] = {unique_users[user_indices[i]]}')
print(f'\t BookID = unique_books[{book_indices[i]}] = {unique_books[book_indices[i]]}')

In the row 130845 of the ratings dataframe we have:
	 UserID = 52183
	 BookID = 3111
	 Rating = 4

Using book_indices and user_indices one can map the IDs to the indices in the sparse matrix:
	 Book index = 2682 (row index)
	 User index = 1221 (column index)

Then, using these indices:
	 CSR_Matrix[2682, 1221] = 4 (rating)

One can also go from the indices in the CSR Matrix to the UserID and BookID values:
	 UserID = unique_users[1221] = 52183
	 BookID = unique_books[2682] = 3111


Now, let's define and normalize the matrices $Y$ and $R$.

In [8]:
################################################################################
#                                                                              #
#                              normalizeRatings                                #
#                                                                              #
#  Preprocess data by subtracting the mean rating for every movie (every row)  #
#  so that each movie has a rating of 0 on average. Unrated moves then have a  #
#  mean rating (0).                                                            #
#  Only include real ratings R(i,j) = 1.                                       #
#                                                                              #
################################################################################

def normalizeRatings(Y, R):
    Ymean = np.sum(Y ,axis=1) / np.sum(R, axis=1) # We are safe from having a 0 in the denominator because all the books have, at least, a rating
    Ynorm = Y - R.multiply(Ymean[:, 0])
    return(Ynorm, Ymean)

# # Check Ymean is well calculated
# Ymean = np.sum(Y ,axis=1) / np.sum(R, axis=1)
# equal = True
# for i in range(len(Ymean)):
#     mean = selected_ratings[selected_ratings['BookID'] == unique_books[i]]['Rating'].mean()
#     if Ymean[i, 0] != mean:
#         equal = False
#         break
# equal

# Check the normalization is well calculated
# Ynorm[i, j] (if != 0) should be equal rating_ij - book_i_mean, where
# book_id = unique_books[i]
# user_id = unique_users[j]
# book_i_mean = selected_ratings[selected_ratings['BookID'] == book_id]['Rating'].mean()
# rating_ij = ratings[(ratings['BookID'] == book_id) & (ratings['UserID'] == user_id)]['Rating'].values[0]

In [9]:
Y = ratings_csr_matrix.copy()

R = ratings_csr_matrix.copy()
R.data = (R.data != 0).astype(int) # The R matrix returns a 0 if the rating of a given user to a given book exists and 1 otherwise

# Normalize Y
Ynorm, Ymean = normalizeRatings(Y, R)

del ratings_csr_matrix

Let's now prepare to train the model. Initialize the parameters and select the Adam optimizer.

In [10]:
# Useful Values
num_books, num_users = Y.shape
num_features = 100

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_books,  num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')

# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

Let's now train the collaborative filtering model. This will learn the parameters $\mathbf{X}$, $\mathbf{W}$, and $\mathbf{b}$. 

The operations involved in learning $w$, $b$, and $x$ simultaneously do not fall into the typical 'layers' offered in the TensorFlow neural network package.  Consequently, the flow used in Course 2: Model, Compile(), Fit(), Predict(), are not directly applicable. Instead, we can use a custom training loop.

Recall from earlier labs the steps of gradient descent.
- repeat until convergence:
    - compute forward pass
    - compute the derivatives of the loss relative to parameters
    - update the parameters using the learning rate and the computed derivatives 
    
TensorFlow has the marvelous capability of calculating the derivatives for you. This is shown below. Within the `tf.GradientTape()` section, operations on Tensorflow Variables are tracked. When `tape.gradient()` is later called, it will return the gradient of the loss relative to the tracked variables. The gradients can then be applied to the parameters using an optimizer. 
This is a very brief introduction to a useful feature of TensorFlow and other machine learning frameworks. Further information can be found by investigating "custom training loops" within the framework of interest.
    
But first, we have to construct the cost function for the model.

In [11]:
################################################################################
#                                                                              #
#                               cofi_cost_func                                 #
#                                                                              #
#  Returns the cost for the content-based filtering.                           #
#  Vectorized for speed. Uses tensorflow operations to be compatible with      #
#  custom training loop.                                                       #
#  Arguments:                                                                  #
#   X (ndarray (num_movies,num_features)): matrix of item features             #
#   W (ndarray (num_users,num_features)) : matrix of user parameters           #
#   b (ndarray (1, num_users)            : vector of user parameters           #
#   Y (matrix (num_movies,num_users)     : matrix of user ratings of books     #
#   R (matrix (num_movies,num_users)    : matrix, where R(i, j) = 1 if the     #
#                                       i-th movies was rated by the j-th user #
#   lambda_ (float): regularization parameter                                  #
#                                                                              #
################################################################################

def cofi_cost_func(X, W, b, Y, R, lambda_):
    # Convert the sparse matrices Y and R into TensorFlow tensors
    Y_dense = tf.convert_to_tensor(Y.toarray(), dtype=tf.float64) 
    R_dense = tf.convert_to_tensor(R.toarray(), dtype=tf.float64) 

    # Compute the difference between prediction and target
    prediction = tf.linalg.matmul(X, tf.transpose(W)) + b  # X * W^T + b
    error = (prediction - Y_dense) * R_dense  # Apply the R mask for the non-zero positions

    # Cálculo del costo (función de pérdida)
    J = 0.5 * tf.reduce_sum(tf.square(error))  # Sum the squared errors
    J += (lambda_ / 2) * (tf.reduce_sum(tf.square(X)) + tf.reduce_sum(tf.square(W)))  # Regularization

    return J

In [12]:
iterations = 1000
lambda_ = 1
for iter in range(iterations):
    # Use TensorFlow’s GradientTape
    # to record the operations used to compute the cost 
    with tf.GradientTape() as tape:

        # Compute the cost (forward pass included in cost)
        cost_value = cofi_cost_func(X, W, b, Ynorm, R, lambda_)

    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss
    grads = tape.gradient( cost_value, [X,W,b] )

    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients( zip(grads, [X,W,b]) )

    # Log periodically.
    if iter % 20 == 0:
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

Training loss at iteration 0: 8743673.9
Training loss at iteration 20: 304267.2
Training loss at iteration 40: 111410.0
Training loss at iteration 60: 58196.5
Training loss at iteration 80: 36952.9
Training loss at iteration 100: 26438.9
Training loss at iteration 120: 20449.1
Training loss at iteration 140: 16729.9
Training loss at iteration 160: 14290.1
Training loss at iteration 180: 12623.7
Training loss at iteration 200: 11447.2
Training loss at iteration 220: 10592.9
Training loss at iteration 240: 9957.3
Training loss at iteration 260: 9474.0
Training loss at iteration 280: 9099.2
Training loss at iteration 300: 8803.3
Training loss at iteration 320: 8565.8
Training loss at iteration 340: 8372.2
Training loss at iteration 360: 8212.0
Training loss at iteration 380: 8077.8
Training loss at iteration 400: 7963.8
Training loss at iteration 420: 7865.9
Training loss at iteration 440: 7781.0
Training loss at iteration 460: 7706.8
Training loss at iteration 480: 7641.5
Training loss a

### Step X. Recommendations

In [118]:
# Make the predictions for my user
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()

# Restore the mean
pm = p + Ymean

# My user is located at the last index
my_predictions = pm[:,-1].flatten()
my_predictions = np.array(my_predictions)[0]

# Sort predictions
ix = np.argsort(my_predictions)[::-1]
print("Ordered indices:", ix[:10])

Ordered indices: [6127 6268 6269 1196 4984 2651 2271 2275 4191 2272]


In [135]:
for i in range(10):
    book_id = unique_books[ix[i]]
    if book_id not in my_ratings_df['BookID'].values:
        book_title = books[books['BookID'] == book_id]['Title'].values[0]
        print(f'Predicting rating {my_predictions[ix[i]]:0.2f} for book: {book_title}')

Predicting rating 5.06 for book: The Choice
Predicting rating 5.05 for book: Cry to Heaven
Predicting rating 5.05 for book: Belinda
Predicting rating 5.05 for book: Selected Poems
Predicting rating 5.04 for book: Chronicles, Vol. 1
Predicting rating 5.04 for book: The 48 Laws of Power
Predicting rating 5.03 for book: Shopaholic Takes Manhattan (Shopaholic, #2)
Predicting rating 5.03 for book: Shopaholic & Baby (Shopaholic, #5)
Predicting rating 5.03 for book: When the Wind Blows (When the Wind Blows, #1)
Predicting rating 5.03 for book: Shopaholic Ties the Knot (Shopaholic, #3)


In [146]:
print('\n\nOriginal vs Predicted ratings:\n')
my_book_ids = my_ratings_df['BookID'].values
for i in range(len(my_book_ids)):
    book_id = my_book_ids[i]
    index_p = unique_books.tolist().index(book_id)
    book_title = books[books['BookID'] == book_id]['Title'].values[0]

    my_rating = my_ratings_df[my_ratings_df['BookID'] == book_id]['Rating'].values[0]
    print(f'Original {my_rating}, Predicted {my_predictions[index_p]:0.2f} for {book_title}')



Original vs Predicted ratings:

Original 4, Predicted 3.95 for The Metamorphosis
Original 4, Predicted 3.98 for The Way of Shadows (Night Angel, #1)
Original 4, Predicted 4.02 for Shadow's Edge (Night Angel, #2)
Original 4, Predicted 4.00 for Beyond the Shadows (Night Angel, #3)
Original 4, Predicted 4.01 for The Hobbit
Original 5, Predicted 4.97 for The Fellowship of the Ring (The Lord of the Rings, #1)
Original 5, Predicted 5.00 for The Two Towers (The Lord of the Rings, #2)
Original 5, Predicted 4.98 for The Return of the King (The Lord of the Rings, #3)
Original 5, Predicted 4.98 for The Final Empire (Mistborn, #1)
Original 5, Predicted 4.98 for The Well of Ascension (Mistborn, #2)
Original 5, Predicted 4.99 for The Hero of Ages (Mistborn, #3)
Original 4, Predicted 4.00 for The Alloy of Law (Mistborn, #4)
Original 4, Predicted 4.06 for Shadows of Self (Mistborn, #5)
Original 4, Predicted 4.07 for The Bands of Mourning (Mistborn, #6)
Original 5, Predicted 4.91 for Warbreaker (Warb