# Book Recommendation System

# Part III: Collaborative Filtering

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import ast

import matplotlib.pyplot as plt

from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.neighbors import NearestNeighbors

# To have input from a dropdown
import tkinter as tk
from tkinter import simpledialog

from IPython.display import Markdown, display, Image
from IPython.display import clear_output

import ipywidgets as widgets
from ipywidgets import interact

### Loading the Data

In [2]:
books = pd.read_csv("data/Books_cleaned.csv").drop('Unnamed: 0', axis = 1)
#ratings = pd.read_csv("data_cleaned/Ratings_cleaned.csv").drop('Unnamed: 0', axis = 1)

ratings_files = [f'data/Ratings_cleaned_part_{i}.csv' for i in range(1,6+1)]
ratings_dfs = [pd.read_csv(file) for file in ratings_files]
ratings = pd.concat(ratings_dfs, ignore_index=True).drop('Unnamed: 0', axis = 1)

books_genres = pd.read_csv("data/Books_genres_cleaned.csv").drop('Unnamed: 0', axis = 1)
books_genres_list = pd.read_csv("data/Books_genres_list_cleaned.csv").drop('Unnamed: 0', axis = 1)

## Modelling

### Step 1. Preparing the datasets

Collaborative filtering is a well-known method for recommendations, being a prime example of it user-based collaborative filtering. Imagine you are looking for a new book to read, but you are unsure which one to choose. Imagine that you have friends or relatives whose taste in books aligns well with yours, so asking them for recommendations would make sense. This idea is the building-block of user-based collaborative filtering.

This is how it works: 

1. First, you identify other users who have similar tastes to a target user on their tastes on the same set of books. For example, if you enjoyed all the Brandon Sanderson's books, you look for users who also liked those books.

2. Once you have found these similar users, you take their average ratings for books that the target user has not read yet. For instance, you check how these Brandon Sanderson fans rated other books.

3. Finally, you recomment the books with the highest average ratings to the current user. 

These three steps form the core of the user-based collaborative filtering algorithm.

However, before implementing this algorithm, we need to restructure our data. For this method, data is typically structured such that each rox corresponds to a user and each columns corresponds to a product (a book in our scenario).

In [3]:
# To look for the BookIDs
books[books['Title'].str.contains('flatland', case=False)][['BookID','Title','Authors']]

Unnamed: 0,BookID,Title,Authors
2930,2931,Flatland: A Romance of Many Dimensions,"Edwin A. Abbott, Banesh Hoffmann"


In [4]:
# In order to have recommendations for the user, me in this case, 
# I add my ratings and user information in the datasets
my_ratings = {
    'UserID': [19960808]*28,
    'BookID': [213,   # The Metamorphosis
               859,   # The Way of Shadows
               1412,  # Shadow's Edge
               1429,  # Beyond the Shadows
               7,     # The Hobbit	
               19,    # The Fellowship of the Ring
               155,   # The Two Towers
               161,   # The Return of the King
               389,   # The Final Empire
               565,   # The Well of Ascension
               603,   # The Hero of Ages 
               1200,  # The Alloy of Law
               8,     # The Catcher in the Rye
               192,   # The Name of the Wind
               429,   # The Color of Magic
               1343,  # The Light Fantastic
               1089,  # Equal Rites
               842,   # A Court of Thorns and Roses
               1308,  # A Court of Mist and Fury
               7373,  # A Court of Wings and Ruin
               1239,  # Chronicle of a Death Foretold
               2676,  # The Three-Body Problem
               7120,  # The Dark Forest
               276,   # Foundation
               789,   # Foundation and Empire
               890,   # Second Foundation 
               54,    # The Hitchhiker's Guide to the Galaxy
               2931,  # Flatland: A Romance of Many Dimensions
              ],
    'Rating': [4, # The Metamorphosis
               4, # The Way of Shadows
               4, # Shadow's Edge
               4, # Beyond the Shadows
               4, # The Hobbit	
               5, # The Fellowship of the Ring
               5, # The Two Towers
               5, # The Return of the King
               5, # The Final Empire
               5, # The Well of Ascension
               5, # The Hero of Ages 
               4, # The Alloy of Law
               5, # The Catcher in the Rye
               5, # The Name of the Wind
               4, # The Color of Magic
               4, # The Light Fantastic
               3, # Equal Rites
               3, # A Court of Thorns and Roses
               4, # A Court of Mist and Fury
               4, # A Court of Wings and Ruin
               4, # Chronicle of a Death Foretold
               5, # The Three-Body Problem
               5, # The Dark Forest
               5, # Foundation
               5, # Foundation and Empire
               5, # Second Foundation 
               4, # The Hitchhiker's Guide to the Galaxy
               3, # Flatland: A Romance of Many Dimensions              
              ]
}

my_ratings_df = pd.DataFrame(my_ratings)
ratings = pd.concat([ratings, my_ratings_df], ignore_index=True)

Now that I have the target user ratings, we select the users that have in common that they have rated, at least, one of the books the target user has rated.

In [5]:
target_UserID = 19960808

# Books rated by the target user
target_books = ratings[ratings['UserID'] == target_UserID].BookID.values

# Users who have rated at least 1 of the items rated by the current user
selected_users = ratings[ratings['BookID'].isin(target_books)]
selected_users = pd.DataFrame(selected_users.groupby('UserID').size(), columns=['Coincidences']).sort_values(by='Coincidences', ascending=False).reset_index()

# There are 34019 users with at least one coincidence
number_of_users = selected_users.shape[0]

# In this case with so many users available for recommendations, we can keep just those with at least 5 coincidences
selected_users = selected_users[selected_users['Coincidences'] >= 10]

# Now, we have 521 available users
number_of_users = selected_users.shape[0]

# Ratings of the selected users
selected_ratings = ratings[ratings['UserID'].isin(selected_users.UserID.values)]

In [6]:
# Creating the matrix with users and books
ratings_matrix = selected_ratings[['UserID', 'BookID', 'Rating']].pivot(index = 'UserID', columns = 'BookID', values = 'Rating').fillna(0)
users_order = list(ratings_matrix.index)
books_order = list(ratings_matrix.columns)
ratings_csr_matrix = csr_matrix(ratings_matrix.values)

print('Total size of the csr_matrix:', ratings_matrix.shape[0] * ratings_matrix.shape[1])
print('Number of non-zero elements in the csr_matrix:', ratings_csr_matrix.count_nonzero())

#del ratings_matrix

Total size of the csr_matrix: 2675856
Number of non-zero elements in the csr_matrix: 66950


In [7]:
ratings_matrix

BookID,1,2,3,4,5,6,7,8,9,10,...,9976,9978,9979,9982,9992,9994,9996,9997,9998,9999
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
105,5.0,0.0,1.0,0.0,0.0,0.0,5.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
589,4.0,0.0,0.0,0.0,3.0,0.0,2.0,5.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
741,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
772,0.0,2.0,1.0,5.0,5.0,0.0,3.0,4.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
830,0.0,4.0,0.0,5.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52954,4.0,4.0,0.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
53097,4.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
53157,2.0,3.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
53171,0.0,0.0,0.0,4.0,3.0,0.0,0.0,3.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This reflects the importance of using a sparse matrix here. Notice how much larger the total size of the array is compared to the number of non-zero elements, which is just the number of ratings of the selected users. Sparse arrays/matrices allow us to represent these obkects without explicitly storing all the 0-valued elements. This means that if the transactional data can be loaded into memory, the sparse array will fit in memory as well.

### Step 2.1. Find similar users

For all the users with at least 5 coincidences, we can calculate the similarity of their ratings with the target user's ratings. Among the possibilities to calculate the similarities, the cosine similarity and Pearson's correlation coefficient are the mosrt popular.

In [8]:
def calculate_similarity(user1_id, user2_id, ratingmatrix, method='pearson'):
    if method == 'pearson':
        correlation = calculate_pearson_similarity(user1_id, user2_id, ratingmatrix)
    elif method == 'cosine': 
        correlation = calculate_cosine_similarity(user1_id, user2_id, ratingmatrix)
    else:
        print("ERROR: The specified method is not valid. Please, write 'cosine' or 'pearson' instead.")
        return
    
    return correlation # In both methods, the closer the correlation to 1, the better

def calculate_pearson_similarity(user1_id, user2_id, ratingmatrix):
    # Indices in the ratingmatrix of the users
    index1 = list(users_order).index(user1_id) 
    index2 = list(users_order).index(user2_id)

    # DataFrames with the BookIDs and the ratings of the users
    user1 = pd.DataFrame({'BookID': books_order, 'Rating': ratingmatrix[index1].toarray()[0]})
    user2 = pd.DataFrame({'BookID': books_order, 'Rating': ratingmatrix[index2].toarray()[0]})
    # An alternative (although ratings_matrix spends way more memory):
    # user1 = pd.DataFrame({'Rating': ratings_matrix.loc[user1_id]}).dropna().reset_index()
    # user2 = pd.DataFrame({'Rating': ratings_matrix.loc[user2_id]}).dropna().reset_index()

    # Merging the two dataframes on BookID
    merged_ratings = pd.merge(user1, user2, on='BookID', suffixes=('_user1', '_user2'))

    # Calculating Pearson correlation
    if len(merged_ratings) > 0:
        correlation = merged_ratings['Rating_user1'].corr(merged_ratings['Rating_user2'], method='pearson')
    else:
        correlation = np.nan
        
    return correlation


def calculate_cosine_similarity(user1_id, user2_id, ratingmatrix):
    # Indices in the ratingmatrix of the users
    index1 = list(users_order).index(user1_id) 
    index2 = list(users_order).index(user2_id)
    
    # Arrays with the ratings of the users
    array1 = ratingmatrix[index1].toarray()
    array2 = ratingmatrix[index2].toarray()

    if len(array1[0]) == len(array2[0]) and (len(array1[0]) > 0):
        correlation = cosine_similarity(array1, array2)[0][0]
    else:
        correlation = np.nan

    return correlation

In [9]:
# Check
user1_id = 19960808
user2_id = users_order[0]

calculate_similarity(user1_id, user2_id, ratings_csr_matrix, method='pearson')

0.19622755525387714

In [10]:
# This is just to check how the csr matrix works
index = list(users_order).index(19960808) 
indices = [i for i, x in enumerate(list(ratings_csr_matrix[index].toarray()[0])) if x == 3]
books_ids = [books_order[i] for i in indices]
#books[books['BookID'].isin(books_ids)]

As explained in the  EDA section, we can reduce the effect of the different tendencies of the users when rating books by normalizing the user's ratings.

In [11]:
def get_csr_matrix_norm(csr_matrix, method='mean_centering'):
    csr_matrix_norm = csr_matrix.copy()

    for i in range(csr_matrix.shape[0]):
        row_start = csr_matrix.indptr[i]
        row_end = csr_matrix.indptr[i + 1]

        if row_start < row_end: # If the row is not empty
            row_data = csr_matrix_norm.data[row_start:row_end]
    
            if method == 'mean_centering':
                # Normalize each row subtracting by the mean value of the row
                mean = row_data.mean()
                row_data = row_data - mean 
            elif method == 'z_score':
                # Normalize each row subtracting the mean and dividing by the std of the row
                mean = row_data.mean()
                std = row_data.std()
                if std != 0: 
                    row_data = (row_data - mean) / std
                else:
                    row_data = row_data - mean
            elif method == 'min_max': 
                # Normalize each row subtracting the min of the row and dividing by the difference between max and min of each row
                max_val = row_data.max()
                min_val = row_data.min()
                if max_val != min_val:
                    row_data = (row_data - min_val) / (max_val - min_val) 
                else:
                    row_data = row_data - min_val   
            else:
                raise ValueError("Invalid method. Please specify 'mean_centering', 'z_score', or 'min_max'.")

            csr_matrix_norm.data[row_start:row_end] = row_data
            
    return csr_matrix_norm

In [12]:
ratings_csr_matrix_norm = get_csr_matrix_norm(ratings_csr_matrix, method='min_max')

user1_id = 19960808
user2_id = users_order[10]

print('Pearson method: ', calculate_similarity(user1_id, user2_id, ratings_csr_matrix_norm, method='pearson'))
print('Cosine method: ', calculate_similarity(user1_id, user2_id, ratings_csr_matrix_norm, method='cosine'))

Pearson method:  0.22145401457242478
Cosine method:  0.2280322027005205


It seems that the results do not depend much on the methods used for normalization and similarity. However, the similarity coefficients differ more (in the second significant figure) when the original csr matrix is used rather than the normalized one.

Now we can calculate the similarity of all users with the target user and sort them to find the users with the highest similarities.

In [13]:
def calculate_all_similarities(target_userID, csr_matrix, method='pearson'):
    user1_id = target_userID

    similarities = {}

    for i in range(0, len(users_order)):
        user2_id = users_order[i]
        if user2_id != user1_id:
            correlation = calculate_similarity(user1_id, user2_id, csr_matrix, method=method)
            similarities[user2_id] = correlation

    return similarities

ratings_csr_matrix_norm = get_csr_matrix_norm(ratings_csr_matrix, method='min_max')
similarities = calculate_all_similarities(19960808, ratings_csr_matrix, method='cosine')

In [14]:
# Most similar users
sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
most_similar_users = [user for user, similarity in sorted_similarities[:100]]
most_similar_users[:5]

[22811, 35630, 19465, 30260, 34192]

In [15]:
# To check the similarity
target_df = pd.merge(ratings[(ratings['UserID'] == 19960808) & (ratings['BookID'].isin(target_books))], books[['BookID', 'Title']], on='BookID', how='left')
similar_df = pd.merge(ratings[(ratings['UserID'] == most_similar_users[0]) & (ratings['BookID'].isin(target_books))], books[['BookID', 'Title']], on='BookID', how='left')

target_df.columns = ['UserID', 'BookID', 'Rating_target', 'Title']
similar_df.columns = ['UserID', 'BookID', 'Rating_similar_user', 'Title']
target_df = target_df[['Rating_target', 'Title']]
similar_df = similar_df[['Rating_similar_user', 'Title']]

pd.merge(target_df, similar_df, on='Title', how='outer')

Unnamed: 0,Rating_target,Title,Rating_similar_user
0,4,A Court of Mist and Fury (A Court of Thorns an...,
1,3,A Court of Thorns and Roses (A Court of Thorns...,
2,4,A Court of Wings and Ruin (A Court of Thorns a...,
3,4,"Beyond the Shadows (Night Angel, #3)",
4,4,Chronicle of a Death Foretold,
5,3,"Equal Rites (Discworld, #3; Witches #1)",
6,3,Flatland: A Romance of Many Dimensions,
7,5,Foundation (Foundation #1),4.0
8,5,Foundation and Empire (Foundation #2),4.0
9,5,Second Foundation (Foundation #3),4.0


### Step 2.2. Find similar users - KNN algorithm

There is an alternative to the previous code in the 'Step 2.1'. Instead of coding everything step by step, we can use a predefined function to get the similar users to the target user. In particular, we can use KNN algorithms.

In [16]:
# Normalize the csr matrix using the get_csr_matrix_norm function defined
# in the Step 2.1
ratings_csr_matrix_norm = get_csr_matrix_norm(ratings_csr_matrix, method='min_max')

In [17]:
# I build the KNN model here
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(ratings_csr_matrix_norm)
model_knn

In [18]:
target_user = 19960808
query_index = ratings_matrix.index.get_loc(target_user) # The row of the target user
distances, indices = model_knn.kneighbors(ratings_matrix.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 101)
#distances.flatten()[1:-1].min()

In [19]:
# Most similar users
most_similar_users = [ratings_matrix.iloc[index].name for index in indices.flatten()[1:]]
most_similar_users[:5]

[15908, 22811, 19465, 35630, 35010]

In [20]:
# To check the similarity
target_df = pd.merge(ratings[(ratings['UserID'] == 19960808) & (ratings['BookID'].isin(target_books))], books[['BookID', 'Title']], on='BookID', how='left')
similar_df = pd.merge(ratings[(ratings['UserID'] == most_similar_users[0]) & (ratings['BookID'].isin(target_books))], books[['BookID', 'Title']], on='BookID', how='left')

target_df.columns = ['UserID', 'BookID', 'Rating_target', 'Title']
similar_df.columns = ['UserID', 'BookID', 'Rating_similar_user', 'Title']
target_df = target_df[['Rating_target', 'Title']]
similar_df = similar_df[['Rating_similar_user', 'Title']]

pd.merge(target_df, similar_df, on='Title', how='outer')

Unnamed: 0,Rating_target,Title,Rating_similar_user
0,4,A Court of Mist and Fury (A Court of Thorns an...,
1,3,A Court of Thorns and Roses (A Court of Thorns...,
2,4,A Court of Wings and Ruin (A Court of Thorns a...,
3,4,"Beyond the Shadows (Night Angel, #3)",
4,4,Chronicle of a Death Foretold,
5,3,"Equal Rites (Discworld, #3; Witches #1)",
6,3,Flatland: A Romance of Many Dimensions,
7,5,Foundation (Foundation #1),5.0
8,5,Foundation and Empire (Foundation #2),5.0
9,5,Second Foundation (Foundation #3),5.0


### Step 3. Get recommendations

For the recommendations, we consider the books that the similar users have read the target user has not yet read. Then, we compute the average rating for each these book based on the rating from these similar users. In addition, to make the average ratings more reliable we will only include books that have been rated by at least X users. This process helps identify books that align with the target user's taste.

In [21]:
# Ratings of the most similar users
number_of_similar_users = 10
similar_users_ratings = ratings[(ratings['UserID'].isin(most_similar_users[:number_of_similar_users])) & (~ratings['BookID'].isin(target_books))]

# Books that have been rated by at least 10 users.
count_ratings = similar_users_ratings.groupby('BookID').size()
multirated_books = count_ratings[count_ratings >= 5].index

# Average rating for these books and sorted dataframe
similar_users_ratings = similar_users_ratings[similar_users_ratings['BookID'].isin(multirated_books)]
multirated_books_rating = pd.DataFrame(similar_users_ratings.groupby('BookID')['Rating'].mean(), columns=['Rating']).reset_index()
multirated_books_rating.columns = ['BookID', 'Average_Rating']
multirated_books_rating = multirated_books_rating.sort_values(by='Average_Rating', ascending=False).reset_index()[['BookID', 'Average_Rating']]
multirated_books_rating

Unnamed: 0,BookID,Average_Rating
0,126,5.0
1,862,4.857143
2,39,4.777778
3,562,4.714286
4,70,4.714286
5,2,4.6
6,611,4.6
7,1546,4.6
8,110,4.571429
9,283,4.4


In [22]:
def contains_all_genres(row, genres):
    # The function checks if all the specified genres in a given list are 
    # present in any of the genre columns of the DataFrame row.

    # The any function returns True if the current genre is found in any of the 7 genre columns.
    # The all function ensures that this condition (any returning True) holds for every genre in the genres list.
    return all(any(row[f'Genre_{i}'] == genre for i in range(1, 8)) for genre in genres)

def contains_any_genre(row, genres):
    # The function checks if all the specified genres in a given list are 
    # present in any of the genre columns of the DataFrame row.

    # The any function returns True if at least one genre from the genres list is found in any of the 7 genre columns of the row.
    return any(row[f'Genre_{i}'] in genres for i in range(1, 8))

def ordinal_number(number):
    # Function that returns the ordinal equivalent of a number
    if 10 <= number % 100 <= 20: # n % 100 returns the last 2 digits
        sufix = 'th'
    else:
        sufix = {1: 'st', 2: 'nd', 3: 'rd'}.get(number % 10, 'th')
    return str(number) + sufix
    

def books_satisfying_genres(data, genres, exclude=[], combine=False): 
    
    ###########################################################################################
    #                                                                                         #
    # This function returns the books (with their genres) satisfying some genres restrictions #
    #                                                                                         #
    # data = dataset with the prediction                                                      #
    # genres = list of the genres the user is interested in                                   #
    # exclude = list of genres the user wants to exclude                                      #
    # combine = if True, look for books that have all the genres                              #
    #                                                                                         #
    ###########################################################################################
    
    # I convert the list into a dictionary and back into a list to drop duplicates
    genres = list(dict.fromkeys(genres))

    # If the list of genres specified by the user contains genres not present in the
    # set of unique genres of the books, the funcion ends
    unique_genres = books_genres_list['Genre'].unique()
    if not set(genres).issubset(unique_genres):
        print('ERROR: There are genres that do not exist.')
        return
    if not set(exclude).issubset(unique_genres):
        print('ERROR: There are genres that do not exist.')
        return

    # If the list of genres specified by the user contains genres present in the
    # list of genres the user wants to exclude, stop the function
    for genre in genres:
        if genre in exclude:
            print('ERROR: There are coincident genres in both lists.')
            return

    data = pd.merge(data, books_genres[['BookID','Genre_1', 'Genre_2', 'Genre_3', 'Genre_4', 'Genre_5', 'Genre_6', 'Genre_7']], on='BookID', how='left')

    if len(genres) == 0: # Case with no genres specified
        data = data
    else:
        if combine: # If the user wants the books to include all the genres specified
            if len(genres) <= 7:
                data = data[data.apply(lambda row: contains_all_genres(row, genres), axis=1)]
            else:
                print('Books can have, at most, 7 different genres. If you want book recommendations including all the selected genres simultaneously, please, choose a maximum of 7 options.')
                return
        else: # If the user wants the books to include at least one of the genres specified
            data = data[data.apply(lambda row: contains_any_genre(row, genres), axis=1)]

    # To drop the books with, at leat, one of its genres in the list exclude
    data = data[~data.apply(lambda row: contains_any_genre(row, exclude), axis=1)]

    return data # This dataframe contains the genres of the books

def books_recommender(data, genres, n, exclude=[], combine=False): 

    #############################################################################################
    #                                                                                           #
    # This function returns the final recommendations and prints the information of these books #
    #                                                                                           #
    # data = dataset with the prediction                                                        #
    # genres = list of the genres the user is interested in                                     #
    # n = maximum numer of books the user wants in the recommendation                           #
    # exclude = list of genres the user wants to exclude                                        #
    # combine = if True, look for books that have all the genres                                #
    #                                                                                           #
    #############################################################################################

    # DataFrame with the books satisfying the genres request
    data_n = books_satisfying_genres(data, genres, exclude, combine)

    # To sort the remaining books acording to their weighted ratings
    data_n = data_n.sort_values(by='Average_Rating', ascending=False).head(n)

    if data_n.shape[0] == 0:
        print("I am sorry, there is no personalized recommendation with the features you asked for. You could try asking for other genres.")
        print(f'However, here you have the {n} most popular books you still have not read.')
        # TO DO: call here the function for the recommendations based on popularity
        return

    # This is just to include the image of the selected books in the dataframe
    data_n = pd.merge(books[['BookID', 'Goodreads_BookID', 'Title', 'Authors', 'Image_url']], data_n, on='BookID', how='right')
    #data_n = data_n.sort_values(by='Average_Rating', ascending=False).reset_index()

    texto = "Recommended books:" # A header to print before giving the recommendations
    tamaño_fuente = 24
    html = f"<h1 style='font-size:{tamaño_fuente}px'>{texto}</h1>"
    display(Markdown(html))
    print('')

    # To print the recommendations
    for i, row in data_n.iterrows():
        url = row["Image_url"]
        img = Image(url=url, width=100)
        book_i_genres = [row[f'Genre_{i}'] for i in range(1,8)]
        book_i_genres = [genre for genre in book_i_genres if genre != 'Empty']
        book_i_genres_string = book_i_genres[0]
        if len(book_i_genres) > 1:
            for ind in range(1, len(book_i_genres)):
                book_i_genres_string += ', ' + book_i_genres[ind]
    

        print(ordinal_number(i+1) + ' recomendation:', '\n')
        print('  Title: {}'.format(row['Title']))
        print('  Authors: {}'.format(row['Authors']))
        print('  Average rating: {}'.format(round(row['Average_Rating'],2)))
        #print('  Predicted rating: {}'.format(row['Rating_predicted']))
        print('  Genres: ', book_i_genres_string)
        print('  Goodreads Book ID: {}'.format(row['Goodreads_BookID']), '\n')
        display(img)
        print('')

    return data_n # The function returns the dataframe with the recommendations

In [23]:
list_genres = []
exclude_genres = []
recommendations_books = books_recommender(multirated_books_rating, list_genres, 5, exclude=exclude_genres, combine=True)

<h1 style='font-size:24px'>Recommended books:</h1>


1st recomendation: 

  Title: Dune (Dune Chronicles #1)
  Authors: Frank Herbert
  Average rating: 5.0
  Genres:  Science Fiction, Fiction, Fantasy, Science Fiction Fantasy, Classics, Audiobook, Novels
  Goodreads Book ID: 234225 




2nd recomendation: 

  Title: Words of Radiance (The Stormlight Archive, #2)
  Authors: Brandon Sanderson
  Average rating: 4.86
  Genres:  Fantasy, Fiction, Epic Fantasy, High Fantasy, Audiobook, Adult, Magic
  Goodreads Book ID: 17332218 




3rd recomendation: 

  Title: A Game of Thrones (A Song of Ice and Fire, #1)
  Authors: George R.R. Martin
  Average rating: 4.78
  Genres:  Fantasy, Fiction, Epic Fantasy, High Fantasy, Adult, Science Fiction Fantasy, Adventure
  Goodreads Book ID: 13496 




4th recomendation: 

  Title: The Way of Kings (The Stormlight Archive, #1)
  Authors: Brandon Sanderson
  Average rating: 4.71
  Genres:  Fantasy, Fiction, Epic Fantasy, High Fantasy, Audiobook, Adult, Science Fiction Fantasy
  Goodreads Book ID: 7235533 




5th recomendation: 

  Title: Ender's Game (Ender's Saga, #1)
  Authors: Orson Scott Card
  Average rating: 4.71
  Genres:  Science Fiction, Fiction, Young Adult, Fantasy, Classics, Science Fiction Fantasy, Dystopia
  Goodreads Book ID: 375802 






## TODO

### Using recommenderlab

Recommenderlab is a R-package that provides the infrastructure to evaluate and compare several collaborative-filtering algortihms. 
Many algorithms are already implemented in the package, and we can use the available ones to save some coding effort, or add custom algorithms and use the infrastructure (e.g. crossvalidation).

There is an important aspect concerning the representation of our rating matrix. 
As we could already see above, most of the values in the rating matrix are missing, because every user just rated a few of the 10000 books. This allows us to represent this matrix is sparse format in order to save memory.


TODO: continue the (https://github.com/zygmuntz/goodbooks-10k/blob/master/contrib/Philipp%20Spachtholz%20-%20Book%20Recommender%20-%20Collaborative%20Filtering%2C%20Shiny.Rmd) notebook from this point.
