# Book Recommender

After finishing a captivating book, it is common to experience a sense of loss, as if saying goodbye to a friend. The search for a good book to fill that void can be intimidating, with the worry that nothing else will live up to its predecessor. This is where an application designed to recommend books can offer a beacon of hope in the form of carefully selected recommendations.

Imagine a program that understands your reading preferences, knows your favorite genres and authors, and can recommend titles with captivating storytelling. Such an app could become a trusted companion for book lovers, making the transition from one book to the next simpler and more satisfying. This is precisely what this recommender aims to achieve.

While there is still plenty of work to be done, this Python program is already capable of providing book recommendations based on user feedback.

The project is split in several parts:

- Part I - Data Wrangling
- [Part II - Exploratory Data Analysis](2-EDA.ipynb)
- [Part III - Collaborative Filtering](3-Model.ipynb)
- [Part IV - Dash Application](4-Dash.ipynb)

# Part I - Data Wrangling

### Importing the libraries

In [1]:
import pandas as pd               # pandas is used for data manipulation and analysis, providing data structures like DataFrames.
import numpy as np                # numpy is used for numerical operations on large, multi-dimensional arrays and matrices.
from IPython.display import Image # IPython's display module is used to display images within Jupyter Notebooks.
import ast                        # ast is used for processing trees of the Python abstract syntax grammar.
import re                         # re provides regular expression matching operations in strings.

## Step 1: Loading the Data

### The goodbooks-10k repository

First of all, the data that we will use in this project is collected. The datasets are downloaded from the [goodbooks-10k](https://github.com/zygmuntz/goodbooks-10k) GitHub repository. It contains approximately six million user ratings for ten thousand popular books from Goodreads. The original datasets are placed in the [data_preprocessed/](data_preprocessed/) directory. There are four files there:

- [books](data_preprocessed/books.csv): Includes the metadata for each of the 10K books.
- [ratings](data_preprocessed/ratings_part_1.csv): Each row represent the rating for a book with a given bookID by a user with a given userID. The dataset has been divided into four smaller files, with the help of the program [split_csv.py](split_csv.py), to avoid GitHub warnings.
- [book_tags](data_preprocessed/book_tags.csv): Contains tags, shelves, and genres assigned by Goodreads users to the books.
- [tags](data_preprocessed/tags.csv): Translates the tags IDs to names.

However, in this project, we decided not to use the book_tags and tags files, as they contain a lot of information that is not relevant for our purposes. Additionally, many tags for the same genres have different names, making the process of cleaning the dataset difficult and tedious.

### Web scraping

As mentioned, we did not obtain the genres of the books from the book_tags and tags files, despite their importance. Instead, we performed web scrapping. Additionally, many books were missing cover image URLs, ISBNs, or publication years. Using the notebooks in the [web_scraping](web_scraping/) folder, we succesfully obtained all these features.

A final remark regarding the genres of the books: While the original files contain more information, having seven genres per book should be sufficient for our purposes. These seven genres are easy to scrape and we avoid many tags that are either duplicated or irrelevant.

In [2]:
# Read the data files
books = pd.read_csv("data_preprocessed/books.csv")

ratings_files = [f'data_preprocessed/ratings_part_{i}.csv' for i in range(1,4+1)]
ratings_dfs = [pd.read_csv(file) for file in ratings_files]
ratings = pd.concat(ratings_dfs, ignore_index=True)

books_tags = pd.read_csv("data_preprocessed/book_tags.csv")
tags = pd.read_csv("data_preprocessed/tags.csv")

books_missing_data = pd.read_csv("data/books_data_missing.txt", sep="\t")
books_missing_image = pd.read_csv("data/books_image_missing.txt", sep="\t")
books_genres = pd.read_csv("data/books_genres.txt", sep="\t")

Have a look at the datasets:

In [246]:
# Books dataset
print("Books Shape: ", books.shape)
books.head(3)

Books Shape:  (10000, 23)


Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...


In [247]:
print("Ratings Shape: ", ratings.shape)
ratings.head(3)

Ratings Shape:  (5976479, 3)


Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5


In [248]:
print("Books_tags Shape: ", books_tags.shape)
books_tags.head(3)

Books_tags Shape:  (999912, 3)


Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173


In [249]:
print("Tags Shape: ", tags.shape)
tags.head(3)

Tags Shape:  (34252, 2)


Unnamed: 0,tag_id,tag_name
0,0,-
1,1,--1-
2,2,--10-


Rename the columns of the datasets.

In [250]:
books.columns = ['BookID', 'Goodreads_BookID', 'Best_BookID', 'WorkID',
       'Books_Count', 'ISBN', 'ISBN13', 'Authors', 'Year',
       'Original_Title', 'Title', 'Language_Code', 'Average_Rating',
       'Ratings_Count', 'Work_Ratings_Count', 'Work_Text_Reviews_Count',
       'Ratings_1', 'Ratings_2', 'Ratings_3', 'Ratings_4', 'Ratings_5',
       'Image_url', 'Small_Image_url']

ratings.columns = ['UserID', 'BookID', 'Rating']

books_tags.columns = ['Goodreads_BookID', 'TagID', 'Count']

tags.columns = ['TagID', 'Tag_Name']

Drop irrelevant columns.

In [251]:
books.drop(['Language_Code','Small_Image_url','ISBN13'], axis=1, inplace=True)

Some of the Goodreads_BookIDs are wrong and we correct them in the cells below. For more details, take a look to the end of the file web_scrapping_goodreads_genres.ipynb

In [252]:
Goodreads_BookIDs = [31426, 852460, 2855034, 89959, 6120349, 61942, 18906484]
books[books['Goodreads_BookID'].isin(Goodreads_BookIDs)][['Goodreads_BookID']]

Unnamed: 0,Goodreads_BookID
3380,31426
4168,852460
4924,2855034
5112,6120349
5375,89959
5414,61942
9011,18906484


In [253]:
# Goodreads_BookID = 31426
books.loc[3380,'Goodreads_BookID'] = 439286
# Goodreads_BookID = 852460
books.loc[4168,'Goodreads_BookID'] = 20742529
# Goodreads_BookID = 2855034
books.loc[4924,'Goodreads_BookID'] = 2424593
# Goodreads_BookID = 6120349
books.loc[5112,'Goodreads_BookID'] = 18652490
# Goodreads_BookID = 89959
books.loc[5375,'Goodreads_BookID'] = 355316
# Goodreads_BookID = 61942
books.loc[5414,'Goodreads_BookID'] = 8356426
# Goodreads_BookID = 18906484
books.loc[9011,'Goodreads_BookID'] = 18906484

## Step 2: Data Cleaning

### Unmatched IDs

Some BookIDs in books may not be present in ratings or viceversa.

In [254]:
# Function to check that both books and ratings datasets include the same BookIDs
# If this is not the case, the books with unmatched BookID are dropped
def same_BoobIDs(Books, Ratings):
    print('Unique BookIDs in books: ', Books['BookID'].nunique())
    print('Unique BookIDs in ratings: ', Ratings['BookID'].nunique())

    # Create sets with the different values for the BookIDs in both datasets
    # to keep just the unique values
    set_books = set(Books['BookID'])
    set_ratings = set(Ratings['BookID'])

    # Store the difference of both sets in a new list of unique BookIDs 
    different_BookIDs = list(set_books.symmetric_difference(set_ratings))

    # Finally, we only keep the rows in the datasets whose ISBNs are in both dataframes
    Books = Books[~Books['BookID'].isin(different_BookIDs)]
    Ratings = Ratings[~Ratings['BookID'].isin(different_BookIDs)]

    return Books, Ratings

In [255]:
books, ratings = same_BoobIDs(books, ratings)

Unique BookIDs in books:  10000
Unique BookIDs in ratings:  10000


### Missing Values

#### 1. Books

In [256]:
books.isnull().sum()

BookID                       0
Goodreads_BookID             0
Best_BookID                  0
WorkID                       0
Books_Count                  0
ISBN                       700
Authors                      0
Year                        21
Original_Title             585
Title                        0
Average_Rating               0
Ratings_Count                0
Work_Ratings_Count           0
Work_Text_Reviews_Count      0
Ratings_1                    0
Ratings_2                    0
Ratings_3                    0
Ratings_4                    0
Ratings_5                    0
Image_url                    0
dtype: int64

In [257]:
# Original_Title
missing_original_titles = books[books['Original_Title'].isnull()]['Title'].apply(lambda x: x.split('(')[0])
books.loc[books['Original_Title'].isnull(), 'Original_Title'] = missing_original_titles

In [258]:
# Year
indices_missing_years = books[books['Year'].isnull()].index # indices of the books in books which have a missing year

books_missing_data.set_index('WorkID', inplace=True) # this is to locate easily the years in the loop

for i in indices_missing_years:
    workID = books.loc[i, 'WorkID']
    books.loc[i, 'Year'] = books_missing_data.loc[workID, 'Year']

In [259]:
# ISBN
indices_missing_isbn = books[books['ISBN'].isnull()].index # indices of the books in books which have a missing ISBN

for i in indices_missing_isbn:
    workID = books.loc[i, 'WorkID']
    books.loc[i, 'ISBN'] = books_missing_data.loc[workID, 'ISBN']

In [260]:
books.isnull().sum()

BookID                     0
Goodreads_BookID           0
Best_BookID                0
WorkID                     0
Books_Count                0
ISBN                       0
Authors                    0
Year                       0
Original_Title             0
Title                      0
Average_Rating             0
Ratings_Count              0
Work_Ratings_Count         0
Work_Text_Reviews_Count    0
Ratings_1                  0
Ratings_2                  0
Ratings_3                  0
Ratings_4                  0
Ratings_5                  0
Image_url                  0
dtype: int64

Some ISBNs have the 0s on the left missing.

In [261]:
for index, row in books.iterrows():
    isbn = row['ISBN']

    if isinstance(isbn, str):
        if len(isbn) == 9:
            books.loc[index, 'ISBN'] = '0' + isbn
        elif len(isbn) == 8:
            books.loc[index, 'ISBN'] = '00' + isbn
        elif len(isbn) == 7:
            books.loc[index, 'ISBN'] = '000' + isbn

#### 2. Ratings

In [262]:
ratings.isnull().sum()

UserID    0
BookID    0
Rating    0
dtype: int64

#### 3. Books_tags

In [263]:
books_tags.isnull().sum()

Goodreads_BookID    0
TagID               0
Count               0
dtype: int64

#### 4. Tags

In [264]:
tags.isnull().sum()

TagID       0
Tag_Name    0
dtype: int64

### Duplicates

Remove duplicates in the datasets.

#### 1. Books

In [265]:
columns = ['BookID', 'Goodreads_BookID', 'Best_BookID', 'WorkID',
       'ISBN', 'Original_Title', 'Title', 'Image_url']
for column in columns:
    num_duplicates = books[books[column].duplicated()].shape[0]
    print(f'Number of duplicates in books[{column}]: {num_duplicates}')

Number of duplicates in books[BookID]: 0
Number of duplicates in books[Goodreads_BookID]: 0
Number of duplicates in books[Best_BookID]: 0
Number of duplicates in books[WorkID]: 0
Number of duplicates in books[ISBN]: 3
Number of duplicates in books[Original_Title]: 148
Number of duplicates in books[Title]: 36
Number of duplicates in books[Image_url]: 3331


In [266]:
# Check ISBNs
books[books['ISBN'].duplicated() & (books['ISBN'] != 'online')].shape[0]

0

In [267]:
# Check Original_Title
# books[books['Original_Title'].duplicated()][['Original_Title', 'Title']]

# This is not a relevant column

In [268]:
# Check Title
duplicated_titles = books[books['Title'].duplicated()]['Title'].values

pd.set_option('display.max_rows', 100)
books[books['Title'].isin(duplicated_titles)].sort_values(by='Title')[['Authors', 'Original_Title', 'Title']]

Unnamed: 0,Authors,Original_Title,Title
1291,"Stephen King, Jerry N. Uelsmann",'Salem's Lot The Illustrated Edition,'Salem's Lot
348,Stephen King,Salem's Lot,'Salem's Lot
5267,Sarah Simblet,Anatomy for the Artist,Anatomy for the Artist
4185,Jenő Barcsay,Művészeti Anatómia,Anatomy for the Artist
6480,Tom Stoppard,Arcadia,Arcadia
6104,Lauren Groff,Arcadia,Arcadia
5786,Alison Bechdel,Are You My Mother?: A Comic Drama,Are You My Mother?
578,P.D. Eastman,Are You My Mother?,Are You My Mother?
9111,Bob Grant,Bambi,Bambi
3401,"Felix Salten, Barbara Cooney",Bambi - Eine Lebensgeschichte aus dem Walde,Bambi


In [269]:
pd.reset_option('display.max_rows')

In [270]:
# Although there are duplicated titles, it is because there are different books 
# from different authors sharing a name.

In [271]:
# Check Image_url
duplicated_images = books[books['Image_url'].duplicated()]['Image_url'].values
print(set(duplicated_images), '\n')
# The duplicated images correspond all of them to the same picture:
url = duplicated_images[0]
Image(url=url, width=100)

{'https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png'} 



In [272]:
indices_missing_image = books[books['Image_url'] == url].index # indices of the books in books which have a missing image

books_missing_image.set_index('WorkID', inplace=True) # this is to locate easily the years in the loop

for i in indices_missing_image:
    workID = books.loc[i, 'WorkID']
    books.loc[i, 'Image_url'] = books_missing_image.loc[workID, 'Image_url']

In [273]:
columns = ['BookID', 'Goodreads_BookID', 'Best_BookID', 'WorkID',
       'ISBN', 'Original_Title', 'Title', 'Image_url']
for column in columns:
    num_duplicates = books[books[column].duplicated()].shape[0]
    print(f'Number of duplicates in books[{column}]: {num_duplicates}')

Number of duplicates in books[BookID]: 0
Number of duplicates in books[Goodreads_BookID]: 0
Number of duplicates in books[Best_BookID]: 0
Number of duplicates in books[WorkID]: 0
Number of duplicates in books[ISBN]: 3
Number of duplicates in books[Original_Title]: 148
Number of duplicates in books[Title]: 36
Number of duplicates in books[Image_url]: 0


#### 2. Ratings

In [274]:
ratings[ratings[['UserID','BookID','Rating']].duplicated()]

Unnamed: 0,UserID,BookID,Rating


In [275]:
ratings[ratings[['UserID','BookID']].duplicated()]

Unnamed: 0,UserID,BookID,Rating


#### 3. Books_Tags

In [276]:
books_tags[books_tags.duplicated()]

Unnamed: 0,Goodreads_BookID,TagID,Count
159371,22369,25148,4
265128,52629,10094,1
265140,52629,2928,1
265155,52629,13272,1
265187,52629,13322,1
308771,77449,25148,7


In [277]:
books_tags = books_tags.drop_duplicates()

In [278]:
books_tags[books_tags.duplicated()]

Unnamed: 0,Goodreads_BookID,TagID,Count


#### 4. Tags

In [279]:
tags[tags.duplicated()]

Unnamed: 0,TagID,Tag_Name


### Outliers

Identify and address outliers, such as extreme values or data that doesn't make sense.

#### 1. Books

In [280]:
books[['BookID', 'Year']].describe()

Unnamed: 0,BookID,Year
count,10000.0,10000.0
mean,5000.5,1982.04
std,2886.89568,152.420773
min,1.0,-1750.0
25%,2500.75,1990.0
50%,5000.5,2004.0
75%,7500.25,2011.0
max,10000.0,2019.0


In [281]:
books[books['Year'] > 2017]

Unnamed: 0,BookID,Goodreads_BookID,Best_BookID,WorkID,Books_Count,ISBN,Authors,Year,Original_Title,Title,Average_Rating,Ratings_Count,Work_Ratings_Count,Work_Text_Reviews_Count,Ratings_1,Ratings_2,Ratings_3,Ratings_4,Ratings_5,Image_url
9196,9197,11318,11318,2024704,56,394757645,Raymond Chandler,2019.0,Trouble Is My Business,Trouble Is My Business,4.05,9020,9466,232,286,308,1629,3666,3577,https://images.gr-assets.com/books/1388189902m...


In [282]:
books.loc[9196,'Year'] = 1939

In [283]:
books[['BookID', 'Year']].describe()

Unnamed: 0,BookID,Year
count,10000.0,10000.0
mean,5000.5,1982.032
std,2886.89568,152.420932
min,1.0,-1750.0
25%,2500.75,1990.0
50%,5000.5,2004.0
75%,7500.25,2011.0
max,10000.0,2017.0


The maximum year for a publication is 2017, since it is the date at which the data was collected.

#### 2. Ratings

In [284]:
ratings.describe()

Unnamed: 0,UserID,BookID,Rating
count,5976479.0,5976479.0,5976479.0
mean,26224.46,2006.477,3.919866
std,15413.23,2468.499,0.9910868
min,1.0,1.0,1.0
25%,12813.0,198.0,3.0
50%,25938.0,885.0,4.0
75%,39509.0,2973.0,5.0
max,53424.0,10000.0,5.0


There seems to be no outliers in the ratings.

### Data Types

Fix, if necessary, the data types of the datasets.

In [285]:
books.dtypes

BookID                       int64
Goodreads_BookID             int64
Best_BookID                  int64
WorkID                       int64
Books_Count                  int64
ISBN                        object
Authors                     object
Year                       float64
Original_Title              object
Title                       object
Average_Rating             float64
Ratings_Count                int64
Work_Ratings_Count           int64
Work_Text_Reviews_Count      int64
Ratings_1                    int64
Ratings_2                    int64
Ratings_3                    int64
Ratings_4                    int64
Ratings_5                    int64
Image_url                   object
dtype: object

In [286]:
ratings.dtypes

UserID    int64
BookID    int64
Rating    int64
dtype: object

In [287]:
books_tags.dtypes

Goodreads_BookID    int64
TagID               int64
Count               int64
dtype: object

In [288]:
tags.dtypes

TagID        int64
Tag_Name    object
dtype: object

### Books with Arabic characters

In [289]:
# Function to find books with arabian characters
def contains_arabic(text):
    arabic_pattern = re.compile(r'[\u0600-\u06FF]')
    return bool(arabic_pattern.search(text))

books_arabic = books[books['Title'].apply(contains_arabic)]
books_arabic_indices = books_arabic['BookID'].values

# Keep just the books with non arabian characters
books = books[~books['BookID'].isin(books_arabic_indices)]
ratings = ratings[~ratings['BookID'].isin(books_arabic_indices)]

### Users with few reviews

Exclude those users with less than 10 ratings in order to ensure statistical significance.

In [290]:
counts_users = ratings['UserID'].value_counts()

users_to_remove = counts_users[counts_users < 10].index

ratings = ratings[~ratings['UserID'].isin(users_to_remove)].reset_index(drop=True)

ratings.shape

(5964966, 3)

In [291]:
counts_users

UserID
12874    200
30944    200
12381    199
52036    199
28158    199
        ... 
47143      2
49880      2
28715      2
32925      2
25856      1
Name: count, Length: 53424, dtype: int64

Now, drop all the books that do not appear in the new `ratings` dataframe

In [292]:
books.shape

(9930, 20)

In [293]:
books_in_ratings = ratings['BookID'].unique()
books = books[books['BookID'].isin(books_in_ratings)]
books.reset_index(drop=True, inplace=True)
books.shape[0]

9930

### Wrong Titles

In [294]:
books.loc[6972, ['Title', 'ISBN']]
books.loc[8645, ['Title', 'ISBN']]

Title    Sin City: Una Dura Despedida, #1 de 3
ISBN                                9509051373
Name: 8645, dtype: object

In [295]:
books.loc[6972, 'Title'] = 'Private Number 1 Suspect (Private, #2)'
books.loc[8645, 'Title'] = 'Sin City: Una Dura Despedida (Sin City Editorial Gargola, #1)'

### Books Genres

In [296]:
books_genres[books_genres['Genres'].isnull()]

Unnamed: 0,Goodreads_BookID,Genres


In [297]:
# The lists in the `Genres` column are treated as strings, so we convert
# them to lists
books_genres['Genres'] = books_genres['Genres'].apply(ast.literal_eval)

In [298]:
print('Minimum number of genres: ', books_genres['Genres'].apply(lambda x: len(x)).min())
print('Maximum number of genres: ', books_genres['Genres'].apply(lambda x: len(x)).max())

Minimum number of genres:  2
Maximum number of genres:  7


In [299]:
def fill_genres(genres_list):
    needed_Emptys = 7 - len(genres_list) # number of 'Empty' values that have to be appended to the list of genres
    genres_list.extend(['Empty'] * needed_Emptys)
    return genres_list

In [300]:
books_genres['Genres'] = books_genres['Genres'].apply(fill_genres)

In [301]:
print('Minimum number of genres: ', books_genres['Genres'].apply(lambda x: len(x)).min())
print('Maximum number of genres: ', books_genres['Genres'].apply(lambda x: len(x)).max())

Minimum number of genres:  7
Maximum number of genres:  7


In [302]:
# Then, we expand the lists into new columns
books_genres_expanded = books_genres['Genres'].apply(pd.Series)

# and concatenate the new columns to the original dataframe
books_genres = pd.concat([books_genres, books_genres_expanded], axis=1)

books_genres.columns = ['Goodreads_BookID', 'Genres', 'Genre_1', 'Genre_2', 'Genre_3', 'Genre_4', 'Genre_5', 'Genre_6', 'Genre_7']

In [303]:
books_genres[(books_genres['Genre_1'] == 'Empty') | (books_genres['Genre_2'] == 'Empty') | (books_genres['Genre_3'] == 'Empty') |
             (books_genres['Genre_4'] == 'Empty') | (books_genres['Genre_5'] == 'Empty') | (books_genres['Genre_6'] == 'Empty') |
             (books_genres['Genre_7'] == 'Empty')].drop('Genres', axis=1)

Unnamed: 0,Goodreads_BookID,Genre_1,Genre_2,Genre_3,Genre_4,Genre_5,Genre_6,Genre_7
1933,18607805,Nonfiction,Reference,Technology,Self Help,Amazon,Empty,Empty
5674,18363068,Novels,Romance,Fiction,Literature,Romantic,Empty,Empty
6633,3885,Cookbooks,Cooking,Nonfiction,Food,Reference,Culinary,Empty
7766,22931009,Reference,Nonfiction,Empty,Empty,Empty,Empty,Empty
7797,7140220,Nonfiction,Vampires,History,Romance,Essays,Paranormal,Empty
8225,18387597,Novels,Fiction,Thriller,Literature,Empty,Empty,Empty
8435,7824768,Poetry,Romance,Literature,Nonfiction,Romantic,Literary Fiction,Empty
8578,21394850,Nonfiction,Religion,Islam,Philosophy,History,Literature,Empty
8961,402017,Cookbooks,Cooking,Nonfiction,Food,Reference,Empty,Empty
9530,16170625,Poetry,Literature,Politics,Nonfiction,History,Egypt,Empty


Maybe it is interesting to have the genres in a new dataframe similar to books_tags

In [304]:
books_genres_list = books_genres.melt(id_vars=['Goodreads_BookID'], value_vars=['Genre_1', 'Genre_2', 'Genre_3', 'Genre_4', 'Genre_5', 'Genre_6', 'Genre_7'],
                    var_name='Genre_num', value_name='Genre')

books_genres_list = books_genres_list[books_genres_list['Genre'] != 'Empty'][['Goodreads_BookID', 'Genre']].reset_index(drop=True)

books_genres_list

Unnamed: 0,Goodreads_BookID,Genre
0,2657,Classics
1,11870085,Young Adult
2,3,Fantasy
3,2767052,Young Adult
4,960,Fiction
...,...,...
69974,8356426,Adventure
69975,20742529,How To
69976,439286,American
69977,355316,Reference


In [305]:
books_genres = pd.merge(books_genres, books[['BookID', 'Goodreads_BookID']], on='Goodreads_BookID', how='left')
books_genres_list = pd.merge(books_genres_list, books[['BookID', 'Goodreads_BookID']], on='Goodreads_BookID', how='left')

## Collection Books 

Some of the books in the books dataset are collections of other books of the set. In order to avoid having as a recommendation one of the collections when you have already read the indiviadual books, we are going to give the individual books the rating of the collection and drop the collections from the books and ratings dataframes.

In [306]:
# Books which are collections of books
mask = np.logical_or.reduce([books['Title'].str.contains(f'#{i}-', case=False) for i in range(1, 10)])
collection_books = books[mask]
print(len(collection_books)) # Output: 99
# Collection books including a parenthesis
print(len(collection_books[collection_books['Title'].str.contains('\(', case=False)])) # Output: 97

99
97


Some book collections do not have a parenthesis to specify the saga and the volumes they include.

In [307]:
# Collection books not including a parenthesis
collection_books[~collection_books['Title'].str.contains('\(', case=False)]

Unnamed: 0,BookID,Goodreads_BookID,Best_BookID,WorkID,Books_Count,ISBN,Authors,Year,Original_Title,Title,Average_Rating,Ratings_Count,Work_Ratings_Count,Work_Text_Reviews_Count,Ratings_1,Ratings_2,Ratings_3,Ratings_4,Ratings_5,Image_url
6396,6429,48811,48811,47752,4,448445867,Carolyn Keene,2006.0,"Nancy Drew Complete Series Set, Books 1-64",Nancy Drew: #1-64,4.19,16743,16810,360,273,541,2897,5116,7983,https://images-na.ssl-images-amazon.com/images...
6957,6991,7278837,7278837,9914053,6,1615579184,Jeff Kinney,2009.0,Diary of a Wimpy Kid: #1-4,Diary of a Wimpy Kid: #1-4,4.45,10752,10973,434,249,296,1053,2022,7353,https://images-na.ssl-images-amazon.com/images...


In [308]:
# Change these book names to include the parenthesis
books.loc[6428, 'Title'] = 'Nancy Drew: #1-64' # This one will be left unchanged. The dataset includes just the volumes 1 and 2, and then the complete collection of 64 books. Then, I do not want to include this book in the modifications
books.loc[6990, 'Title'] = 'Diary of a Wimpy Kid: #1-4 (Diary of a Wimpy Kid, #1-4)'

# Also, the 'The Walking Dead' books have inconvenient titles
books.loc[501, 'Title'] = 'The Walking Dead: Days Gone Bye (The Walking Dead, #1)'
books.loc[1450, 'Title'] = 'The Walking Dead: Compendium 1 (The Walking Dead, #1-8)'
books.loc[3543, 'Title'] = 'The Walking Dead: Book One (The Walking Dead, #1-2)'
books.loc[3808, 'Title'] = 'The Walking Dead: Miles Behind Us (The Walking Dead, #2)'
books.loc[4324, 'Title'] = 'The Walking Dead: Safety Behind Bars (The Walking Dead, #3)'
books.loc[4431, 'Title'] = 'The Walking Dead: Made to Suffer (The Walking Dead, #8)'
books.loc[5203, 'Title'] = 'The Walking Dead: The Hearts Desire (The Walking Dead, #4)'
books.loc[5463, 'Title'] = 'The Walking Dead: The Best Defense (The Walking Dead, #5)'
books.loc[5772, 'Title'] = 'The Walking Dead: This Sorrowful Life (The Walking Dead, #6)'
books.loc[5975, 'Title'] = 'The Walking Dead: Life Among Them (The Walking Dead, #12)'
books.loc[6378, 'Title'] = 'Rise of the Governor (The Walking Dead: Novels, #1)'
books.loc[7009, 'Title'] = 'The Walking Dead: The Calm Before (The Walking Dead, #7)'
books.loc[7780, 'Title'] = 'The Walking Dead: Here We Remain (The Walking Dead, #9)'
books.loc[7812, 'Title'] = 'The Walking Dead: Fear the Hunters (The Walking Dead, #11)'
books.loc[7864, 'Title'] = 'The Walking Dead: Compendium 2 (The Walking Dead, #9-16)'
books.loc[7930, 'Title'] = 'The Walking Dead: What We Become (The Walking Dead, #10)'
books.loc[8076, 'Title'] = 'The Walking Dead: No Way Out (The Walking Dead, #14)'
books.loc[8077, 'Title'] = 'The Walking Dead: Book Two (The Walking Dead, #3-4)'
books.loc[8111, 'Title'] = 'The Walking Dead: Book Three (The Walking Dead, #5-6)'
books.loc[9740, 'Title'] = 'The Walking Dead: Too Far Gone (The Walking Dead, #13)'

# Other problematic names:
books.loc[3062, 'Title'] = 'The Dark Is Rising (The Dark Is Rising, #2)'
books.loc[3635, 'Title'] = 'Eragon, Eldest & Brisingr (The Inheritance Cycle, #1-3)'
books.loc[4602, 'Title'] = 'Eragon & Eldest (The Inheritance Cycle, #1-2)'
books.loc[5184, 'Title'] = 'Sand (The Sand Chronicles, #1)'
books.loc[5866, 'Title'] = 'The Sword of Shannara Trilogy (The Original Shannara Trilogy, #1-3)'
books.loc[6217, 'Title'] = 'From the Two Rivers: The Eye of the World, Part 1 (Wheel of time, #1)'
books.loc[6406, 'Title'] = 'Dragonlance Chronicles (Dragonlance: Chronicles #1-3)'
books.loc[7055, 'Title'] = 'The Icewind Dale Trilogy Collectors Edition (The Icewind Dale Trilogy, #1-3)'
books.loc[7616, 'Title'] = "Dragon's Oath (House of Night: Novellas, #1)"
books.loc[7696, 'Title'] = 'The Captive Part II / The Power (The Secret Circle, #3)'
books.loc[9092, 'Title'] = 'The Lightning Thief: The Graphic Novel (Percy Jackson and the Olympians: Graphic Novel, #1)'
books.loc[9324, 'Title'] = "Lenobia's Vow (House of Night: Novellas, #2)"
books.loc[2807, 'Title'] = 'The Crystal Shard (Legend of Drizzt, #4)'
books.loc[3392, 'Title'] = "The Halfling's Gem (Legend of Drizzt, #6)"
books.loc[3499, 'Title'] = 'Streams of Silver (Legend of Drizzt, #5)'

# Collection books without the Nancy Drew collection
mask = np.logical_or.reduce([books['Title'].str.contains(f'#{i}-', case=False) for i in range(1, 10)])
collection_books = books[mask]
collection_books = collection_books[collection_books['Title'].str.contains('\(', case=False)]

In [309]:
# To store the name of the saga and the volumes the collection has
collection_books_info = []
for index, book in collection_books.iterrows():  
    # Name of the saga
    saga = re.findall(r'\((.*?),', book['Title'])
    if saga == []:
        saga = re.findall(r'\((.*?) #', book['Title'])
    # Numbers of the volumes in the collection 
    match = re.search(r'#(\d+)-(\d+)', book['Title'])
    first_volume = int(match.group(1))
    last_volume = int(match.group(2))
    volumes = [i for i in range(first_volume, last_volume + 1)]
    
    collection_books_info.append([index, saga[0], volumes, books.loc[index, 'BookID']])

In [310]:
for i in range(0, len(collection_books_info)):
    print(collection_books_info[i])
    saga = collection_books_info[i][1]
    volumes = collection_books_info[i][2]

    aux = books[(books['Title'].str.contains('\(' + saga + ' ')) | (books['Title'].str.contains('\(' + saga + ','))]
#    if len(aux) == 0:
#        aux = books[books['Title'].str.contains(saga)]
    for volume in volumes:
        print(aux[aux['Title'].str.contains(f'#{volume}\)')]['Title'].values)

    print('')

[188, 'The Lord of the Rings', [1, 2, 3], 189]
['The Fellowship of the Ring (The Lord of the Rings, #1)']
['The Two Towers (The Lord of the Rings, #2)']
['The Return of the King (The Lord of the Rings, #3)']

[218, 'Chronicles of Narnia', [1, 2, 3, 4, 5, 6, 7], 219]
['The Lion, the Witch, and the Wardrobe (Chronicles of Narnia, #1)']
['Prince Caspian (Chronicles of Narnia, #2)']
['The Voyage of the Dawn Treader (Chronicles of Narnia, #3)']
['The Silver Chair (Chronicles of Narnia, #4)']
['The Horse and His Boy (Chronicles of Narnia, #5)']
["The Magician's Nephew (Chronicles of Narnia, #6)"]
['The Last Battle (Chronicles of Narnia, #7)']

[421, 'Harry Potter', [1, 2, 3, 4, 5, 6, 7], 422]
["Harry Potter and the Sorcerer's Stone (Harry Potter, #1)"]
['Harry Potter and the Chamber of Secrets (Harry Potter, #2)']
['Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)']
['Harry Potter and the Goblet of Fire (Harry Potter, #4)']
['Harry Potter and the Order of the Phoenix (Harry Potter

In [311]:
# Collections for which only one book, some or none are present in the dataset
indices_not_to_include = [
    993, 1366, 1804, 2049, 2181, 2271, 2870, 3472, 3724, 3797, 3988, 4148, 
    4858, 5246, 5414, 5464, 5723, 6295, 6386, 6530, 6908, 7055, 7262, 7864, 
    8164, 8529, 9307, 9858
]

# Drop the books with the previous indices
for i in range(len(collection_books_info)-1, -1, -1):
    if collection_books_info[i][0] in indices_not_to_include:
        collection_books_info.remove(collection_books_info[i])

In [312]:
volumes_IDs = pd.DataFrame(columns=['BookID', 'Title', 'Collection_BookID'])

for i in range(0, len(collection_books_info)):
    saga = collection_books_info[i][1]
    volumes = collection_books_info[i][2]
    collection_id = collection_books_info[i][3]
    collection_index = collection_books_info[i][0]

    aux = books[(books['Title'].str.contains('\(' + saga + ' ')) | (books['Title'].str.contains('\(' + saga + ','))]
#    if len(aux) == 0:
#        aux = books[books['Title'].str.contains(saga)]
    for volume in volumes:
        volume_book = aux[aux['Title'].str.contains(f'#{volume}\)')]
                
        row_to_add = volume_book[['BookID', 'Title']].copy()
        row_to_add['Collection_BookID'] = collection_id

        volumes_IDs = pd.concat([volumes_IDs, row_to_add], ignore_index=True)

In [313]:
# BookID of the collection books
bookID_collections = [elem[3] for elem in collection_books_info]

# Ratings of the collection books
book_collection_ratings = ratings[ratings['BookID'].isin(bookID_collections)].reset_index(drop=True)

#book_collection_ratings = book_collection_ratings.head(3)

# Dictionary to store the new ratings
new_df = {'UserID': [], 'BookID': [], 'Rating': []}
def new_ratings(row, df):
    collection_id = row['BookID']
    user_id = row['UserID']
    rating = row['Rating']
    
    aux_df = volumes_IDs[volumes_IDs['Collection_BookID'] == collection_id] # Dataframe with the volumes of the collection
    aux_df.apply(lambda row: new_ratings_2(row, new_df, user_id, rating), axis=1)

def new_ratings_2(row, df, userID, rating):
    new_df['UserID'].append(userID)
    new_df['BookID'].append(row['BookID'])
    new_df['Rating'].append(rating)
    return row['BookID']

book_collection_ratings.apply(lambda row: new_ratings(row, new_df), axis=1)

# Convert the dictionary into a dataframe
new_df = pd.DataFrame(new_df)

In [314]:
# Ratings of the books which are not collections
ratings_no_collectinos = ratings[~ratings['BookID'].isin(bookID_collections)].reset_index(drop=True)

# Concat the new ratings to the original ratings to books which are not collections
ratings_no_collectinos = pd.concat([ratings_no_collectinos, new_df], ignore_index=True)

# If there was already a rating to a particular volume of a user who also 
# rated the collection, keep the original rating for the volume, which will be above
ratings_no_collectinos = ratings_no_collectinos.drop_duplicates(subset=['UserID', 'BookID'])

In [315]:
# Now we can convert ratings_no_collectinos into the main ratings dataframe
ratings = ratings_no_collectinos.copy()

In [316]:
# The new books dataframe
new_books = ratings['BookID'].unique()
books = books[books['BookID'].isin(new_books)]

In [317]:
# There is still a problem with this TLOTR boxed set
print(books.loc[963, 'Title'])
hobbit_bookID = 7
tlotr1_bookID = 19
tlotr2_bookID = 155
tlotr3_bookID = 161
#books[books['Title'].str.contains('the lord of the rings, #', case=False)][['BookID','Title']]

J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings


In [318]:
ratings_hobbit = ratings[ratings['BookID'] == 964].copy()
ratings_hobbit['BookID'] = hobbit_bookID

ratings_tlotr1 = ratings[ratings['BookID'] == 964].copy()
ratings_tlotr1['BookID'] = tlotr1_bookID

ratings_tlotr2 = ratings[ratings['BookID'] == 964].copy()
ratings_tlotr2['BookID'] = tlotr2_bookID

ratings_tlotr3 = ratings[ratings['BookID'] == 964].copy()
ratings_tlotr3['BookID'] = tlotr3_bookID

In [319]:
ratings = ratings[ratings['BookID'] != 964]
ratings = pd.concat([ratings, ratings_hobbit], ignore_index=True)
ratings = pd.concat([ratings, ratings_tlotr1], ignore_index=True)
ratings = pd.concat([ratings, ratings_tlotr2], ignore_index=True)
ratings = pd.concat([ratings, ratings_tlotr3], ignore_index=True)

# If there was already a rating to a particular volume of a user who also 
# rated the collection, keep the original rating for the volume, which will be above
ratings = ratings.drop_duplicates(subset=['UserID', 'BookID'])

In [320]:
# The new books dataframe
new_books = ratings['BookID'].unique()
books = books[books['BookID'].isin(new_books)]

### Add Saga_Name and Saga_Number

In [321]:
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

def get_saga_and_volume(df):

    mask = np.logical_or.reduce([df['Title'].str.contains(f'#{i}', case=False) for i in range(1, 20)])
    volumes = df[mask][['BookID', 'Title']].reset_index(drop=True)
    mask = volumes['Title'].str.contains('Discworld', case=False)
    volumes = volumes[~mask].reset_index(drop=True)
    mask = np.logical_or.reduce([volumes['Title'].str.contains(f'#{i}-', case=False) for i in range(1, 30)])
    volumes = volumes[~mask].reset_index(drop=True)
    volumes['Saga_Name'] = ''
    volumes['Saga_Volume'] = ''
    
    for index, book in volumes.iterrows():  
        # Name of the saga
        saga = re.findall(r'\((.*?),', book['Title'])
        if saga == []:
            saga = re.findall(r'\((.*?) #', book['Title'])
            if saga == []:
                saga = re.findall(r'\((.*?)#', book['Title'])
                if saga == []:
                    saga = re.findall(r'(.*?)\s+#', book['Title'])
                    if saga == []:
                        saga = re.findall(r'(.*?)#', book['Title'])
                        if saga == []:
                            saga == [' ']
        volume = re.findall(r'#(.*?):', book['Title']) 
        if (volume == []) or (not is_number(volume[0])):
            volume = re.findall(r'#(.*?);', book['Title'])
            if (volume == []) or (not is_number(volume[0])):
                volume = re.findall(r'#(.*?)–', book['Title'])
                if (volume == []) or (not is_number(volume[0])):
                    volume = re.findall(r'#(.*?),', book['Title'])
                    if (volume == []) or (not is_number(volume[0])):
                        volume = re.findall(r'#(.*?)/', book['Title'])
                        if (volume == []) or (not is_number(volume[0])):
                            volume = re.findall(r'#(.*?)\)', book['Title']) 
                            if (volume == []) or (not is_number(volume[0])):
                                volume = re.findall(r'#(.*)', book['Title']) 
                                if (volume == []) or (not is_number(volume[0])):
                                    volume = re.findall(r'#(.*)\s\(', book['Title']) 

        try:
            float(volume[0])
        except:
            volume = [book['Title'].split()[0]] 
        
        #print(str(index) + '  ' + saga[0] + '        ' + volume[0])
    
        volumes.loc[index, 'Saga_Name'] = saga[0]
        volumes.loc[index, 'Saga_Volume'] = float(volume[0])

    return volumes

In [326]:
saga_and_volume = get_saga_and_volume(books)

books = pd.merge(books, saga_and_volume[['BookID', 'Saga_Name', 'Saga_Volume']], on='BookID', how='left')

## Step 3: Exporting the DataFrames

In [328]:
books.to_csv("data/Books_cleaned.csv")
ratings.to_csv("data/Ratings_cleaned.csv")
#books_tags.to_csv("data/Books_tags_cleaned.csv")
#tags.to_csv("data/Tags_cleaned.csv")

books_genres.to_csv("data/Books_genres_cleaned.csv")
books_genres_list.to_csv("data/Books_genres_list_cleaned.csv")

Note: GitHub wants the files to weight less than 25GB. Therefore, after eporting the dataframes, use the python file `split_csv.py` to split the .csv files lighter files.

In [460]:
# To check
books_aux = pd.read_csv("data/Books_cleaned.csv").drop('Unnamed: 0', axis = 1)
ratings_aux = pd.read_csv("data/Ratings_cleaned.csv").drop('Unnamed: 0', axis = 1)
#books_tags_aux = pd.read_csv("data_cleaned/Books_tags_cleaned.csv").drop('Unnamed: 0', axis = 1)
#tags_aux = pd.read_csv("data_cleaned/Tags_cleaned.csv").drop('Unnamed: 0', axis = 1)

books_genres_aux = pd.read_csv("data/Books_genres_cleaned.csv").drop('Unnamed: 0', axis = 1)
books_genres_list_aux = pd.read_csv("data/Books_genres_list_cleaned.csv").drop('Unnamed: 0', axis = 1)

## Useful code

Code to change the number of rows and columns of a dataframe that will be displayed:

In [None]:
max_rows_original = pd.get_option('display.max_rows')
max_cols_original = pd.get_option('display.max_columns')
max_colwidth = pd.get_option('display.max_colwidth')

pd.set_option('display.max_rows', None)  # Show all the rows
pd.set_option('display.max_columns', None)  # Show all the columns
pd.set_option('display.max_colwidth', None) # Show the whole width of the columns

print(books[books['Title'].str.contains('1-', case=False)][['Title']])

# Restore the original configuration
pd.set_option('display.max_rows', max_rows_original)
pd.set_option('display.max_columns', max_cols_original)
pd.set_option('display.max_colwidth', None)

Code to search a string in a column of the dataframes:

In [None]:
books[books['Title'].str.contains('Sea Glass', case=False)][['Original_Title','Title']]