# Book Search Engine
This search engine will search a large book dataset scraped from the goodreads website by using their developer API (dataset from https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks) and display all of its information. The information we can collect from this dataset is the book ID, title, author, average rating, language, number of pages, number of ratings, number of text reviews, publication date, and publisher.

In [38]:
#import packages
import pandas as pd
import numpy as np
import seaborn as sns
import re
import sklearn

# Importing Data

In [21]:
#open csv file
book_data = pd.read_csv("/Users/intisarmuhammad/Downloads/books.csv")

In [23]:
#view data of first 5 rows
book_data.head()

Unnamed: 0,bookID,title,authors,average_rating,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,eng,652,2095690,27591,9/16/06,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,eng,870,2153167,29221,9/1/04,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,eng,352,6333,244,11/1/03,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,eng,435,2339585,36325,5/1/04,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,eng,2690,41428,164,9/13/04,Scholastic


In [25]:
book_data.shape

(11123, 10)

# Data Cleaning

In [27]:
# create new title column for modified title names (remove non alphanumeric characters)
# this will make search queries easier for our search engine
book_data["mod_title"] = book_data["title"].str.replace("[^a-zA-Z0-9 ]", "", regex = True)

In [28]:
book_data

Unnamed: 0,bookID,title,authors,average_rating,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,mod_title
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,eng,652,2095690,27591,9/16/06,Scholastic Inc.,Harry Potter and the HalfBlood Prince Harry Po...
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,eng,870,2153167,29221,9/1/04,Scholastic Inc.,Harry Potter and the Order of the Phoenix Harr...
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,eng,352,6333,244,11/1/03,Scholastic,Harry Potter and the Chamber of Secrets Harry ...
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,eng,435,2339585,36325,5/1/04,Scholastic Inc.,Harry Potter and the Prisoner of Azkaban Harry...
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,eng,2690,41428,164,9/13/04,Scholastic,Harry Potter Boxed Set Books 15 Harry Potter 15
...,...,...,...,...,...,...,...,...,...,...,...
11118,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,eng,512,156,20,12/21/04,Da Capo Press,Expelled from Eden A William T Vollmann Reader
11119,45633,You Bright and Risen Angels,William T. Vollmann,4.08,eng,635,783,56,12/1/88,Penguin Books,You Bright and Risen Angels
11120,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,eng,415,820,95,8/1/93,Penguin Books,The IceShirt Seven Dreams 1
11121,45639,Poor People,William T. Vollmann,3.72,eng,434,769,139,2/27/07,Ecco,Poor People


In [None]:
# transform modified titles to lower case
book_data["mod_title"] = book_data["mod_title"].str.lower()

In [30]:
# remove multiple spaces in a row
book_data["mod_title"] = book_data["mod_title"].str.replace("\s+", " ", regex = True)

In [101]:
#view the data
book_data

Unnamed: 0,bookID,title,authors,average_rating,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,mod_title
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,eng,652,2095690,27591,9/16/06,Scholastic Inc.,harry potter and the halfblood prince harry po...
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,eng,870,2153167,29221,9/1/04,Scholastic Inc.,harry potter and the order of the phoenix harr...
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,eng,352,6333,244,11/1/03,Scholastic,harry potter and the chamber of secrets harry ...
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,eng,435,2339585,36325,5/1/04,Scholastic Inc.,harry potter and the prisoner of azkaban harry...
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,eng,2690,41428,164,9/13/04,Scholastic,harry potter boxed set books 15 harry potter 15
...,...,...,...,...,...,...,...,...,...,...,...
11118,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,eng,512,156,20,12/21/04,Da Capo Press,expelled from eden a william t vollmann reader
11119,45633,You Bright and Risen Angels,William T. Vollmann,4.08,eng,635,783,56,12/1/88,Penguin Books,you bright and risen angels
11120,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,eng,415,820,95,8/1/93,Penguin Books,the iceshirt seven dreams 1
11121,45639,Poor People,William T. Vollmann,3.72,eng,434,769,139,2/27/07,Ecco,poor people


In [32]:
# remove null titles
book_data = book_data[book_data["mod_title"].str.len() > 0]

In [34]:
book_data.shape
# the above process removed 3 books

(11120, 11)

# Building the search engine
This search engine will match the title of a book you input and also find books with similar titles to display. 

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(book_data["mod_title"])

In [93]:
# turn search query into vector and match it against the tfidf matrix and then compare (using a function)
from sklearn.metrics.pairwise import cosine_similarity

def search(query, vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower())
    query_vec = vectorizer.transform([processed])
    similarity = cosine_similarity(query_vec, tfidf).flatten()

    # find the 10 largest similarities
    indices = np.argpartition(similarity, -10)[-10:]

    # index the titles
    results = book_data.iloc[indices]

    # sort results on the highest number of ratings
    results = results.sort_values("ratings_count", ascending = False)
    return results.head(5)

In [102]:
# run the function with a book title
search("the alchemist", vectorizer)

Unnamed: 0,bookID,title,authors,average_rating,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,mod_title
284,865,The Alchemist,Paulo Coelho/Alan R. Clarke/Özdemir İnce,3.86,eng,197,1631221,55843,5/1/93,HarperCollins,the alchemist
288,870,Fullmetal Alchemist Vol. 1 (Fullmetal Alchemi...,Hiromu Arakawa/Akira Watanabe,4.5,eng,192,111091,1427,5/3/05,VIZ Media LLC,fullmetal alchemist vol 1 fullmetal alchemist 1
287,869,Fullmetal Alchemist Vol. 8 (Fullmetal Alchemi...,Hiromu Arakawa/Akira Watanabe,4.57,eng,192,11451,161,7/18/06,VIZ Media LLC,fullmetal alchemist vol 8 fullmetal alchemist 8
285,866,Fullmetal Alchemist Vol. 9 (Fullmetal Alchemi...,Hiromu Arakawa/Akira Watanabe,4.57,eng,192,9013,153,9/19/06,VIZ Media LLC,fullmetal alchemist vol 9 fullmetal alchemist 9
6978,26425,Fullmetal Alchemist: The Abducted Alchemist (F...,Makoto Inoue/Hiromu Arakawa/Alexander O. Smith...,4.57,eng,240,2779,19,1/10/06,VIZ Media LLC,fullmetal alchemist the abducted alchemist ful...


# Creating a list of liked books
In this section, I will be using my search engine from above to query the books that I read (titles from my personal Goodreads account; https://www.goodreads.com/user/show/50112085-star) and create a list of their book ids if the query exists in the data.

In [103]:
# book ids of 7 of the books I like
liked_books = ["37415", "10210", "27451", "7613", "3636", "22188","865"]

In [104]:
# Display the rows of books from my liked books list by their book ID
book_data.loc[book_data["bookID"].isin(["37415", "10210", "27451", "7613", "3636", "22188","865"])]

Unnamed: 0,bookID,title,authors,average_rating,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher,mod_title
284,865,The Alchemist,Paulo Coelho/Alan R. Clarke/Özdemir İnce,3.86,eng,197,1631221,55843,5/1/93,HarperCollins,the alchemist
1069,3636,The Giver (The Giver #1),Lois Lowry,4.13,eng,208,1585589,56604,1/24/06,Ember,the giver the giver 1
2114,7613,Animal Farm,George Orwell/Boris Grabnar/Peter Škerl,3.93,eng,122,2111750,29677,5/6/03,NAL,animal farm
2764,10210,Jane Eyre,Charlotte Brontë/Michael Mason,4.12,eng,532,1409369,27884,2/4/03,Penguin,jane eyre
5887,22188,Gossip Girl (Gossip Girl #1),Cecily von Ziegesar,3.52,eng,224,54400,2271,4/1/02,Little Brown and Company,gossip girl gossip girl 1
7161,27451,The Great Gatsby,F. Scott Fitzgerald/Matthew J. Bruccoli,3.91,eng,216,9844,1050,6/1/95,Scribner,the great gatsby
9427,37415,Their Eyes Were Watching God,Zora Neale Hurston,3.91,eng,219,220309,9536,5/30/06,Amistad,their eyes were watching god
