<a href="https://colab.research.google.com/github/natasaivic/ml/blob/main/book_search_engine_and_recommender_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Book search engine and recommender system** 

## Overview
- This project is about finding (recommending) top 10 similar books based on the user search text.
- I will use TF-IDF vectorizer and Cosine similarity that measures the angle between vectors in a multi-dimensional space to find the most relevant books.

![Good reads](https://lucidbookspublishing.com/wp-content/uploads/2018/05/goodreads.jpg)

#Install and import required packages


In [None]:
import pandas as pd
import ast
import numpy as np
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import IPython
from google.colab import output

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#About the dataset


- Goodreads is an American social cataloging website that allows individuals to search its database of books. 
- The dataset that I will use is https://www.kaggle.com/meetnaren/goodreads-best-books which has a collection of most popular books
 - In the dataset we have:
   - 52,478 different book entries
   - 25 different columns

In [None]:
# Loading and previewing the dataset
url = 'https://raw.githubusercontent.com/scostap/goodreads_bbe_dataset/main/Best_Books_Ever_dataset/books_1.Best_Books_Ever.csv'
data = pd.read_csv(url)
data.head(5)

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,bookFormat,edition,pages,publisher,publishDate,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",Hardcover,First Edition,374,Scholastic Press,09/14/08,,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",Paperback,US Edition,870,Scholastic Inc.,09/28/04,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",Paperback,,324,Harper Perennial Modern Classics,05/23/06,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,9999999999999,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",Paperback,"Modern Library Classics, USA / CAN",279,Modern Library,10/10/00,01/28/13,[],2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,"['United Kingdom', 'Derbyshire, England (Unite...",https://i.gr-assets.com/images/S/compressed.ph...,1983116,20452,
4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.6,About three things I was absolutely positive.\...,English,9780316015844,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",Paperback,,501,"Little, Brown and Company",09/06/06,10/05/05,"['Georgia Peach Book Award (2007)', 'Buxtehude...",4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,"['Forks, Washington (United States)', 'Phoenix...",https://i.gr-assets.com/images/S/compressed.ph...,1459448,14874,2.1


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bookId            52478 non-null  object 
 1   title             52478 non-null  object 
 2   series            23470 non-null  object 
 3   author            52478 non-null  object 
 4   rating            52478 non-null  float64
 5   description       51140 non-null  object 
 6   language          48672 non-null  object 
 7   isbn              52478 non-null  object 
 8   genres            52478 non-null  object 
 9   characters        52478 non-null  object 
 10  bookFormat        51005 non-null  object 
 11  edition           4955 non-null   object 
 12  pages             50131 non-null  object 
 13  publisher         48782 non-null  object 
 14  publishDate       51598 non-null  object 
 15  firstPublishDate  31152 non-null  object 
 16  awards            52478 non-null  object

#Dataset transformations


- I will create a new dataset based on the dataset I just downloaded with a few changes
 - Keep = books in English langue from the dataset
 - Columns to be selected will be the features for the search algorithm
    - author, title, description, genres, characters, price, rating, bbescore, likedPercent, bookId
- Then I will transform some columns to make them more useful using my helper functions 
- Drop NAN values 
- Reduce dataset to top 20k books ordered by bbeScore


In [None]:
# Keep english only
search_df = data.where(data['language'] == 'English')

# Select interesting columns
columns = ['author', 'title', 'description', 'genres', 'characters', 'price', 'rating', 'bbeScore', 'likedPercent', 'bookId']
search_df = search_df[columns]
search_df.head(5)

Unnamed: 0,author,title,description,genres,characters,price,rating,bbeScore,likedPercent,bookId
0,Suzanne Collins,The Hunger Games,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",5.09,4.33,2993816.0,96.0,2767052-the-hunger-games
1,"J.K. Rowling, Mary GrandPré (Illustrator)",Harry Potter and the Order of the Phoenix,There is a door at the end of a silent corrido...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",7.38,4.5,2632233.0,98.0,2.Harry_Potter_and_the_Order_of_the_Phoenix
2,Harper Lee,To Kill a Mockingbird,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",,4.28,2269402.0,95.0,2657.To_Kill_a_Mockingbird
3,"Jane Austen, Anna Quindlen (Introduction)",Pride and Prejudice,Alternate cover edition of ISBN 9780679783268S...,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",,4.26,1983116.0,94.0,1885.Pride_and_Prejudice
4,Stephenie Meyer,Twilight,About three things I was absolutely positive.\...,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",2.1,3.6,1459448.0,78.0,41865.Twilight


In [None]:
# Here I create derivate columns

# Transform array columns using helper function
# Some columns have multiple values but they are encoded as a string repr of list
# e.g. "['Holden Caulfield', 'Robert Ackley', 'Stradla..."
def transform_to_string(input):
  if isinstance(input, str) and input[0] == "[":
    items = []
    for item in ast.literal_eval(input):
      items.append(item.replace(" ", ""))
    return " ".join(items).lower()
  return ""

# Characters (vectorize by character names)
# ['Scout Finch', 'Atticus Finch', 'Jem Finch'] => "scoutfinch atticusfinch jemfinch"
search_df['characters_feature'] = search_df['characters'].apply(transform_to_string)

# Genres (vectorize by genres)
# ['Classics', 'Fiction', 'Historical Fiction'] => "classics fiction historicalfiction"
search_df['genres_feature'] = search_df['genres'].apply(transform_to_string)

search_df.head(5)

Unnamed: 0,author,title,description,genres,characters,price,rating,bbeScore,likedPercent,bookId,characters_feature,genres_feature
0,Suzanne Collins,The Hunger Games,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",5.09,4.33,2993816.0,96.0,2767052-the-hunger-games,katnisseverdeen peetamellark cato(hungergames)...,youngadult fiction dystopia fantasy sciencefic...
1,"J.K. Rowling, Mary GrandPré (Illustrator)",Harry Potter and the Order of the Phoenix,There is a door at the end of a silent corrido...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",7.38,4.5,2632233.0,98.0,2.Harry_Potter_and_the_Order_of_the_Phoenix,siriusblack dracomalfoy ronweasley petuniadurs...,fantasy youngadult fiction magic childrens adv...
2,Harper Lee,To Kill a Mockingbird,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",,4.28,2269402.0,95.0,2657.To_Kill_a_Mockingbird,scoutfinch atticusfinch jemfinch arthurradley ...,classics fiction historicalfiction school lite...
3,"Jane Austen, Anna Quindlen (Introduction)",Pride and Prejudice,Alternate cover edition of ISBN 9780679783268S...,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",,4.26,1983116.0,94.0,1885.Pride_and_Prejudice,mr.bennet mrs.bennet janebennet elizabethbenne...,classics fiction romance historicalfiction lit...
4,Stephenie Meyer,Twilight,About three things I was absolutely positive.\...,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",2.1,3.6,1459448.0,78.0,41865.Twilight,edwardcullen jacobblack laurent renee bellaswa...,youngadult fantasy romance vampires fiction pa...


In [None]:
# Drop NAN values
search_df.dropna(subset=['title', 'description', 'bbeScore'], inplace=True)

# Use top 20k books
search_df_20k = search_df.sort_values(by=['bbeScore'], ascending=False).iloc[:20000, :]
search_df_20k.head(5)

Unnamed: 0,author,title,description,genres,characters,price,rating,bbeScore,likedPercent,bookId,characters_feature,genres_feature
0,Suzanne Collins,The Hunger Games,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",5.09,4.33,2993816.0,96.0,2767052-the-hunger-games,katnisseverdeen peetamellark cato(hungergames)...,youngadult fiction dystopia fantasy sciencefic...
1,"J.K. Rowling, Mary GrandPré (Illustrator)",Harry Potter and the Order of the Phoenix,There is a door at the end of a silent corrido...,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",7.38,4.5,2632233.0,98.0,2.Harry_Potter_and_the_Order_of_the_Phoenix,siriusblack dracomalfoy ronweasley petuniadurs...,fantasy youngadult fiction magic childrens adv...
2,Harper Lee,To Kill a Mockingbird,The unforgettable novel of a childhood in a sl...,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",,4.28,2269402.0,95.0,2657.To_Kill_a_Mockingbird,scoutfinch atticusfinch jemfinch arthurradley ...,classics fiction historicalfiction school lite...
3,"Jane Austen, Anna Quindlen (Introduction)",Pride and Prejudice,Alternate cover edition of ISBN 9780679783268S...,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",,4.26,1983116.0,94.0,1885.Pride_and_Prejudice,mr.bennet mrs.bennet janebennet elizabethbenne...,classics fiction romance historicalfiction lit...
4,Stephenie Meyer,Twilight,About three things I was absolutely positive.\...,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",2.1,3.6,1459448.0,78.0,41865.Twilight,edwardcullen jacobblack laurent renee bellaswa...,youngadult fantasy romance vampires fiction pa...


# Features and techniques

- I will use the text from the description, genres and characters columns
 - I will try to reduce the dictionary size of the desciptions by using only first 5 letters on each word  
- I will use TF-IDF vectorizer and fit to the data to get my similarity matrices
- 1st matrix created based on the descriprion words #(20000, 47833)
- 2nd matrix created based on the genres words #(20000, 821)
- 3rd matrix created based on the characters words #(20000, 33234)
- Why I need 3 matrices? 
 - I will explain in the next slides.

In [None]:
# Use set() because of fast lookup
stop_words = set(stopwords.words('english'))

# This is description tokenizer which limits each word to first 5 letters
# that way we keep the dictionary small. This has proven to work good enough.
class DescriptionTokenizer:
    def __call__(self, doc):
      tokens = []
      for token in word_tokenize(doc):
        if token not in stop_words and len(token) > 2:
           # this is a workaround for memory limit
          tokens.append(token.lower()[:5])
      return tokens

# Create descriptin tfidf based on DescriptionTokenizer class
description_tfidf = TfidfVectorizer(stop_words='english', tokenizer=DescriptionTokenizer())
description_matrix = description_tfidf.fit_transform(search_df_20k['description'])

# For genres I use normal tokenizer as there are not too many genres. The dictionary size is small.
genres_tfidf = TfidfVectorizer(stop_words='english')
genres_matrix = genres_tfidf.fit_transform(search_df_20k['genres_feature'])

# Character dictinoary gets bigger than the Genres dictionary but it's not an issue
characters_tfidf = TfidfVectorizer(stop_words='english')
characters_matrix = characters_tfidf.fit_transform(search_df_20k['characters_feature'])

  'stop_words.' % sorted(inconsistent))


In [None]:
description_matrix.shape, genres_matrix.shape, characters_matrix.shape

((20000, 47833), (20000, 821), (20000, 33234))

In [None]:
# Cosine similarity for each term matrix
description_cosine_sim = cosine_similarity(description_matrix, description_matrix)
genres_cosine_sim = cosine_similarity(genres_matrix, genres_matrix)
characters_cosine_sim = cosine_similarity(characters_matrix, characters_matrix)

In [None]:
# Create new dataframe for each similarity matrix
# to be used later in the search algorithm
description_df = pd.DataFrame(description_cosine_sim, columns=search_df_20k.title.values, index=search_df_20k.title.values)
genres_df = pd.DataFrame(genres_cosine_sim, columns=search_df_20k.title.values, index=search_df_20k.title.values)
characters_df = pd.DataFrame(characters_cosine_sim, columns=search_df_20k.title.values, index=search_df_20k.title.values)

Testing the data
--
- These matrices are the dataset of the serach algorithm.
- I need to make sure that they work as we expect them to.
- I will run one test for each dataframe to make sure finding similar books works.
- I will pick a book title and I will find most similar books based on the desciption or genres or charaters.

In [None]:
# Book indices to be used in my tests
book_indices = pd.Series(search_df_20k.index, index=search_df_20k['title']).drop_duplicates()

# Helper test function 
def test_similarity(matrix, title, book_indices, dataframe, num_results):
  idx = book_indices[title]
  sim_scores = list(enumerate(matrix[idx]))
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) 

  book_indices = []
  for id, similarity in sim_scores[1:num_results + 1]:
    book_indices.append(id) 

  return dataframe.iloc[book_indices]

In [None]:
book_indices.head(20)

title
The Hunger Games                                                          0
Harry Potter and the Order of the Phoenix                                 1
To Kill a Mockingbird                                                     2
Pride and Prejudice                                                       3
Twilight                                                                  4
The Book Thief                                                            5
Animal Farm                                                               6
The Chronicles of Narnia                                                  7
J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings     8
Gone with the Wind                                                        9
The Fault in Our Stars                                                   10
The Hitchhiker's Guide to the Galaxy                                     11
The Giving Tree                                                          12
Wuther

In [None]:
# Test similarity using genre similarity matrix
book_title = 'The Book Thief'
top10_book_titles = test_similarity(genres_cosine_sim, book_title, book_indices, search_df_20k, 10)

# These are the results I get, looks relevant
top10_book_titles[['title', 'genres', 'description']]

Unnamed: 0,title,genres,description
3712,Summer of My German Soldier,"['Historical Fiction', 'Young Adult', 'Fiction...",Minutes before the train pulled into the stati...
9878,Escape from Warsaw,"['Historical Fiction', 'Fiction', 'Childrens',...","WARSAW 1942On a cold, dark night in Warsaw in ..."
22607,Touching the Wire,"['Holocaust', 'Historical Fiction', 'Historica...",Librarian note: alternate cover edition ASIN -...
23417,Girl in the Blue Coat,"['Historical Fiction', 'Young Adult', 'Mystery...","Amsterdam, 1943. Hanneke spends her days procu..."
5735,When Hitler Stole Pink Rabbit,"['Historical Fiction', 'Fiction', 'Childrens',...","Partly autobiographical, this is first of the ..."
11858,Resistance,"['Historical Fiction', 'Young Adult', 'Middle ...",Chaya Lindner is a teenager living in Nazi-occ...
208,The Boy in the Striped Pajamas,"['Historical Fiction', 'Fiction', 'Young Adult...","If you start to read this book, you will go on..."
1733,The Tattooist of Auschwitz,"['Historical Fiction', 'Fiction', 'Historical'...","In April 1942, Lale Sokolov, a Slovakian Jew, ..."
8823,The Lost Wife,"['Historical Fiction', 'Fiction', 'Romance', '...","There on her forearm, next to a small brown bi..."
6958,My Enemy's Cradle,"['Historical Fiction', 'World War II', 'Fictio...",Cyrla's neighbors have begun to whisper. Her c...


In [None]:
# Test similarity using description similarity matrix
book_title = 'The Adventures of Huckleberry Finn'
top10_book_titles = test_similarity(description_cosine_sim, book_title, book_indices, search_df_20k, 10)

# Based on the description of 'The Adventures of Huckleberry Finn'
# I get all these results. Desciptions have some words in common.
top10_book_titles[['title', 'genres', 'description']]

Unnamed: 0,title,genres,description
112,The Adventures of Tom Sawyer,"['Classics', 'Fiction', 'Adventure', 'Young Ad...",The Adventures of Tom Sawyer revolves around t...
4128,The Adventures of Tom Sawyer and Adventures of...,"['Classics', 'Fiction', 'Adventure', 'Literatu...",THE ADVENTURES OF TOM SAWYERTake a lighthearte...
10848,The Day the Falls Stood Still,"['Historical Fiction', 'Fiction', 'Romance', '...","Tom Cole, the grandson of a legendary local he..."
6803,All My Friends are Superheroes,"['Fiction', 'Fantasy', 'Humor', 'Magical Reali...",All Tom's friends really are superheroes.There...
14525,Alphabet Weekends,"['Chick Lit', 'Fiction', 'Romance', 'Adult', '...",Natalie and Tom have been best friends forever...
9332,How to Stop Time,"['Fiction', 'Fantasy', 'Historical Fiction', '...","""She smiled a soft, troubled smile and I felt ..."
18095,Deadly Currents,"['Mystery', 'Fiction', 'Cozy Mystery', 'Advent...",The Arkansas River is the heart and soul of Sa...
15403,The Fabulous Riverboat,"['Science Fiction', 'Fantasy', 'Fiction', 'Sci...","In To Your Scattered Bodies Go, Philip José Fa..."
10945,Haunting Rachel,"['Mystery', 'Romance', 'Romantic Suspense', 'F...",Tom Sheridan disappears just three weeks befor...
5530,Night of the Soul Stealer,"['Fantasy', 'Young Adult', 'Horror', 'Fiction'...","It's going to be a long, hard, cruel winter. A..."


In [None]:
# Test similarity using character similarity matrix
book_title = 'A Game of Thrones'
top10_book_titles = test_similarity(characters_cosine_sim, book_title, book_indices, search_df_20k, 10)

# Game of thrones books have quite unique character names,
# and the results I get are as expected.
top10_book_titles[['title', 'characters', 'description']]

Unnamed: 0,title,characters,description
194,A Storm of Swords,"['Brandon Stark', 'Catelyn Stark', 'Tyrion Lan...",An alternate cover for this isbn can be found ...
1679,A Song of Ice and Fire,"['Tyrion Lannister', 'Arya Stark', 'Khal Drogo...","For the first time, all five novels in the epi..."
429,A Feast for Crows,"['Arya Stark', 'Jaime Lannister', 'Petyr Baeli...","Crows will fight over a dead man's flesh, and ..."
471,A Dance with Dragons,"['Brandon Stark', 'Tyrion Lannister', 'Daenery...",Alternate cover edition of ASIN B004XISI4AIn t...
4374,A Dance with Dragons: Dreams and Dust,"['Tyrion Lannister', 'Daenerys Targaryen', 'Th...","In the aftermath of a colossal battle, new thr..."
13324,The Winds of Winter,"['Theon Greyjoy', 'Arya Stark', 'Stannis Barat...",The Winds of Winter is the forthcoming sixth n...
248,A Clash of Kings,"['Brandon Stark', 'Catelyn Stark', 'Tyrion Lan...",A comet the color of blood and flame cuts acro...
4635,Runaway,"['Emerson Watts', 'Nikki Howard', 'Lulu Collin...",EM WATTS IS ON THE RUNShe's on the run from sc...
3731,Being Nikki,"['Emerson Watts', 'Nikki Howard', 'Gabriel Lun...",THINGS AREN'T PRETTY FOR EMERSON WATTS.Em was ...
2649,Airhead,"['Emerson Watts', 'Nikki Howard', 'Lulu Collin...",ÖNEMLİ OLAN PAKET...İÇİNDEKİ KİMİN UMRUNDA?Miz...


# Search Algorithm

## Searching
- Take user input (search text) and vectorize it
- First pass 
 - Using the search text, find 10 most similar books based on book descriptions
- Second pass
 - Use half of the first-pass results and extract genres
 - Based on extracted genres find 10 more similar books
 - Add books found to the search result
- Third pass
 - Use half of the first-pass results and extract characters
 - Based on extracted characters find 10 more similar books
 - Add books found to the search result

## Ranking
 - Put all results in one list and order them by bbeScore
 - All results are ranked by the bbeScore column
   - First pass results are not weighted
   - Second pass results are weighted at 0.25
   - Third pass results are weighted at 0.25
 - Pick the top 10 results

## Displaying
- Print the results in the rank order

In [None]:
# Search algorithm helper functions
def find_sim_scores(term_vector, tfidf_matrix):
  term_matrix = [term_vector]
  search_cosine_sim = cosine_similarity(term_matrix, tfidf_matrix)
  sim_scores = list(enumerate(search_cosine_sim[0]))
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
  return sim_scores

def find_book_indices(sim_scores):
  book_indices = []
  for id, similarity in sim_scores[1:11]: 
    if similarity > 0: 
      book_indices.append(id)
  return book_indices

def search_by_description(user_input):
  term_vector = np.zeros(len(description_tfidf.vocabulary_))

  for word in user_input.split(" "):
    word = word.lower().strip()[:5]
    if word in description_tfidf.vocabulary_:
      value_index = description_tfidf.vocabulary_[word]
      term_vector[value_index] = 1

  sim_scores = find_sim_scores(term_vector, description_matrix)
  book_indices = find_book_indices(sim_scores)

  columns = ['title', 'author', 'rating', 'price', 'bookId', 'bbeScore', 'genres_feature', 'characters_feature']
  result = search_df_20k[columns].iloc[book_indices]
  result['rank_score'] = result['bbeScore']
  return result

def search_by_feature(feature_name, feature_tfidf, feature_matrix, first_pass_results, rank_multiplier):
  term_vector = np.zeros(len(feature_tfidf.vocabulary_))

  for index, row in first_pass_results.head(5).iterrows():
    for word in row[feature_name].split(" "):
      word = word.strip()
      if word in feature_tfidf.vocabulary_:
        value_index = feature_tfidf.vocabulary_[word]
        term_vector[value_index] = 1

  sim_scores = find_sim_scores(term_vector, feature_matrix)
  book_indices = find_book_indices(sim_scores)

  columns = ['title', 'author', 'rating', 'price', 'bookId', 'bbeScore']
  result = search_df_20k[columns].iloc[book_indices]
  result['rank_score'] = result['bbeScore'].apply(lambda x: x * rank_multiplier)
  return result

def rank_search_results(results):
  all_results = pd.concat(results, ignore_index=True)
  ranked_results = all_results.sort_values(by=['rank_score'], ascending=False).drop_duplicates(subset=['title'])
  return ranked_results

In [None]:
# Search algorithm
def search_books(user_input):
  results = []

  # First pass
  # From user input find similar book descriptions
  similar_by_description = search_by_description(user_input)
  results.append(similar_by_description)

  # Second pass
  # Use the results so I have so far to find more similar books 
  # based on the genres
  results.append(search_by_feature('genres_feature', genres_tfidf, genres_matrix, similar_by_description, 0.25))

  # Third pass
  # And find more books that have similar characters
  # and I am hoping to refine my results a little better
  results.append(search_by_feature('characters_feature', characters_tfidf, characters_matrix, similar_by_description, 0.25))

  # Ranking
  ranked_results = rank_search_results(results)
  return ranked_results.head(10)

# This is the extent of my HTML knowledge
# luckily StackOverflow has an answer to most of the problems :)
def display_results(results, user_input):
  display(IPython.display.HTML(f'<h2>Search results for <i>"{user_input}":</i></h2><br>'))
  if len(results.index) == 0:
    display(IPython.display.HTML('No results found'))
    return
  for i, n in enumerate(results.index):
    display(IPython.display.HTML(f'{i + 1}. <a href="https://www.goodreads.com/book/show/{results["bookId"][n]}" target="new">{results["title"][n]}</a>'))
    display(IPython.display.HTML(f'by <i>{results["author"][n]}</i>, price: ${results["price"][n] }, rating: {results["rating"][n]}<br><br>'))

# Demo 
- Let's try the search engine and see what results we get. 
- Some test queries: 
 - tom sawyer
 - isaac asimov
 - kite runner
 - holden
 - gandalf

In [None]:
user_input = input("Search query: ")

results = search_books(user_input)
display_results(results, user_input)