# **🏗️ Building a Book Recommender System 📚**

### **Business Understanding**
Goodreads is a social platform that connects readers to books and each other.
It is a book-focused website that helps you keep track of what you're reading and lets you write book reviews.
With a Goodreads account, you can keep track of the books you've read, the books you're reading, and the books you want to read.
It is a website for book lovers, allowing users to track their reading, find book recommendations, see what friends are reading, keep a list of books they want to read, participate in challenges, and more. 

Goodreads is operated by Goodreads Inc., a subsidiary of Amazon.
The data used to build the recommendation engine is a subset of the entire collection of books in the Goodreads itenerary.

Recommender Systems are software tools and techniques providing suggestions of relevant items to users. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items such multiple possible books to choose from. You can only read one book at a time. You could not possibly read every book out there unless you commit your entire life to reading and even then you'd not have read even half of the books ever written, it's simply impossible. 

There is a huge pool of books to read with such limited time in our busy lives to read all books. Wouldn't it be ideal if whichever books we read actually appeal to us individually making our reading experience fun every time? 

We can achieve this using a book recommendation system. In this train, we build a **content-based recommendation** system that recommends books based on similarity between genres, authors and language of books we've previously read.

### **Dependencies**

In [2]:
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
en_stop = stopwords.words("english")
import json

### **Data Understanding**

In [3]:
df = pd.read_csv(r"data\goodbooks dataset.csv",index_col=0)
print(f"The dataset has {df.shape[0]} entries and {df.shape[-1]} columns.")
df.head()

The dataset has 10000 entries and 25 columns.


Unnamed: 0,id,book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,tag_name,auth_tags
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,to-read fantasy favorites currently-reading yo...,Suzanne Collins to-read fantasy favorites curr...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPrÃ©",1997.0,Harry Potter and the Philosopher's Stone,...,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,to-read fantasy favorites currently-reading yo...,"J.K. Rowling, Mary GrandPrÃ© to-read fantasy f..."
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,to-read fantasy favorites currently-reading yo...,Stephenie Meyer to-read fantasy favorites curr...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...,to-read favorites currently-reading young-adul...,Harper Lee to-read favorites currently-reading...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...,to-read favorites currently-reading young-adul...,F. Scott Fitzgerald to-read favorites currentl...


The dataset has vast amounts of information. For our book recommendation system, we keep it lean and drop any columns that do not help us build our recommendation engine.

In [4]:
drop_cols = ['id', 'original_title','best_book_id', 'work_id', 'books_count','isbn13','language_code','ratings_count','work_ratings_count','ratings_1','ratings_2', 'ratings_3', 'ratings_4', 'ratings_5','auth_tags']
df.drop(columns=drop_cols,inplace=True)
df.head()

Unnamed: 0,book_id,isbn,authors,original_publication_year,title,average_rating,work_text_reviews_count,image_url,small_image_url,tag_name
0,2767052,439023483,Suzanne Collins,2008.0,"The Hunger Games (The Hunger Games, #1)",4.34,155254,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,to-read fantasy favorites currently-reading yo...
1,3,439554934,"J.K. Rowling, Mary GrandPrÃ©",1997.0,Harry Potter and the Sorcerer's Stone (Harry P...,4.44,75867,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,to-read fantasy favorites currently-reading yo...
2,41865,316015849,Stephenie Meyer,2005.0,"Twilight (Twilight, #1)",3.57,95009,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,to-read fantasy favorites currently-reading yo...
3,2657,61120081,Harper Lee,1960.0,To Kill a Mockingbird,4.25,72586,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...,to-read favorites currently-reading young-adul...
4,4671,743273567,F. Scott Fitzgerald,1925.0,The Great Gatsby,3.89,51992,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...,to-read favorites currently-reading young-adul...


We will need the cover page images for our interactive user interface, we subset the image links and store them in a separate `.csv` file.

In [5]:
image_df = df[['book_id','image_url','small_image_url']]
image_df.to_csv("images.csv")
df.drop(columns=['image_url','small_image_url','isbn'],inplace=True)
image_df.head()

Unnamed: 0,book_id,image_url,small_image_url
0,2767052,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,3,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,41865,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,2657,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,4671,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [6]:
df.isna().sum()

book_id                       0
authors                       0
original_publication_year    21
title                         0
average_rating                0
work_text_reviews_count       0
tag_name                      0
dtype: int64

Notice we only have about 21 missing values for our publication year column. About **0.21%** of our total data so we can safely drop the missing entries.

In [5]:
df.dropna(inplace=True)

### **👷🏽‍♂️Model Building**

#### **Feature Engineering**

Text data needs to be transformed before generating a numerical representation. We also need to concolidate all our text related data on each book into one feature.

In [6]:
df['pub_year'] = df['original_publication_year'].astype(int)
df.drop(columns=['original_publication_year'],inplace=True)

We need an approach for combining book metadata to have one descriptive column.

In [7]:
new_titles = []
for title in df['title']:
    new_titles.append(re.sub(r"[^a-zA-Z0-9() ]","",title))

In [8]:
def clean_title(s):
    match = re.search(r'\((.*?)\)', s)
    if match:
        main_title = s.split('(')[0].strip()
        bracket_content = match.group(1).strip()

        # Extract number if present
        num_match = re.search(r'#?(\d+)', bracket_content)
        number = num_match.group(1) if num_match else ''

        # Words in both parts
        words_in_bracket = re.findall(r'\b\w+\b', bracket_content.lower())
        words_in_main = re.findall(r'\b\w+\b', main_title.lower())

        # If any word matches
        if any(word in words_in_main for word in words_in_bracket):
            return re.sub(r"[^a-zA-Z0-9 ]","",f"{main_title} {number}".strip())
        else:
            return re.sub(r"[^a-zA-Z0-9 ]","",f"{main_title} {bracket_content.replace('#'+number, '').strip()} {number}".strip())
    else:
        return re.sub(r"[^a-zA-Z0-9 ]","",s)

In [9]:
df['tag_name'] = df['tag_name'].str.lower().str.replace(r"to-read|-"," ",regex=True).str.strip().str.replace(r"[^a-z ]","",regex=True)
df['title'] = df['title'].apply(clean_title)
df['authors'] = [re.sub(r'[^a-zA-Z ]',"",author).strip() for author in df['authors']]

In [10]:
df['description'] = df['authors'].str.lower() + " " + df['tag_name']

In [11]:
df.head()

Unnamed: 0,book_id,authors,title,average_rating,work_text_reviews_count,tag_name,pub_year,description
0,2767052,Suzanne Collins,The Hunger Games 1,4.34,155254,fantasy favorites currently reading young adul...,2008,suzanne collins fantasy favorites currently re...
1,3,JK Rowling Mary GrandPr,Harry Potter and the Sorcerers Stone 1,4.44,75867,fantasy favorites currently reading young adul...,1997,jk rowling mary grandpr fantasy favorites curr...
2,41865,Stephenie Meyer,Twilight 1,3.57,95009,fantasy favorites currently reading young adul...,2005,stephenie meyer fantasy favorites currently re...
3,2657,Harper Lee,To Kill a Mockingbird,4.25,72586,favorites currently reading young adult fictio...,1960,harper lee favorites currently reading young a...
4,4671,F Scott Fitzgerald,The Great Gatsby,3.89,51992,favorites currently reading young adult fictio...,1925,f scott fitzgerald favorites currently reading...


We subset the original dataset to only `book_id` and `description` which we will use to build a **TF-IDF matrix**; an encoding NLP technique that converts text to vector representations.

In [12]:
df_tf = df[['book_id','description']]
df_tf.head()

Unnamed: 0,book_id,description
0,2767052,suzanne collins fantasy favorites currently re...
1,3,jk rowling mary grandpr fantasy favorites curr...
2,41865,stephenie meyer fantasy favorites currently re...
3,2657,harper lee favorites currently reading young a...
4,4671,f scott fitzgerald favorites currently reading...


#### **Building the text representation model**

##### **TF-IDF**

In [13]:
vectorizer = TfidfVectorizer(stop_words=en_stop,lowercase=True)
X = vectorizer.fit_transform(df_tf['description'])

##### **Cosine similarity**

In [14]:
cos_sim = cosine_similarity(X,X)

In [15]:
sim_df = pd.DataFrame(cos_sim)
sim_df.columns = df_tf['book_id'].values
sim_df['id'] = df_tf['book_id'].apply(lambda x:str(x))
cols_order = list(sim_df.columns)
cols_order.pop()
cols_order.insert(0,'id')
sim_df = sim_df[cols_order]

In [16]:
sim_df.head()

Unnamed: 0,id,2767052,3,41865,2657,4671,11870085,5907,5107,960,...,101094,13616278,4936457,4769651,15613,7130616,208324,77431,8565083,8914
0,2767052,1.0,0.325924,0.355967,0.197583,0.19248,0.352707,0.304033,0.201138,0.176833,...,0.053052,0.260444,0.082329,0.175164,0.082136,0.216814,0.036395,0.076485,0.09402,0.018187
1,3,0.325924,1.0,0.309493,0.240624,0.230233,0.296911,0.426032,0.223823,0.1478,...,0.054087,0.289519,0.10107,0.333461,0.109204,0.251046,0.043329,0.069068,0.12294,0.023708
2,41865,0.355967,0.309493,1.0,0.168208,0.194354,0.275644,0.287195,0.175251,0.13031,...,0.043887,0.29219,0.064116,0.172325,0.09011,0.296079,0.04392,0.050091,0.072892,0.018929
3,2657,0.197583,0.240624,0.168208,1.0,0.712389,0.275458,0.27529,0.711296,0.201557,...,0.122932,0.095351,0.149404,0.14619,0.446807,0.094294,0.099733,0.103475,0.136794,0.078937
4,4671,0.19248,0.230233,0.194354,0.712389,1.0,0.285147,0.291151,0.763649,0.175954,...,0.090534,0.112609,0.122926,0.138187,0.480164,0.109246,0.084921,0.100229,0.157451,0.043318


In [17]:
sim_df.head()

Unnamed: 0,id,2767052,3,41865,2657,4671,11870085,5907,5107,960,...,101094,13616278,4936457,4769651,15613,7130616,208324,77431,8565083,8914
0,2767052,1.0,0.325924,0.355967,0.197583,0.19248,0.352707,0.304033,0.201138,0.176833,...,0.053052,0.260444,0.082329,0.175164,0.082136,0.216814,0.036395,0.076485,0.09402,0.018187
1,3,0.325924,1.0,0.309493,0.240624,0.230233,0.296911,0.426032,0.223823,0.1478,...,0.054087,0.289519,0.10107,0.333461,0.109204,0.251046,0.043329,0.069068,0.12294,0.023708
2,41865,0.355967,0.309493,1.0,0.168208,0.194354,0.275644,0.287195,0.175251,0.13031,...,0.043887,0.29219,0.064116,0.172325,0.09011,0.296079,0.04392,0.050091,0.072892,0.018929
3,2657,0.197583,0.240624,0.168208,1.0,0.712389,0.275458,0.27529,0.711296,0.201557,...,0.122932,0.095351,0.149404,0.14619,0.446807,0.094294,0.099733,0.103475,0.136794,0.078937
4,4671,0.19248,0.230233,0.194354,0.712389,1.0,0.285147,0.291151,0.763649,0.175954,...,0.090534,0.112609,0.122926,0.138187,0.480164,0.109246,0.084921,0.100229,0.157451,0.043318


In [18]:
id_map = {i:id for i,id in zip(range(len(df_tf)),df_tf['book_id'])}

In [19]:
new_cos_sim = {}
for i in range(cos_sim.shape[0]):
    new_cos_sim[id_map[i]] = [id_map[j] for j in np.argsort(cos_sim[i])[::-1][1:101]]
with open("sim_matrix.json",'w') as file:
    json.dump(new_cos_sim,file)

#### **Building recommendation logic**

In [20]:
book_find_title = {}
book_retreive_id = {}
for id, book in zip(df['book_id'],df['title']):
    book_find_title[id] = book.strip()
    book_retreive_id[book.strip()] = id
with open('find_title.json','w') as file:
    json.dump(book_find_title,file)
with open('retreive_id.json','w') as file:
    json.dump(book_retreive_id,file)

In [21]:
most_read_books = list(df.sort_values(by=['work_text_reviews_count','average_rating'],ascending=False).loc[:,'title'][:20])
with open("most_read.json",'w') as file:
    json.dump(most_read_books,file)
all_books = list(df['title'].str.strip())
with open("all_books.json",'w') as file:
    json.dump(all_books, file)

In [22]:
def recommend_n_titles(book_read,find_map=book_find_title,retreive_map=book_retreive_id,matrix=new_cos_sim,n=10,pre_seed=most_read_books,archive=all_books):
    """
    Retreives the titles of the most similar books to a title previously read
    Params:
        - book_read(str): a book that user has already read
        - find_map(dict): key-value pair with book name as key and book_id as value
        - retreive_map(dict): key-value pair with book_id as key and book title as value
        - matrix(sparse matrix): matrix with cosine similarity scores
        - n(int): the total number of recommendations
        - pre_seed(list): a list containing the most reveiwed and highly rated titles
        - archive(list): array of all books
        - content(list): array of recommendations
    """
    content = []
    book_clean = re.sub(r"[^a-z0-9A-Z ]","",book_read.strip())
    print(book_clean)
    if book_clean in archive:
        book_id = retreive_map[book_clean]
    else:
        book_read = random.choice(pre_seed)
        print(f"😞Looks like we don't have that. Here's one of our best rated books: {book_read}")
        book_id = retreive_map[book_read.strip()]
    sim_titles_n_id = matrix[book_id][:n]
    for id in sim_titles_n_id:
        content.append(find_map[id])
    print(f"Here are some books similar to {book_read} we think you'd really enjoy!\n")
    return content,sim_titles_n_id

In [23]:
df = pd.merge(left=df,right=image_df,how='left')

In [24]:
most_read_images = []
for title in most_read_books:
    most_read_images.append(df.loc[df['title']==title,'image_url'].item())

In [25]:
images = {title : img for title, img  in zip(list(df['book_id']),list(df['image_url']))}
with open('images.json','w') as file:
    json.dump(images,file)

In [26]:
gallery_images = [
    {'image':img, 'label': title}
    for img, title in zip(most_read_images,most_read_books)
]
with open('gallery.json','w') as file:
    json.dump(gallery_images,file)