# Overview

**Introduction**

Reading is a great way to gain knowledge, improve vocabulary, and relax the mind. However, not everyone is passionate about reading, especially beginner adult readers. Therefore, building a book recommendation tool can help them discover books that match their interests, preferences, and reading levels. In this paper, I will explore the problem of building a book recommendation tool for beginner adult readers, the value of the solution, the data source, the techniques used, and the challenges faced.








The data from Goodreads data , it is a website where you can keep track of what you read and find recommendations.

Goodreads book data was scrapped by reseacher at UCSD and it can be dowloaded via this link https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home

USCD data is used instead of API because there is no API , Goodreads ended their API acces.

# **Data** **Explanation**




1.   **goodreads_interactions.csv**  : For every user , it has what books they read and how the books

*   **user_id :** A unique ID for each user on Goodreads
*   **book_id :** Unique ID for each book on Goodreads
*   **Rating** : each row tells how the user rated a particular book . From 0-5


2.   **Goodreads_books**.**json**.**gz** : it is a json file , million of lines and each line is metadata about a specific book

*   **Tittle** : Title of the book on Goodreads
*   **book_id** : Unique ID for each book on Goodreads
*   **ratings_count**: how many times the book has been rated by user

3.   **book_id_map**.**csv** ": Map the book id from one data set to the other data set to make sure that we are refering to the same book

*   **book_id_csv** : book_id of goodreads_intereactions
*   **book_id** : book_id of Goodreads_books.json.gz


4.   **Books_titles**.**json**: It will be used for Collective filtering, it is based on personal books preferences.



*   **book_id** : Unique ID for each book on Goodreads
*   **title**: Title of the book on Goodreads
*   **ratings**: each row tells how the user rated a particular book . From 0-5
*   **url** : url to access the book online
*   **cover_image**: image of the book cover
*   **mod_title**:adjusted book search result












# **Load** **Data**

**Goodreads_books**.**json**.**gz** contains millions of data and it will be difficult to load all the data without using all the memory. The json file will be loaded line by line.

In [None]:
import gzip # Instead of reading the whole file into memory at once, let's read it line by line to use less memory

with gzip.open("/content/drive/MyDrive/Data Magic Uploads/Data/goodreads_books.json.gz", 'r') as f:
     line = f.readline() #streamthe file without unzipping it

In [None]:
line #read a single line in from the file, it contains data about a single book, like the title , number of times it was rated the bookid

b'{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin\'s Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "t

From the output above, one line contain information about one book which is "title": "W.C. Fields: A Life on Film", and it list the ratings_count, book_id and other informations about the book.

# **Data** **Cleaning**

In [None]:
import json # use this json module to load the single line from the file which will turn it into a dictionary where we a can access each property

json.loads(line)

{'isbn': '0312853122',
 'text_reviews_count': '1',
 'series': [],
 'country_code': 'US',
 'language_code': '',
 'popular_shelves': [{'count': '3', 'name': 'to-read'},
  {'count': '1', 'name': 'p'},
  {'count': '1', 'name': 'collection'},
  {'count': '1', 'name': 'w-c-fields'},
  {'count': '1', 'name': 'biography'}],
 'asin': '',
 'is_ebook': 'false',
 'average_rating': '4.00',
 'kindle_asin': '',
 'similar_books': [],
 'description': '',
 'format': 'Paperback',
 'link': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'authors': [{'author_id': '604031', 'role': ''}],
 'publisher': "St. Martin's Press",
 'num_pages': '256',
 'publication_day': '1',
 'isbn13': '9780312853129',
 'publication_month': '9',
 'edition_information': '',
 'publication_year': '1984',
 'url': 'https://www.goodreads.com/book/show/5333265-w-c-fields',
 'image_url': 'https://images.gr-assets.com/books/1310220028m/5333265.jpg',
 'book_id': '5333265',
 'ratings_count': '3',
 'work_id': '5400751',
 'title': '

In [None]:
def parse_fields(line):# will take a single line and return the field needed
    data = json.loads(line)
    return {
        "book_id":data["book_id"],
        "title": data ["title_without_series"],
        "ratings": data["ratings_count"],
        "url": data["url"],
        "cover_image":data["image_url"]
    }


In [None]:
books_titles = []  #go line by line to parse each line
with gzip.open("/content/drive/MyDrive/Data Magic Uploads/Data/goodreads_books.json.gz", 'r') as f:
     while True:
         line = f.readline()
         if not line: # when it reach the end of the file , it will end
              break
         fields = parse_fields(line)

         try:
              ratings = int(fields["ratings"]) # only want to take books that have more than a certain number of ratings because books that have few ratings make it unlikely for user to read it
         except ValueError:
             continue
         if ratings > 15: # only take books with more than 15 ratings, it will cut down on the data by selecting a small amount of fields
             books_titles.append(fields)

In [None]:
import pandas as pd

titles = pd.DataFrame.from_dict(books_titles) # turn into a dataframe. books titles will be a list of dictionaries and from_dict will turn those dictionary into a row

In [None]:
titles["ratings"] = pd.to_numeric(titles["ratings"]) # turn titles ratings into a numerical columns

## **Search** **Engine**

The search engine space should be minimized. To accomplish it the number of potential characters should be minimized meaning if a tittle is capitalized different the search will bring out the same book, like " Books" and "books" is the same thing so the search should come with the same tittle.

In [None]:
titles["mod_title"] = titles ["title"].str.replace("[^a-zA-Z0-9 ]", "", regex=True) # the field will modify the tittle to minimize the search space.

In [None]:
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,The Unschooled Wizard Sun Wolf and Starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,Best Friends Forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,The Aeneid for Boys and Girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,Alls Fairy in Love and War Avalon Web of Magic 8
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,The Devils Notebook
...,...,...,...,...,...,...
1308952,17805813,"Ondine (Ondine Quartet, #0.5)",327,https://www.goodreads.com/book/show/17805813-o...,https://images.gr-assets.com/books/1379766592m...,Ondine Ondine Quartet 05
1308953,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,Jacqueline Kennedy Onassis Friend of the Arts
1308954,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,The Spaniards Blackmailed Bride
1308955,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,The Childrens Classic Poetry Collection


Looking at the mod_title, the title are stil capitalized and there is a lot of space included in the title.

In [None]:
titles["mod_title"] = titles ["mod_title"].str.lower() # lower case the titles

In [None]:
titles["mod_title"] = titles ["mod_title"].str.replace("\s+", " ", regex=True) # remove any spaces in row , if there is 3 spaces in a row will replace with single space

The goal is to make the search engine a little more efficient

In [None]:
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,"the unschooled wizard (sun wolf and starhawk, ..."
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,all's fairy in love and war (avalon: web of ma...
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,the devil's notebook
...,...,...,...,...,...,...
1308952,17805813,"Ondine (Ondine Quartet, #0.5)",327,https://www.goodreads.com/book/show/17805813-o...,https://images.gr-assets.com/books/1379766592m...,"ondine (ondine quartet, #0.5)"
1308953,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,jacqueline kennedy onassis: friend of the arts
1308954,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,the spaniard's blackmailed bride
1308955,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,the children's classic poetry collection


## **Cleaning** **Missing** **Values**

To make sure accurate results are displayed missing values will be removed from the search

In [None]:
titles = titles[titles["mod_title"].str.len() > 0] # remove the all the null tittles by checking the length and only taking the mod titles that are greater than 0

In [None]:
titles.to_json("books_titles.json") #will be used int future sessions

In [None]:
titles

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,alls fairy in love and war avalon web of magic 8
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,the devils notebook
...,...,...,...,...,...,...
1308952,17805813,"Ondine (Ondine Quartet, #0.5)",327,https://www.goodreads.com/book/show/17805813-o...,https://images.gr-assets.com/books/1379766592m...,ondine ondine quartet 05
1308953,331839,Jacqueline Kennedy Onassis: Friend of the Arts,18,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,jacqueline kennedy onassis friend of the arts
1308954,2685097,The Spaniard's Blackmailed Bride,112,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,the spaniards blackmailed bride
1308955,2342551,The Children's Classic Poetry Collection,36,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,the childrens classic poetry collection


### **Term** **Frequency** - **Inverse** **Document** **Frequency**

For the search Engine ,Term Frequency - Inverse Document Frequency (TF-IDF) will be used , it is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer # this library builds the TF-IDF matrix
vectorizer = TfidfVectorizer() # this vectorizer will take a lists of the string and turn it into a tfidf matrix

tfidf = vectorizer.fit_transform(titles["mod_title"])

Turn the search query into a vector match it against the matrix for a comparison

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re
#to check if you are picking the correct book , clean url with HTML which will give you a list of url you can click on it to check if it is the correct book you want or
# check the image using show image
def make_clickable(val):
    return'<a target="_blank" href="{}">Goodreads</a>'.format(val) # style the column with HTML ,the url will print nicely and have the ability to click on the url

def show_image(val):
    return '<img src="{}" width= 50></img'.format(val)


def search(query, vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]","",query.lower()) # same processing with the mod titles from earlier and then lowercase the quey
    query_vec = vectorizer.transform([processed]) # turn the query into a vector using the vectorizer
    similarity = cosine_similarity(query_vec, tfidf).flatten() # find the similarity, search the matrix and tell how much each row is similar
    indices = np.argpartition(similarity, -10)[-10:]  # finding the 10th largest similarity values using partition to find indices and values
    results = titles.iloc[indices] # use the indices to index titles
    results = results.sort_values("ratings", ascending=False) # take the rows with the higher numbers of rating to avoid multiples books to be shown
    return results.head(5).style.format({'url':make_clickable, 'cover_image': show_image}) # the url will print nicely and have the ability to click on the url and print the image

In [None]:
search ("Pachinko", vectorizer)

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
1062354,29983711,Pachinko,8161,Goodreads,,pachinko
436303,32737635,The Most Dangerous Place on Earth,4063,Goodreads,,the most dangerous place on earth
1086215,32619967,Pachinko,1361,Goodreads,,pachinko
368254,684819,Dreaming Pachinko,283,Goodreads,,dreaming pachinko
178483,34051011,Pachinko,254,Goodreads,,pachinko


Make a list of Liked books using the books_id

In [None]:
liked_books = ["4408", "31147619", "29983711", "9401317", "9317691", "8153988", "20494944"]

Used the liked book to build some recommendations, Use the goodreader interaction files where it has each user and how much they liked each books to create recommendations for us.
First find the user that like the same books as us and then find all the books they liked because we'll probably like the same books as them.

**book_id_map**.**csv** Let's analyze the book_id map.

In [None]:
!head "/content/drive/MyDrive/Data Magic Uploads/Data/book_id_map.csv" # returns a specified number of rows, string from the top

book_id_csv,book_id
0,34684622
1,34536488
2,34017076
3,71730
4,30422361
5,33503613
6,33517540
7,34467031
8,6383669


In [None]:
csv_book_mapping = {}
with open ("/content/drive/MyDrive/Data Magic Uploads/Data/book_id_map.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        csv_id, book_id = line.strip().split (",") # it reads each line in the file , it's splitting the line on the comma and it is assigning the first part before the comma to cvsid and the second part to book id
        csv_book_mapping[csv_id] = book_id

In [None]:
len(csv_book_mapping) # check the number of keys/books that can be matched between the data set

2360651

In [None]:
!wc -l "/content/drive/MyDrive/Data Magic Uploads/Data/goodreads_interactions.csv" # to look at the number of lines in the file

228648343 /content/drive/MyDrive/Data Magic Uploads/Data/goodreads_interactions.csv


**goodreads_interactions**.**csv** : is the file that has how each user rated each book

In [None]:
!head "/content/drive/MyDrive/Data Magic Uploads/Data/goodreads_interactions.csv"
# goodreads_interation.csv is the file that has how each user rated each book
# a row is a rating for one book
# basically we  check the book and we say is this book in our set of liked books, if it is then the user might have similar taste in bookstores
#then check to see the rating , if the rating was highly rated then the user does have similar tasted to us
# add them to a set users with similar tastes in books

user_id,book_id,is_read,rating,is_reviewed
0,948,1,5,0
0,947,1,5,1
0,946,1,5,0
0,945,1,5,0
0,944,1,5,0
0,943,1,5,0
0,942,1,5,0
0,941,1,5,0
0,940,1,5,0


# **Finding** **the** **Users** **With** **Equal** **Book** **Taste** **than** **US**

In [None]:
overlap_users = set () # a set python data structure where every element is unique

with open("/content/drive/MyDrive/Data Magic Uploads/Data/goodreads_interactions.csv",'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",") # undersore means the field is not fo interest, this will give user_id, csv_id,and rating

        if user_id in overlap_users: # if the user_id is already added to the set, there is no need to keep proccessing it again
             continue

        try:
             rating = int (rating) # parse the rating as interger
        except ValueError:
            continue
        book_id = csv_book_mapping[csv_id] #turn csv_id into a book_id because the book id is what was used to create a list of liked books

        if book_id in liked_books and rating >=4 :# if the book for the row is the book we like and the rating for the book is >= 4 add the user to overlap users
            overlap_users.add(user_id)


# **Finding** **What** **Books** **Users** **Liked**

In [None]:
rec_lines = [] # it will only contain books that users who liked the same books as us have read # will contain all the books we might want to read

with open("/content/drive/MyDrive/Data Magic Uploads/Data/goodreads_interactions.csv",'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",") #split the data # undersore means the field is not fo interest, this will give user_id, csv_id,and rating
        if user_id in overlap_users:
            book_id = csv_book_mapping[csv_id]
            rec_lines.append ([user_id, book_id, rating])

In [None]:
len(overlap_users) # we can see the number of users who likes the books we liked

2029

In [None]:
len(rec_lines) # how many total books the users read and rated a 4 or above

1530257

# **Rank** **Users** **Recommendations**

In [None]:
import pandas as pd

# turn the rec(recommendation) lines into data frame to make it easier to work with
recs = pd.DataFrame(rec_lines, columns=["user_id", "book_id", "rating"])
recs ["book_id"] = recs ["book_id"].astype(str)     # make sure book_id is a string



In [None]:
#finding the top recommendations
top_recs = recs["book_id"].value_counts().head(10) # tells which book id occurs the most and show you the most common one.
top_recs = top_recs.index.values # get the values of the index

Once we figure out the top recommendation , now we book_id and value count is left. Get the book_id into a title.

In [None]:
books_titles = pd.read_json("books_titles.json") # reading our book titles in so we can combine our recommendation with the book titles , to find the recommended tittle
books_titles["book_id"] = books_titles ["book_id"].astype (str) # making sure it is the same data type

In [None]:
books_titles.head()

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
1,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
2,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls
3,6066812,All's Fairy in Love and War (Avalon: Web of Ma...,98,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,alls fairy in love and war avalon web of magic 8
4,287149,The Devil's Notebook,986,https://www.goodreads.com/book/show/287149.The...,https://images.gr-assets.com/books/1328768789m...,the devils notebook


Creating Initial Book recommendations

In [None]:
books_titles[books_titles["book_id"].isin(top_recs)] #find all the book titles where the book_id is in the top 10 recommendations

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
53027,77203,The Kite Runner,1848782,https://www.goodreads.com/book/show/77203.The_...,https://images.gr-assets.com/books/1484565687m...,the kite runner
284473,2767052,"The Hunger Games (The Hunger Games, #1)",4899965,https://www.goodreads.com/book/show/2767052-th...,https://images.gr-assets.com/books/1447303603m...,the hunger games the hunger games 1
401395,5107,The Catcher in the Rye,2086945,https://www.goodreads.com/book/show/5107.The_C...,https://images.gr-assets.com/books/1398034300m...,the catcher in the rye
463463,4671,The Great Gatsby,2758812,https://www.goodreads.com/book/show/4671.The_G...,https://images.gr-assets.com/books/1490528560m...,the great gatsby
615314,5470,1984,2023937,https://www.goodreads.com/book/show/5470.1984,https://images.gr-assets.com/books/1348990566m...,1984
757376,38447,The Handmaid's Tale,648783,https://www.goodreads.com/book/show/38447.The_...,https://images.gr-assets.com/books/1498057733m...,the handmaids tale
790927,2657,To Kill a Mockingbird,3255518,https://www.goodreads.com/book/show/2657.To_Ki...,https://images.gr-assets.com/books/1361975680m...,to kill a mockingbird
878151,18143977,All the Light We Cannot See,498685,https://www.goodreads.com/book/show/18143977-a...,https://images.gr-assets.com/books/1451445646m...,all the light we cannot see
878545,3,Harry Potter and the Sorcerer's Stone (Harry P...,4765497,https://www.goodreads.com/book/show/3.Harry_Po...,https://images.gr-assets.com/books/1474154022m...,harry potter and the sorcerers stone harry pot...
1062354,29983711,Pachinko,8161,https://www.goodreads.com/book/show/29983711-p...,https://images.gr-assets.com/books/1462393298m...,pachinko


# **Improving** **the** **book** **recommendations**:

The books shown above seems to be the most popular books and it is not likely to be a book that we are interested in.Let's find a way to have books based on our preference not a generic preferences.

In [None]:
all_recs = recs ["book_id"].value_counts() # dataframe of how many times each books appears in the set

In [None]:
all_recs

2767052     1092
29983711    1089
2657        1074
3           1048
4671        1028
            ... 
21843400       1
18595019       1
22514204       1
22733082       1
18781576       1
Name: book_id, Length: 364169, dtype: int64

In [None]:
all_recs = all_recs.to_frame().reset_index() # convert the series into a data frame, make the index the columns which is the book id that we are interested in

In [None]:
all_recs

Unnamed: 0,index,book_id
0,2767052,1092
1,29983711,1089
2,2657,1074
3,3,1048
4,4671,1028
...,...,...
364164,21843400,1
364165,18595019,1
364166,22514204,1
364167,22733082,1


looking at all_recs ouput , book_id is named incorrectly , it is suppose to be how many times each book appears and index is the actual book_id. To fix it we'll rename the columns

In [None]:
all_recs.columns = ["book_id", "book_count"]

In [None]:
all_recs

Unnamed: 0,book_id,book_count
0,2767052,1092
1,29983711,1089
2,2657,1074
3,3,1048
4,4671,1028
...,...,...
364164,21843400,1
364165,18595019,1
364166,22514204,1
364167,22733082,1


In [None]:
all_recs = all_recs.merge(books_titles, how= "inner", on="book_id") # merge the above set into a book tittles, inner merge (if the data doesn't exist in both then it will remove it. )

In [None]:
all_recs

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title
0,2767052,1092,"The Hunger Games (The Hunger Games, #1)",4899965,https://www.goodreads.com/book/show/2767052-th...,https://images.gr-assets.com/books/1447303603m...,the hunger games the hunger games 1
1,29983711,1089,Pachinko,8161,https://www.goodreads.com/book/show/29983711-p...,https://images.gr-assets.com/books/1462393298m...,pachinko
2,2657,1074,To Kill a Mockingbird,3255518,https://www.goodreads.com/book/show/2657.To_Ki...,https://images.gr-assets.com/books/1361975680m...,to kill a mockingbird
3,3,1048,Harry Potter and the Sorcerer's Stone (Harry P...,4765497,https://www.goodreads.com/book/show/3.Harry_Po...,https://images.gr-assets.com/books/1474154022m...,harry potter and the sorcerers stone harry pot...
4,4671,1028,The Great Gatsby,2758812,https://www.goodreads.com/book/show/4671.The_G...,https://images.gr-assets.com/books/1490528560m...,the great gatsby
...,...,...,...,...,...,...,...
328338,22707746,1,Names Can Never Hurt Me,297,https://www.goodreads.com/book/show/22707746-n...,https://images.gr-assets.com/books/1405051347m...,names can never hurt me
328339,21843400,1,Blackbird Knitting in a Bunny's Lair (Granby K...,604,https://www.goodreads.com/book/show/21843400-b...,https://images.gr-assets.com/books/1396575651m...,blackbird knitting in a bunnys lair granby kni...
328340,18595019,1,Bar None,25,https://www.goodreads.com/book/show/18595019-b...,https://images.gr-assets.com/books/1380480671m...,bar none
328341,22514204,1,Unexpected Trust (Unexpected #2),121,https://www.goodreads.com/book/show/22514204-u...,https://images.gr-assets.com/books/1403721300m...,unexpected trust unexpected 2


In [None]:
all_recs["score"] = all_recs ["book_count"] * (all_recs["book_count"]/ all_recs["ratings"]) # the book_count of all the users with similar interest in our book, how many are there. And exclude the popularity of the book and focus on our similar book interest

# **Top** **10** **Books** **Recommended** **To** **Us**

In [None]:
all_recs.sort_values("score",ascending=False).head(10) # show the top 10 recommendations based on the new score

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title,score
1,29983711,1089,Pachinko,8161,https://www.goodreads.com/book/show/29983711-p...,https://images.gr-assets.com/books/1462393298m...,pachinko,145.315648
238,4408,327,East of Eden,3447,https://www.goodreads.com/book/show/4408.East_...,https://images.gr-assets.com/books/1323882457m...,east of eden,31.020888
724,9317691,175,The Name of the Wind (The Kingkiller Chronicle...,1043,https://www.goodreads.com/book/show/9317691-th...,https://images.gr-assets.com/books/1360558233m...,the name of the wind the kingkiller chronicle 1,29.362416
236,32920226,328,"Sing, Unburied, Sing",4592,https://www.goodreads.com/book/show/32920226-s...,https://images.gr-assets.com/books/1499340866m...,sing unburied sing,23.428571
216,30753987,342,The Leavers,5602,https://www.goodreads.com/book/show/30753987-t...,https://images.gr-assets.com/books/1489158974m...,the leavers,20.878972
7617,26856502,27,"Vengeful (Villains, #2)",35,https://www.goodreads.com/book/show/26856502-v...,https://s.gr-assets.com/assets/nophoto/book/11...,vengeful villains 2,20.828571
1287,31147619,118,Homegoing,697,https://www.goodreads.com/book/show/31147619-h...,https://images.gr-assets.com/books/1491119004m...,homegoing,19.977044
5517,34927828,37,The Great Alone,70,https://www.goodreads.com/book/show/34927828-t...,https://images.gr-assets.com/books/1501852384m...,the great alone,19.557143
249,8153988,322,"The Eye of the World (Wheel of Time, #1)",5740,https://www.goodreads.com/book/show/8153988-th...,https://images.gr-assets.com/books/1465920672m...,the eye of the world wheel of time 1,18.063415
6011,35099035,34,Red Clocks,67,https://www.goodreads.com/book/show/35099035-r...,https://images.gr-assets.com/books/1494345016m...,red clocks,17.253731


looking at the table above we can see that the first 3 books we picked are mentioned and they are others books that as fewer ratings and they are popular but they are not based on the majority of likes, they are based on our interests.

In [None]:
popular_recs = all_recs[all_recs["book_count"] >75].sort_values("score",ascending=False) # only take recommendation where book_count =75

In [None]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)


popular_recs[~popular_recs["book_id"].isin(liked_books)].head(10).style.format({'url': make_clickable, 'cover_image': show_image})

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title,score
236,32920226,328,"Sing, Unburied, Sing",4592,Goodreads,,sing unburied sing,23.428571
216,30753987,342,The Leavers,5602,Goodreads,,the leavers,20.878972
441,33253215,236,The Heart's Invisible Furies,3629,Goodreads,,the hearts invisible furies,15.347479
692,33280160,181,What We Lose,2250,Goodreads,,what we lose,14.560444
671,33621427,184,Home Fire,2390,Goodreads,,home fire,14.16569
763,21032488,169,"Doors of Stone (The Kingkiller Chronicle, #3)",2059,Goodreads,,doors of stone the kingkiller chronicle 3,13.871297
990,30971664,142,Salt Houses,1474,Goodreads,,salt houses,13.679783
71,30688435,533,Exit West,21378,Goodreads,,exit west,13.288848
251,32283423,321,American War,7776,Goodreads,,american war,13.251157
228,26025588,335,Behold the Dreamers,8793,Goodreads,,behold the dreamers,12.762993


# **Collaborative** **Filtering**

This method is another method of analysis for the book recommendations. It will an extension of the previous analysis.

Use Goodreads interactions Book tittle.json :

book_id
title
ratings
url
Cover Image Use book_id map

In [None]:
import pandas as pd

my_books = pd.read_csv("/content/drive/MyDrive/Data Magic Uploads/Data/liked_books.csv", index_col=0) #read the liked book file created based on the interest

In [None]:
my_books

Unnamed: 0,user_id,book_id,rating,title
0,-1,2517439,5,"The Forever War (The Forever War, #1)"
1,-1,113576,5,The Smartest Guys in the Room: The Amazing Ris...
2,-1,35100,5,Battle Cry of Freedom
3,-1,228221,5,The Mask of Command
5,-1,17662739,5,"2001: A Space Odyssey (Space Odyssey, #1)"
6,-1,356824,5,India After Gandhi: The History of the World's...
7,-1,12125412,5,The Lady or the Tiger?: and Other Logic Puzzles
8,-1,139069,5,Endurance: Shackleton's Incredible Voyage
10,-1,76680,5,"Foundation (Foundation, #1)"
11,-1,1898,5,Into Thin Air: A Personal Account of the Mount...


In [None]:
my_books["book_id"] = my_books["book_id"].astype(str) # make sure book_id is a string to match it with other files

Finding Similar Users based on book interest

In [None]:
!head "/content/drive/MyDrive/Data Magic Uploads/Data/book_id_map.csv"

book_id_csv,book_id
0,34684622
1,34536488
2,34017076
3,71730
4,30422361
5,33503613
6,33517540
7,34467031
8,6383669


In [None]:
csv_book_mapping = {} # Read the file line by line instead of reading the whole file because the file is huge

with open("/content/drive/MyDrive/Data Magic Uploads/Data/book_id_map.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        csv_id, book_id = line.strip().split(",")
        csv_book_mapping[csv_id] = book_id

In [None]:
book_set = set(my_books["book_id"]) # this set like a list in python will contain all the unique books we read

In [None]:
overlap_users = {} # a set python data structure where every element is unique

with open("/content/drive/MyDrive/Data Magic Uploads/Data/goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ =line.split(",")# undersore means the field is not fo interest, this will give user_id, csv_id,and rating

        book_id = csv_book_mapping.get(csv_id)# get. return none if the value is not found

        if book_id in book_set: # if it is one of the book we read then let us add that user to the ovelap user dictionary
            if user_id not in overlap_users:
                overlap_users[user_id] = 1
            else:
                overlap_users[user_id] += 1 # keeping counts of how many times a given user has book overlapping form what we are interested in

In [None]:
len(overlap_users)

316341

In [None]:
filtered_overlap_users = set([k for k in overlap_users if overlap_users[k] > my_books.shape[0]/5]) # look for users who had over 100 books in common with us

Finding Similar User Book Ratings

In [None]:
interactions_list = []

with open("/content/drive/MyDrive/Data Magic Uploads/Data/goodreads_interactions.csv") as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.strip().split(",")

        if user_id in filtered_overlap_users: # if the user is someone in the list of our overlap with our reads book then add the reading history into the interations list
            book_id = csv_book_mapping[csv_id]
            interactions_list.append([user_id, book_id, rating])

# **Build** **A** **collaborative** **filtering** **matrix**

A user book Matrix, every row of the matrix will be a differn user and every column of the matrix will be a different book and the cell will contain the ratings the user gave to that book.

In [None]:
len(interactions_list)# how long is the interactive list

5638701

In [None]:
interactions_list[0] # first item in the list [ user_id , bood_id , bookrating ]

['282', '627206', '4']

In [None]:
interactions = pd.DataFrame(interactions_list, columns=["user_id", "book_id", "rating"]) # turn into a dataframe

In [None]:
interactions = pd.concat([my_books[["user_id", "book_id", "rating"]], interactions]) # add our own rating into everyone else ratings

In [None]:
interactions

Unnamed: 0,user_id,book_id,rating
0,-1,2517439,5
1,-1,113576,5
2,-1,35100,5
3,-1,228221,5
5,-1,17662739,5
...,...,...,...
5638696,804100,475178,0
5638697,804100,186074,0
5638698,804100,153008,0
5638699,804100,45107,0


Looking at the table, our ratings are -1 and they are others people ratings too.

In [None]:
interactions["book_id"] = interactions["book_id"].astype(str) # Bood_id and user_id are string like in json file
interactions["user_id"] = interactions["user_id"].astype(str)
interactions["rating"] = pd.to_numeric(interactions["rating"]) #ratings will be numbers

In [None]:
interactions["user_id"].unique()

array(['-1', '282', '874', ..., '442043', '712588', '804100'],
      dtype=object)

looking at the user_id, it is shown as long numbers , the goal is to make the user_id correspond to a single row in a matrix.

In [None]:
interactions["user_index"] = interactions["user_id"].astype("category").cat.codes # all same number will be convert in the same category

In [None]:
interactions["user_index"].unique()

array([   0,  555, 1216, ..., 1054, 1143, 1183], dtype=int16)

In [None]:
interactions["book_index"] = interactions["book_id"].astype("category").cat.codes

In [None]:
from scipy.sparse import coo_matrix # type of sparse matrix [array or list , row position , columns position]

ratings_mat_coo = coo_matrix((interactions["rating"], (interactions["user_index"], interactions["book_index"])))

In [None]:
ratings_mat_coo.shape

(1259, 802870)

In [None]:
ratings_mat = ratings_mat_coo.tocsr() # convert coo to csr format

Finding user Similar to us

In [None]:
interactions[interactions["user_id"] == "-1"] # find row for specific user

Unnamed: 0,user_id,book_id,rating,user_index,book_index
0,-1,2517439,5,0,414880
1,-1,113576,5,0,38971
2,-1,35100,5,0,575858
3,-1,228221,5,0,356004
5,-1,17662739,5,0,214285
6,-1,356824,5,0,581743
7,-1,12125412,5,0,59763
8,-1,139069,5,0,124430
10,-1,76680,5,0,722098
11,-1,1898,5,0,276178


In [None]:
my_index = 0 #row zero in book rating matrix

In [None]:
from sklearn.metrics.pairwise import cosine_similarity # cos sim will find similarity between the 2 rows in the matrix

similarity = cosine_similarity(ratings_mat[my_index,:], ratings_mat).flatten()

In [None]:
similarity[2] # similarity of a user to us, the lower the sim matrix the less the taste in books are

0.06143442518998915

In [None]:
import numpy as np

indices = np.argpartition(similarity, -15)[-15:] #find indices of the users most similar to us in therm of book taste

In [None]:
indices

array([1188,  942,  218,  129,  496,  435, 1208,  795, 1213, 1210, 1143,
        321,  294,  862,    0])

In [None]:
similar_users = interactions[interactions["user_index"].isin(indices)].copy() # find the user_id where the user_index is in the indices

In [None]:
similar_users = similar_users[similar_users["user_id"]!="-1"] # looking at similar user taking our -1 which describes us to make sure we not getting our own book recommendation

In [None]:
similar_users

Unnamed: 0,user_id,book_id,rating,user_index,book_index
45312,4133,5359,3,942,632143
45313,4133,10464963,4,942,13492
45314,4133,3858,3,942,593622
45315,4133,11827808,4,942,51904
45316,4133,7913305,4,942,732465
...,...,...,...,...,...
5638521,712588,32388712,3,1143,543119
5638522,712588,16322,5,1143,183365
5638523,712588,860543,0,1143,759827
5638524,712588,853510,5,1143,756768


Looking at the table we have 4302 rows from potential books we want to read based on the user that are most similar to us

Creating book recommendations

In [None]:
book_recs = similar_users.groupby("book_id").rating.agg(['count', 'mean']) # how many time the book appears in the recommendation and find the mean rating

In [None]:
book_recs

Unnamed: 0_level_0,count,mean
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6,3.833333
100322,1,0.000000
100365,1,0.000000
10046142,1,0.000000
1005,3,0.000000
...,...,...
99561,2,2.500000
99610,1,3.000000
99664,1,4.000000
9969571,3,2.333333


Looking at the table, each row is for a book , the count is how many times the book was recommendated to us, the mean is the score on how the users rated the book.

In [None]:
books_titles = pd.read_json("books_titles.json") # get the book title
books_titles["book_id"] = books_titles["book_id"].astype(str) # ensure the book id is a string

In [None]:
book_recs = book_recs.merge(books_titles, how="inner", on="book_id")# merge the two data sets to get the book titles of the book_id

In [None]:
book_recs

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title
0,1,6,3.833333,Harry Potter and the Half-Blood Prince (Harry ...,1713866,https://www.goodreads.com/book/show/1.Harry_Po...,https://images.gr-assets.com/books/1361039191m...,harry potter and the halfblood prince harry po...
1,100322,1,0.000000,Assata: An Autobiography,11057,https://www.goodreads.com/book/show/100322.Assata,https://images.gr-assets.com/books/1328857268m...,assata an autobiography
2,100365,1,0.000000,The Mote in God's Eye,48736,https://www.goodreads.com/book/show/100365.The...,https://images.gr-assets.com/books/1399490037m...,the mote in gods eye
3,10046142,1,0.000000,Dancing in the Glory of Monsters: The Collapse...,2391,https://www.goodreads.com/book/show/10046142-d...,https://images.gr-assets.com/books/1328757755m...,dancing in the glory of monsters the collapse ...
4,1005,3,0.000000,Think and Grow Rich,87634,https://www.goodreads.com/book/show/1005.Think...,https://s.gr-assets.com/assets/nophoto/book/11...,think and grow rich
...,...,...,...,...,...,...,...,...
2843,99561,2,2.500000,Looking for Alaska,804587,https://www.goodreads.com/book/show/99561.Look...,https://images.gr-assets.com/books/1394798630m...,looking for alaska
2844,99610,1,3.000000,The Best Laid Plans,17434,https://www.goodreads.com/book/show/99610.The_...,https://images.gr-assets.com/books/1353374848m...,the best laid plans
2845,99664,1,4.000000,The Painted Veil,24606,https://www.goodreads.com/book/show/99664.The_...,https://images.gr-assets.com/books/1320421719m...,the painted veil
2846,9969571,3,2.333333,Ready Player One,376328,https://www.goodreads.com/book/show/9969571-re...,https://images.gr-assets.com/books/1500930947m...,ready player one


Raking the book recommendations

In [None]:
book_recs["adjusted_count"] = book_recs["count"] * (book_recs["count"] / book_recs["ratings"]) #adjusted count , the count of how many the book appears for people with our interest

In [None]:
book_recs["score"] = book_recs["mean"] * book_recs["adjusted_count"] # how much we might each the book, average rating from user like us

In [None]:
book_recs = book_recs[~book_recs["book_id"].isin(my_books["book_id"])]

In [None]:
my_books["mod_title"] = my_books["title"].str.replace("[^a-zA-Z0-9 ]", "", regex=True).str.lower() # take the title of the book we like and replace any characters that doesn't fall in the specifi set of charaters

In [None]:
my_books["mod_title"] = my_books["mod_title"].str.replace("\s+", " ", regex=True) # replace mutlple place in the row into one single space

In [None]:
book_recs = book_recs[~book_recs["mod_title"].isin(my_books["mod_title"])] # remove the books we liked , that we read in the recommendation

In [None]:
book_recs = book_recs[book_recs["count"]>2] #remove anything that appears less than twice

In [None]:
book_recs = book_recs[book_recs["mean"] >=4] # only find books where the mean rating is grader than 4

In [None]:
top_recs = book_recs.sort_values("mean", ascending=False) #sort data based on the score

In [None]:
top_recs

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,score
2260,62291,5,4.8,"A Storm of Swords (A Song of Ice and Fire, #3)",477834,https://www.goodreads.com/book/show/62291.A_St...,https://images.gr-assets.com/books/1497931121m...,a storm of swords a song of ice and fire 3,5.2e-05,0.000251
600,157993,3,4.333333,The Little Prince,763309,https://www.goodreads.com/book/show/157993.The...,https://images.gr-assets.com/books/1367545443m...,the little prince,1.2e-05,5.1e-05
1100,22034,3,4.333333,The Godfather,259150,https://www.goodreads.com/book/show/22034.The_...,https://images.gr-assets.com/books/1394988109m...,the godfather,3.5e-05,0.00015
1173,2318271,3,4.333333,The Last Lecture,245804,https://www.goodreads.com/book/show/2318271.Th...,https://images.gr-assets.com/books/1388075896m...,the last lecture,3.7e-05,0.000159
1906,4381,3,4.333333,Fahrenheit 451,591506,https://www.goodreads.com/book/show/4381.Fahre...,https://images.gr-assets.com/books/1351643740m...,fahrenheit 451,1.5e-05,6.6e-05
243,119322,4,4.25,"The Golden Compass (His Dark Materials, #1)",973154,https://www.goodreads.com/book/show/119322.The...,https://images.gr-assets.com/books/1505766203m...,the golden compass his dark materials 1,1.6e-05,7e-05
1441,2767793,4,4.25,"The Hero of Ages (Mistborn, #3)",149260,https://www.goodreads.com/book/show/2767793-th...,https://images.gr-assets.com/books/1480717763m...,the hero of ages mistborn 3,0.000107,0.000456
2558,78983,4,4.25,"Kane and Abel (Kane and Abel, #1)",75215,https://www.goodreads.com/book/show/78983.Kane...,https://s.gr-assets.com/assets/nophoto/book/11...,kane and abel kane and abel 1,0.000213,0.000904
244,119324,3,4.0,"The Subtle Knife (His Dark Materials, #2)",246697,https://www.goodreads.com/book/show/119324.The...,https://images.gr-assets.com/books/1505766360m...,the subtle knife his dark materials 2,3.6e-05,0.000146
398,13497,4,4.0,"A Feast for Crows (A Song of Ice and Fire, #4)",437398,https://www.goodreads.com/book/show/13497.A_Fe...,https://images.gr-assets.com/books/1429538615m...,a feast for crows a song of ice and fire 4,3.7e-05,0.000146


In [None]:
def make_clickable(val): # style the data frame where we can click on the link and see the image
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)

top_recs.style.format({'url': make_clickable, 'cover_image': show_image})

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,score
2260,62291,5,4.8,"A Storm of Swords (A Song of Ice and Fire, #3)",477834,Goodreads,,a storm of swords a song of ice and fire 3,5.2e-05,0.000251
600,157993,3,4.333333,The Little Prince,763309,Goodreads,,the little prince,1.2e-05,5.1e-05
1100,22034,3,4.333333,The Godfather,259150,Goodreads,,the godfather,3.5e-05,0.00015
1173,2318271,3,4.333333,The Last Lecture,245804,Goodreads,,the last lecture,3.7e-05,0.000159
1906,4381,3,4.333333,Fahrenheit 451,591506,Goodreads,,fahrenheit 451,1.5e-05,6.6e-05
243,119322,4,4.25,"The Golden Compass (His Dark Materials, #1)",973154,Goodreads,,the golden compass his dark materials 1,1.6e-05,7e-05
1441,2767793,4,4.25,"The Hero of Ages (Mistborn, #3)",149260,Goodreads,,the hero of ages mistborn 3,0.000107,0.000456
2558,78983,4,4.25,"Kane and Abel (Kane and Abel, #1)",75215,Goodreads,,kane and abel kane and abel 1,0.000213,0.000904
244,119324,3,4.0,"The Subtle Knife (His Dark Materials, #2)",246697,Goodreads,,the subtle knife his dark materials 2,3.6e-05,0.000146
398,13497,4,4.0,"A Feast for Crows (A Song of Ice and Fire, #4)",437398,Goodreads,,a feast for crows a song of ice and fire 4,3.7e-05,0.000146


# **Conclusion**

Building a book recommendation tool for beginner adult readers is a valuable solution to help them discover books they will enjoy and increase their motivation to read. The tool utilizes machine learning techniques and publicly available data to generate personalized recommendations based on users' interests, preferences, and reading levels.
Goodreads API, which provides access to book information, user ratings, and reviews. Collaborative filtering as used to analyze the similarity between users and recommend books that are popular among similar users.The tool was build to recommend the type of the books we liked base on other users interest that are similar to our taste.The search engine can be updated to find a different genre of books after we are done reading the other books from the top 10 recommendations