# Book Recommendation Engine

> This project seeks to create a book recommendation engine using K-Means Nearest Neighbors. It uses a dataset from the Goodreads API before it closed. The notebook collects the data from the internet, installs it automatically, combines the dataframe into a single dataframe, employs a search engine for users to find the books they like, and uses a K-Means-based recommendation engine that creates 5 randomly picked books that the user may potentially enjoy (similar to what you may see on goodreads, which gives random recommendations based on enjoyed books.)

### NOTE: A few things require using a specific directory after you download some files, so you will need to change your directory from mine.

# Downloading the Files

In [1]:
conda install -c conda-forge gdown

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
import gdown
import os

In [2]:
#The data is stored in a number of google drive links as json.gzs. This is a way for me to get all the different
#google drive links as a list, so that I can import what I need specifically.

files = pd.read_csv('https://raw.githubusercontent.com/MengtingWan/goodreads/master/gdrive_id.csv')

In [4]:
#Code found on the github page for the data, allows me to pull down the specific json.gzs that I need. Basically,
#it translates the above ids into their names.

file_id_map = dict(zip(files['name'].values, files['id'].values))

def download_by_name(fname, output=None, quiet=False):
    if fname in file_id_map:
        url = 'https://drive.google.com/uc?id='+file_id_map[fname]
        gdown.download(url, output=output, quiet=quiet)
    else:
        print('The file', fname, 'can not be found!')

In [21]:
#Downloading necessary files: these are pretty big, so you'll need a bit of storage (about 10 GB should be plenty). The
#first file is for information on the books itself, the second for ratings of books, and the final one for matching
#the ids in the two datasets.

download_by_name('goodreads_books.json.gz')
download_by_name('goodreads_interactions.csv')
download_by_name('book_id_map.csv')

Downloading...
From: https://drive.google.com/uc?id=1CHTAaNwyzvbi1TR08MJrJ03BxA266Yxr
To: C:\Users\isebby\Downloads\book_id_map.csv
100%|█████████████████████████████████████████████████████████████████████████████| 37.8M/37.8M [00:01<00:00, 36.6MB/s]


# Importing the files into the environment and merging them

In [5]:
#You will probably need 16 GB of RAM to do this in a reasonable amount of time. This imports the interactions data in chunks
#and then concatenates them to create a dataframe. You will need to change your directory here.

import time

s_time_chunk = time.time()
chunk = pd.read_csv("C:\\Users\\ianse\\Documents\\ECON 300 Project\\Data\\goodreads_interactions.csv", usecols = ['user_id','book_id', 'rating'], chunksize = 50000)
e_time_chunk = time.time()
  
print("With chunks: ", (e_time_chunk-s_time_chunk), "sec")
InteractionsDF = pd.concat(chunk)

With chunks:  0.07816886901855469 sec


In [6]:
InteractionsDF

Unnamed: 0,user_id,book_id,rating
0,0,948,5
1,0,947,5
2,0,946,5
3,0,945,5
4,0,944,5
...,...,...,...
228648337,876144,24772,0
228648338,876144,23847,4
228648339,876144,23950,3
228648340,876144,374106,5


In [None]:
#Different method needed for json to do a similar thing as above.

In [8]:
import gzip
import json
def parse_fields(line):
    data = json.loads(line)
    return {
        "title": data["title_without_series"],
        "book_id": data["book_id"],
        "ratings_count": data["ratings_count"]
    }

In [9]:
#This uses the function above. This will take quite a while.
booktitles = []
with gzip.open("C:\\Users\\ianse\\Documents\\ECON 300 Project\\Data\\goodreads_books.json.gz", 'r') as f: #Opens the file
    while True:                                      #Reads in every line, essentially
        line = f.readline()                          #Reads in the line
        if not line:                                 #Keeps this from being an infinite loop
            break
        fields = parse_fields(line)                  #With all the lines imported in, we get the specific values we need
        try:
            ratings=int(fields["ratings_count"])     #Make ratings_count an integer.
        except ValueError:
            continue
        if ratings > 50000:                          #Sets rating threshold
            booktitles.append(fields)                #Appends book information to list

In [10]:
books = pd.DataFrame.from_dict(booktitles) #Takes the booktitles list above and makes it into a pandas dataframe
books["book_id"]=books["book_id"].astype(int) #Sets the book ID type as an integer
books["ratings_count"]=books["ratings_count"].astype(int) #Sets the ratings count as an integer
books

Unnamed: 0,title,book_id,ratings_count
0,Best Friends Forever,6066819,51184
1,90 Minutes in Heaven: A True Story of Death an...,89375,68157
2,I Am the Messenger,19057,94968
3,All the Light We Cannot See,19398490,53342
4,Born a Crime: Stories From a South African Chi...,29780253,57318
...,...,...,...
2094,"The Scorch Trials (Maze Runner, #2)",7631105,312407
2095,Breakfast at Tiffany's,251688,134187
2096,The Thief Lord,113304,61489
2097,"The Wonderful Wizard of Oz (Oz, #1)",236093,251691


In [11]:
#This cell reads in the book id map csv, which matches the books data with the interactions data, which have two different types of book ids.
bookmap = pd.read_csv("C:\\Users\\ianse\\Documents\\ECON 300 Project\\Data\\book_id_map.csv") 
bookmap

Unnamed: 0,book_id_csv,book_id
0,0,34684622
1,1,34536488
2,2,34017076
3,3,71730
4,4,30422361
...,...,...
2360645,2360645,19517100
2360646,2360646,18597299
2360647,2360647,18584882
2360648,2360648,18518801


In [12]:
#Merges the book id map with the books dataframe
testdf = books.merge(bookmap, how = "inner", left_on = "book_id", right_on = "book_id")
testdf

Unnamed: 0,title,book_id,ratings_count,book_id_csv
0,Best Friends Forever,6066819,51184,14854
1,90 Minutes in Heaven: A True Story of Death an...,89375,68157,28677
2,I Am the Messenger,19057,94968,14799
3,All the Light We Cannot See,19398490,53342,153
4,Born a Crime: Stories From a South African Chi...,29780253,57318,14502
...,...,...,...,...
2094,"The Scorch Trials (Maze Runner, #2)",7631105,312407,1591
2095,Breakfast at Tiffany's,251688,134187,7298
2096,The Thief Lord,113304,61489,14951
2097,"The Wonderful Wizard of Oz (Oz, #1)",236093,251691,12977


In [13]:
#Merges the interactions dataframe with the books dataframe to create a single combined dataframe. This may take a little while.
CompleteDF = testdf.merge(InteractionsDF, how = "inner", left_on = "book_id_csv", right_on = "book_id")
CompleteDF = CompleteDF[CompleteDF.rating != 0] #Ratings set to 0 means they left a written review for the book, but not a rating, so we drop these.
CompleteDF
#The completed dataframe.

Unnamed: 0,title,book_id_x,ratings_count,book_id_csv,user_id,book_id_y,rating
0,Best Friends Forever,6066819,51184,14854,22,14854,3
1,Best Friends Forever,6066819,51184,14854,90,14854,3
3,Best Friends Forever,6066819,51184,14854,120,14854,2
4,Best Friends Forever,6066819,51184,14854,301,14854,4
5,Best Friends Forever,6066819,51184,14854,308,14854,4
...,...,...,...,...,...,...,...
53695962,Coraline,17061,325562,7284,875737,7284,4
53695963,Coraline,17061,325562,7284,875822,7284,4
53695964,Coraline,17061,325562,7284,875832,7284,5
53695965,Coraline,17061,325562,7284,875881,7284,4


In [14]:
#Checks to see if any values are empty.
CompleteDF.isna().sum()

title            0
book_id_x        0
ratings_count    0
book_id_csv      0
user_id          0
book_id_y        0
rating           0
dtype: int64

In [51]:
#Downloads top 1000 rows of the dataset.
#print(CompleteDF.head(1000).to_csv("ProjectDataSet.csv"))

None


# Search Engine

#### In order to create a recommendation engine based on what the user likes, they need to be able to search through the dataframe to find the IDs of the books they like so that these IDs can be put through the recommendation engine to find what the user enjoys. This next section creates a search engine that helps the user find the IDs of the books they enjoy.

In [15]:
#These next few cells, up until titles is printed, create a modified title that will allow the search engine to more easily
#search for the requested title.

import pandas as pd

titles = testdf.copy()

In [16]:
titles["ratings_count"] = pd.to_numeric(titles["ratings_count"])
titles["mod_title"] = titles["title"].str.replace("[^a-zA-Z0-9 ]", "", regex=True)
titles["mod_title"] = titles["mod_title"].str.lower()
titles["mod_title"] = titles["mod_title"].str.replace("\s+", " ", regex=True)
titles = titles[titles["mod_title"].str.len() > 0]

In [17]:
titles

Unnamed: 0,title,book_id,ratings_count,book_id_csv,mod_title
0,Best Friends Forever,6066819,51184,14854,best friends forever
1,90 Minutes in Heaven: A True Story of Death an...,89375,68157,28677,90 minutes in heaven a true story of death and...
2,I Am the Messenger,19057,94968,14799,i am the messenger
3,All the Light We Cannot See,19398490,53342,153,all the light we cannot see
4,Born a Crime: Stories From a South African Chi...,29780253,57318,14502,born a crime stories from a south african chil...
...,...,...,...,...,...
2094,"The Scorch Trials (Maze Runner, #2)",7631105,312407,1591,the scorch trials maze runner 2
2095,Breakfast at Tiffany's,251688,134187,7298,breakfast at tiffanys
2096,The Thief Lord,113304,61489,14951,the thief lord
2097,"The Wonderful Wizard of Oz (Oz, #1)",236093,251691,12977,the wonderful wizard of oz oz 1


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

tfidf = vectorizer.fit_transform(titles["mod_title"])

#The vectorizer allows us to find titles that are similar to the ones we search based on the shared letters/string of letters.
#It turns our book titles into a vector based on their similarity.

In [19]:
#Now that we have our modified book titles stored as vectors, we need to turn our search into a vector as well.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re

def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)

def search(query,vectorizer):
    processed = re.sub("[^a-zA-Z0-9 ]", "", query.lower()) #Takes our processed modified title.
    query_vec = vectorizer.transform([query]) #Turns the modified title into a vector.
    similarity = cosine_similarity(query_vec, tfidf).flatten() #Takes our vectorized title and compares it to the vectors from the tfidf above.
    indices = np.argpartition(similarity, -10)[-10:]
    results = titles.iloc[indices]
    results = results.sort_values("ratings_count", ascending=False) #Sorts the results by ratings count.
    
    return results.head(5)

In [18]:
#Search the title of the book you enjoy.
search(input(""), vectorizer)

lord of the rings


Unnamed: 0,title,book_id,ratings_count,book_id_csv,mod_title
1950,The Fellowship of the Ring (The Lord of the Ri...,34,1813229,670,the fellowship of the ring the lord of the rin...
641,Lord of the Flies,7624,1638289,839,lord of the flies
339,"The Two Towers (The Lord of the Rings, #2)",15241,490005,669,the two towers the lord of the rings 2
1768,"The Return of the King (The Lord of the Rings,...",18512,473101,668,the return of the king the lord of the rings 3
1948,"The Lord of the Rings (The Lord of the Rings, ...",33,396933,459,the lord of the rings the lord of the rings 13


>As you can see, and as expected, multiple listings will likely show up for a book when it is searched, especially if it's part of a series. This creates a bit of a complication for the user, because they will have to directly copy the book ID and put it into a list themselves. If only one book showed up for each search, I could likely automatically copy the book ID into a list, but that doesn't seem possible with multiple listings. So, for each book you search, you will need to copy the book ID and put it into the list below.

In [20]:
#This is the list of book IDs for books I enjoy. You can change it by searching the title above and copying the book ID into the list.
#You will need to copy the book_id_csv for reasons listed above.
#You can also include multiple books in the list if you'd like, although you may run into memory issues or it may take longer to run.

likedbooks = ["459"] #Insert book_id_csv here.

# Recommendation Engine using K-Means Nearest Neighbors

In [21]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import sklearn
from sklearn.decomposition import TruncatedSVD

In [22]:
#Reimporting the book maps due to a few formatting/variable type issues.

csv_book_mapping = {}

with open("C:\\Users\\ianse\\Documents\\ECON 300 Project\\Data\\book_id_map.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        csv_id, book_id = line.strip().split(",")
        csv_book_mapping[csv_id] = book_id

In [23]:
#This cell creates a list of overlap users who like the same books that we do. The rating threshold is set at 4, so
#those who rated the book we like at 4 stars or higher are included in the list of overlap users.


overlap_users = set()

with open("C:\\Users\\ianse\\Documents\\ECON 300 Project\\Data\\goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")
        
        if user_id in overlap_users:
            continue

        try:
            rating = int(rating)
        except ValueError:
            continue
        
        book_id = csv_book_mapping[csv_id]
        
        if book_id in likedbooks and rating >= 4:
                overlap_users.add(user_id)

In [24]:
#This creates a list of the books that overlap users also like, along with their ratings for those books.

rec_lines = []

with open("C:\\Users\\ianse\\Documents\\ECON 300 Project\\Data\\goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")
        
        if user_id in overlap_users:
            book_id = csv_book_mapping[csv_id]
            rec_lines.append([user_id, book_id, rating])

In [25]:
#This takes the list of the books, along with their ratings, and puts them into a dataframe.
recs = pd.DataFrame(rec_lines, columns=["user_id", "book_id", "rating"])

In [26]:
recs['user_id'] = recs['user_id'].astype(str).astype(int)
recs['book_id'] = recs['book_id'].astype(str).astype(int)
recs['rating'] = recs['rating'].astype(str).astype(int)

In [27]:
#Creates a new dataframe with our recommendations and the book_id map key.
KMeansDF = pd.merge(bookmap, recs, how = "inner", on = "book_id")
KMeansDF

Unnamed: 0,book_id_csv,book_id,user_id,rating
0,0,34684622,0,0
1,0,34684622,2001,0
2,0,34684622,29274,0
3,0,34684622,40903,0
4,0,34684622,42263,0
...,...,...,...,...
18685275,2360318,13742247,871734,5
18685276,2360494,1623457,874644,0
18685277,2360511,28814415,874863,5
18685278,2360512,2998512,874863,3


In [28]:
#Creates a dataframe of titles and book ids.
TitleDF1 = pd.DataFrame(books["book_id"])
TitleDF2 = pd.DataFrame(books["title"])
TitleDF = pd.concat([TitleDF1, TitleDF2], axis=1)
TitleDF

Unnamed: 0,book_id,title
0,6066819,Best Friends Forever
1,89375,90 Minutes in Heaven: A True Story of Death an...
2,19057,I Am the Messenger
3,19398490,All the Light We Cannot See
4,29780253,Born a Crime: Stories From a South African Chi...
...,...,...
2094,7631105,"The Scorch Trials (Maze Runner, #2)"
2095,251688,Breakfast at Tiffany's
2096,113304,The Thief Lord
2097,236093,"The Wonderful Wizard of Oz (Oz, #1)"


In [29]:
#A bit of variable type change.
TitleDF["book_id"] = TitleDF["book_id"].astype(str).astype(object)
KMeansDF["book_id"] = KMeansDF["book_id"].astype(str).astype(object)
KMeansDF["book_id_csv"] = KMeansDF["book_id_csv"].astype(str).astype(object)

In [30]:
#Creates a new dataframe that merges the titles and the recommendations + key. This is a dataframe that is a large list of
#highly rated books among overlap users. We will essentially be "sampling" from this list to produce some book recommendations.
recs = pd.merge(KMeansDF, TitleDF, on = "book_id", how = "inner")

In [31]:
#Removing some duplicates.
if not recs[recs.duplicated(['user_id', 'title'])].empty:
    initial_rows = recs.shape[0]

    print('Initial dataframe shape {0}'.format(recs.shape))
    recs = recs.drop_duplicates(['user_id', 'title'])
    current_rows = recs.shape[0]
    print('New dataframe shape {0}'.format(recs.shape))
    print('Removed {0} rows'.format(initial_rows - current_rows))

Initial dataframe shape (4885220, 5)
New dataframe shape (4881644, 5)
Removed 3576 rows


In [32]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import sklearn
from sklearn.decomposition import TruncatedSVD

In [33]:
#Creates a pivot table with titles and user ids, along with a sparse matrix that fills in the values with 0.
ratings_pivot = recs.pivot(index = 'title', columns = 'user_id', values = 'rating').fillna(0)
ratings_matrix = csr_matrix(ratings_pivot.values)

In [34]:
#Creating our K-Means Nearest Neighbors Model.
from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(ratings_matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

### Completed list of recommendations.

In [94]:
#Runs the model, creating a list of 5 random books that are highly rated among overlap users.
query_index = np.random.choice(ratings_pivot.shape[0])
distances, indices = model_knn.kneighbors(ratings_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for your liked book')
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, ratings_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for your liked book
1: The Omnivore's Dilemma: A Natural History of Four Meals, with distance of 0.8363237767853718:
2: What I Talk About When I Talk About Running, with distance of 0.8474978654585859:
3: Into Thin Air: A Personal Account of the Mount Everest Disaster, with distance of 0.8488919634388353:
4: A Walk in the Woods, with distance of 0.8577654306330256:
5: Outliers: The Story of Success, with distance of 0.8597737630976837:


> Note about the choice of randomization: I think having 5 random books is likely better than having 5 of the most similar books. For example, if your favorite book is Harry Potter: Prisoner of Azkaban, then when you put that through the recommendation engine and get the 5 most similar books, they will likely all be Harry Potter books. This would likely be the same for all series. By having randomization for the books, you will get books you likely have not read/heard of.

>There is room for improvement here. I was not able to find a way to specify a certain range of distances (i.e., only include a book with distance of 0.25 or less). This would find books even more similar to what you are looking for. For now, though, I believe that this process works.

# Areas for Improvement

> This project created a book recommendation engine that takes the input of the book that the user enjoyed, and outputs some random recommendations from a list of books that similar users enjoyed (which is a similar concept to how the Goodreads recommendation engine functions).

> There are some areas that could be improved, such as employing different machine learning algorithms, allowing the user to create a longer list of books, and including other factors in the engine (such as liked authors.) Another area of improvement would be a better interface. For example, allowing the user to see the cover of the book and click on a link to the book. However, this seems to accomplish the purpose of providing some recommendations of popular books that the user may enjoy based on a book that they already enjoy.