# Project Overview:
Goal: Further personalizing book recommendations!
What we need to achieve this:
- Users with similar taste in books
- A way to predict what we'll like based on what they like. 

# Sources:
Goodreads Book Data
- Scraped by UCSD researchers!
Available at: https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home

## Some new files:

books_titles.json:
- for every book this has the title, url, url to cover image, as well as ratings received on GoodReads

# Project Steps
1. Find similar users
2. Create matrix
3. Recommend Books

# Reading in Books we like:

In [1]:
import pandas as pd

my_books = pd.read_csv("liked_books.csv", index_col=0)

In [2]:
my_books

Unnamed: 0,user_id,book_id,rating,title
0,-1,2517439,5,"The Forever War (The Forever War, #1)"
1,-1,113576,5,The Smartest Guys in the Room: The Amazing Ris...
2,-1,35100,5,Battle Cry of Freedom
3,-1,228221,5,The Mask of Command
5,-1,17662739,5,"2001: A Space Odyssey (Space Odyssey, #1)"
6,-1,356824,5,India After Gandhi: The History of the World's...
7,-1,12125412,5,The Lady or the Tiger?: and Other Logic Puzzles
8,-1,139069,5,Endurance: Shackleton's Incredible Voyage
10,-1,76680,5,"Foundation (Foundation, #1)"
11,-1,1898,5,Into Thin Air: A Personal Account of the Mount...


In [3]:
my_books["book_id"] = my_books["book_id"].astype(str)

# Finding Similar Users

We need to load in the mapping file.

In [4]:
!head book_id_map.csv

book_id_csv,book_id
0,34684622
1,34536488
2,34017076
3,71730
4,30422361
5,33503613
6,33517540
7,34467031
8,6383669


In [5]:
csv_book_mapping = {}

with open("book_id_map.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        csv_id, book_id = line.strip().split(",")
        csv_book_mapping[csv_id] = book_id

In [6]:
book_set = set(my_books["book_id"])

In [7]:
!head goodreads_interactions.csv

user_id,book_id,is_read,rating,is_reviewed
0,948,1,5,0
0,947,1,5,1
0,946,1,5,0
0,945,1,5,0
0,944,1,5,0
0,943,1,5,0
0,942,1,5,0
0,941,1,5,0
0,940,1,5,0


In [8]:
# Lots of copying with Pandas, so working with it line by line helps with computer storage since it's too big. 
!wc -l goodreads_interactions.csv

 228648343 goodreads_interactions.csv


In [9]:
# Dictionary, the keys will be the 
overlap_users = {}

with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
            
        user_id, csv_id, _, rating, _ = line.split(",") # _ means we dont care about the variable (as much)

        book_id = csv_book_mapping.get(csv_id) # We have csv_id but we want the book_id !

        # if this is in our liked_books DF, lets add this user to our overlap_users dict. 
        # Keeping count how many times a given user has that overlaps with the books we like. 
        if book_id in book_set:
            if user_id not in overlap_users:
                overlap_users[user_id] = 1
            else:
                overlap_users[user_id] += 1

In [10]:
len(overlap_users)

316341

In [11]:
# We only want to find users who have read some of the same books as us. Users who don't have the same 
# won't be useful for our model. List comprehension implemented below. So at least 20% of our book count. 
filtered_overlap_users = set([k for k in overlap_users if overlap_users[k] > my_books.shape[0]/5])

# Finding Similar User Book Ratings

In [12]:
# Now that we have a small set of users:
# Creating a list of all the books they've read, because we might want to read those same books.
interactions_list = []

with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        
        user_id, csv_id, _, rating, _ = line.split(",")
        
        if user_id in filtered_overlap_users:
            book_id = csv_book_mapping[csv_id]
            interactions_list.append([user_id, book_id, rating])

# Creating a User/Book Matrix

We are creating a user/book matrix in which ever row will be a different user and every column will be a different book. The cells will contain the rating that the user gave that book. If we have a ton of users+books, this matrix can easily become huge, hence our filtered set of users!

In [13]:
# how long is our interaction list?
len(interactions_list)

5638701

In [14]:
# Looking at the first item, user_id, book_id, rating!
interactions_list[0]

['282', '627206', '4']

In [15]:
# Turning it into a DataFrame!
interactions = pd.DataFrame(interactions_list, columns=["user_id", "book_id", "rating"])

In [16]:
# Add our own ratings to this matrix. 
interactions = pd.concat([my_books[["user_id", "book_id", "rating"]], interactions])

In [17]:
interactions

Unnamed: 0,user_id,book_id,rating
0,-1,2517439,5
1,-1,113576,5
2,-1,35100,5
3,-1,228221,5
5,-1,17662739,5
...,...,...,...
5638696,804100,475178,0
5638697,804100,186074,0
5638698,804100,153008,0
5638699,804100,45107,0


In [18]:
interactions["book_id"] = interactions["book_id"].astype(str)
interactions["user_id"] = interactions["user_id"].astype(str)
interactions["rating"] = pd.to_numeric(interactions["rating"])

In [19]:
interactions["user_id"].unique()

array(['-1', '282', '874', ..., '442043', '712588', '804100'],
      dtype=object)

In [20]:
interactions["user_index"] = interactions["user_id"].astype("category").cat.codes

In [21]:
len(interactions["user_index"].unique())

1259

In [22]:
interactions["book_index"] = interactions["book_id"].astype("category").cat.codes

In [23]:
len(interactions["book_index"].unique())

802870

In [24]:
1259 * 802870

1010813330

# The Structure of the matrix:
user_id converts to a user position. 

- We will be using a Sparse matrix.

When we multiple our unique users and books we get 1010813330. Thats 1010813330 unique matrix cells, and if we tried to store all of that in a disk in a dense matrix, putting a value in each of those 1010813330 cells, it would take a huge amount of memory. Instead, we use a spare matrix, which is a little harder to work with but if there is no value in that column, we just leave it blank (taking up no storage or memory). If we work in Pandas or NumPy, we tend to work with dense matrices, so we will be creating a sparse matrix to save memory!

In [25]:
from scipy.sparse import coo_matrix

ratings_mat_coo = coo_matrix((interactions["rating"], (interactions["user_index"], interactions["book_index"])))

In [26]:
ratings_mat_coo

<1259x802870 sparse matrix of type '<class 'numpy.int64'>'
	with 5638728 stored elements in COOrdinate format>

In [27]:
ratings_mat = ratings_mat_coo.tocsr()

# Finding Users Similar to Us

In [28]:
# -1 is us! Let's change that to my_index to know that's us. 
interactions[interactions["user_id"] == "-1"]

Unnamed: 0,user_id,book_id,rating,user_index,book_index
0,-1,2517439,5,0,414880
1,-1,113576,5,0,38971
2,-1,35100,5,0,575858
3,-1,228221,5,0,356004
5,-1,17662739,5,0,214285
6,-1,356824,5,0,581743
7,-1,12125412,5,0,59763
8,-1,139069,5,0,124430
10,-1,76680,5,0,722098
11,-1,1898,5,0,276178


In [29]:
my_index = 0

In [30]:
# Using cosine_similarity to quantify what users are similar to us. We are collaborating with other people in our data!
# Find similarity between two rows in our matrix. How similar is each user to us? 
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(ratings_mat[my_index,:], ratings_mat).flatten()

In [31]:
# how similar we are to ourselves:
similarity[0]

0.9999999999999999

In [32]:
# Comparing to the next user:
similarity[1]

0.04579825781910479

In [33]:
# We will find the indices of the users most similar to us in terms of book taste. The 15 users!
import numpy as np

indices = np.argpartition(similarity, -15)[-15:]

In [34]:
indices

array([1188,  942,  218,  129,  496,  435, 1208,  795, 1213, 1210, 1143,
        321,  294,  862,    0])

In [35]:
# We need to find their user_id's: 
similar_users = interactions[interactions["user_index"].isin(indices)].copy()

In [36]:
# Taking ourselves out to not get self-recommendations:
similar_users = similar_users[similar_users["user_id"]!="-1"]

In [37]:
similar_users

Unnamed: 0,user_id,book_id,rating,user_index,book_index
45312,4133,5359,3,942,632143
45313,4133,10464963,4,942,13492
45314,4133,3858,3,942,593622
45315,4133,11827808,4,942,51904
45316,4133,7913305,4,942,732465
...,...,...,...,...,...
5638521,712588,32388712,3,1143,543119
5638522,712588,16322,5,1143,183365
5638523,712588,860543,0,1143,759827
5638524,712588,853510,5,1143,756768


# Creating Book Recommendations

In [38]:
# Let's figure out how many times each book appeared in these recommendations. Aggregating the rating count and the average rating.
book_recs = similar_users.groupby("book_id").rating.agg(['count', 'mean'])

In [39]:
book_recs

Unnamed: 0_level_0,count,mean
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6,3.833333
100322,1,0.000000
100365,1,0.000000
10046142,1,0.000000
1005,3,0.000000
...,...,...
99561,2,2.500000
99610,1,3.000000
99664,1,4.000000
9969571,3,2.333333


In [40]:
# Hello old friend...we need the book titles :) We must ensure the book_id is a string. 
books_titles = pd.read_json("books_titles.json")
books_titles["book_id"] = books_titles["book_id"].astype(str)

In [41]:
# Merging our two data sets! similar to SQL joins
book_recs = book_recs.merge(books_titles, how="inner", on="book_id")

In [42]:
book_recs

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title
0,1,6,3.833333,Harry Potter and the Half-Blood Prince (Harry ...,1713866,https://www.goodreads.com/book/show/1.Harry_Po...,https://images.gr-assets.com/books/1361039191m...,harry potter and the halfblood prince harry po...
1,100322,1,0.000000,Assata: An Autobiography,11057,https://www.goodreads.com/book/show/100322.Assata,https://images.gr-assets.com/books/1328857268m...,assata an autobiography
2,100365,1,0.000000,The Mote in God's Eye,48736,https://www.goodreads.com/book/show/100365.The...,https://images.gr-assets.com/books/1399490037m...,the mote in gods eye
3,10046142,1,0.000000,Dancing in the Glory of Monsters: The Collapse...,2391,https://www.goodreads.com/book/show/10046142-d...,https://images.gr-assets.com/books/1328757755m...,dancing in the glory of monsters the collapse ...
4,1005,3,0.000000,Think and Grow Rich,87634,https://www.goodreads.com/book/show/1005.Think...,https://s.gr-assets.com/assets/nophoto/book/11...,think and grow rich
...,...,...,...,...,...,...,...,...
2849,99561,2,2.500000,Looking for Alaska,804587,https://www.goodreads.com/book/show/99561.Look...,https://images.gr-assets.com/books/1394798630m...,looking for alaska
2850,99610,1,3.000000,The Best Laid Plans,17434,https://www.goodreads.com/book/show/99610.The_...,https://images.gr-assets.com/books/1353374848m...,the best laid plans
2851,99664,1,4.000000,The Painted Veil,24606,https://www.goodreads.com/book/show/99664.The_...,https://images.gr-assets.com/books/1320421719m...,the painted veil
2852,9969571,3,2.333333,Ready Player One,376328,https://www.goodreads.com/book/show/9969571-re...,https://images.gr-assets.com/books/1500930947m...,ready player one


# Ranking Our Book Recommendations

In [43]:
# Normalizing the book count. We want to find the books specific to us and users like us! Not books super popular to everyone, we want recs adjusted to our taste.
book_recs["adjusted_count"] = book_recs["count"] * (book_recs["count"] / book_recs["ratings"])

In [44]:
#Score indicating how much we might like the book. Mean from all the users like us that read the same books as us. 
book_recs["score"] = book_recs["mean"] * book_recs["adjusted_count"]

In [45]:
# Taking out any books we already read. 
book_recs = book_recs[~book_recs["book_id"].isin(my_books["book_id"])]

In [46]:
# Take out any book titles that match a book we already read. A challenge with this data is that it's not entirely clean, some books have duplicate IDs. 
my_books["mod_title"] = my_books["title"].str.replace("[^a-zA-Z0-9 ]", "", regex=True).str.lower()

In [47]:
# Replace sequences of spaces with a single space
my_books["mod_title"] = my_books["mod_title"].str.replace("\s+", " ", regex=True)

In [48]:
# Take out anythng in our recs where the mod_title fits into the books we already read. Remove any of the books we already liked or read from recommendations. 
book_recs = book_recs[~book_recs["mod_title"].isin(my_books["mod_title"])]

In [49]:
# At least 3 similar users had to read it and like it in order for it to get on our recommendations. 
book_recs = book_recs[book_recs["count"]>2]

In [50]:
# Mean rating must be over 4. 
book_recs = book_recs[book_recs["mean"]>=4]

In [51]:
# Sort based on score. 
top_recs = book_recs.sort_values("mean", ascending=False)

In [52]:
top_recs

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,score
2265,62291,5,4.8,"A Storm of Swords (A Song of Ice and Fire, #3)",477834,https://www.goodreads.com/book/show/62291.A_St...,https://images.gr-assets.com/books/1497931121m...,a storm of swords a song of ice and fire 3,5.2e-05,0.000251
600,157993,3,4.333333,The Little Prince,763309,https://www.goodreads.com/book/show/157993.The...,https://images.gr-assets.com/books/1367545443m...,the little prince,1.2e-05,5.1e-05
1103,22034,3,4.333333,The Godfather,259150,https://www.goodreads.com/book/show/22034.The_...,https://images.gr-assets.com/books/1394988109m...,the godfather,3.5e-05,0.00015
1176,2318271,3,4.333333,The Last Lecture,245804,https://www.goodreads.com/book/show/2318271.Th...,https://images.gr-assets.com/books/1388075896m...,the last lecture,3.7e-05,0.000159
1909,4381,3,4.333333,Fahrenheit 451,591506,https://www.goodreads.com/book/show/4381.Fahre...,https://images.gr-assets.com/books/1351643740m...,fahrenheit 451,1.5e-05,6.6e-05
243,119322,4,4.25,"The Golden Compass (His Dark Materials, #1)",973154,https://www.goodreads.com/book/show/119322.The...,https://images.gr-assets.com/books/1505766203m...,the golden compass his dark materials 1,1.6e-05,7e-05
1444,2767793,4,4.25,"The Hero of Ages (Mistborn, #3)",149260,https://www.goodreads.com/book/show/2767793-th...,https://images.gr-assets.com/books/1480717763m...,the hero of ages mistborn 3,0.000107,0.000456
2563,78983,4,4.25,"Kane and Abel (Kane and Abel, #1)",75215,https://www.goodreads.com/book/show/78983.Kane...,https://s.gr-assets.com/assets/nophoto/book/11...,kane and abel kane and abel 1,0.000213,0.000904
244,119324,3,4.0,"The Subtle Knife (His Dark Materials, #2)",246697,https://www.goodreads.com/book/show/119324.The...,https://images.gr-assets.com/books/1505766360m...,the subtle knife his dark materials 2,3.6e-05,0.000146
398,13497,4,4.0,"A Feast for Crows (A Song of Ice and Fire, #4)",437398,https://www.goodreads.com/book/show/13497.A_Fe...,https://images.gr-assets.com/books/1429538615m...,a feast for crows a song of ice and fire 4,3.7e-05,0.000146


# Improve the Display of the Books

In [53]:
# Like in the search notebook! :D 
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val)

def show_image(val):
    return '<img src="{}" width=50></img>'.format(val)

In [54]:
top_recs.style.format({'url': make_clickable, 'cover_image': show_image})

Unnamed: 0,book_id,count,mean,title,ratings,url,cover_image,mod_title,adjusted_count,score
2265,62291,5,4.8,"A Storm of Swords (A Song of Ice and Fire, #3)",477834,Goodreads,,a storm of swords a song of ice and fire 3,5.2e-05,0.000251
600,157993,3,4.333333,The Little Prince,763309,Goodreads,,the little prince,1.2e-05,5.1e-05
1103,22034,3,4.333333,The Godfather,259150,Goodreads,,the godfather,3.5e-05,0.00015
1176,2318271,3,4.333333,The Last Lecture,245804,Goodreads,,the last lecture,3.7e-05,0.000159
1909,4381,3,4.333333,Fahrenheit 451,591506,Goodreads,,fahrenheit 451,1.5e-05,6.6e-05
243,119322,4,4.25,"The Golden Compass (His Dark Materials, #1)",973154,Goodreads,,the golden compass his dark materials 1,1.6e-05,7e-05
1444,2767793,4,4.25,"The Hero of Ages (Mistborn, #3)",149260,Goodreads,,the hero of ages mistborn 3,0.000107,0.000456
2563,78983,4,4.25,"Kane and Abel (Kane and Abel, #1)",75215,Goodreads,,kane and abel kane and abel 1,0.000213,0.000904
244,119324,3,4.0,"The Subtle Knife (His Dark Materials, #2)",246697,Goodreads,,the subtle knife his dark materials 2,3.6e-05,0.000146
398,13497,4,4.0,"A Feast for Crows (A Song of Ice and Fire, #4)",437398,Goodreads,,a feast for crows a song of ice and fire 4,3.7e-05,0.000146


The adjusted_count and score columns show us how they are rated in terms of recommendations for next books we might want to read!

# Potential Next Steps:

1. Play around with some of the filters, such as filtered_overlap_users.
2. Other ways to scrap GoodReads are available!
3. In similarity we could change the 15 user limit.
4. We can also tweak the score during the book recommendations and what books we choose to remove from the set (ex/ does it have to be a 4 star rating? Could it be less than 3 people?)