# Explore Book Rating Data

In [1]:
liked_books = ["53732", "117902", "472331", "6066095", "3850639", "526270", "11250317"]

First Steps:
1. Find all the users who like these same books (using the goodreads interaction file because it has user id's)
2. Then find all the books they like because we assume the have similar taste as ours. 

In [2]:
# We have to make sure the id's match up across files (like mentioned in search)

!head book_id_map.csv

book_id_csv,book_id
0,34684622
1,34536488
2,34017076
3,71730
4,30422361
5,33503613
6,33517540
7,34467031
8,6383669


In [3]:
csv_book_mapping = {}

with open("book_id_map.csv", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break
        csv_id, book_id = line.strip().split(",") #Forgot to include strip. Important because it removes 
                                                  # any new line characters. 
        csv_book_mapping[csv_id] = book_id

In [4]:
len(csv_book_mapping)

2360651

In [5]:
!wc -l goodreads_interactions.csv

 228648343 goodreads_interactions.csv


In [6]:
!ls -lh | grep goodreads_interactions

-rw-r--r--  1 mellanyandrea  staff   4.0G Feb 24 21:01 goodreads_interactions.csv


# Finding Users who Like the same books as Us

In [7]:
# File that has how each user rated. If we match book_id, the user 
#has "similar" taste to us and put it into overlap_users. We have to iterate through it! 
!head goodreads_interactions.csv

user_id,book_id,is_read,rating,is_reviewed
0,948,1,5,0
0,947,1,5,1
0,946,1,5,0
0,945,1,5,0
0,944,1,5,0
0,943,1,5,0
0,942,1,5,0
0,941,1,5,0
0,940,1,5,0


In [8]:
# Pandas would use more memory so we won't use it. We are using streaming method. 

overlap_users = set()

with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",") # _ means we dont care about the variable (as much)
        
        if user_id in overlap_users:
            continue
            
        try:
            rating = int(rating)
        except ValueError:
            continue
            
        book_id = csv_book_mapping[csv_id]
        
        if book_id in liked_books and rating >=4:
            overlap_users.add(user_id)

In [9]:
# Reading all the potential recommendations. Will only contain books that users have read that have the 
# same likes as us
rec_lines = []

with open("goodreads_interactions.csv", 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        user_id, csv_id, _, rating, _ = line.split(",")
        
        if user_id in overlap_users:
            book_id = csv_book_mapping[csv_id]
            rec_lines.append([user_id, book_id, rating])

In [10]:
# Looking at how many users over lap with us:
len(overlap_users)

31788

In [11]:
# Look how many total books those users read and rated 4 (out of 5) or above
len(rec_lines)

18553177

In [12]:
# Figuring out how to rank these books. Dataframe is easier to work with. We will have to do some filtering!
import pandas as pd

recs = pd.DataFrame(rec_lines, columns=["user_id", "book_id", "rating"])
recs["book_id"] = recs["book_id"].astype(str) 

In [13]:
# It counts how many times a book_id occurred the most frequently in our DataFrame. 
# After this we only have left the book_id and the type. 
top_recs = recs["book_id"].value_counts().head(10)

In [14]:
top_recs = top_recs.index.values

In [15]:
# Reading our book titles in so we can start to combine our recommendations with the book titles so we can 
# see which titles are most recommended!
books_titles = pd.read_json("books_titles.json")
books_titles["book_id"] = books_titles["book_id"].astype(str)

In [16]:
books_titles.head()

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
0,1333909,Good Harbor,10,https://www.goodreads.com/book/show/1333909.Go...,https://s.gr-assets.com/assets/nophoto/book/11...,good harbor
1,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",140,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,the unschooled wizard sun wolf and starhawk 12
2,6066819,Best Friends Forever,51184,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,best friends forever
3,287140,Runic Astrology: Starcraft and Timekeeping in ...,15,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,runic astrology starcraft and timekeeping in t...
4,287141,The Aeneid for Boys and Girls,46,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,the aeneid for boys and girls


# Creating Initial Book Recommendations

The first line of code below finds all the book titles where the book_id is in our top recommendations. Hopefully the list below are books that we would enjoy reading. However there is one issue, most of the books are very very popular. So there is not much of a difference between the most popular books and the books listed below. We need to find out what books are popular in OUR set. So, users who read the books we liked, what books did THEY like? Not necessarily popular among ALL users. 

In [17]:
books_titles[books_titles["book_id"].isin(top_recs)]

Unnamed: 0,book_id,title,ratings,url,cover_image,mod_title
55379,472331,Watchmen,406669,https://www.goodreads.com/book/show/472331.Wat...,https://images.gr-assets.com/books/1442239711m...,watchmen
386663,2767052,"The Hunger Games (The Hunger Games, #1)",4899965,https://www.goodreads.com/book/show/2767052-th...,https://images.gr-assets.com/books/1447303603m...,the hunger games the hunger games 1
546297,5107,The Catcher in the Rye,2086945,https://www.goodreads.com/book/show/5107.The_C...,https://images.gr-assets.com/books/1398034300m...,the catcher in the rye
608482,5907,The Hobbit,2099680,https://www.goodreads.com/book/show/5907.The_H...,https://images.gr-assets.com/books/1372847500m...,the hobbit
630937,4671,The Great Gatsby,2758812,https://www.goodreads.com/book/show/4671.The_G...,https://images.gr-assets.com/books/1490528560m...,the great gatsby
838525,5470,1984,2023937,https://www.goodreads.com/book/show/5470.1984,https://images.gr-assets.com/books/1348990566m...,1984
1048745,7613,Animal Farm,1928931,https://www.goodreads.com/book/show/7613.Anima...,https://images.gr-assets.com/books/1424037542m...,animal farm
1077226,2657,To Kill a Mockingbird,3255518,https://www.goodreads.com/book/show/2657.To_Ki...,https://images.gr-assets.com/books/1361975680m...,to kill a mockingbird
1196415,3,Harry Potter and the Sorcerer's Stone (Harry P...,4765497,https://www.goodreads.com/book/show/3.Harry_Po...,https://images.gr-assets.com/books/1474154022m...,harry potter and the sorcerers stone harry pot...
1316662,13496,"A Game of Thrones (A Song of Ice and Fire, #1)",1359501,https://www.goodreads.com/book/show/13496.A_Ga...,https://images.gr-assets.com/books/1436732693m...,a game of thrones a song of ice and fire 1


In [18]:
# Checking the original error in which the previous line was not executing correctly. The problem was that top_recs
# was the value count while the book_id was just an index, we need to be able to the values of the index. 
top_recs

array(['472331', '3', '2767052', '5470', '5907', '4671', '2657', '5107',
       '7613', '13496'], dtype=object)

In [19]:
# Look at all recs to give us a dataframe of how many times a book appeared in our set.
all_recs = recs["book_id"].value_counts()

In [20]:
all_recs

472331      25852
3           18970
2767052     17731
5470        17625
5907        16520
            ...  
28810953        1
25417028        1
24174154        1
25212031        1
18901392        1
Name: book_id, Length: 1162326, dtype: int64

In [21]:
# Converting from series to dataframe
all_recs = all_recs.to_frame().reset_index()

In [22]:
all_recs

Unnamed: 0,index,book_id
0,472331,25852
1,3,18970
2,2767052,17731
3,5470,17625
4,5907,16520
...,...,...
1162321,28810953,1
1162322,25417028,1
1162323,24174154,1
1162324,25212031,1


In [23]:
# Number of times this book appeared in our recs
all_recs.columns = ["book_id", "book_count"]

In [24]:
all_recs

Unnamed: 0,book_id,book_count
0,472331,25852
1,3,18970
2,2767052,17731
3,5470,17625
4,5907,16520
...,...,...
1162321,28810953,1
1162322,25417028,1
1162323,24174154,1
1162324,25212031,1


In [25]:
all_recs.head(5)

Unnamed: 0,book_id,book_count
0,472331,25852
1,3,18970
2,2767052,17731
3,5470,17625
4,5907,16520


In [26]:
# Inner merge = if the data doesn't exist in both, get rid of them both. Merging on the book_id. SIMILAR TO A SQL JOIN!
all_recs = all_recs.merge(books_titles, how="inner", on="book_id")

In [27]:
all_recs

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title
0,472331,25852,Watchmen,406669,https://www.goodreads.com/book/show/472331.Wat...,https://images.gr-assets.com/books/1442239711m...,watchmen
1,3,18970,Harry Potter and the Sorcerer's Stone (Harry P...,4765497,https://www.goodreads.com/book/show/3.Harry_Po...,https://images.gr-assets.com/books/1474154022m...,harry potter and the sorcerers stone harry pot...
2,2767052,17731,"The Hunger Games (The Hunger Games, #1)",4899965,https://www.goodreads.com/book/show/2767052-th...,https://images.gr-assets.com/books/1447303603m...,the hunger games the hunger games 1
3,5470,17625,1984,2023937,https://www.goodreads.com/book/show/5470.1984,https://images.gr-assets.com/books/1348990566m...,1984
4,5907,16520,The Hobbit,2099680,https://www.goodreads.com/book/show/5907.The_H...,https://images.gr-assets.com/books/1372847500m...,the hobbit
...,...,...,...,...,...,...,...
1066778,28810953,1,The Tale of Kitty-in-Boots,28,https://www.goodreads.com/book/show/28810953-t...,https://images.gr-assets.com/books/1469408544m...,the tale of kittyinboots
1066779,25417028,1,Something Like This (Secrets Book 1),166,https://www.goodreads.com/book/show/25417028-s...,https://s.gr-assets.com/assets/nophoto/book/11...,something like this secrets book 1
1066780,24174154,1,unDefeated (Wayward Fighters #3),178,https://www.goodreads.com/book/show/24174154-u...,https://images.gr-assets.com/books/1430131797m...,undefeated wayward fighters 3
1066781,25212031,1,"Blood and Steel (Throne of the Caesars, #2)",73,https://www.goodreads.com/book/show/25212031-b...,https://images.gr-assets.com/books/1430413409m...,blood and steel throne of the caesars 2


In [28]:
# It'll be the book count, of all the users that liked books we liked, how many also liked this book. 
# We will penalize it based on how popular the book was in the general set. So looking for books that are popular among
# USERS with similar taste, not the generic popular list. 
all_recs["score"] = all_recs["book_count"] * (all_recs["book_count"] / all_recs["ratings"])

In [29]:
all_recs.sort_values("score", ascending=False).head(10)

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title,score
11013,35430702,228,The Lady's Guide to Petticoats and Piracy,9,https://www.goodreads.com/book/show/35430702-t...,https://s.gr-assets.com/assets/nophoto/book/11...,the ladys guide to petticoats and piracy,5776.0
6540,26856502,360,"Vengeful (Villains, #2)",35,https://www.goodreads.com/book/show/26856502-v...,https://s.gr-assets.com/assets/nophoto/book/11...,vengeful villains 2,3702.857143
13643,34019109,187,"Ninth House (Alex Stern, #1)",11,https://www.goodreads.com/book/show/34019109-n...,https://s.gr-assets.com/assets/nophoto/book/11...,ninth house alex stern 1,3179.0
17772,29749094,147,"Superman (DC Icons, #4)",7,https://www.goodreads.com/book/show/29749094-s...,https://s.gr-assets.com/assets/nophoto/book/11...,superman dc icons 4,3087.0
14210,32802595,180,"Record of a Spaceborn Few (Wayfarers, #3)",12,https://www.goodreads.com/book/show/32802595-r...,https://images.gr-assets.com/books/1498469008m...,record of a spaceborn few wayfarers 3,2700.0
13962,26857046,183,The Invisible Life of Addie La Rue,13,https://www.goodreads.com/book/show/26857046-t...,https://s.gr-assets.com/assets/nophoto/book/11...,the invisible life of addie la rue,2576.076923
4064,28601452,541,There Will Be Other Summers (Aristotle and Dan...,150,https://www.goodreads.com/book/show/28601452-t...,https://s.gr-assets.com/assets/nophoto/book/11...,there will be other summers aristotle and dant...,1951.206667
5920,24909347,391,"Obsidio (The Illuminae Files, #3)",82,https://www.goodreads.com/book/show/24909347-o...,https://images.gr-assets.com/books/1501704611m...,obsidio the illuminae files 3,1864.402439
12898,36307629,197,"King of Scars (King of Scars, #1)",22,https://www.goodreads.com/book/show/36307629-k...,https://images.gr-assets.com/books/1506962795m...,king of scars king of scars 1,1764.045455
16717,36300633,155,The Iron Season,14,https://www.goodreads.com/book/show/36300633-t...,https://s.gr-assets.com/assets/nophoto/book/11...,the iron season,1716.071429


In [30]:
popular_recs = all_recs[all_recs["book_count"] > 75].sort_values("score", ascending=False)

In [31]:
def make_clickable(val):
    return '<a target="_blank" href="{}">Goodreads</a>'.format(val, val)

def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)


# Prevents us from seeing books we've already read or liked. 
popular_recs[~popular_recs["book_id"].isin(liked_books)].head(10).style.format({'url': make_clickable, 
                                                                                'cover_image': show_image})

Unnamed: 0,book_id,book_count,title,ratings,url,cover_image,mod_title,score
11013,35430702,228,The Lady's Guide to Petticoats and Piracy,9,Goodreads,,the ladys guide to petticoats and piracy,5776.0
6540,26856502,360,"Vengeful (Villains, #2)",35,Goodreads,,vengeful villains 2,3702.857143
13643,34019109,187,"Ninth House (Alex Stern, #1)",11,Goodreads,,ninth house alex stern 1,3179.0
17772,29749094,147,"Superman (DC Icons, #4)",7,Goodreads,,superman dc icons 4,3087.0
14210,32802595,180,"Record of a Spaceborn Few (Wayfarers, #3)",12,Goodreads,,record of a spaceborn few wayfarers 3,2700.0
13962,26857046,183,The Invisible Life of Addie La Rue,13,Goodreads,,the invisible life of addie la rue,2576.076923
4064,28601452,541,"There Will Be Other Summers (Aristotle and Dante Discover the Secrets of the Universe, #2)",150,Goodreads,,there will be other summers aristotle and dante discover the secrets of the universe 2,1951.206667
5920,24909347,391,"Obsidio (The Illuminae Files, #3)",82,Goodreads,,obsidio the illuminae files 3,1864.402439
12898,36307629,197,"King of Scars (King of Scars, #1)",22,Goodreads,,king of scars king of scars 1,1764.045455
16717,36300633,155,The Iron Season,14,Goodreads,,the iron season,1716.071429


# Wrap Up and Next Steps:
Improving the quality of our recommendations using collaborative filtering! 