# GoodReads to BetterReads

by: Sara Mendoza

Data Analytics - Ironhack Amsterdam / cohort Jan - June 2020

Project 6 - June 2020

## 1 Introduction

In this notebook I clean and inspect the data that was used from the GoodReads website through the file "1_GoodReads_API". This data will be used to create a program that recommends books based on similar users.

In [None]:
# importing the necesary libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#reading the downloaded data
df1 = pd.read_csv('../data/goodreads_batch1.csv')
df2 = pd.read_csv('../data/goodreads_batch2.csv')
print(df1.shape,df2.shape)

#and concatenating in one file
df = pd.concat([df1,df2])
print(df.shape)

## 2 Inspecting the data
We want to make sure that all the data is clean and ready to use

In [None]:
# types are correct
print(df.dtypes)

# we have a total of 66,055 lines of data
print(df.shape)

# 2,383 different users / out of over 6,000 users tried!
print(len(df['userid'].unique()))

# with a mean of 28 books per user
print(df['userid'].value_counts().mean())

# 33,932 different books
print(len(df['book'].unique()))

Checking The rating distribution for our data we see there is quite a lot of books with rating zero, so no rating at all.
But mostly books are rated 4 or 5 stars, probably because poeple like to add to their lists books they liked and not books they disliked

In [None]:
df['rating'].value_counts(sort=False).plot.bar()

Lets have a look at what our highest rated books are all-round

In [None]:
# quick check to see what are our highest rated books
mean_rating = df.pivot_table(index=['book'],values=['rating'],aggfunc=(len,np.mean)).reset_index()
mean_rating.columns = ['_'.join(col).strip() for col in mean_rating.columns.values]

# most books have been rated few times, to see the most popular books we will drop everything will less than 10 ratings
mean_rating['rating_len'].value_counts()
mean_rating = mean_rating.loc[mean_rating.rating_len > 10]
mean_rating['rating_len'].value_counts()

# below the 5 highest rates books in our data set
mean_rating.sort_values(by='rating_mean', ascending=False)[0:5]

Above list shows the rating mean, but these books probably also have less ratings than other books, so its not a accurate depiction of the most liked books.

Since we want the information of the books people have enjoyed reading, we will remove all lines with less than 4 stars

In [None]:
# only keeping books with scores of 4 or 5 stars, as we want the books that are recommended
high_rated = df.loc[df.rating > 3]

# Now we have a total of 40,163 lines of data
print(high_rated.shape)

# 1674 different users
print(len(high_rated['userid'].unique()))

# with a mean of 23 books per user
print(high_rated['userid'].value_counts().mean())

# and a total of 21,369 different books
print(len(high_rated['book'].unique()))

Now we can check, out of all the books that are rated 4 or 5 stars, which ones have been rated 4 or 5 the most

In [None]:
#high_rated.groupby('book')['rating'].count()

most_rated = high_rated.pivot_table(index=['book'],values=['rating'],aggfunc=(len,np.mean)).reset_index()
most_rated.columns = ['_'.join(col).strip() for col in most_rated.columns.values]
most_rated.sort_values(by='rating_len', ascending=False)[0:20]

... and this top 20 list full of bestsellers is exactly why I want to build a recommender system

In [None]:
# # first we create a matrix of all books vs all users, if they have not read / rated it, the rating will be nan
# high_rated_pivot = high_rated.pivot_table(index='book', columns='userid').rating.reset_index()

# #searching my own book to check that the correct rating is reflecting
# high_rated_pivot.loc[high_rated_pivot.book == 'Normal People'][42889636]
# # I indeed rated Normal People with 4 stars

# high_rated_pivot


In [None]:
# # matrix of total books recommended per user
# books_loved = pd.DataFrame(high_rated.groupby('userid')['rating'].count())
# books_loved.rename(columns={'rating': 'total_loved_books'},inplace=True)
# books_loved.sort_values('total_loved_books', ascending=False).head()

In [None]:
# # Now I want to check which users are highly correlated, to find book recommendations
# corr = high_rated_pivot.corr()

In [None]:
# #selecting only one user, to find similar users
# my_user = 42889636
# similar_to_mine = corr[my_user]
# similar_to_minedf = pd.DataFrame(similar_to_mine)
# similar_to_minedf.rename(columns={my_user: 'pearson_corr'},inplace=True)
# similar_to_minedf.dropna(inplace=True)

# #adding information of how many books other users have loved
# similar_to_minedf = similar_to_minedf.join(books_loved['total_loved_books'])

# #other users need to at least have loved half the books I have loved to be able to recommend
# parameter = books_loved.reset_index()
# parameter = parameter.loc[parameter['userid'] == my_user]['total_loved_books'] / 2
# parameter

# top = similar_to_minedf[similar_to_minedf['total_loved_books'] >= int(parameter)].drop(my_user).sort_values('pearson_corr', ascending=False).head(20)
# top.reset_index(inplace=True)
# top


In [None]:
# top_id = list(top['userid'])

# #adding my user to identify the books I've already read
# top_id.append(my_user)
# top_id

In [None]:
# #high_rated_pivot#[top_ten_id].dropna(how='all')

# # now we create a new matrix will all the books that have been read by the highest corr users
# top_books = high_rated.pivot_table(index='userid', columns='book').rating.reset_index()
# top_books = top_books[top_books['userid'].isin(top_id)].dropna(how='all',axis=1)

# # but we drop all books that the user has already read
# read_books = top_books[(top_books['userid'] == my_user)].dropna(axis=1).columns[1:]
# read_books = list(df[(df['userid'] == my_user)]['book'])
# for i in read_books:
#     if i in top_books.columns:
#         top_books.drop(i,1,inplace=True)

# # top_books

In [None]:

        
# # recommendation given
# recommendation = top_books.fillna(0).astype(bool).sum(axis=0).sort_values(ascending=False).head(20)
# recommendation.reset_index()[1:].rename(columns={0: 'times recommended'})



In [None]:
# # remove top 20 liked books in general ?
# # remove harry potter, hunger games, A Game of Thrones
# read_books

In [None]:
# high_rated['userid'].unique()

In [None]:
# from goodreads import client
# gc = client.GoodreadsClient('wxwrc6aLfRoMX3Ivr784A','rFT6Ytzh5TRBNcWnAYTdWY1wU5U27fQ6tEegWiSM5M')


# #             list_users.append((i,user.name))

In [None]:
df1 = pd.read_csv('../data/goodreads_batch1.csv')
df2 = pd.read_csv('../data/goodreads_batch2.csv')
df = pd.concat([df1,df2])

# Only keeping books with scores of 4 or 5 stars, as we want the books that are recommended
high_rated = df.loc[df.rating > 3]
high_rated

In [None]:
mean_rating = high_rated.pivot_table(index=['book'],values=['rating'],aggfunc=(len,np.mean)).reset_index()
mean_rating.columns = ['_'.join(col).strip() for col in mean_rating.columns.values]
best_sellers = mean_rating.sort_values(by='rating_len', ascending=False)[0:20]
best_sellers


In [None]:
mean_rating = high_rated.pivot_table(index=['book'],values=['rating'],aggfunc=(len,np.mean)).reset_index()
mean_rating.columns = ['_'.join(col).strip() for col in mean_rating.columns.values]
best_sellers = mean_rating.sort_values(by='rating_len', ascending=False)[0:20]
best_sellers