# GoodReads API

by: Sara Mendoza

Data Analytics - Ironhack Amsterdam / cohort Jan - June 2020

DATA for Project 6 - June 2020

## 1 Introduction

This notebook was used to download data from the GoodReads website.
On GoodReads, users can create accounts, by default they all have a "shelf" (list) of books called READ.
On this shelf all their read books are stored. 

This notebook loops through many userIDs and creates requests to GoodReads to get the list of books on their read shelf.

The GoodReads Api Documentation can be accessed here: https://www.goodreads.com/api/index#shelves.list

In [None]:
# Importing the needed libraries
from goodreads import client
import pandas as pd
import numpy as np
import requests
import json 
from bs4 import BeautifulSoup
from time import sleep
from random import randint

The GoodReads Api Documentation can be accessed here: https://www.goodreads.com/api/index#shelves.list

A developer key is needed, this can be requested here: https://www.goodreads.com/api/keys

In [None]:
# API key and password from GoodReads
gc = client.GoodreadsClient('wxwrc6aLfRoMX3Ivr784A','rFT6Ytzh5TRBNcWnAYTdWY1wU5U27fQ6tEegWiSM5M')

## 2 Getting the UserID's
It is not possible to get a list of users from GoodReads, but all users are indexed consecutively from ID 1,
for example: https://www.goodreads.com/user/show/1 

to ID's with 9 digits, for example:
https://www.goodreads.com/user/show/111111111

so to access the data I created a function to return random numbers between 1 and 999999999

In [None]:
#function to return random userid's

def createuserIDs(num_users):
    i = 0
    df = pd.read_csv('../data/goodreads_batch1.csv')
    #check to avoid duplicates
    batch1 = df['userid'].unique()
    mylist = []
    while i < num_users:
        x = randint(1,99999999)
        if x not in mylist and x not in batch1:
            mylist.append(x)
        i+= 1
    return mylist

Initially I had also created a function to check if the users were active and public (if inactive or private we will not be able to access their shelf).

But GoodReads has a limit of 1 API request per second, and checking over 3,000 users was taking too much time, so I did not use this function.

In [None]:
# removing this usercheck, as its only creating more GET requests and slowing down everything
# def createuserlist(list_users,userIDs):
#     for i in userIDs:
#         try:
#             user = gc.user(i)
#             list_users.append((i,user.name))
#             #sleep 1, as they accept max 1 request per second
#             sleep(1)
#         # adding except: pass, as some users are no longer active
#         except:
#             pass
#     return list_users

## 2 Creating the GET request
This function takes a userID and through the API access their read shelf.
Unfortunately there is no specific method to download only the book titles + rating, so I had to donwload all the information on their page, parse it and select what I needed.

Also, as the function to check if userIDs are active and public was removed, we will also be looping through inactive or private users, which will return 3 empty lists. The empty lists will be removed later, and this way was faster than  creating 2x GET requests

In [None]:
def createshelve_read(user):
    userID = []
    books = []
    ratings = []
    try:
        url = 'https://www.goodreads.com/review/list/' + str(user) +'.xml?key=wxwrc6aLfRoMX3Ivr784A&v=2&per_page=200&shelf=read'
        html = requests.get(url).content
        soup = BeautifulSoup(html, "lxml")
        for element in soup.find_all('title_without_series'):
            books.append(element.text)
        for element in soup.find_all('rating'):
            ratings.append(element.text)
        #sleep 1, as they accept max 1 request per second
        sleep(1)
        for i in range(len(books)):
            userID.append(user)
    # adding except: pass, as some users are private and you cannot download their info
    except:
        pass
    return userID, books, ratings

## 3 Getting the Data

Since we are not checking if users are active or public, I wanted to download at least info for 3,000 users. Thinking that worst case scenario 1/3 will be inactive or private and 1/3 will be active but have little to no info on their account. This would leave us with a total of 1,000 users info.

The function for the GET request has a 1 second sleep, since GoodReads limits 1 API request per second, and blocks users who dont comply. Thats 3000 seconds for 3000 users or 50 min. In reality it took close to 2 hours to run the entire file.

In [None]:
# define some random userIDs
userIDs = createuserIDs(3000)

#adding myself :-) in batch1
#userIDs.append(42889636)

#adding other friends for batch2
# Zuzanna: 29153227
# Mom: 45188186
# Melissa: 48794835
# Paolo: 6940448

# more_users = [29153227,45188186,48794835,6940448]
# for i in more_users:
#     userIDs.append(i)

## 4 Running in Batches
Because the file kept freezing or not running to completion, I decided to split the userIDs into different batches of 500 and saving the info in each run and then merging at the end.

In [None]:
# list where all the info will be stored
alldata = []

# RUN 1

# selecting only 500 users at the time
for i in userIDs[:500]:
    result = createshelve_read(i)
    alldata.append(result)

# printing to make sure we are actually saving data in each round (learned from past experiences of letting it run for 2 hours and having no data...)
print(len(alldata))


In [None]:
# RUN 2

for i in userIDs[500:1000]:
    result = createshelve_read(i)
    alldata.append(result)
len(alldata)


In [None]:
# RUN 3

for i in userIDs[1000:1500]:
    result = createshelve_read(i)
    alldata.append(result)    
len(alldata)


In [None]:
# RUN 4

for i in userIDs[1500:2000]:
    result = createshelve_read(i)
    alldata.append(result)
len(alldata)


In [None]:
# RUN 5

for i in userIDs[2000:2500]:
    result = createshelve_read(i)
    alldata.append(result)
len(alldata)


In [None]:
# RUN 6

for i in userIDs[2500:]:
    result = createshelve_read(i)
    alldata.append(result)
len(alldata)


In [None]:
# putting it all in one data frame

userID = [i[0] for i in alldata]
books = [i[1] for i in alldata]
ratings = [i[2] for i in alldata]

flat_userID = [item for sublist in userID for item in sublist]
flat_books = [item for sublist in books for item in sublist]
flat_ratings = [item for sublist in ratings for item in sublist]

df = pd.DataFrame({
    'userid' : flat_userID,
    'book' : flat_books,
    'rating' : flat_ratings,
})

df.head()

In [None]:
# saving the data in a CSV

# first batch downloaded on 23/06
# df.to_csv('../data/goodreads_batch1.csv',index=False)

# second batch downloaded on 25/06
# df.to_csv('../data/goodreads_batch2.csv',index=False)