# Cleaning and EDA of Goodreads

Goodreads is a social cataloging website that allows individuals to freely search its database of books, annotations, and reviews. Users can sign up and register books to generate library catalogs and reading lists. They can also create their own groups of book suggestions, surveys, polls, blogs, and discussions. The website's offices are located in San Francisco.The company is owned by the online retailer Amazon.

Goodreads was founded in December 2006 and launched in January 2007 by Otis Chandler, a software engineer and entrepreneur, and then, Elizabeth Khuri (Now Elizabeth Khuri Chandler), a journalist at the Los Angeles Times The website grew rapidly in popularity after being launched. In December 2007, the site had over 650,000 members and over 10,000,000 books had been added.By July 2012, the site reported 10 million members, 20 million monthly visits, and 30 employees. On July 23, 2013, it was announced on their website that the user base had grown to 20 million members, having doubled in close to 11 months. On March 28, 2013, Amazon announced its acquisition of Goodreads.

<img src="goodreads.png">

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 
import requests
from bs4 import BeautifulSoup
import time
%matplotlib inline

 We are going to `Best Books Ever` page of the GoodReads website
 
 Here is a description of the data we are scrapping from the page.

```
rating: the average rating on a 1-5 scale achieved by the book
review_count: the number of Goodreads users who reviewed this book
isbn: the ISBN code for the book
booktype: an internal Goodreads identifier for the book
author_url: the Goodreads (relative) URL for the author of the book
year: the year the book was published
genre_urls: a string with '|' separated relative URLS of Goodreads genre pages
dir: a directory identifier internal to the scraping code
rating_count: the number of ratings for this book (this is different from the number of reviews)
name: the name of the book
```

### Exploring the web pages and downloading them

In [2]:
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
plt.rcParams["figure.figsize"] = (8,8)

Lets first try scrapping the first page , after that its easy to automate for all the other pages.

In [3]:
base_url = "https://www.goodreads.com"
best_book = "/list/show/1.Best_Books_Ever?page="
page_no = str(1)
url = base_url+best_book+page_no
print(url)

https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1


In [4]:
p_request = requests.get(url)
print(p_request.status_code)
p_text = p_request.text

200


In [5]:
for i in range(1,60):
    page_no = str(i)
    stuff = requests.get(base_url+best_book+page_no)
    filetowrite = "page"+ '%02d'%i + '.html'
    ## print("FTW" , filetowrite)
    fd = open(filetowrite,"w",encoding="utf-8")
    fd.write(stuff.text)
    fd.close()
    time.sleep(2)

FTW page01.html
FTW page02.html
FTW page03.html
FTW page04.html
FTW page05.html
FTW page06.html
FTW page07.html
FTW page08.html
FTW page09.html
FTW page10.html
FTW page11.html
FTW page12.html
FTW page13.html
FTW page14.html
FTW page15.html
FTW page16.html
FTW page17.html
FTW page18.html
FTW page19.html
FTW page20.html
FTW page21.html
FTW page22.html
FTW page23.html
FTW page24.html
FTW page25.html
FTW page26.html
FTW page27.html
FTW page28.html
FTW page29.html
FTW page30.html
FTW page31.html
FTW page32.html
FTW page33.html
FTW page34.html
FTW page35.html
FTW page36.html
FTW page37.html
FTW page38.html
FTW page39.html
FTW page40.html
FTW page41.html
FTW page42.html
FTW page43.html
FTW page44.html
FTW page45.html
FTW page46.html
FTW page47.html
FTW page48.html
FTW page49.html
FTW page50.html
FTW page51.html
FTW page52.html
FTW page53.html
FTW page54.html
FTW page55.html
FTW page56.html
FTW page57.html
FTW page58.html
FTW page59.html


### Parse the page, extract book urls

In [6]:
bookdict={}
for i in range(1,60):
    books=[]
    stri = '%02d' % i
    filetoread="page"+ stri + '.html'
    ## print("FTW", filetoread)
    with open(filetoread,encoding="utf8") as fdr:
        data = fdr.read()
    soup = BeautifulSoup(data, 'html.parser')
    for e in soup.select('.bookTitle'):
        books.append(e['href'])
    bookdict[stri]=books
    fd=open("list"+stri+".txt","w")
    fd.write("\n".join(books))
    fd.close()

FTW page01.html
FTW page02.html
FTW page03.html
FTW page04.html
FTW page05.html
FTW page06.html
FTW page07.html
FTW page08.html
FTW page09.html
FTW page10.html
FTW page11.html
FTW page12.html
FTW page13.html
FTW page14.html
FTW page15.html
FTW page16.html
FTW page17.html
FTW page18.html
FTW page19.html
FTW page20.html
FTW page21.html
FTW page22.html
FTW page23.html
FTW page24.html
FTW page25.html
FTW page26.html
FTW page27.html
FTW page28.html
FTW page29.html
FTW page30.html
FTW page31.html
FTW page32.html
FTW page33.html
FTW page34.html
FTW page35.html
FTW page36.html
FTW page37.html
FTW page38.html
FTW page39.html
FTW page40.html
FTW page41.html
FTW page42.html
FTW page43.html
FTW page44.html
FTW page45.html
FTW page46.html
FTW page47.html
FTW page48.html
FTW page49.html
FTW page50.html
FTW page51.html
FTW page52.html
FTW page53.html
FTW page54.html
FTW page55.html
FTW page56.html
FTW page57.html
FTW page58.html
FTW page59.html


In [5]:
# Checking if we got all the links right
## bookdict

### Parse a book page, extract book properties

In [11]:
furl=base_url+bookdict['02'][0]
furl

'https://www.goodreads.com/book/show/43763.Interview_with_the_Vampire'

In [12]:
fstuff=requests.get(furl)
print(fstuff.status_code)

200


In [13]:
d=BeautifulSoup(fstuff.text, 'html.parser')

In [14]:
d.select("meta[property='og:title']")[0]['content']

'Interview with the Vampire (The Vampire Chronicles, #1)'

In [15]:
print(
"title", d.select_one("meta[property='og:title']")['content'],"\n",
"isbn", d.select("meta[property='books:isbn']")[0]['content'],"\n",
"type", d.select("meta[property='og:type']")[0]['content'],"\n",
"author", d.select("meta[property='books:author']")[0]['content'],"\n",
"average rating", d.select_one("span[itemprop='ratingValue']").text.strip(),"\n",
"ratingCount", d.select("meta[itemprop='ratingCount']")[0]["content"],"\n",
"reviewCount", d.select_one("meta[itemprop='reviewCount']")['content']
)

title Interview with the Vampire (The Vampire Chronicles, #1) 
 isbn 9780345476876 
 type books.book 
 author https://www.goodreads.com/author/show/7577.Anne_Rice 
 average rating 3.99 
 ratingCount 443625 
 reviewCount 9223


In [16]:
genres=d.select("div.elementList div.left a")

In [17]:
glist = []
for g in genres:
    glist.append(g['href'].split('/')[-1])

In [18]:
"|".join(glist)

'horror|fantasy|fiction|paranormal|vampires|fantasy|paranormal'

In [8]:
import re
yearre = r'\d{4}'
def get_year(d):
    if d.select_one("nobr.greyText"):
        return d.select_one("nobr.greyText").text.strip().split()[-1][:-1]
    else:
        thetext=d.select("div#details div.row")[1].text.strip()
        rowmatch=re.findall(yearre, thetext)
        if len(rowmatch) > 0:
            rowtext=rowmatch[0].strip()
        else:
            rowtext="NA"
        return rowtext

In [9]:
def get_genres(d):
    genres = d.select("div.elementList div.left a")
    glist = []
    for g in genres:
        glist.append(g['href'].split('/')[-1])
    return "|".join(glist)

In [33]:
good_reads = {
    'title':[],
    'isbn':[],
    'booktype':[],
    'author':[],
    'rating':[],
    'ratingCount':[],
    'reviewCount':[],
    'year':[],
    'genres':[]
}


In [34]:
for i in range(1,60):
    stri = '%02d' % i
    for j in range(100):
        try:
            furl=base_url+bookdict[stri][j]
            fstuff=requests.get(furl)
            d=BeautifulSoup(fstuff.text, 'html.parser')
            good_reads['title'].append(d.select_one("meta[property='og:title']")['content'])
            good_reads["isbn"].append(d.select("meta[property='books:isbn']")[0]['content'])
            good_reads["booktype"].append(d.select("meta[property='og:type']")[0]['content'])
            good_reads["author"].append(d.select("meta[property='books:author']")[0]['content'])
            good_reads["rating"].append(d.select_one("span[itemprop='ratingValue']").text.strip())
            good_reads["ratingCount"].append(d.select("meta[itemprop='ratingCount']")[0]["content"])
            good_reads["reviewCount"].append(d.select_one("meta[itemprop='reviewCount']")['content'])
            good_reads['year'].append(get_year(d))
            good_reads['genres'].append(get_genres(d))
        except:
            print("Problem with :",i,j)
    print("Done:",i)

Done: 1
Done: 2
Done: 3
Done: 4
Done: 5
Done: 6
Done: 7
Done: 8
Done: 9
Done: 10
Done: 11
Done: 12
Done: 13
Done: 14
Done: 15
Done: 16
Done: 17
Done: 18
Done: 19
Done: 20
Done: 21
Done: 22
Done: 23
Done: 24
Done: 25
Done: 26
Done: 27
Problem with : 28 4
Done: 28
Done: 29
Done: 30
Done: 31
Done: 32
Done: 33
Done: 34
Done: 35
Done: 36
Done: 37
Done: 38
Done: 39
Problem with : 40 38
Done: 40
Done: 41
Done: 42
Done: 43
Done: 44
Done: 45
Done: 46
Done: 47
Done: 48
Done: 49
Done: 50
Done: 51
Done: 52
Done: 53
Done: 54
Done: 55
Done: 56
Done: 57
Done: 58
Done: 59


In [10]:
## Trouble Shooting Function
def trys(n):
    stri = '%02d' % n
    for j in range(1,100):
        if j!=4:
            furl=base_url+bookdict[stri][j]
            fstuff=requests.get(furl)
            d=BeautifulSoup(fstuff.text, 'html.parser')
            print(d.select_one("meta[property='og:title']")['content'])
            print(d.select("meta[property='books:isbn']")[0]['content'])
            print(d.select("meta[property='og:type']")[0]['content'])
            print(d.select("meta[property='books:author']")[0]['content'])
            print(d.select_one("span[itemprop='ratingValue']").text.strip())
            print(d.select("meta[itemprop='ratingCount']")[0]["content"])
            print(d.select_one("meta[itemprop='reviewCount']")['content'])
            print(get_genres(d))
            print(get_year(d))
            print("Done:",j)
            print("-------------------")

## trys(28)

In [42]:
good_reads.keys()

dict_keys(['title', 'isbn', 'booktype', 'author', 'rating', 'ratingCount', 'reviewCount', 'year', 'genres'])

Since we missed 2 values in the middle , we have to replace it with null

In [48]:
good_reads['year'].insert(2804,None)
good_reads['year'].insert(4038,None)

In [52]:
good_reads['genres'].insert(2804,None)
good_reads['genres'].insert(4038,None)

In [53]:
print(len(good_reads['title']))
print(len(good_reads['isbn']))
print(len(good_reads['booktype']))
print(len(good_reads['author']))
print(len(good_reads['rating']))
print(len(good_reads['ratingCount']))
print(len(good_reads['reviewCount']))
print(len(good_reads['year']))
print(len(good_reads['genres']))

5900
5900
5900
5900
5900
5900
5900
5900
5900


In [55]:
df = pd.DataFrame(good_reads)

In [56]:
df.to_csv("goodreads_scrapping.csv")

In [57]:
df1 = pd.read_csv("goodreads_scrapping.csv")

In [58]:
df1.head()

Unnamed: 0.1,Unnamed: 0,title,isbn,booktype,author,rating,ratingCount,reviewCount,year,genres
0,0,"The Hunger Games (The Hunger Games, #1)",9780439023481.0,books.book,https://www.goodreads.com/author/show/153394.S...,4.33,5759514,163427,2008.0,young-adult|fiction|science-fiction|dystopia|f...
1,1,Harry Potter and the Order of the Phoenix (Har...,9780439358071.0,books.book,https://www.goodreads.com/author/show/1077326....,4.49,2171104,36278,2003.0,fantasy|young-adult|fiction
2,2,"To Kill a Mockingbird (To Kill a Mockingbird, #1)",,books.book,https://www.goodreads.com/author/show/1825.Har...,4.27,3971942,83914,1960.0,classics|fiction|historical|historical-fiction...
3,3,Pride and Prejudice,,books.book,https://www.goodreads.com/author/show/1265.Jan...,4.25,2604077,57874,1813.0,classics|fiction|romance
4,4,"Twilight (Twilight, #1)",9780316015844.0,books.book,https://www.goodreads.com/author/show/941441.S...,3.59,4460110,99364,2005.0,young-adult|fantasy|romance|paranormal|vampire...
