## Lab 1: Goodreads Parsing and Scraping
**Univ.AI** <br>
**DS-1 Cohort 1**


## Table of Contents
* [Lab1: Goodreads Scraping and Parsing](#Lab1:-Goodreads-Scraping-and-Parsing)
  * [Learning Goals](##Learning-Goals)
  * [Q1: Scrape the "Best Books ever" web page](##Q1:-Scrape-the-"Best-Books-ever"-web-page)
  * [Q2: Parse the page, extract book urls](##Q2:-Parse-the-page,-extract-book-urls)
  * [Q3: Scrape the web page of each book](##Q3:-Scrape-the-web-page-of-each-book)
  * [Q4: Parse each books page, extract information](##Q4:-Parse-each-books-page,-extract-information)
    * [4.1 Extract genres](###4.1-Extract-Genres)
    * [4.2 Extract Published Year](###4.2-Extract-Published-Year)
    * [4.3 Extract Rating, ISBN, Title of the book, Author and Rating Count](###4.3-Extract-Rating,-ISBN,-Title-of-the-book,-Author-and-Rating-Count)
    * [4.4 Creating a DataFrame](###4.4-Creating-a-DataFrame)


## Learning Goals 
Goodreads has put out a list of the "Best Books ever", as voted on by around 200,000 people from the general Goodreads community. 

In this lab, we will be scraping and extracting information from Goodread's "[Best Books ever](https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1)" list.

This lab consists of four main parts:
1. Scraping the "Best Books ever" web page
2. Parsing the page, extract book urls
3. Scraping the web page of each book
4. Parsing a book page, extract book properties

This lab will develop your skills in:
* Exploring web pages through developer tools 
* Scraping and Parsing using Beautiful Soup and requests

In [7]:
#Import libraries
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import time, requests

## Q1: Scrape the  "Best Books ever" web page
We're going to see the structure of Goodread's best books list. 

To get this page we use pythons [requests module](https://requests.readthedocs.io/en/master/). 

In [8]:
#Getting the url using requests module
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="
url = URLSTART+BESTBOOKS+'1'
page = requests.get(url)

Check the status of the page - 200 is OK and 404 is not good!

In [9]:
page.status_code
page.text

'<!DOCTYPE html>\n<html class="desktop\n">\n<head>\n  <title>Best Books Ever (54128 books)</title>\n\n<meta content=\'52,635 books based on 206232 votes: The Hunger Games by Suzanne Collins, Harry Potter and the Order of the Phoenix by J.K. Rowling, To Kill a Mockingbird...\' name=\'description\'>\n<meta content=\'telephone=no\' name=\'format-detection\'>\n<link href=\'https://www.goodreads.com/list/show/1.Best_Books_Ever\' rel=\'canonical\'>\n\n\n\n    <script type="text/javascript"> var ue_t0=window.ue_t0||+new Date();\n </script>\n  <script type="text/javascript">\n    var ue_mid = "A1PQBFHBHS6YH1";\n    var ue_sn = "www.goodreads.com";\n    var ue_furl = "fls-na.amazon.com";\n    var ue_sid = "615-6896289-6064227";\n    var ue_id = "HJTRFPKTDYDRZNZSRHCR";\n\n    (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:arguments,t:a.d()})}}functio

Now that we are familiar with pythons request module, let us access the first two pages from Goodreads. 

This will mean you will need to scrape **two** URLs: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1 and  https://www.goodreads.com/list/show/1.Best_Books_Ever?page=2. 

<br>

**Hint:** To do this, you can put your request.get() function in a `for` loop.<br>
You will also need to use the time.sleep() function to wait for 1 second between the two get requests so that Goodread's doesn't think you are a threat attempting to mount a denial-of-service attack!<br>
In addition to this, store the HTML text in a dictionary `page_dict`. The key should be the page number (1 or 2) and the value should be the HTML text corresponding the page.

In [None]:
#Loop to fetch 2 pages of "best-books" from goodreads. 
URLSTART="https://www.goodreads.com"
BESTBOOKS="/list/show/1.Best_Books_Ever?page="

page_dict={}

def getPageDataByPageID(id):
  url = URLSTART+BESTBOOKS+str(id)
  page = requests.get(url)
  if(page.status_code == 200):
    return page.text
  return null 

for i in range (1, 3):
  page_dict[i] = getPageDataByPageID(i)
  time.sleep(1)


In [None]:
print(len(page_dict))

2


## Q2: Parse the page, extract book urls

Our next step should be to parse the HTML text we have saved from the previous section and extract the information we need. 

To do this, we will be using BeautifulSoup to transform HTML content into Python data structures. You can also use other libraries like PyQuery if you are comfortable with jQuery, but we will be using BeautifulSoup.


In [10]:
#Import BeautifulSoup
from bs4 import BeautifulSoup

Our aim is to extract the **book URLs** on the page in order to use it in further sections. 

To do this, we look for the elements with class bookTitle, extract the urls, and write them into a dictionary `urldict` where the keys are the page numbers and the values are a list of the book URLs extracted from the page.


**Hint:** While parsing the HTML, look for the HTML a element, but only the one that has a CSS class of `bookTitle`. If you look at the page source, you'll see a construct like **`class=bookTitle`** on the table as seen below:
<br><br>
![goodreadsexample](https://drive.google.com/uc?export=view&id=1PUvIe7VkXFSkvrj4pZm0pRKWr6W8Rw0F)


In [11]:
urldict={}

for i in range (1,3):
  soup = BeautifulSoup(page_dict[i], 'html.parser')
  bookTitles = soup.find_all("a", "bookTitle")
  titlelist = []
  for bookTitle in bookTitles:
    titlelist.append(bookTitle["href"])
    # print( bookTitle["href"],"::", bookTitle.span.contents[0])
  urldict[i] = titlelist

In [16]:
print(urldict)

{1: ['/book/show/2767052-the-hunger-games', '/book/show/2.Harry_Potter_and_the_Order_of_the_Phoenix', '/book/show/2657.To_Kill_a_Mockingbird', '/book/show/1885.Pride_and_Prejudice', '/book/show/41865.Twilight', '/book/show/19063.The_Book_Thief', '/book/show/170448.Animal_Farm', '/book/show/11127.The_Chronicles_of_Narnia', '/book/show/30.J_R_R_Tolkien_4_Book_Boxed_Set', '/book/show/11870085-the-fault-in-our-stars', '/book/show/18405.Gone_with_the_Wind', '/book/show/386162.The_Hitchhiker_s_Guide_to_the_Galaxy', '/book/show/370493.The_Giving_Tree', '/book/show/6185.Wuthering_Heights', '/book/show/968.The_Da_Vinci_Code', '/book/show/5297.The_Picture_of_Dorian_Gray', '/book/show/929.Memoirs_of_a_Geisha', '/book/show/10210.Jane_Eyre', '/book/show/24213.Alice_s_Adventures_in_Wonderland_Through_the_Looking_Glass', '/book/show/24280.Les_Mis_rables', '/book/show/13079982-fahrenheit-451', '/book/show/13335037-divergent', '/book/show/7624.Lord_of_the_Flies', '/book/show/18144590-the-alchemist', '/

## Q3: Scrape the web page of each book

Now that we have the book URLs in a dictionary `urldict`, we can parse the web pages of the books itself to extract some information. <br>

Before we extract information, we will need to get the HTML text for each book's page. 
Scrape the books web pages and store the HTML text for each book in a dictionary named `bookdict` using a `for` loop in a similar fashion as before.

In [None]:
#This is just an **example** to understand how to scarpe these files
#Scraping one of the files
URLSTART="https://www.goodreads.com"

book_url=URLSTART+urldict[2][0]
stuff=requests.get(book_url)

#Check the status of the page
print(stuff.status_code)

#All OK!

200


In [42]:
#Fetching the actual 200 book pages
#In the interest of time, we are taking just the first 10 of each page. Running this for 200 books takes 25 min!

bookdict={}
URLSTART="https://www.goodreads.com"

def getDetailPageByURL(url):
  book_url=URLSTART+url
  stuff=requests.get(book_url)
  return stuff.text

for i in urldict:
  for j in range(0,10):
    url = urldict[i][j]
    bookdict[url] = getDetailPageByURL(url)
    time.sleep(1)

##Q4: Parse each books page, extract information

Now that we have the HTML text for the books, we can extract information from these web pages. 
We intend to extract the following data:

- Published year
- Rating
- ISBN 
- Title of the book
- Author
- Genres this book fits in. Since there are several genres associated with each book, you will need to extract the URL of each genre, separated by a pipe '|' like so:
```
/genres/young-adult|/genres/fiction|/genres/science-fiction|/genres/dystopia|/genres/fantasy|/genres/science-fiction|/genres/romance|/genres/adventure|
```
- Rating count, the number of people who have rated this book
<br>
<br> 
All this information can be seen on the web page.
You will need to go to the developer tools and extract the necessary information. 
<br>
<br>

Since Published year and Genre require some extra processing to be extracted, we will start by writing 2 functions - `get_genre` and `get_year`.

### 4.1 Extract Genres

Write a function to get the genres which takes as input the HTML text and outputs a list of the genre URLs.

In [43]:
#Extracting genre
def get_genre(d):
  # soup = BeautifulSoup(d, 'html.parser')
  genresList = d.select('.mainContentContainer .mainContent .mainContentFloat .rightContainer .bigBoxBody .containerWithHeaderContent .elementList .left a')
  genres =''
  for listItem in genresList:
    genres += listItem["href"] + "|"
  return genres


# print(get_genre(BeautifulSoup(bookdict['/book/show/2767052-the-hunger-games'], 'html.parser')))
#your code here

### 4.2 Extract Published Year

Write a function to get the published year which takes as input the HTML text.

You might have to use regular expressions to extract only the year Published Date seen in the web pages.

**Regular expressions** <br>
Regular Expressions is a pattern matching mechanism used throughout Computer Science and programming (it's not just specific to Python). A tutorial on Regular Expressions (aka regex) is beond this lab, but below are many great resources that we recommend, if you are interested in them (could be very useful for a homework problem):<br>
https://docs.python.org/3.3/library/re.html <br>
https://regexone.com <br>
https://docs.python.org/3/howto/regex.html <br>


In [44]:
#Extracting published year
yearre = r'\d{4}'
def get_year(d):
  years=d.find("div", attrs={"class": "uitext darkGreyText"})
  years=years.findChildren("div")[1].text
  yearmatch=re.findall(yearre,years)
  years_original=d.find_all("nobr", attrs={"class": "greyText"})
  if years_original!=[]:
    finalyear=yearmatch[1]
    return finalyear
  else:
      if len(yearmatch) > 0:
          finalyear=yearmatch[0]
      else:
          finalyear="NA"
      return finalyear
# print(get_year(BeautifulSoup(bookdict['/book/show/2767052-the-hunger-games'], 'html.parser')))

### 4.3 Extract Rating, ISBN, Title of the book, Author and Rating Count

Now that you have created functions to extract genres and published years, you can extract the rest of the fields in a line or two.

Extract the other fields and incorporate your functions to get a **list of dictionaries**. Each element in the list is a dictionary with the information you have extracted (Published year, Rating, ISBN, Title of the book, Author, Genres, Rating Count).
<br>
<br>
So **each element in the list** should look something like this:
```
{'author': 'https://www.goodreads.com/author/show/153394.Suzanne_Collins',
 'booktype': 'books.book',
 'rating': 4.33,
 'genres': '/genres/young-adult|/genres/fiction|/genres/science-fiction|/genres/dystopia|/genres/fantasy|/genres/science-fiction|/genres/romance|/genres/adventure|/genres/young-adult|/genres/teen|/genres/apocalyptic|/genres/post-apocalyptic|/genres/action',
 'isbn': '9780439023481',
 'ratingCount': '6554254',
 'title': 'The Hunger Games (The Hunger Games, #1)',
 'year': '2008'}
```

**Note**: Remember to convert your list of genres to a string seperated with the pipe character '|'.


In [45]:
listofdicts=[]

for url in bookdict:
  soup = BeautifulSoup(bookdict[url], 'html.parser')

  authorName  = soup.find("div", {"class": "authorName__container"}).find("span", {"itemprop":"name"}).contents[0]
  rating        = soup.find("span", {"itemprop": "ratingValue"}).contents[0]
  genres       = get_genre(soup)
  isbn          = soup.find_all("div", {"class": "infoBoxRowItem"})[1].contents[0].strip()
  # ratingCount
  title = soup.find("h1", {"id": "bookTitle"}).contents[0].strip()
  year          =  get_year(soup)

  
  pageMetaDetail  = soup.find_all("div", {"id": "details"})
  publishedYear   = pageMetaDetail[0].find_all("div", "row")[1].contents[0].split("\n")[2].strip()

  listofdicts.append({
    # 'author': 'https://www.goodreads.com/author/show/153394.Suzanne_Collins',
    'author': authorName,
    'booktype': 'books.book',
    'rating': rating,
    'genres': get_genre(soup),
    'isbn': isbn,
    'ratingCount': '6554254',
    'title': title,
    'year': get_year(soup)
  })

#your code here

In [46]:
listofdicts[0]

{'author': 'Suzanne Collins',
 'booktype': 'books.book',
 'genres': '/genres/young-adult|/genres/fiction|/genres/science-fiction|/genres/dystopia|/genres/fantasy|/genres/science-fiction|/genres/romance|/genres/adventure|/genres/young-adult|/genres/teen|/genres/apocalyptic|/genres/post-apocalyptic|/genres/action|',
 'isbn': '0439023483',
 'rating': '\n  4.32\n',
 'ratingCount': '6554254',
 'title': 'The Hunger Games',
 'year': '2008'}

### 4.4 Creating a DataFrame

Convert the list of dictionaries created above to a Pandas DataFrame

In [49]:
df = pd.DataFrame.from_records(listofdicts)
df.head()


Unnamed: 0,author,booktype,rating,genres,isbn,ratingCount,title,year
0,Suzanne Collins,books.book,\n 4.32\n,/genres/young-adult|/genres/fiction|/genres/sc...,0439023483,6554254,The Hunger Games,2008
1,J.K. Rowling,books.book,\n 4.50\n,/genres/fantasy|/genres/young-adult|/genres/fi...,0439358078,6554254,Harry Potter and the Order of the Phoenix,2003
2,Harper Lee,books.book,\n 4.28\n,/genres/classics|/genres/fiction|/genres/histo...,English,6554254,To Kill a Mockingbird,1960
3,Jane Austen,books.book,\n 4.27\n,/genres/classics|/genres/fiction|/genres/roman...,English,6554254,Pride and Prejudice,1813
4,Stephenie Meyer,books.book,\n 3.61\n,/genres/young-adult|/genres/fantasy|/genres/ro...,0316015849,6554254,Twilight,2005


Convert this dataframe to a csv file and store it using `to_csv`.

In [50]:
df.to_csv("Goodreads.csv", index=False, header=True)