<h1><center>HW 2: Scrape Book Reviews</center></h1>

Choose one of your favorite book at goodreads.com (e.g. for the book "The Midninght Library", the URL is: https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true)

- Q1. Write a function to scrape all **reviews on the first page**, including, 
    - **reviewer's name** (see (1) in Figure)
    - **rating** (see (2) in Figure)
    - **date** (see (3) in Figure)
    - **review content** (see (4) in Figure. For each review text, need to get the **complete text**, i.e., need to expand the `more` button. Only text is needed, pictures are not needed. (Hint: take a close look at the content of the html file. Do you need Selenium?)
    - **likes** (see (5) in Figure). 
    - If a field, e.g. rating, is missing, use `None` to indicate it. 
- `Function Input`: book page URL
- `Function Output`: save all reviews as a DataFrame of columns (`reviewer, rating, date, review, like`). E.g., for the given URL, you can get 30 reviews.
- 10 points: 1 point for each element, 5 points for overall logic.
- **Note**: GoodReads occasionaly blocks request. You may get an error that is not due to your codes. You may try to run a couple of times. 
    
    
![alt text](GoodReads.png "GoodReads")

In [144]:
import requests
from bs4 import BeautifulSoup  
import pandas as pd
import re
 

def getReviews(page_url):
    
    # enter your codes here
    #list of column headers for scraped data
    colHeaders = ['Reviewer Name','Rating(out of 5)','Date','Review','Likes']
    #class names corrosponding to the column headers
    classNames = ["user","staticStar p10","reviewDate createdAt right","reviewText stacked","likesCount"]
    reviews = pd.DataFrame(columns = colHeaders)
    row = []

    soup = BeautifulSoup(requests.get(page_url).content, 'html.parser')
    scrapedData = soup.find_all(class_="friendReviews elementListBrown")

    for scrapedReview in scrapedData:
        for scrapedCol in range(len(classNames)):
            try:
                cell = scrapedReview.find_all(class_= classNames[scrapedCol])
                #check if data is not empty
                if cell:
                    if scrapedCol==1:#processing star count
                        #count number of stars, by counting number of times staticStar p10 appears in the scraped column
                        row.append(str(len(cell)) + " stars")
                    elif scrapedCol==3:#getting all review text, including hidden text
                        #each review has 2 spans, one contaning just the display data and other the entire review
                        #we select the span with the entire review to get all the text including hidden text
                        #since all the data is already present on the page, we can directly get it via bs4,
                        #we dont need to use selenium in this case to click (..more) to expand review text, as it will only slow down the code by adding unesseccary steps
                        #if in case the text was not in source and called via an Ajax req, or js script, then
                        #we would have to use selenium to click on the link and get the rest of the text in the page source
                        spans = scrapedReview.find_all('span',attrs = {'id' : re.compile(r'freeText')})
                        cell[0] = spans[len(spans)- 1]
                        row.append(cell[0].text.strip())
                    else:
                        row.append(cell[0].text.strip())
                else:
                    #append none if no data is found for the cell
                    row.append(None)
            except AttributeError:
                if len(row) < 5:
                    row.append(None)
                continue
        reviews.loc[len(reviews)] = row
        row = []
    #return first 30 reviews
    return reviews

#note: if we still want to use selenium ,we can simply initialize a webdriver, and use 
#webDriver.find_element_by_link_text("...more").click()
#and select span with attribute Style : Display:inline
#it will have the same result but will add extra steps, slowing execution time


In [145]:
# enter your url
page_url = 'https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true'
reviews=getReviews(page_url)
reviews

Unnamed: 0,Reviewer Name,Rating(out of 5),Date,Review,Likes
0,Nataliya,2 stars,"Jan 04, 2021",I liked this book until it suddenly decided to...,2625 likes
1,Nicole,2 stars,"Nov 20, 2020",Can’t help but agree with you. I picked up the...,831 likes
2,Nilufer Ozmekik,5 stars,"Aug 21, 2020",Okay! No more words! This is one of the best s...,2913 likes
3,Paromjit,5 stars,"May 28, 2020",It is no secret that Matt Haig has mental heal...,1632 likes
4,emma,3 stars,"Oct 12, 2020",Okay. Picture this: you are about to bite into...,1685 likes
5,Jayme,3 stars,"Nov 21, 2020",Unpopular opinion! In between life and death i...,857 likes
6,Emily (Books with Emily Fox),5 stars,"Feb 28, 2020","(4.5?) After loving The Humans, I was very exc...",1558 likes
7,Ruby Granger,5 stars,"Feb 25, 2021",okay WOW. This was amazing.I must say that I w...,644 likes
8,Cindy,2 stars,"Aug 18, 2021",Corny like a Hallmark movie and probably the l...,2013 likes
9,Emily B,3 stars,"Sep 20, 2020",This was cute and the concept was great but un...,1098 likes


- Q2 (Bonus). Modify the function you defined in Q1 to scrape **reviews on all the pages** for your url. Since a book may have multiple pages, use the **next** button at the end of each page (shown in the picture) to navigate to the next page. Continue scraping all the pages until the last page. `Please don't hardcode the pages in the URL because the number of pages varies by book`. (3 points. If URL is hardcoded, -2).

![alt text](GoodReads_bonus.png "GoodReads_bonus")

In [148]:
from selenium import webdriver
import time
import requests
from bs4 import BeautifulSoup  
import pandas as pd
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.webdriver.common.keys import Keys

#the website uses an ajax request to get the next page data on the current page, hence we need to use selenium here
#to click on the next button and get the next page data.


#this function gets reviews on the current page
def getReviews_modified(driver,reviews):
    sourc = driver.page_source
    soup = BeautifulSoup(sourc, 'html.parser')
#---rest of the function is same as question 1----------------------------------------------#
    scrapedData = soup.find_all(class_="friendReviews elementListBrown")
    classNames = ["user","staticStar p10","reviewDate createdAt right","reviewText stacked","likesCount"]
    row = []
    for scrapedReview in scrapedData:
        for scrapedCol in range(len(classNames)):
            try:
                cell = scrapedReview.find_all(class_= classNames[scrapedCol])
                if cell:
                    if scrapedCol==1:
                        row.append(str(len(cell)) + " stars")
                    elif scrapedCol==3:
                        spans = scrapedReview.find_all('span',attrs = {'id' : re.compile(r'freeText')})
                        cell[0] = spans[len(spans)- 1]
                        row.append(cell[0].text.strip())
                    else:
                        row.append(cell[0].text.strip())
                else:
                    row.append(None)
            except AttributeError:
                if len(row) < 5:
                    row.append(None)
        reviews.loc[len(reviews)] = row
        row = []
    return reviews



#this function gets reviews from all pages
def getReviews_2(page_url):
    colHeaders = ['Reviewer Name','Rating(out of 5)','Date','Review','Likes']
    reviews = pd.DataFrame(columns = colHeaders)
    #initializing selenium webdriver
    driver = webdriver.Firefox()
    driver.maximize_window()
    driver.get(page_url)
    time.sleep(5)
    source = driver.page_source
    s = BeautifulSoup(source, 'html.parser')
    reviews = getReviews_modified(driver,reviews)
    print('scraping')
    pg_count = 2
    tryCount = 0
    while (not s.find(class_="next_page disabled")) or (pg_count<=10):
    #goodreads only shows first 10 pages of reviews hence the second condition, in case next_page disabled is not found
        try:
            element = driver.find_element_by_class_name("next_page")
            if element:
                driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.HOME)#to avoid other element obscuring next button scrolling to top of page
                time.sleep(3)
                element.click()
                time.sleep(4)
                pg_count = pg_count + 1
                source = driver.page_source
                s = BeautifulSoup(source, 'html.parser')
                reviews = getReviews_modified(driver,reviews)
                tryCount = 0
                print('.')
            else:
                return reviews
        except ElementClickInterceptedException:
            if tryCount==1:
                print('page refresh failed, might be network error')
                return reviews
            print('next button obscured,refreshing page')
            try:
                driver.get(page_url)
                time.sleep(5)
                if pg_count>2:
                    driver.find_element_by_link_text(str(pg_count-1)).click()
                time.sleep(3)
                tryCount = 1
                source = driver.page_source
                s = BeautifulSoup(source, 'html.parser')
                continue
            except:
                print('could not refresh page,try running again')
                return reviews
        except NoSuchElementException:
            if pg_count <= 10:
                try:
                    if pg_count<2:
                        driver.find_element_by_link_text(str(pg_count-1)).click()
                    else:
                        driver.find_element_by_class_name("next_page").click()
                    time.sleep(3)
                    source = driver.page_source
                    s = BeautifulSoup(source, 'html.parser')
                    continue
                except:
                    print('page load error,check network and try running again')
                    return reviews
    driver.close()
    return reviews
        

In [149]:
# enter your url
page_url = 'https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true'

reviews=getReviews_2(page_url)
reviews


scraping
.
.
.
.
.
.
.
.
.


Unnamed: 0,Reviewer Name,Rating(out of 5),Date,Review,Likes
0,Nataliya,2 stars,"Jan 04, 2021",I liked this book until it suddenly decided to...,2625 likes
1,Nicole,2 stars,"Nov 20, 2020",Can’t help but agree with you. I picked up the...,831 likes
2,Nilufer Ozmekik,5 stars,"Aug 21, 2020",Okay! No more words! This is one of the best s...,2913 likes
3,Paromjit,5 stars,"May 28, 2020",It is no secret that Matt Haig has mental heal...,1632 likes
4,emma,3 stars,"Oct 12, 2020",Okay. Picture this: you are about to bite into...,1685 likes
...,...,...,...,...,...
295,Laura,4 stars,"Dec 16, 2020","Six stars for the premise. For me, this one st...",22 likes
296,Israt Zaman Disha,4 stars,"Jan 12, 2021","""The only way to learn is to live.""A Beautiful...",22 likes
297,Tanja Berg,5 stars,"Oct 19, 2020","Nora Seeds decides to die. However, before she...",22 likes
298,Sofia,5 stars,"Dec 21, 2020","If you do not like a story, what's to do, well...",22 likes
