# Goodreads Web Scraping Project

## Context

To get this data, I thought I would try out webscraping. I needed data on each title, author, genres, and average rating, however, my Goodreads TBR page only displays title, author, and average rating. 

So how do I find the genres? I can’t just scrape the TBR page, but when I looked at the html, I can scrape the url of each title, which will take me to the book’s individual page where genre’s are located. Knowing this, I could outline my plan:

1. Create a function that will scrape each URL from my TBR and store it in a list called book_urls
2. Create another function that will process each URL, grabbing the title, author, genres, and average rating from Goodreads (I  will test this to see if each worked correctly), storing them in a dataframe

### Initial Scraping Function:

This function is responsible for scraping the initial list of book URLs.
It creates a WebDriver instance, scrapes the URLs, and stores them in the book_urls list.
After scraping, it closes the WebDriver.

### Data Processing Function:

This function takes the book_urls list and processes the book data for each URL.
It handles any exceptions and appends the data to the appropriate lists (e.g., titles, authors, genres, avg_ratings, and book_length).
If a URL encounters an issue, it stores that URL in the failed_urls list for later retry.

### Retry Function:

This function takes the failed_urls list and retries scraping the data for the URLs that previously encountered issues.
Similar to the data processing function, it handles exceptions, appends the data to the lists, and can further troubleshoot issues.
It creates a new WebDriver instance and closes it after retrying.

### Part 1: Initial Scraping Function

This code grabbed all the URLs by creating a WebDriver instance, scraping the URLs and storing them in the book_urls list. After scraping, it closed the WebDriver.

In [3]:
# importing neccessary packages
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import time

# creating a single webdriver instance
driver = webdriver.Chrome()

# defining the function to scrape book urls
def scrape_book_urls(base_url, num_pages_to_scrape):
    book_urls = []
    for page in range(1, num_pages_to_scrape + 1):
        page_url = base_url + str(page)
        driver.get(page_url)
        time.sleep(3)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        # this is the html with the table
        target_div = soup.find('div', id='rightCol', class_='last col')
        title_cells = target_div.find_all('td', class_='field title')
        for title_cell in title_cells:
            url_element = title_cell.find('a', href=True)
            if url_element:
                book_url = url_element.get("href")
                if book_url:
                    book_urls.append("https://www.goodreads.com" + book_url)
    return book_urls

# defining the initial list of urls (book_urls) here
base_url = "https://www.goodreads.com/review/list/127846814-lauren-mcmaster?utf8=%E2%9C%93&utf8=%E2%9C%93&ref=nav_mybooks&shelf=to-read&title=lauren-mcmaster&per_page=100&page="
# 302 books, 30 books per page, so we need to scrape 11 pages
num_pages_to_scrape = 11
book_urls = scrape_book_urls(base_url, 11)

# viewing the list containing book urls from multiple pages
for url in book_urls:
    print(url)

# closing the webdriver
driver.quit()

https://www.goodreads.com/book/show/40864790-pumpkinheads
https://www.goodreads.com/book/show/53152636-mexican-gothic
https://www.goodreads.com/book/show/61165369-a-portrait-in-shadow
https://www.goodreads.com/book/show/29589074-truly-devious
https://www.goodreads.com/book/show/43263520-the-grace-year
https://www.goodreads.com/book/show/50706646-the-bone-shard-daughter
https://www.goodreads.com/book/show/40944965-binding-13
https://www.goodreads.com/book/show/22299763-crooked-kingdom
https://www.goodreads.com/book/show/50485649-in-my-dreams-i-hold-a-knife
https://www.goodreads.com/book/show/43575115-the-starless-sea
https://www.goodreads.com/book/show/62583508-talking-at-night
https://www.goodreads.com/book/show/57516722-off-to-the-races
https://www.goodreads.com/book/show/174712272-fair-catch
https://www.goodreads.com/book/show/55926057-indigo-ridge
https://www.goodreads.com/book/show/60784729-biography-of-x
https://www.goodreads.com/book/show/50548197-a-deadly-education
https://www.g

In [4]:
# checking the number of books
print(len(book_urls))

301


### Part 2: Data Processing Function

Next, I created the data processing function which would take the book_urls lsit and process the book data for each URL. It handles any exceptions and appends the data to the appropriate lists (e.g., titles, authors, genres, avg_ratings, and book_lengths). If a URL encounters an issue, it stores that URL in the failed_urls just in case

In [10]:
# importing more packages for explicit wait times
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# creating a single webdriver instance
driver = webdriver.Chrome()

# initializing a list to store urls that encountered issues
failed_urls = []

# defining the function to process a list of urls
def process_urls(url_list, failed_urls):
    titles = []
    authors = []
    genres_list = []
    avg_ratings = []
    book_lengths = []

    for url in url_list:
        try:
            # navigating to the book page
            driver.get(url)

            # explicitly wait for the title element to be present (so fewer urls return no data)
            title_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//h1[@data-testid="bookTitle"]'))
            )

            # explicitly wait for the FeaturedDetails element to be present
            featured_details_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, '//div[@class="FeaturedDetails"]'))
            )
            
            # finding the 'p' element with data-testid="pagesFormat"
            pages_format_element = featured_details_element.find_element(By.XPATH, '//p[@data-testid="pagesFormat"]')

            # extracting the text from the 'p' element
            pages_format = pages_format_element.text if pages_format_element else 'Not Found'

            # extracting the number of pages from the text
            num_pages = pages_format.split(" ")[0]

            # extracting the html content from the current page
            html = driver.page_source

            # using Beautiful Soup to process the html
            soup = BeautifulSoup(html, 'html.parser')

            # extracting book details from the page with error handling
            author_element = soup.find('div', class_='BookPageMetadataSection__contributor')
            genres_element = soup.find('div', {'data-testid': 'genresList'})
            avg_rating_element = soup.find('div', class_='RatingStatistics__rating', attrs={'aria-hidden': True})

            # continue processing (finding missing data)
            title = title_element.text.strip() if title_element else 'Not Found'
            author = author_element.text.strip() if author_element else 'Not Found'

            if genres_element:
                genres = [genre.text.strip() for genre in genres_element.find_all('a')]
            else:
                genres = ['Not Found']

            avg_rating = avg_rating_element.text.strip() if avg_rating_element else 'Not Found'

            # checking if any of the values are missing
            if "Not Found" in (title, author, ', '.join(genres), avg_rating, num_pages):
                # marking the url as incorrect and add it to the failed_urls list
                failed_urls.append(url)
            else:
                # appending the data to the lists
                titles.append(title)
                authors.append(author)
                genres_list.append(", ".join(genres))
                avg_ratings.append(avg_rating)
                book_lengths.append(num_pages)

            # pausing for a few seconds before the next request to not overload the server
            time.sleep(3)

        except Exception as e:
            # logging error message
            print(f'Error: {str(e)}')
            print(f'URL: {url}')
            # storing the url that encountered an issue for later retry
            failed_urls.append(url)

    return titles, authors, genres_list, avg_ratings, book_lengths

# processing all urls in the list
titles, authors, genres_list, avg_ratings, book_lengths = process_urls(book_urls, failed_urls)

# creating a dataframe from the collected data
book_data = pd.DataFrame({
    "Title": titles,
    "Author": authors,
    "Genres": genres_list,
    "Average Rating": avg_ratings,
    "Book Length": book_lengths
})

# printing the dataframe
print(book_data.head())

# closing the webdriver
driver.quit()

                  Title                                             Author  \
0          Pumpkinheads  Rainbow Rowell, Faith Erin Hicks (Illustrator)...   
1        Mexican Gothic                               Silvia Moreno-Garcia   
2  A Portrait in Shadow                                      Nicole Jarvis   
3        Truly, Devious                                    Maureen Johnson   
4        The Grace Year                                        Kim Liggett   

                                              Genres Average Rating  \
0  Graphic Novels, Young Adult, Romance, Contempo...           4.03   
1  Horror, Fiction, Historical Fiction, Gothic, M...           3.68   
2  Fantasy, Historical Fiction, Historical, Adult...           3.86   
3  Mystery, Young Adult, Mystery Thriller, Contem...           3.95   
4  Young Adult, Dystopia, Fantasy, Fiction, Scien...           4.15   

  Book Length  
0         209  
1         320  
2         432  
3         416  
4         416  


In [12]:
print(len(failed_urls))

0


## Cleaning the dataframe

Now that I have my complete dataframe, I wanted to make sure I haven’t added any books to my TBR twice. First, I searched for duplicates.

In [15]:
# finding duplicates
duplicate_titles = book_data[book_data.duplicated(subset='Title', keep=False)]
duplicate_titles

Unnamed: 0,Title,Author,Genres,Average Rating,Book Length
101,I Fell in Love with Hope,Lancali.,"Romance, Fiction, Contemporary, Young Adult, L...",4.06,416
167,I Fell in Love with Hope,Lancali.,"Romance, Fiction, Contemporary, Young Adult, L...",4.06,402


It looks like “I Fell in Love with Hope” was added to my TBR twice (possibly in different formats like Hardcover/Paperback). So to fix this, I removed any duplicates. 

In [16]:
# removing duplicates
book_data = book_data.drop_duplicates(subset='Title')
book_data

Unnamed: 0,Title,Author,Genres,Average Rating,Book Length
0,Pumpkinheads,"Rainbow Rowell, Faith Erin Hicks (Illustrator)...","Graphic Novels, Young Adult, Romance, Contempo...",4.03,209
1,Mexican Gothic,Silvia Moreno-Garcia,"Horror, Fiction, Historical Fiction, Gothic, M...",3.68,320
2,A Portrait in Shadow,Nicole Jarvis,"Fantasy, Historical Fiction, Historical, Adult...",3.86,432
3,"Truly, Devious",Maureen Johnson,"Mystery, Young Adult, Mystery Thriller, Contem...",3.95,416
4,The Grace Year,Kim Liggett,"Young Adult, Dystopia, Fantasy, Fiction, Scien...",4.15,416
...,...,...,...,...,...
296,Aristotle and Dante Discover the Secrets of th...,Benjamin Alire Sáenz,"Young Adult, LGBT, Romance, Contemporary, Fict...",4.32,390
297,Autoboyography,Christina Lauren,"Romance, LGBT, Young Adult, Contemporary, Quee...",4.15,407
298,Emma,"Jane Austen, Fiona Stafford (Editor)","Classics, Fiction, Romance, Historical Fiction...",4.04,474
299,The Awakening,Kate Chopin,"Classics, Fiction, Feminism, School, Literatur...",3.68,195


After removing duplicates, my data looked very clean. Next, I exported it into an Excel workbook. I checked over it quickly to make sure everything was clean, however, I found that in place of some of the page numbers, there were values like “Audible”, “Audiobook”, and “Kindle.” So I used Excel’s Find & Replace function to remove each of these and replace them with blanks. 

In [17]:
# exporting the data to a csv
book_data.to_csv('goodreads_data.csv', index=False)