# Web Scraping a dataset of Top Banned Books

By [Crystal Shearer](https://grrlofhighart.github.io/)

This notebook will outline the steps I took to compile a dataset of top banned books from [Goodreads](https://www.Goodreads.com). My original intention was to utilize a readily available dataset from a website such as [Kaggle](https://www.kaggle.com). However, after some searching it seemed like all the available datasets were lacking many of the data points I was hoping to utilize. My only solution was to attempt compiling my own dataset.

This project is best ran from the command line or Terminal. Before getting started go ahead and navigate into the `/book_dataset` directory.

In [None]:
!cd book_dataset

#### Install required packages

For the project to run you will need to install the required Python packages. If you have not previously done so, now is the time. I typically prefer to use a virtual environment. If you are using [VS Code](https://code.visualstudio.com/docs/setup/setup-overview) you can create the virtual environment and install the required packages all at once. Open the Command Palette with [(Ctrl + Shift + P)](https://code.visualstudio.com/docs/python/environments#_creating-environments), search for and select the command `Python: Create Environment`. Select your choice from the list of environment types. (`Venv` would be our default if no other options.) Finally, select a current version of Python from the list of Interpreters and check the box for requirements.txt. Click `OK` and your virtual environment should be created and initiated assuming there are no issues.

If you aren't down with VS Code, you can [create a virtual environment](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#create-and-use-virtual-environments) and [install the required packages](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#install-packages-using-pip) right from the command line. You can find a guide on virtual environments on [python.org](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/). 

In [None]:
!py -m venv .venv

In [None]:
!.venv\Scripts\activate

In [None]:
!pip install -r requirements.txt

### Step 1: Importing all the necessary Libraries

In [35]:
# numpy for working with arrays
import numpy as np

# pandas for data manipulation
import pandas as pd

# BeautifulSoup for navigating webdata 
from bs4 import BeautifulSoup

# requests fetching data
import requests

# re for working with regular expressions (strings)
import re

# sqlite3 for communicating with SQLite
import sqlite3
from contextlib import closing

# time to aid in webscraping
import time

# selenium for fetching dynamic web content
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

### Step 2: Setting up the Database

I plan on collecting multiple datasets throughout this project. To keep everything uniform I decided to store my data in a sql database.

In [4]:
# Setup SQLite Database
database = 'Book_DB.db'
conn = sqlite3.connect(database)
c = conn.cursor()

### Step 3: Setting scrape parameters

I found a list on Goodreads Listopia called `Best Banned, Censored, and Challenged Books`. According to the list description, the list is comprised of "books that have at one point either been banned, censored, or requested for removal from libraries". The list contains over 700 books which I am sure includes many, if not all, of the books from the ALA Top Banned Books lists. Lists on Goodreads only display 100 books per page, so there are at least 8 pages that need to be scraped.

In [19]:
# Set base url and number of pages to scrape
# Add header to avoid '403' errors
base_url = 'https://www.goodreads.com/list/show/1360.Best_Banned_Censored_and_Challenged_Books?page='
pages = 8
header = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
    }

### Step 4: Scraping the list

In [20]:
# Setup empty list to store data
TopBannedBooks = []

In [21]:
# Use a for loop to scrape all books from the List pages
for page_num in range(1, pages + 1):
    BannedList = {}
    # Construct the URL for the current page
    URL = base_url + str(page_num)
    
    # Send a GET request to the URL
    response = requests.get(URL, headers=header)

    BannedBooks  = (response).content
    
    ## Grabbing all tags in webpage of 'a' type and class 'bookTitle'
    soup = BeautifulSoup(BannedBooks,"lxml")
    block = soup.select('a.bookTitle')

    ## Iterating through and creating list for all titles (bookT) and links (bookLink)
    bookT = [x.text.strip() for x in block]
    bookLink = ['https://www.goodreads.com'+ x.get('href') for x in block]

    ## Combining list
    col_stack = np.column_stack((bookT, bookLink))
    # TopBannedBooks = pd.DataFrame(con, columns = ['title', 'bklink'])
    # BannedList = [bookT, bookLink]
    TopBannedBooks.append(col_stack)

In [23]:
# Create DataFrame
df = pd.DataFrame(np.concatenate(TopBannedBooks), columns = ['title', 'bklink'])
BookID = [re.search(r'\d+', i)[0] for i in df['bklink']]
df['bookID'] = BookID

In [24]:
# Review DataFrame
df.head()

Unnamed: 0,title,bklink,bookID
0,To Kill a Mockingbird,https://www.goodreads.com/book/show/2657.To_Ki...,2657
1,Harry Potter and the Sorcerer's Stone (Harry P...,https://www.goodreads.com/book/show/3.Harry_Po...,3
2,1984,https://www.goodreads.com/book/show/61439040-1984,61439040
3,Animal Farm,https://www.goodreads.com/book/show/7613.Anima...,7613
4,Fahrenheit 451,https://www.goodreads.com/book/show/13079982-f...,13079982


In [None]:
# Export copy of DataFrame to database
# df.to_sql(name='goodreads_list', con=conn, if_exists='append', index=False)
df.to_sql(name='goodreads_original', con=conn, if_exists='append', index=False)
# If not using sqlite you can also export a copy to csv
# df.to_csv('goodreads_list_original.csv', index=False)

In [25]:
# Remove parentheses and information within from title column
df['title'] = df['title'].str.replace(r'\([^()]*\)', '', regex=True)

# Remove special characters from title column
df['title'] = df['title'].str.replace(r'[\'\.+\`\|\#\:\’\*+]', '', regex=True)

# Change bookID to dtype int
df['bookID'] = df['bookID'].astype(int)

# Sort dataframe alphabetically by title
df = df.sort_values('title')

# Remove leading and trailing whitespaces from dataframe
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [26]:
# Keep only unique values 
df = df.drop_duplicates(subset='title')
df

Unnamed: 0,title,bklink,bookID
486,1001 Comics You Must Read Before You Die The U...,https://www.goodreads.com/book/show/10469840-1...,10469840
2,1984,https://www.goodreads.com/book/show/61439040-1984,61439040
369,22s Diary,https://www.goodreads.com/book/show/34828777-2...,34828777
718,A Bad Boy Can Be Good for a Girl,https://www.goodreads.com/book/show/584937.A_B...,584937
477,A Butler Christmas,https://www.goodreads.com/book/show/35295478-a...,35295478
...,...,...,...
379,Z,https://www.goodreads.com/book/show/871294.Z,871294
389,Zero,https://www.goodreads.com/book/show/608787.Zero,608787
713,Zhuan Falun,https://www.goodreads.com/book/show/19869794-z...,19869794
717,ttyl,https://www.goodreads.com/book/show/301023.ttyl,301023


In [27]:
# Isolate the bookID column in the dataframe and create a list
bookIDs = df['bookID'].unique().astype(int)
bookIDs = np.sort(bookIDs).tolist()
bookIDs

[1,
 2,
 3,
 5,
 6,
 24,
 30,
 33,
 34,
 249,
 264,
 295,
 330,
 343,
 662,
 890,
 929,
 960,
 968,
 1303,
 1420,
 1519,
 1554,
 1591,
 1617,
 1618,
 1622,
 1625,
 1848,
 1852,
 1869,
 1934,
 1953,
 2122,
 2165,
 2175,
 2187,
 2316,
 2657,
 2696,
 2767,
 2839,
 2865,
 2956,
 2997,
 3103,
 3636,
 3685,
 3835,
 3863,
 3876,
 4406,
 4473,
 4671,
 4708,
 4900,
 4909,
 4953,
 4981,
 5043,
 5107,
 5129,
 5148,
 5209,
 5220,
 5297,
 5308,
 5326,
 5368,
 5527,
 5693,
 5805,
 5854,
 6149,
 6295,
 6310,
 6327,
 6328,
 6333,
 6440,
 6514,
 6689,
 7437,
 7445,
 7588,
 7604,
 7613,
 7624,
 7777,
 8732,
 8737,
 8909,
 9328,
 9516,
 9646,
 9777,
 10210,
 10592,
 10603,
 10614,
 10629,
 10799,
 10917,
 10964,
 11012,
 11127,
 11149,
 11337,
 11378,
 11573,
 11588,
 11868,
 12067,
 12321,
 12467,
 12649,
 12722,
 12781,
 12898,
 13023,
 13214,
 13615,
 13651,
 14050,
 14743,
 15195,
 15196,
 15197,
 15622,
 15881,
 16640,
 16735,
 16900,
 16981,
 17125,
 17162,
 17250,
 18116,
 18122,
 18254,
 18266,
 

In [None]:
# Export the list of ID's to a text file to use for scraping the remaing book data
with open('book_IDs.txt', 'w') as f:
    for item in bookIDs:
        f.write(f"{item}\n")

### Step 5: Collect the Book Data

Before scraping make a new folder to store all the output files. Make sure you are in the correct directory, then type the code below into the command line. If using VS Code, select the code below then go to Terminal, Run Selected Text.

In [None]:
!mkdir banned_book_data

To start scraping run the `get_book_data.py` script in the command line (or Terminal). Direct it to place output files in the folder `/banned_book__data` and set the file format of the compiled book data to CSV. To scrape a small sample of the books use `book_IDs_sample.txt`.

In [None]:
!python get_book_data.py --book_ids_path book_IDs.txt --output_directory_path banned_book_data --format csv

In [51]:
!python get_book_data.py --book_ids_path book_IDs_sample.txt --output_directory_path banned_book_data --format csv

2024-08-08 15:34:47.912842 get_book_data.py: Scraping 42837514...
2024-08-08 15:34:47.912842 get_book_data.py: #1 out of 5 books
2024-08-08 15:35:12.866216 get_book_data.py: Scraping 44280883...
2024-08-08 15:35:12.866216 get_book_data.py: #2 out of 5 books
2024-08-08 15:35:39.039447 get_book_data.py: Scraping 22074335...
2024-08-08 15:35:39.039447 get_book_data.py: #3 out of 5 books
2024-08-08 15:35:47.708095 get_book_data.py: Scraping 214335039...
2024-08-08 15:35:47.708095 get_book_data.py: #4 out of 5 books
2024-08-08 15:35:52.521699 get_book_data.py: Scraping 292327...
2024-08-08 15:35:52.521699 get_book_data.py: #5 out of 5 books
2024-08-08 15:36:02.254511 get_book_data.py:

🙌 Success! All book data scraped. 🙌

Data files output to /banned_book_data
Total scraping run time = ⏳ 0:01:14.341669 ⌛


Once the script has completed the compiled data can be viewed by reading to a pandas DataFrame.

In [53]:
bk_data = pd.read_csv("banned_book_data/all_books.csv")
bk_data

Unnamed: 0,book_id_title,book_id,cover_image_uri,book_title,book_series,book_series_uri,top_5_other_editions,isbn,isbn13,year_first_published,...,author,num_pages,genres,shelves,lists,num_ratings,num_reviews,average_rating,rating_distribution,reviews_page
0,10138607,10138607,https://images-na.ssl-images-amazon.com/images...,Habibi,,,https://www.goodreads.com/work/editions/15036678,9780375424,9780375424144,"September 1, 2011",...,Craig Thompson,672,"['Graphic Novels', 'Comics', 'Fiction', 'Graph...","{'to-read': 48146, 'graphic-novels': 2455, 'gr...","{'Best Graphic Novels': [38, 3359], 'Required ...",42085,3976,4.03,"{'5 Stars': 1544, '4 Stars': 2490, '3 Stars': ...",https://www.goodreads.com/book/show/10138607/r...
1,10818853,10818853,https://images-na.ssl-images-amazon.com/images...,Fifty Shades of Grey,Fifty Shades,https://www.goodreads.com/series/63134-fifty-s...,https://www.goodreads.com/work/editions/15732562,9781612130,9781612130293,"May 25, 2011",...,E.L. James,356,"['Romance', 'Fiction', 'Erotica', 'BDSM', 'Adu...","{'to-read': 715974, 'currently-reading': 51995...","{'Best Book Boyfriends': [2, 10180], 'Best M/F...",2659011,84832,3.66,"{'5 Stars': 295769, '4 Stars': 276111, '3 Star...",https://www.goodreads.com/book/show/10818853/r...
2,10917,10917,https://images-na.ssl-images-amazon.com/images...,My Sister’s Keeper,,,https://www.goodreads.com/work/editions/1639903,9780743454,9780743454537,"April 6, 2004",...,Jodi Picoult,423,"['Fiction', 'Chick Lit', 'Young Adult', 'Drama...","{'to-read': 335130, 'currently-reading': 8931,...","{'Best Books Ever': [83, 122775], 'Best Books ...",1231031,38093,4.10,"{'5 Stars': 21917, '4 Stars': 53698, '3 Stars'...",https://www.goodreads.com/book/show/10917/revi...
3,11330361,11330361,https://images-na.ssl-images-amazon.com/images...,A Stolen Life,Jaycee Dugard,https://www.goodreads.com/series/368303-jaycee...,https://www.goodreads.com/work/editions/16258764,9781451629,9781451629187,"July 11, 2011",...,Jaycee Dugard,273,"['Nonfiction', 'Memoir', 'True Crime', 'Biogra...","{'to-read': 111068, 'currently-reading': 4090,...","{'Kidnapped!': [21, 773], 'Books That Everyone...",126111,10174,3.95,"{'5 Stars': 2011, '4 Stars': 6657, '3 Stars': ...",https://www.goodreads.com/book/show/11330361/r...
4,11330361,11330361,https://images-na.ssl-images-amazon.com/images...,A Stolen Life,Jaycee Dugard,https://www.goodreads.com/series/368303-jaycee...,https://www.goodreads.com/work/editions/16258764,9781451629,9781451629187,"July 11, 2011",...,Jaycee Dugard,273,"['Nonfiction', 'Memoir', 'True Crime', 'Biogra...","{'to-read': 111062, 'currently-reading': 4094,...","{'Kidnapped!': [21, 773], 'Books That Everyone...",126102,10174,3.95,"{'5 Stars': 2011, '4 Stars': 6657, '3 Stars': ...",https://www.goodreads.com/book/show/11330361/r...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138,95144,95144,https://images-na.ssl-images-amazon.com/images...,In the Night Kitchen,,,https://www.goodreads.com/work/editions/2223682,9780099417,9780099417477,"January 1, 1970",...,Maurice Sendak,40,"['Picture Books', 'Childrens', 'Fiction', 'Fan...","{'to-read': 6526, 'picture-books': 710, 'child...","{""Best Children's Books"": [172, 5117], 'Best B...",18404,997,4.00,"{'5 Stars': 672, '4 Stars': 1267, '3 Stars': 3...",https://www.goodreads.com/book/show/95144/revi...
139,9516,9516,https://images-na.ssl-images-amazon.com/images...,Persepolis: The Story of a Childhood,Persepolis,https://www.goodreads.com/series/45795-persepolis,https://www.goodreads.com/work/editions/3303888,9780375714,9780375714573,"April 29, 2003",...,Marjane Satrapi,153,"['Graphic Novels', 'Nonfiction', 'Memoir', 'Co...","{'to-read': 141765, 'graphic-novels': 5800, 'g...","{'Best Books of the Decade: 2000s': [68, 7129]...",212336,12222,4.26,"{'5 Stars': 2252, '4 Stars': 5499, '3 Stars': ...",https://www.goodreads.com/book/show/9516/revie...
140,958289,958289,https://images-na.ssl-images-amazon.com/images...,Skippyjon Jones,Skippyjon Jones,https://www.goodreads.com/series/60536-skippyj...,https://www.goodreads.com/work/editions/649854,9780142404,9780142404034,"September 15, 2003",...,Judy Schachner,32,"['Picture Books', 'Childrens', 'Animals', 'Fic...","{'to-read': 5912, 'picture-books': 740, 'child...","{'Boy Friendly Picture Books': [35, 190], 'Bes...",34633,1431,4.22,"{'5 Stars': 1051, '4 Stars': 1685, '3 Stars': ...",https://www.goodreads.com/book/show/958289/rev...
141,968,968,https://images-na.ssl-images-amazon.com/images...,The da Vinci Code,Robert Langdon,https://www.goodreads.com/series/92467-robert-...,https://www.goodreads.com/work/editions/2982101,isbn not found,isbn13 not found,"January 1, 2003",...,Dan Brown,489,"['Fiction', 'Mystery', 'Thriller', 'Mystery Th...","{'to-read': 540941, 'currently-reading': 23689...","{'Best Books Ever': [20, 122774], 'The BOOK wa...",2380081,55573,3.92,"{'5 Stars': 87018, '4 Stars': 158677, '3 Stars...",https://www.goodreads.com/book/show/968/review...


In [None]:
# Export copy of DataFrame to database
bk_data.to_sql(name='all_books', con=conn, if_exists='append', index=False)
# If not using sqlite you can also export a copy to csv
# bk_data.to_csv('all_books.csv', index=False)

### Step 6: Collect Review Data (Optional)

Collecting the review data for a list of books can be a lengthy process. The average processing time is 1 to 3 minutes for each book. If you still want to compile some book reviews, proceed. Otherwise I have included samples of some book review output as `BookTitle.csv` and the final joined scrape of all reviews from the 'Best Banned, Censored, and Challenged Books' list as `book_reviews.csv`.

In [56]:
# Function to collect all reviews for each book in the List
def get_Book_Reviews(page_source):

    soup = BeautifulSoup(page_source, 'lxml')

    # Locate book title
    title = soup.find('h1', attrs = {'class' : 'Text H1Title'}).text

    # Locate all user account hrefs
    cont = soup.select('div.ReviewerProfile__name')
    hrefsUsers = [x.find('a')['href'] for x in cont]

    # Collect user review text
    contReview = soup.select("section.ReviewText")
    Reviews = [x.text.strip() for x in contReview]

    # Collect individual user ratings 
    contRatingCont = soup.select("div.ShelfStatus")
    userRatings = [x.find('span')['aria-label'] if (x.findChildren('span', recursive=False) == []) == False else 'No Rating' for x in contRatingCont]

    # Collect the date the review was written
    dateCont = soup.select('section.ReviewCard__row')
    datesOfReviews = [x.find('span', attrs = {'class': 'Text Text__body3'}).text for x in dateCont]

    # Collect number of likes and comments for review
    commentLikeCont = soup.select('footer.SocialFooter')
    likes = ['0' if x.find('div', attrs={'class': 'SocialFooter__statsContainer'}) == None else x.find('span', attrs={'class': 'Button__labelItem'}).text  for x in commentLikeCont]
    comments = ['0' if x.find('div', attrs={'class': 'Button__container'}).next_sibling == None else x.find('div', attrs={'class': 'Button__container'}).next_sibling.text for x in commentLikeCont]

    # Create DataFrame of all review data collected

    reviewData = pd.DataFrame({ 'User Href' : hrefsUsers,
                                'Title' : title,
                                'Rating' : userRatings,
                                'Date' : datesOfReviews,
                                'Likes' : likes,
                                'Comments' : comments,
                                'Review' : Reviews})
    
    return(reviewData)

In [36]:
# Function to check if element exists on page
def check_element(e):
    if e:
        return (e)
    else:
        return (0)

# Function to automate loading all reviews    
def get_more_reviews(url):
    clicks = 0

    # Initilaizing driver and webpage and allowing time for reviews to load
    options = Options()
    options.add_argument('--headless')
    # headless option allows webbrowser to run in the background
    # Remove options if you want to allow visible interface while script is running 
    driver = webdriver.Edge(options=options)
    driver.get(url)
    time.sleep(3)

    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
    try:
        nreviews = check_element(int(re.sub('\\D', '', soup.find('span', attrs = {'class' : 'Text Text__body3 Text__subdued'}).text)))
    except ValueError:
        nreviews = int(0)
    cap = 36
    iters = np.round(nreviews/30)-1


    if iters < cap:
        while clicks < iters:

            # scrolling down page to ensure click will work on "show more results" button
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            # Clicking "show more results" button
            MoreResults = driver.find_element(By.XPATH, "//div[@class = 'Divider Divider--contents Divider--largeMargin']/div[@class = 'Button__container']/button")
            driver.execute_script("arguments[0].click();", MoreResults)
            time.sleep(1)

            clicks += 1
    else:
        while clicks < cap:

            # scrolling down page to ensure click will work on "show more results" button
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            # Clicking "show more results" button
            MoreResults = driver.find_element(By.XPATH, "//div[@class = 'Divider Divider--contents Divider--largeMargin']/div[@class = 'Button__container']/button")
            driver.execute_script("arguments[0].click();", MoreResults)
            time.sleep(1)

            clicks += 1
    

    # grabbing reference for final state of page after n number of "show more results" button clicks
    page_source = driver.page_source

    reviews = get_Book_Reviews(page_source)

    driver.quit()

    return(reviews)

In [47]:
# Open table containing book links
BannedBooksReviews = pd.read_sql_query("SELECT * FROM all_books", conn)
# df = pd.read_csv('all_books.csv')
# BannedBooksReviews = df['reviews_page']

In [48]:
# Processing in smaller batches to double check loading into csv was successful
for i in range(0, 5):
    bookdata = get_more_reviews(BannedBooksReviews['reviews_page'][i])
    # Save reviews to csv
    # bookdata.to_csv('Book{num}.csv'.format(num=i+1), index=False)
    bookdata.to_csv(f'{BannedBooksReviews['book_title'][i]}.csv'.format(num=i+1), index=False)
    # Save reviews to database
    bookdata.to_sql(name='Book{num}'.format(num=i+1), con=conn, if_exists='append', index=False)
    # bookdata.to_sql(name=f'{BannedBooksReviews['book_title'][i]}'.format(num=i+1), con=conn, if_exists='append', index=False)

Once the script has completed the review data can viewed in the database or in the project folder depending on which save option you chose.