*  DSC 540-T302 Data Preparation
*  Week 9 & 10 Exercise
*  Peter Lozano

# Activity 7.01 Extracting the Top 100 e-books from Gutenberg

I have to extract the Top 100 e-books from [Project Gutenberg](https://www.gutenberg.org/browse/scores/top) using web scraping techniques. I will use Python along with the BeautifulSoup and requests libraries to accomplish this task.

## Import Libraries

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

## Read HTML from the URL

In [2]:
# Read HTML from the URL
gutenberg_url = "https://www.gutenberg.org/browse/scores/top"
# Get the page content
response = requests.get(gutenberg_url)

## Write a small function to check the status of the web request.

It's always good practice to check the status of the request to ensure that the page was retrieved successfully.

In [None]:
# Function for checking the status of the web request
def check_request_status(response):
    # 200 status code indicates that the request was successful.
    if response.status_code == 200:
        print("Request was successful.")
    else:
        print(f"Request failed with status code: {response.status_code}")

In [6]:
check_request_status(response)

Request was successful.


A **200** status code indicates that the request was successful. Any other status code indicates that there was an issue with the request.

## Decode the response and pass this on to **BeautifulSoup** for HTML parsing.

The response variable contains the HTML content of the page. However, just passing the response to BeautifulSoup may not work as expected because the content is in bytes format. Therefore, we need to decode the response content by accessing the contents attribute before passing it to BeautifulSoup for parsing.

I will show you what the response variable looks like before decoding.

In [11]:
response

<Response [200]>

As you can see, the response variable appears to only contain the response status but not the actual HTML content. This is because the response variable is an object that contains various attributes, including the status code, headers, and content.

To access the actual HTML content, we need to use the `.content` attribute of the response object. This will give us the raw bytes of the HTML content, which we can then decode to a string format if needed.

In [None]:
# Decode the response and pass this on to BeautifulSoup for HTML parsing.
soup = BeautifulSoup(response.content, 'html.parser')

Using the **BeautifulSoup** library, I will parse the HTML content to extract the relevant information about the top 100 e-books, such as titles and authors. I do this by passing the `html.parser` argument to the BeautifulSoup constructor.

## Find all the **href** tags and store them in a list of links. Check what the list looks like -- print the first 30 elements.

Now I have to find the tags that contain the information about the top 100 e-books. After inspecting the HTML structure of the page, I found that the top 100 e-books are listed under an `<ol>` tag with a preceding `<h2>` tag that contains the text "Top 100 EBooks yesterday".

In [None]:
# Initialize a list to store links
lst_links = []
# Find all the href tags and store them in the list
for link in soup.find_all('a', href=True):
    lst_links.append(link['href'])

# Print the first 30 elements of the list
print(lst_links[:30])

['/', '/donate/', '/about/', '/about/contact_information.html', '/about/background/', '/help/mobile.html', '/help/', '/ebooks/offline_catalogs.html', '/donate/', '/browse/scores/top', '/ebooks/categories', '/ebooks/bookshelf/', '/ebooks/', '/browse/scores/top', '/ebooks/categories', '/about/pretty-pictures.html', '#books-last1', '#books-last7', '#books-last30', '#authors-last1', '#authors-last7', '#authors-last30', '/ebooks/84', '/ebooks/2701', '/ebooks/1342', '/ebooks/1513', '/ebooks/43', '/ebooks/16328', '/ebooks/2641', '/ebooks/100']


## Use a regular expression to find the numeric digits in these links and loop over the appropriate range and use a regex to find the numeric digits in the link (href) string. Use the `findall()` method.

I know that all the books contain "ebooks/" followed by the numeric digits that represent the file numbers for the eBooks. I will use a regular expression to extract these numeric digits from the links.

In [None]:
# Initialize a list to store file numbers
file_numbers = []

# Loop over the list of links to extract numeric digits using regex
for link in lst_links:
    # Matching 'ebooks/' followed by digits using findall()
    match = re.findall(r'ebooks/(\d+)', link)
    if len(match) == 1:
        # Appending the numeric digits (file number) to the list
        file_numbers.append(match[0])
print(file_numbers[:30])  # Print the first 30 file numbers


['84', '2701', '1342', '1513', '43', '16328', '2641', '100', '11', '145', '37106', '2554', '64317', '67979', '768', '1260', '16389', '5197', '1080', '394', '2160', '1259', '6761', '6593', '2542', '4085', '844', '3207', '76', '174']


## What does the **soup** object's text look like? Use the `.text()` method and print only the first 2,000 characters (do not print the whole thing, as it is too long).

I can use the **soup** object's `.text` attribute to get the text content of the HTML document. This will give me a plain text representation of the HTML.

In [22]:
soup.text[:2000]  # Print the first 2000 characters of the soup object's text

"\n\n\n\nTop 100 | Project Gutenberg\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nX\n\nGo!\n\n\n\n\n\n\n\n Donate \n\n\n\n\n\nAbout▼\n\nAbout Project Gutenberg \nContact Us\nHistory & Philosophy\nKindle & eReaders\nHelp Pages\nOffline Catalogs\nDonate\n\n\n\nFrequently Downloaded\nMain Categories\nReading Lists\nSearch Options\n\n\n\nFrequently Downloaded\nMain Categories\n\n\n\nFrequently Viewed or Downloaded\nCalculated from the number of times each eBook gets\ndownloaded. (Multiple downloads from the same Internet\naddress on the same day count as one download. Addresses\nthat download more than 100 eBooks in a day are considered\nrobots and are not counted.)\n\nDownloaded Books\n2026-01-211441835\nlast 7 days7651346\nlast 30 days36511933\n\nVisualizations and graphs are available as\npretty pictures.\n\n\nTop 100 EBooks: Yesterday - 7\xa0days - 30\xa0days\nTop 100 Authors: Yesterday - 7\xa0days - 30\xa0days\n\nTop 100 EBooks yesterday\n\nFrankenstein; Or, Th

## Search the extracted text (using regex) from the **soup** object to find the names of the top 100 eBooks (yesterday's ranking).

I will start by creating a temporary list to hold the strings of the top 100 eBook names.

In [64]:
list_titles_temp = []

## Create a starting index. It should point at the text Top 100 Ebooks yesterday. Use the **splitlines** method of **soup.text**. It splits the lines of the text of the **soup** object.

Parsing through the HTML content, I found that the text "Top 100 EBooks yesterday" is located at line index 98. Therefore, I will set the starting index to 98.

In [65]:
start_index = soup.text.splitlines().index('Top 100 EBooks yesterday') # Returns 98 as the start index


## Run the **for** loop from **1-100** to add the strings of the next **100** lines to this temporary list. **Hint**: use the **splitlines** method.

I will pass the `start_index` through the for loop to extract the next 100 lines, which contain the eBook titles. This will result in a list of strings, each representing an eBook title along with its author.

In [None]:
for i in range(100):
    # Append the line at the calculated index to the temporary list
    list_titles_temp.append(soup.text.splitlines()[start_index + 2 + i]) # +2 to skip the header lines

## Use regex to extract only text from the name strings and append them to an empty list. Use **match** and **span** to find the indices and use them.

Now that I have my temporary list of eBook titles with authors, I will use regular expressions to extract only the text portion of each title. This involves matching the alphabetic characters at the beginning of each string.

This method is effective but has flaws. For example, if an eBook title contains special characters or numbers, those special characters from the book will be excluded from the final list.

In [None]:
list_titles = []
for i in range(100):
    # Use regex to extract only the alphabetic characters from the beginning of each title
    id1, id2 = re.match('^[a-zA-Z ]*',list_titles_temp[i]).span() # Span is to get the start and end index of the match
    list_titles.append(list_titles_temp[i][id1:id2])

## Print the list of titles.

In [68]:
for title in list_titles:
    print(title)

Frankenstein
Moby Dick
Pride and Prejudice by Jane Austen 
Romeo and Juliet by William Shakespeare 
The Strange Case of Dr
Beowulf
A Room with a View by E
The Complete Works of William Shakespeare by William Shakespeare 
Alice
Middlemarch by George Eliot 
Little Women
Crime and Punishment by Fyodor Dostoyevsky 
The Great Gatsby by F
The Blue Castle
Wuthering Heights by Emily Bront
Jane Eyre
The Enchanted April by Elizabeth Von Arnim 
My Life 
A Modest Proposal by Jonathan Swift 
Cranford by Elizabeth Cleghorn Gaskell 
The Expedition of Humphry Clinker by T
Twenty years after by Alexandre Dumas and Auguste Maquet 
The Adventures of Ferdinand Count Fathom 
History of Tom Jones
A Doll
The Adventures of Roderick Random by T
The Importance of Being Earnest
Leviathan by Thomas Hobbes 
Adventures of Huckleberry Finn by Mark Twain 
The Picture of Dorian Gray by Oscar Wilde 
The Count of Monte Cristo by Alexandre Dumas and Auguste Maquet 
The King in Yellow by Robert W
The Adventures of Tom Saw

# Activity 7.02 Building Your Own Movie Database by Reading an API