# Lesson 2_3: Scraping Gutenberg: Batch Download



## 1 Introduction

Up to this point, we have only downloaded and modified a data table we found on Gutenberg. This table is important because it contains the `ID` numbers for every Gutenberg text. With these numbers we can scrape and access virtually every text because the naming conventions for each individual site are consistent.

For example, the `txt` file for *Huckleberry Finn* by Mark Twain is stored here:

https://www.gutenberg.org/cache/epub/76/pg76.txt

Meanwhile the `txt` file for *Tom Sawyer* by Mark Twain is stored here:

https://www.gutenberg.org/cache/epub/74/pg74.txt

The only difference is that one is stored at `76` and the other is stored at `74`. Consequently, if we want to download a batch of specific files, we only really need their ID numbers, since this is the only thing that changes in the web address.

Indeed, we could simply make a list of ID's we want to download and loop through it. We can write this out as pseudo-code as something like this.

```python
df_texts = []
list1 = [74, 75, 76]
for x in list1:
    new_text = download_text('https://www.gutenberg.org/cache/epub/' +x)
    df_texts.append(new_text)
```
The above code won't work but this is the core principle.

## 2 Problems

There are a number of technical issues with simply looping through a whole bunch of Gutenberg sites and downloading these texts. Explaining these technical issues in full is not very interesting and beyond the scope of this course. Briefly they include:
- Gutenberg doesn't like robots, so we have to use specific servers
- Not all texts are on the same servers so we have to switch servers
- The web address is not always 100% the same
- The text still has metadata

Fortunately, a kind [coder](https://skeptric.com/gutenberg/) has already solved many of these problems. All I have done is modified some of their code for our purposes. We won't get bogged down with the details.

## 3 Solutions

We will be using the `tqdm` package to check our progress when we download the files.

In [154]:
try:
    import tqdm  # Replace with the package you want to check
    print("tqdm is already installed.")
except ImportError:
    print("tqdm is not installed. Installing...")
    # Use magic command to install the package
    %pip install tqdm

tqdm is already installed.


In [155]:
# Importing the necessary libraries
import pandas as pd  # For working with data in a DataFrame
import requests      # To make HTTP requests to download data from URLs
import re            # Use regular rexpressions
from io import BytesIO  # For handling byte streams (used for unzipping files)
import zipfile       # For unzipping files downloaded from Project Gutenberg
from tqdm import tqdm  # For showing a progress bar when downloading multiple files
import logging       # For logging warnings or errors
import chardet       # Check if characters are valid



### 3.1 Get Web scraping server info

Get the page that has a list of all of the robot servers

In [156]:
gutenberg_robot_url = "http://www.gutenberg.org/robot/harvest?filetypes[]=txt"
r = requests.get(gutenberg_robot_url)

Get the link for each individual mirror

In [157]:
gutenberg_mirror = re.search('(https?://[^/]+)[^"]*.zip', r.text).group(1)

### 3.2 Loop through mirrors

Search through each mirror to see if it has the file we need

In [158]:
def gutenberg_text_urls(id: str, mirror=gutenberg_mirror, suffixes=("", "-8", "-0")) -> list[str]:
    """
    Generate URLs to download the Gutenberg book by ID.
    
    Args:
        id (str): The book ID from Project Gutenberg.
        mirror (str): The mirror URL for Project Gutenberg.
        suffixes (tuple): Possible suffixes for the book file.
        
    Returns:
        list[str]: A list of possible URLs for downloading the book.
    """
    # Convert id to a string to ensure slicing works
    id = str(id)  
    
    # The path is created using all but the last character of the ID, or '0' if the ID is short
    path = "/".join(id[:-1]) or "0"
    
    # Generate URLs using the mirror, path, and suffixes
    return [f"{mirror}/{path}/{id}/{id}{suffix}.zip" for suffix in suffixes]


### 3.3 Create Download Function

Create a download function to get and extract each individual book

In [159]:
def download_gutenberg(id: str) -> str:
    """
    Download the book from Project Gutenberg by its ID,
    and unzip the content if necessary.
    
    Args:
        id (str): Gutenberg book ID.
    
    Returns:
        str: The content of the book as a text string or an error message.
    """
    for url in gutenberg_text_urls(id):
        try:
            r = requests.get(url)
            if r.status_code == 404:
                logging.warning(f"404 for {url} - moving to next possible URL.")
                continue
            r.raise_for_status()
            break
        except requests.exceptions.RequestException as e:
            logging.error(f"Error fetching {url}: {e}")
            continue
    else:
        fallback_url = f"https://www.gutenberg.org/ebooks/{id}.txt.utf-8"
        logging.info(f"Attempting fallback URL: {fallback_url}")
        try:
            r = requests.get(fallback_url)
            r.raise_for_status()
        except requests.exceptions.RequestException as e:
            logging.error(f"Error fetching fallback URL {fallback_url}: {e}")
            return "Unable to download file"

    if 'application/zip' in r.headers.get('Content-Type', ''):
        z = zipfile.ZipFile(BytesIO(r.content))
        if len(z.namelist()) != 1:
            return "Unable to download file"  # Return error if file count is unexpected
        
        # Read the file and detect encoding
        file_content = z.read(z.namelist()[0])
        encoding = chardet.detect(file_content)['encoding']
        return file_content.decode(encoding)  # Decode using detected encoding
    else:
        return r.text  # Return the text content directly if it’s not a zip file


### 3.4 Strip Metadata

Use a small helper function to strip all of hte metadata from the text

In [160]:
def strip_headers(text):
    gutenberg_text = "PROJECT GUTENBERG EBOOK"
    
    in_text = False
    output = []
    
    for line in text.splitlines():        
        if gutenberg_text in line:
            if not in_text:
                in_text = True
            else:
                break
        else:
            if in_text:
                output.append(line)

    return "\n".join(output).strip()

### 3.4 Download and Strip

Create a main function that runs the above two functions to download an individual book clean it and return it.

In [161]:
def book_text(book_id):
    """
    Fetches and returns the text content of a book from Project Gutenberg using the book ID.
    
    Args:
        book_id (str): The Gutenberg book ID.
    
    Returns:
        str: The cleaned book text.
    """
    # download_gutenberg already returns the text content as a string
    text = download_gutenberg(book_id)
    
    # Clean the text by stripping the headers/footers (optional, depends on your implementation)
    clean_text = strip_headers(text)
    
    return clean_text


### 3.5 Create DataFrame column `text_data`

This function calls `book_text()` for each `text_id` in the supplied dataframe.

In [162]:
def fetch_text_data(df):
    """
    Fetches text data for each book in the DataFrame and inserts it into the 'text_data' column.
    If 'text_data' already exists, prompts the user for confirmation before overwriting.
    
    Args:
        df (pd.DataFrame): A DataFrame that contains a column with book IDs (e.g., 'text#').
    
    Returns:
        pd.DataFrame: The DataFrame with an additional or updated 'text_data' column.
    """
    # Check if 'text_data' column exists
    if 'text_data' in df.columns:
        overwrite = input("'text_data' column already exists. Do you want to overwrite it? (y/n): ").strip().lower()
        if overwrite != 'y':
            print("Operation aborted. No changes were made.")
            return df  # Return the DataFrame unchanged

    else:
        # Initialize 'text_data' with string dtype if not present
        df['text_data'] = pd.Series(dtype=pd.StringDtype())

    # Iterate over rows with tqdm progress bar
    for index, row in tqdm(df.iterrows(), total=len(df), desc="Fetching text data"):
        text_id = row['text_id']  # Assuming 'text_id' column has the book IDs
        text = book_text(text_id)  # Fetch the text data using book_text() function
        
        # Assign the fetched text directly to the 'text_data' column
        df.loc[index, 'text_data'] = text

    return df

## 4 All you really need to know!

Since most of the scraping has been setup in this file, you are safe to simply import your `.pickle` file and then run the function `fetch_text_data()` with the name of the dataframe inside the parenthesis.

In [163]:
df_virginia = pd.read_pickle('virginia_history.pickle')

In [164]:
df_virginia.head(50)


Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,"1936; Johnson, Allen, 1870-1931 [Editor]"
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621
4721,4762,Text,2003-12-01,Civil Government of Virginia,en,Virginia -- Politics and government,JK,Browsing: History - American; Browsing: Politi...,,Fox,William Fayette,1836,1909
11066,11137,Text,2004-02-01,"Twenty-Two Years a Slave, and Forty Years a Fr...",en,"Steward, Austin, 1794-1860; Fugitive slaves --...",E300,Slavery; African American Writers; Browsing: B...,,Steward,Austin,1794,1860
11660,11731,Text,2004-03-01,Virginia: the Old Dominion,en,Historic buildings -- Virginia; Houseboats; Ja...,F206,Browsing: History - American,"Hutchins, Cortelle",Hutchins,Frank W.; Hutchins,Cortelle,
11787,11858,Text,2004-03-01,George Washington: Farmer,en,"Washington, George, 1732-1799 -- Homes and hau...",E300,Browsing: History - American; Browsing: Travel...,,Haworth,Paul Leland,1876,1936
12448,12519,Text,2004-06-01,"The Virginia Housewife; Or, Methodical Cook",en,"Cooking, American; Cooking -- Virginia",TX,Cookbooks and Cooking; Browsing: Cooking & Dri...,,Randolph,Mary,1762,1828
21996,22067,Text,2007-07-13,The Story of a Cannoneer Under Stonewall Jackson,en,"United States -- History -- Civil War, 1861-18...",E456,US Civil War; Browsing: History - American,,Moore,Edward Alexander,1842,


In [165]:
fetch_text_data(df_virginia)

ERROR:root:Error fetching fallback URL https://www.gutenberg.org/ebooks/26305.txt.utf-8: 404 Client Error: Not Found for url: https://www.gutenberg.org/cache/epub/26305/pg26305.txt
Fetching text data: 100%|██████████| 97/97 [00:52<00:00,  1.85it/s]


Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death,text_data
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900,
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,"1936; Johnson, Allen, 1870-1931 [Editor]","Produced by Dianne Bean, Justin Philips, The J..."
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900,Produced by David Widger ON HORSEBACK By...
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621,A BRIEFE AND TRUE REPORT OF THE NEW FOUND LAND...
4721,4762,Text,2003-12-01,Civil Government of Virginia,en,Virginia -- Politics and government,JK,Browsing: History - American; Browsing: Politi...,,Fox,William Fayette,1836,1909,"Robert Rowe, Charles Franks and the Online Dis..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,1816?,"1897; Stearns, Charles (Abolitionist) [Contrib...",[Illustration]  ...
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,1771?,"1839?; Hurnard, Robert [Author of introduction...",INCIDENTS IN THE LIFE OF SOLOMON BAYLEY *** A...
65081,65160,Text,2021-04-25,The Discoveries of John Lederer,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,1644,"; Talbot, William, Sir, -1691 [Translator]","LEDERER *** Licensed, Nov. 1. 1671. ROG..."
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,"Jr. [Editor]; Pitkin, Thomas M., 1901",1988 [Editor],REVOLUTION *** Transcriber’s Notes: Text en...


We can see in the result that a new column has been created and that the full-text of each text has been written into each cell.

In [166]:
df_virginia


Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death,text_data
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900,
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,"1936; Johnson, Allen, 1870-1931 [Editor]","Produced by Dianne Bean, Justin Philips, The J..."
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900,Produced by David Widger ON HORSEBACK By...
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621,A BRIEFE AND TRUE REPORT OF THE NEW FOUND LAND...
4721,4762,Text,2003-12-01,Civil Government of Virginia,en,Virginia -- Politics and government,JK,Browsing: History - American; Browsing: Politi...,,Fox,William Fayette,1836,1909,"Robert Rowe, Charles Franks and the Online Dis..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,1816?,"1897; Stearns, Charles (Abolitionist) [Contrib...",[Illustration]  ...
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,1771?,"1839?; Hurnard, Robert [Author of introduction...",INCIDENTS IN THE LIFE OF SOLOMON BAYLEY *** A...
65081,65160,Text,2021-04-25,The Discoveries of John Lederer,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,1644,"; Talbot, William, Sir, -1691 [Translator]","LEDERER *** Licensed, Nov. 1. 1671. ROG..."
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,"Jr. [Editor]; Pitkin, Thomas M., 1901",1988 [Editor],REVOLUTION *** Transcriber’s Notes: Text en...


Since it is hard to read the cells in the dataframe, by checking an individual cell. Since our column titles are lower case and do not have special characters or spaces, we can use dot notation to access the column: `df_virginia.text_data`. We can then use get the row of our choosing by using `.iloc[row_index]`. Finally, we can slice the string down to 1000 characters by using list slicing `[:1000]`.

In [167]:
df_virginia.text_data.iloc[3][:1000]

'A BRIEFE AND TRUE REPORT OF THE NEW FOUND LAND OF VIRGINIA\n\n1590\n\nby Thomas Hariot\n\nThe 1590 edition of de Brys in the Library of Congress\n\n\nA briefe and true report\nof the new found land of Virginia,\n_of the commodities and of the nature and man\nners of the naturall inhabitants: Discouered by\nthe English Colony there seated by_ Sir Richard\nGreinuile Knight _In the yeere 1585. Which remained\nvnder the gouernment of twelue monethes,\nAt the speciall charge and direction of the Honourable_\nSIR WALTER RALEIGH _Knight, lord Warden\nof the stanneries Who therein hath beene fauoured\nand authorised by her_ MAIESTIE\n_and her letters patents:\nThis fore booke Is made in English\nBy Thomas Hariot; seruant to the abouenamed\nSir_ WALTER, _a member of the Colony, and there\nimployed in discouering._\n\nCVM GRATIA ET PRIVILEGIO CÆS. MATIS SPECIALD\n\nFRANCOFORTI AD MOENVM\nTYPIS IOANNIS WECHELI, SVMTIBVS VERO THEODORI\nDE BRY ANNO CD D XC.\nVENALES REPERIVNTVR IN OFFICINA SIGISMV

### 4.1 Overwrite protection

The function has overwrite protection built in. If the `text_data` column already exists you can choose not to run the function.

In [168]:
fetch_text_data(df_virginia)

'text_data' column already exists. Do you want to overwrite it? (y/n):  n


Operation aborted. No changes were made.


Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death,text_data
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900,
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,"1936; Johnson, Allen, 1870-1931 [Editor]","Produced by Dianne Bean, Justin Philips, The J..."
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900,Produced by David Widger ON HORSEBACK By...
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621,A BRIEFE AND TRUE REPORT OF THE NEW FOUND LAND...
4721,4762,Text,2003-12-01,Civil Government of Virginia,en,Virginia -- Politics and government,JK,Browsing: History - American; Browsing: Politi...,,Fox,William Fayette,1836,1909,"Robert Rowe, Charles Franks and the Online Dis..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,1816?,"1897; Stearns, Charles (Abolitionist) [Contrib...",[Illustration]  ...
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,1771?,"1839?; Hurnard, Robert [Author of introduction...",INCIDENTS IN THE LIFE OF SOLOMON BAYLEY *** A...
65081,65160,Text,2021-04-25,The Discoveries of John Lederer,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,1644,"; Talbot, William, Sir, -1691 [Translator]","LEDERER *** Licensed, Nov. 1. 1671. ROG..."
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,"Jr. [Editor]; Pitkin, Thomas M., 1901",1988 [Editor],REVOLUTION *** Transcriber’s Notes: Text en...


### 4.2 Save result

Save the result as a pickle file for later use.

In [170]:
df_virginia.to_pickle('df_virginia_text.pickle')