Date: 24th October 2025

---

Author: Jedidah Wavinya


---

Project:
* In this project, i scrape textual data from novels that are available freely on the web and plot interesting statistics like Word Frequency distribution, which gives insights about which words the author commonly uses.
* For this project, i use the website Project Gutenberg that has free ebooks of many novels.

1. Import the necesessary libraries

In [13]:
import pandas as pd
import numpy as py
import requests
from bs4 import BeautifulSoup
import os
import re

2. URL reference for the website

In [24]:
# Gutenberg book URL
url = "https://www.gutenberg.org/files/1342/1342-0.txt"

# Create a folder to store files
os.makedirs("gutenberg_books", exist_ok=True)


3. Download the Text

In [27]:
response = requests.get(url,)
text = response.text

# Save raw text
with open("gutenberg_books/pride_and_prejudice_raw.txt", "w", encoding="utf-8") as f:
    f.write(text)

print("✅ Downloaded raw text successfully!")

✅ Downloaded raw text successfully!


4. Clean the Text

In [16]:
def clean_gutenberg_text(raw_text):
    # Use regex to find the start and end markers
    start_match = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*\*\*\*", raw_text)
    end_match = re.search(r"\*\*\* END OF THIS PROJECT GUTENBERG EBOOK.*\*\*\*", raw_text)

    if start_match and end_match:
        main_text = raw_text[start_match.end():end_match.start()]
    else:
        main_text = raw_text  # fallback if markers not found

    # Clean up whitespace and newlines
    main_text = re.sub(r'\r\n', '\n', main_text)
    main_text = re.sub(r'\n{3,}', '\n\n', main_text)  # normalize excessive blank lines
    main_text = main_text.strip()

    return main_text

clean_text = clean_gutenberg_text(text)

# Save cleaned text
with open("gutenberg_books/pride_and_prejudice_clean.txt", "w", encoding="utf-8") as f:
    f.write(clean_text)

print("✅ Cleaned and saved main body of the novel!")


✅ Cleaned and saved main body of the novel!


5. Preview a Sample

In [17]:
print(clean_text[:1000])  # preview the first 1000 characters


*** START OF THE PROJECT GUTENBERG EBOOK 1342 ***

                            [Illustration:

                             GEORGE ALLEN
                               PUBLISHER

                        156 CHARING CROSS ROAD
                                LONDON

                             RUSKIN HOUSE
                                   ]

                            [Illustration:

               _Reading Jane’s Letters._      _Chap 34._
                                   ]

                                PRIDE.
                                  and
                               PREJUDICE

                                  by
                             Jane Austen,

                           with a Preface by
                           George Saintsbury
                                  and
                           Illustrations by
                             Hugh Thomson

                         [Illustration: 1894]

                       Ruskin       156. Charing
     

6. Download Multiple Books

In [18]:
book_ids = {
    "Pride_and_Prejudice": 1342,
    "Frankenstein": 84,
    "Dracula": 345,
}

for title, gid in book_ids.items():
    url = f"https://www.gutenberg.org/files/{gid}/{gid}-0.txt"
    response = requests.get(url, headers={"User-Agent": "WavinyaScraper/1.0"})
    text = response.text

    cleaned = clean_gutenberg_text(text)

    with open(f"gutenberg_books/{title}.txt", "w", encoding="utf-8") as f:
        f.write(cleaned)
    print(f"✅ Saved: {title}")


✅ Saved: Pride_and_Prejudice
✅ Saved: Frankenstein
✅ Saved: Dracula


7. Verify Saved Files

In [19]:
import os
print("Files saved in 'gutenberg_books':")
print(os.listdir("gutenberg_books"))


Files saved in 'gutenberg_books':
['Dracula.txt', 'pride_and_prejudice_raw.txt', 'pride_and_prejudice_clean.txt', 'Frankenstein.txt', 'Pride_and_Prejudice.txt']


8. Prepare Multiple Books and Collect Data

In [20]:
book_info = [
    {"title": "Pride and Prejudice", "author": "Jane Austen", "id": 1342},
    {"title": "Frankenstein", "author": "Mary Shelley", "id": 84},
    {"title": "Dracula", "author": "Bram Stoker", "id": 345},
]

dataset = []

for book in book_info:
    gid = book["id"]
    title = book["title"]
    author = book["author"]
    url = f"https://www.gutenberg.org/files/{gid}/{gid}-0.txt"

    print(f"📘 Fetching: {title} by {author}")

    try:
        response = requests.get(url, headers={"User-Agent": "WavinyaScraper/1.0"})
        text = response.text
        cleaned = clean_gutenberg_text(text)

        # Save cleaned text locally
        file_path = f"gutenberg_books/{title.replace(' ', '_')}.txt"
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(cleaned)

        # Append metadata + text
        dataset.append({
            "gutenberg_id": gid,
            "title": title,
            "author": author,
            "download_url": url,
            "text_length": len(cleaned),
            "text": cleaned
        })

    except Exception as e:
        print(f"❌ Error fetching {title}: {e}")


📘 Fetching: Pride and Prejudice by Jane Austen
📘 Fetching: Frankenstein by Mary Shelley
📘 Fetching: Dracula by Bram Stoker


9. Convert to DataFrame and Inspect

In [21]:
df = pd.DataFrame(dataset)
df.head()


Unnamed: 0,gutenberg_id,title,author,download_url,text_length,text
0,1342,Pride and Prejudice,Jane Austen,https://www.gutenberg.org/files/1342/1342-0.txt,728392,*** START OF THE PROJECT GUTENBERG EBOOK 1342 ...
1,84,Frankenstein,Mary Shelley,https://www.gutenberg.org/files/84/84-0.txt,419290,*** START OF THE PROJECT GUTENBERG EBOOK 84 **...
2,345,Dracula,Bram Stoker,https://www.gutenberg.org/files/345/345-0.txt,845805,*** START OF THE PROJECT GUTENBERG EBOOK 345 *...


10. Save Dataset to CSV

In [22]:
csv_path = "gutenberg_books/gutenberg_novels_dataset.csv"
df.to_csv(csv_path, index=False, encoding="utf-8")

print(f"✅ Dataset saved successfully to {csv_path}")



✅ Dataset saved successfully to gutenberg_books/gutenberg_novels_dataset.csv


11. Preview Saved CSV

In [23]:
pd.read_csv(csv_path).head()


Unnamed: 0,gutenberg_id,title,author,download_url,text_length,text
0,1342,Pride and Prejudice,Jane Austen,https://www.gutenberg.org/files/1342/1342-0.txt,728392,*** START OF THE PROJECT GUTENBERG EBOOK 1342 ...
1,84,Frankenstein,Mary Shelley,https://www.gutenberg.org/files/84/84-0.txt,419290,*** START OF THE PROJECT GUTENBERG EBOOK 84 **...
2,345,Dracula,Bram Stoker,https://www.gutenberg.org/files/345/345-0.txt,845805,*** START OF THE PROJECT GUTENBERG EBOOK 345 *...
