# Preparing the Texts
### Downloading and Cleaning F. Scott Fitzgerald’s Novels for NLP Analysis

This notebook handles collecting and cleaning several of F. Scott Fitzgerald’s novels from Project Gutenberg. Metadata such as headers, footers, and appendices are removed, formatting and punctuation are normalized, and the texts are prepared for NLP tasks like sentiment analysis and topic modeling.

The cleaned output is saved as sentence-per-line `.txt` files, with one file for each book.

## Import Libraries

In [None]:
%run ../notebooks/setup_path.py
from config import *

# Utilities
import re
import requests
import time

# Text processing
import nltk
from nltk.tokenize import sent_tokenize

# Download NLTK resources
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to C:\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## Data Collection

This project explores themes and emotions in F. Scott Fitzgerald’s works using natural language processing techniques. The texts were obtained from Project Gutenberg, which provides free access to three of his novels: *This Side of Paradise*, *The Beautiful and Damned*, and *The Great Gatsby*.

*Tender Is the Night* is not yet in the public domain and is therefore excluded from this analysis. Likewise, Fitzgerald’s unfinished final novel, *The Last Tycoon*, published posthumously, is not included.

In [2]:
# Download and save each book
for title, url in BOOK_URLS.items():
    print(f"Downloading {title}...")
    response = requests.get(url)
    if response.status_code == 200:
        filepath = RAW_DIR / f"{title}.txt"
        with open(filepath, "w", encoding = "utf-8") as f:
            f.write(response.text)
        print(f"File sucessfully saved!\n")
    else:
        print(f"Failed to download {title} from {url}\n")
    time.sleep(2)
    
print("All downloads completed!")

Downloading this-side-of-paradise...
File sucessfully saved!

Downloading the-beautiful-and-damned...
File sucessfully saved!

Downloading the-great-gatsby...
File sucessfully saved!

All downloads completed!


## Data Cleaning

The raw texts from Project Gutenberg contain metadata, headers, footers, and other content that introduce noise into the analysis. To prepare the data for processing, a set of helper functions is used to clean and structure the text:

- `detect_narrative_start`: Finds where the actual story begins by matching characteristic first lines, allowing the script to skip over prefaces and metadata.
- `remove_appendix`: Removes the appendix section from *This Side of Paradise* based on a known heading.
- `normalize_chapter_headings`: Standardizes chapter formatting across all books by converting chapter numbers to Roman numerals where needed.
- `move_chapter_titles_to_new_line`: Ensures chapter headings appear on their own lines, even when embedded mid-paragraph.
- `clean_gutenberg_text_for_sentiment`: Cleans and normalizes punctuation, removes Gutenberg footer and appendix, and strips unwanted characters.
- `get_sentences`: Tokenizes the cleaned text into individual sentences using NLTK, preparing the data for sentiment and thematic analysis.

Using these modular functions helps keep the cleaning pipeline organized and easier to maintain.

In [3]:
# Helper: find the actual start of narrative
def detect_narrative_start(text):
    """
    Detects the start of the main narrative in the text by matching known 
    opening lines from each book. This helps remove introductory material 
    like title pages or prefaces.

    Parameters:
    ----------
        text (str): The full raw text of the book.

    Returns:
    ----------
        str: The text starting from the detected beginning of the narrative.
    """
    patterns = [
        r"^\s*Amory Blaine inherited",                             # First line of This Side of Paradise
        r"^\s*In 1913, when Anthony Patch was twenty[- ]?five",    # First line of The Beautiful and Damned
        r"^\s*In my younger and more vulnerable years"             # First line of The Great Gatsby        
    ]

    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
        if match:
            return text[match.start():]
        
    return text

In [4]:
# Helper: remove appendix from "This Side of Paradise"
def remove_appendix(text):
    """
    Removes the appendix from 'This Side of Paradise' starting from the 
    known appendix heading to the end of the text. This prevents non-narrative 
    metadata from polluting downstream analysis.

    Parameters:
    ----------
        text (str): The cleaned book text that may include an appendix.

    Returns:
    ----------
        str: The text with the appendix removed.
    """
    pattern = re.compile(r"\n\s*Appendix: Production notes for eBook edition 11.*", re.DOTALL | re.IGNORECASE)
    cleaned_text = re.sub(pattern, "", text)
    return cleaned_text.strip()

In [5]:
def normalize_chapter_headings(text, title):
    """
    Standardizes chapter headings across different books by:
    
    - Adding "CHAPTER I" to the beginning of the text.
    - Converting Arabic chapter numbers (e.g., "CHAPTER 1") to Roman numerals 
      for "This Side of Paradise", assuming chapters 1–5 only.
    - Replacing inline Roman numerals with "CHAPTER <numeral>" in "The Great Gatsby"
      when they appear before a capitalized word (e.g., chapter titles).
    
    Parameters:
    ----------
        text (str): The raw or cleaned book text.
        title (str): The title key of the book used to apply title-specific rules.
    
    Returns:
    ----------
        str: The text with normalized chapter headings.
    """
    text = "CHAPTER I\n" + text

    if title == "this-side-of-paradise":
        # Map Arabic numbers 1-5 to Roman numerals
        number_map = {
            '1': 'I',
            '2': 'II',
            '3': 'III',
            '4': 'IV',
            '5': 'V'
        }

        def repl(match):
            num = match.group(1)
            roman = number_map.get(num, num)  # fallback just in case
            return f"CHAPTER {roman}"

        # Replace "CHAPTER <number>" with "CHAPTER <Roman>
        text = re.sub(r"CHAPTER (\d+)", repl, text)

    if title == "the-great-gatsby":
        roman_numerals = ["X", "IX", "VIII", "VII", "VI", "V", "IV", "III", "II", "I"]
        for numeral in roman_numerals:
            pattern = rf"(?<= )({numeral})(?= [A-Z])"
            text = re.sub(pattern, f"CHAPTER {numeral}", text)

    return text

In [6]:
def move_chapter_titles_to_new_line(text):
    """
    Ensures that chapter titles (e.g., "CHAPTER I", "CHAPTER 1") start on a new line,
    even if they appear in the middle of a paragraph or sentence.

    Matches both Roman numerals and Arabic digits following the word "CHAPTER".

    Parameters:
    ----------
        text (str): The input text containing chapter headings.

    Returns:
    -------
        str: Modified text where all chapter titles begin on their own line.
    """
    pattern = r"(?<!\n)(?<!^)(\bCHAPTER\s+(?:[IVXLCM]+|\d+)\b)"
    return re.sub(pattern, r"\n\1", text, flags = re.IGNORECASE)

In [7]:
# Helper: clean and prepare text
def clean_gutenberg_text_for_sentiment(text, title=""):
    """
    Cleans and prepares a Project Gutenberg text for NLP by:
    - Normalizing punctuation
    - Removing Gutenberg footer and appendix (if applicable)
    - Stripping headers and unwanted symbols
    - Detecting and starting from the narrative beginning

    Parameters:
    ----------
        text (str): The raw text of the book.
        title (str): The title of the book (used to apply title-specific rules).

    Returns:
    ----------
        str: Cleaned and normalized text ready for tokenization.
    """
    # Normalize punctuation
    text = text.replace("“", '"').replace("”", '"') \
               .replace("’", "'").replace("‘", "'").replace("—", "-")
    
    # Find narrative start
    text = detect_narrative_start(text)

    # Remove footer
    end_pattern = r"\*\*\* END OF.*?\*\*\*"
    parts = re.split(end_pattern, text, flags=re.IGNORECASE)
    if len(parts) > 1:
        text = parts[0]

    # Remove appendix if it's This Side of Paradise
    if "this-side-of-paradise" in title.lower():
        text = remove_appendix(text)

    # Normalize spaces and remove unwanted characters
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^A-Za-z0-9.,!?\'"\s]', '', text)

    # Normalize chapter headers and fix formatting
    text = normalize_chapter_headings(text, title)
    text = move_chapter_titles_to_new_line(text)

    return text.strip()

In [8]:
def get_sentences(text):
    """
    Splits the input text into individual sentences using NLTK's sentence tokenizer.

    Parameters:
    ----------
        text (str): A cleaned string of text.

    Returns:
    ----------
        list: A list of sentence strings.
    """
    return sent_tokenize(text)

Once the helper functions are defined, the preprocessing pipeline follows these steps:

- Load each raw text file,
- Clean the content using the defined helper functions,
- Tokenize the cleaned text into individual sentences,
- Save the resulting sentences to new files in the `data/processed` directory,
- And print the first and last five sentences to visually verify the results.

This process ensures that the texts are properly structured and ready for downstream NLP tasks such as sentiment analysis and topic modeling.

In [9]:
print("== Cleaned Book Summary ==")
for title in BOOKS: 

    raw_path = RAW_DIR / f"{title}.txt"
    clean_path = PROCESSED_DIR / f"{title}-cleaned.txt"

    # Load raw text
    with open(raw_path, "r", encoding="utf-8") as f:
        raw_text = f.read()

    # Clean and tokenize
    cleaned_text = clean_gutenberg_text_for_sentiment(raw_text, title=title)
    sentences = get_sentences(cleaned_text)

    # Save cleaned sentences
    with open(clean_path, "w", encoding="utf-8") as f:
        f.writelines(sentence + "\n" for sentence in sentences)

    # Output summary
    print(f"\nTitle: '{title.replace('-', ' ').title()}'")
    print(f"Saved: {len(sentences):,} sentences")
    print(f"File path: {clean_path}")

    print("\nFirst 5 sentences:\n")
    for i, sent in enumerate(sentences[:5], 1):
        print(f"  {i}. {sent}")

    print("\nLast 5 sentences:\n")
    for i, sent in enumerate(sentences[-5:], len(sentences) - 4):
        print(f"  {i}. {sent}")

    print("-" * 60)

== Cleaned Book Summary ==

Title: 'This Side Of Paradise'
Saved: 5,411 sentences
File path: C:\Users\Virginia\Python\Fitzgerald-sentiment-topic-analysis\data\processed\this-side-of-paradise-cleaned.txt

First 5 sentences:

  1. CHAPTER I
Amory Blaine inherited from his mother every trait, except the stray inexpressible few, that made him worth while.
  2. His father, an ineffectual, inarticulate man with a taste for Byron and a habit of drowsing over the Encyclopedia Britannica, grew wealthy at thirty through the death of two elder brothers, successful Chicago brokers, and in the first flush of feeling that the world was his, went to Bar Harbor and met Beatrice O'Hara.
  3. In consequence, Stephen Blaine handed down to posterity his height of just under six feet and his tendency to waver at crucial moments, these two abstractions appearing in his son Amory.
  4. For many years he hovered in the background of his family's life, an unassertive figure with a face halfobliterated by lifel