# Discovering Themes
### Uncovering Latent Topics in Fitzgerald’s Novels with NLP

This notebook explores recurring themes in F. Scott Fitzgerald’s novels using topic modeling. By applying unsupervised learning methods like Latent Dirichlet Allocation (LDA) and BERTopic, it looks for underlying patterns and semantic structures in the text. The aim is to highlight dominant themes and track how they change across chapters or different works.

## Import libraries

In [1]:
%run ../notebooks/setup_path.py
from config import *

# Text processing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from collections import defaultdict
import re

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

# Topic Modeling
from gensim.models.phrases import Phrases, Phraser
import gensim.corpora as corpora
from gensim.models import TfidfModel, LdaModel, LdaMulticore, CoherenceModel

# Utilities
from pprint import pprint
from collections import Counter
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Text Processing

This part of the project focuses on preparing the cleaned, sentence-tokenized texts for topic modeling. Since the texts are already split into sentences, the next step is to apply word tokenization, part-of-speech tagging, and lemmatization at the sentence level. This helps normalize words to their base forms while keeping their grammatical context, which improves the clarity and relevance of the topics. Once each sentence is processed, the lemmatized tokens are grouped back by chapters based on previously identified chapter boundaries. This method preserves the thematic flow within chapters, allowing topic modeling to capture how themes shift and develop throughout each novel.

- **Stopwords**

Standard English stopwords are loaded and expanded with custom stopwords, including frequent character names and locations from the novels. This helps prevent these terms from dominating the topic modeling results.

In [2]:
# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

In [3]:
custom_stopwords = [
    # Character Names
    "amory", "blaine", "isabelle", "borgé", "rosalind", "connage", "eleanor", "savage", "beatrice", "monsignor",
    "darcy", "thayer", "burne", "anthony", "patch", "gloria", "gilbert", "caramel", "dick", "richard",
    "maury", "noble", "joseph", "bloeckman", "muriel", "nick", "carraway", "gatsby", "jay", "james",
    "gatz", "daisy", "tom", "buchanan", "jordan", "baker", "myrtle", "george", "wilson", "owl",
    "eyes", "klipspringer", "wolfsheim", "meyer", "dan", "cody", "henry", "catherine", "pammy", "mckee",
    "michaelis", "alec", "old", "sport", "sally", "marietta", "wolfshiem", "adam", "barbara", "rachael",
    "kane",

    # Places
    "princeton", "minneapolis", "atlantic", "new", "york", "harvard", "broadway", "west", "egg", "east",
    "valley", "ashes", "fifth", "avenue", "long", "island", "midwest", "chicago", "garage", "pennsylvania",

    # Social Titles and Misc. Proper Nouns
    "baedeker", "captain", "charles", "clara", "ella", "eckleburg", "eyebrow", "granny", "janitor", "john",
    "kerry", "mary", "miss", "mister", "mr", "mrs", "robert", "sloane", "thomas", "william",
    "regis", "geraldine", "paramore", "grandfather", "officer", "rose"
]

stop_words = stop_words.union(custom_stopwords)

Additional words are added to the stopword list after the initial model run. These come from reviewing the most frequent words and top topic words that didn’t describe the themes but were part of the narrative, ensuring the model focuses on meaningful topics.

In [4]:
additional_stopwords = [
    # Generic Verbs
    "asked", "became", "called", "came", "come", "cried", "decided", "dozen", "feel", "find",
    "found", "get", "give", "go", "going", "got", "gone", "heard", "knew", "let",
    "look", "looked", "made", "make", "met", "moved", "put", "read", "said", "sat",
    "saw", "say", "see", "set", "started", "stood", "suppose", "take", "talk", "tell",
    "think", "told", "took", "tried", "trying", "turned", "used", "walked", "want", "went",
    "seemed", "believe", "remembered", "demanded", "suggested", "hear", "call", "answered", "write", "getting",
    "commenced", "cutting", "chewing", "waving", "humming", "know", "looking", "began", "seen", "passed",

    # Adjectives & Modifiers
    "almost", "alone", "big", "blue", "better", "cold", "dark", "enough", "ever", "full",
    "good", "gray", "great", "high", "hot", "late", "little", "many", "much", "next",
    "quite", "really", "rather", "short", "small", "still", "sure", "white", "young", "whole",
    "certain", "far", "often", "always", "never", "several", "few", "fewer", "even", "suddenly",
    "right", "left", "perhaps", "else", "well", "something", 

    # Common Nouns
    "air", "arm", "back", "bed", "book", "car", "corner", "door", "start", "end",
    "face", "foot", "front", "girl", "hand", "head", "home", "house", "light", "life",
    "name", "part", "people", "place", "room", "side", "sound", "street", "table", "thing",
    "time", "window", "woman", "word", "world", "hair", "apartment", "chair", "glass", "floor",
    "school", "college", "hall", "road", "train", "town", "picture", "way", "class", "man",
    "men",

    # Weak Concepts
    "anything", "clad", "considered", "continued", "done", "everything", "fact", "felt", "followed", "matter",
    "mean", "nothing", "question", "sense", "story", "thought", "wanted", "yet", "course", "chapter",
    "outline", "pass", "voice", "sort",

    # Setting Descriptors
    "morning", "afternoon", "evening", "night", "open", "closed", "day", "week", "month", "year",
    "minute", "minutes", "hour", "hours", "moment", "later", "second", "seconds", "first", "last",

    # Narration Fillers
    "idea", "oh", "yes", "no",

    # Miscellaneous
    "butler", "dot", "someone", "wreath", "pillar", "bench", "pump", "cake", "motor", "incredulously",
    "telegram", "tapped", "violence", "impassioned", "desolate", "soggy", "groaning", "cellar", "invented", "lethargic",
    "tray", "brick", "gloomy", "reappeared", "closing", "visible", "supercilious", "deeper", "pointing", "freshman",
    "basket", "celebrated", "raising", "borrowed", "perceptible", "touching", "rejected", "carelessly", "gust", "enchanted",
    "practice", "nodding", "squeezed"
]

stop_words = stop_words.union(additional_stopwords)

- **Helper Functions**

Two functions support the text processing workflow: one splits the full book into chapters, and the other processes individual sentences by tokenizing, lemmatizing, removing stopwords, and filtering by part of speech. The split_into_chapters function was originally created in the Jupyter notebook 02-sentiment-analysis-fitzgerald.ipynb.

In [5]:
def split_into_chapters(text):
    """
    Splits a full text into chapters based on common chapter headings.

    Recognizes patterns like "Chapter 1", "Chapter I" etc.

    Parameters:
    ----------
        text (str): The full raw text of a book or document.

    Returns:
    ----------
        List[Tuple[str, str]]: A list of tuples, each containing:
            - chapter title
            - chapter text
    """
    chapter_pattern = re.compile(r"(?:^|\n)(chapter [ivxlc\d]+|chapter \w+)", re.IGNORECASE)
    splits = chapter_pattern.split(text)
    chapters = []
    for i in range(1, len(splits), 2):
        chapter_title = splits[i].strip()
        chapter_text = splits[i+1] if i+1 < len(splits) else ""
        chapters.append((chapter_title, chapter_text.strip()))
    return chapters


def process_sentence(sentence):
    """
    Processes a sentence by tokenizing, lemmatizing, removing stopwords, and filtering by part of speech.

    Keeps only nouns, verbs, adjectives, and adverbs. Returns a list of cleaned, lowercased lemmas.

    Parameters:
    ----------
        sentence (str): A raw sentence or string of text to be processed.

    Returns:
    ----------
        List[str]: A list of lemmatized, lowercased words (lemmas) that:
            - are alphabetic
            - are not stopwords
            - are tagged as noun, verb, adjective, or adverb
    """
    tokens = word_tokenize(sentence)
    tokens = [t.lower() for t in tokens if t.isalpha()]
    tokens = [t for t in tokens if t not in stop_words]
    tagged = pos_tag(tokens)
    
    lemmas = []
    for token, tag in tagged:
        if tag.startswith(("NN", "JJ", "VB", "RB")):
            lemma = lemmatizer.lemmatize(token)
            if lemma not in stop_words:
                lemmas.append(lemma)
    return lemmas


- **Book Processing**

Each book file is loaded, split into chapters, and processed one chapter at a time. Sentences are cleaned and tokenized using helper functions, producing lists of relevant lemmas for every chapter.

In [6]:
print("== Topic Modeling Processing ==")

book_chapter_tokens = defaultdict(list)

for title in BOOKS:
    clean_path = PROCESSED_DIR / f"{title}-cleaned.txt"
    if not clean_path.exists():
        print(f"File not found: {clean_path}")
        continue

    with open(clean_path, "r", encoding = "utf-8") as f:
        text = f.read()

    chapters = split_into_chapters(text)

    print(f"\nProcessing '{title.replace('-', ' ').title()}': {len(chapters)} chapters found")

    for chapter_title, chapter_text in chapters:
        sentences = chapter_text.split("\n")
        chapter_tokens = []
        for sentence in sentences:
            lemmas = process_sentence(sentence)
            chapter_tokens.extend(lemmas)

        book_chapter_tokens[title].append({
            "chapter_title": chapter_title,
            "tokens": chapter_tokens
        })

    print(f"Processing completed for '{title.replace('-', ' ').title()}'.")

== Topic Modeling Processing ==

Processing 'This Side Of Paradise': 9 chapters found
Processing completed for 'This Side Of Paradise'.

Processing 'The Beautiful And Damned': 9 chapters found
Processing completed for 'The Beautiful And Damned'.

Processing 'The Great Gatsby': 9 chapters found
Processing completed for 'The Great Gatsby'.


In [7]:
print("== Processed Book Summary ==")
print(f"\nTotal processed books: {len(book_chapter_tokens)}")
print("-" * 40)
for book, chapters in book_chapter_tokens.items():
    print(f"\nTitle: '{book.replace('-', ' ').title()}'\n")
    for i, chapter in enumerate(chapters):
        print(f" Chapter {i + 1}: {len(chapter['tokens'])} tokens")
        print(f" First 10 tokens: {chapter['tokens'][:10]}\n")
    print("-" * 40)

== Processed Book Summary ==

Total processed books: 3
----------------------------------------

Title: 'This Side Of Paradise'

 Chapter 1: 3430 tokens
 First 10 tokens: ['inherited', 'mother', 'trait', 'stray', 'inexpressible', 'worth', 'father', 'ineffectual', 'inarticulate', 'taste']

 Chapter 2: 5297 tokens
 First 10 tokens: ['spire', 'gargoyle', 'noticed', 'wealth', 'sunshine', 'creeping', 'green', 'sward', 'dancing', 'leaded']

 Chapter 3: 2603 tokens
 First 10 tokens: ['egotist', 'considers', 'ouch', 'dropped', 'shirt', 'studit', 'hurt', 'melook', 'neck', 'spot']

 Chapter 4: 3721 tokens
 First 10 tokens: ['narcissus', 'duty', 'transition', 'period', 'change', 'broaden', 'live', 'gothic', 'beauty', 'parade']

 Chapter 5: 2210 tokens
 First 10 tokens: ['debutante', 'february', 'large', 'dainty', 'bedroom', 'sixtyeighth', 'pink', 'wall', 'curtain', 'pink']

 Chapter 6: 2192 tokens
 First 10 tokens: ['experiment', 'convalescence', 'knickerbocker', 'bar', 'beamed', 'maxfield', 'par

- **Phrase Detection**

All chapters from the books are combined into a single list of documents, with each document represented as a list of tokens. Metadata for each chapter is stored separately for reference. The Gensim library is then used to identify common bigrams (two-word phrases) within the tokenized documents. Strict thresholds are applied to ensure that only meaningful, frequently co-occurring pairs are merged. Once identified, these bigrams replace the original word pairs in the documents with single tokens joined by an underscore (e.g., "new_york")

In [8]:
documents = []
chapter_metadata = []

for book, chapters in book_chapter_tokens.items():
    for chapter in chapters:
        documents.append(chapter["tokens"])
        chapter_metadata.append({
            "book": book,
            "chapter_title": chapter["chapter_title"]
        })

# Apply thresholds to avoid noisy bigrams
bigram = Phrases(documents, min_count = 10, threshold = 20)
bigram_mod = Phraser(bigram)

documents = [bigram_mod[doc] for doc in documents]

- **Reviewing frequent words**

The most frequent words in the corpus were analyzed, and those not relevant for topic modeling were added to the additional stopword list.

In [9]:
# Get most frequent words in the corpus
def analyze_frequencies(texts, top_n):
    """
    Analyze and print the top N most common tokens in the corpus.

    Parameters
    ----------
    texts : list of list of str
        Tokenized documents.
    top_n : int
        Number of top tokens to display.

    Returns
    -------
    freq_dist : collections.Counter
        Frequency distribution of all tokens in the corpus.
    """
    # Flatten tokens from all documents
    all_tokens = [token.lower() for doc in texts for token in doc]

    # Count frequencies
    freq_dist = Counter(all_tokens)

    print(f"Top {top_n} most common tokens:\n")
    for token, count in freq_dist.most_common(top_n):
        print(f"{token}: {count}")

    return freq_dist

freq_dist = analyze_frequencies(documents, top_n = 100)

Top 100 most common tokens:

love: 191
mind: 190
half: 190
away: 183
boy: 126
work: 126
money: 125
gave: 123
god: 118
friend: 112
mother: 108
party: 108
heart: 106
summer: 100
kiss: 97
beauty: 96
child: 94
together: 90
city: 88
best: 87
laughed: 85
war: 84
hundred: 84
beautiful: 83
business: 83
dream: 82
letter: 82
wife: 81
lay: 81
coming: 81
brought: 80
club: 80
conversation: 79
lot: 78
drink: 78
care: 77
play: 77
fell: 77
married: 76
dollar: 76
step: 73
lip: 72
pretty: 72
laughter: 71
poor: 71
yellow: 70
sitting: 70
crowd: 70
broke: 70
line: 69
silence: 69
reached: 69
spring: 68
live: 68
smile: 68
sent: 68
dinner: 68
sit: 68
black: 68
soul: 67
lost: 67
mouth: 66
dance: 66
body: 65
slowly: 65
present: 65
shoulder: 65
deep: 65
dress: 64
hard: 64
finally: 64
tree: 64
making: 63
keep: 62
taken: 61
coat: 61
faint: 61
shook: 61
stand: 61
sometimes: 61
moon: 60
talked: 60
talking: 60
desire: 60
tired: 60
known: 59
effort: 59
wait: 59
quickly: 59
broken: 59
rest: 59
rain: 59
ten: 59
try: 58
