Cell 1: Loading the Text Files
In this first part, I load in all of the .txt files from my Family folder. I use Python’s pathlib to find all text files in that folder and read their contents. I store the text from each file in a dictionary, using the file name as the key. This lets me keep track of which file each text came from, which could be helpful later if I want to go back and look at how certain topics show up in specific documents.

In [1]:
from pathlib import Path

# Set folder to "Family"
family_folder = Path("Family")

# Get all .txt files in the "Family" folder
family_files = list(family_folder.glob("*.txt"))

# Load contents of all text files into a dictionary
family_texts = {}
for file_path in family_files:
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        family_texts[file_path.name] = f.read()

# Summary output
print(f"Loaded {len(family_texts)} files from 'Family' folder:")
for name in family_texts:
    print("-", name)


Loaded 39 files from 'Family' folder:
- aladdin.txt
- princessbridethe.txt
- findingnemo.txt
- kungfupanda.txt
- e.t..txt
- shrek.txt
- walle.txt
- itsawonderfullife.txt
- labyrinth.txt
- speedracer.txt
- marypoppins.txt
- riseoftheguardians.txt
- big.txt
- happyfeet.txt
- mulan.txt
- pacifierthe.txt
- mightymorphinpowerrangersthemovie.txt
- fantasticmrfox.txt
- wizardofozthe.txt
- up.txt
- fieldofdreams.txt
- transformersthemovie.txt
- flintstonesthe.txt
- megamind.txt
- newsies.txt
- anastasia.txt
- littlemermaidthe.txt
- coraline.txt
- threemenandababy.txt
- ducksoup.txt
- marleyme.txt
- rescuersdownunderthe.txt
- newyorkminute.txt
- chroniclesofnarniathelionthewitchandthewardrobe.txt
- addamsfamilythe.txt
- mygirl.txt
- sandlotkidsthe.txt
- pokemonmewtworeturns.txt
- toystory.txt


Cell 2: Tokenizing and Removing Proper Nouns
Here I start cleaning the texts. First, I tokenize the words (break the text into individual words) using NLTK. Then I tag each word with its part of speech so I can filter out proper nouns (like names and places). These tend to be very specific and can distract from the broader themes I'm trying to find. After removing them, I join the words back together so they’re ready for analysis. This helps make sure the topic model doesn’t get biased by specific character names or locations that show up a lot.

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download NLTK models (only once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Step 1: Tokenize each document
family_tokenized = {
    filename: word_tokenize(text)
    for filename, text in family_texts.items()
}

# Step 2: Remove proper nouns (NNP, NNPS)
family_cleaned = {
    filename: [
        word for word, tag in pos_tag(tokens)
        if tag not in ('NNP', 'NNPS')
    ]
    for filename, tokens in family_tokenized.items()
}

# Step 3: Rejoin cleaned tokens into strings
family_joined_cleaned = {
    filename: ' '.join(words)
    for filename, words in family_cleaned.items()
}

print("Text preprocessing complete. Ready for vectorization.")


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/beauxcreel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/beauxcreel/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Text preprocessing complete. Ready for vectorization.


Cell 3: TF-IDF Vectorization
This cell turns my cleaned texts into numbers that the computer can understand using something called TF-IDF. It looks at how important each word is in a document, compared to how often it shows up across all documents. I also set two filters:

min_df=5: this skips words that only show up in a few texts (too rare).

max_df=0.9: this skips words that show up in almost every text (too common).

These settings help me focus on the words that are most useful for finding topics.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Set TF-IDF parameters (can experiment with min_df and max_df later)
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.9,
    min_df=5,
    stop_words='english'
)

# Apply TF-IDF vectorization
family_dtm = tfidf_vectorizer.fit_transform(family_joined_cleaned.values())

print("TF-IDF matrix shape:", family_dtm.shape)


TF-IDF matrix shape: (39, 4810)


Cell 4: Topic Modeling with NMF
Now I actually build the topic model using NMF (Non-negative Matrix Factorization). I start by asking it to find 10 topics. The model goes through the TF-IDF matrix and tries to group words that appear together often across documents. Then, for each topic, I look at the top 10 words to get a sense of what that topic is about. This gives me a list of possible themes across the texts — like family relationships, childhood, home life, etc. I chose NMF because it usually gives cleaner, more interpretable topics than LDA when working with TF-IDF.

In [4]:
from sklearn.decomposition import NMF

# Create and fit NMF model (10 topics to start)
nmf_model = NMF(n_components=10, random_state=42)
family_nmf_topics = nmf_model.fit_transform(family_dtm)

# Get feature (word) names from vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Display top 10 words per topic
for topic_idx, topic in enumerate(nmf_model.components_):
    top_words = [feature_names[i] for i in topic.argsort()[-10:]]
    print(f"\nNMF TOPIC #{topic_idx + 1}:\n", top_words)



NMF TOPIC #1:
 ['finally', 'north', 'pitch', 'staff', 'scroll', '12', 'power', 'master', 'kids', 'suddenly']

NMF TOPIC #2:
 ['transforms', 'reacts', 'interview', 'ice', 'forward', 'dog', 'speaks', 'speed', 'car', 'continued']

NMF TOPIC #3:
 ['human', 'kiss', 'ocean', 'ta', 'gasps', 'swim', 'swimming', 'sea', 'ha', 'fish']

NMF TOPIC #4:
 ['table', 'window', 'living', 'scene', 'office', 'money', 'baby', 'house', 'phone', 'car']

NMF TOPIC #5:
 ['play', 'scene', 'gang', 'field', 'fence', 'glove', 'squints', 'ball', 'baseball', 'yeah']

NMF TOPIC #6:
 ['window', 'boys', 'papers', 'ai', 'continued', '11', 'family', '165', '90', '91']

NMF TOPIC #7:
 ['console', 'audience', 'wild', 'passengers', 'suddenly', 'ship', 'ext', 'plant', 'int', 'wall']

NMF TOPIC #8:
 ['ground', 'arrow', 'palace', 'castle', 'troops', 'walks', 'tent', 'princess', 'dragon', 'sword']

NMF TOPIC #9:
 ['father', 'sugar', 'step', 'bank', 'boom', 'banks', 'medicine', 'sir', 'children', 'mary']

NMF TOPIC #10:
 ['audie



Cell 5: Optional — Try LDA for Comparison
In this extra step, I use LDA (Latent Dirichlet Allocation), another popular topic modeling method. It works a bit differently than NMF and is based on probabilities. I use the same number of topics (10) and the same data to see if the results are clearer or more interesting. This gives me a way to compare and decide which model works best for this particular text collection.

In [5]:
from sklearn.decomposition import LatentDirichletAllocation as LDA

# Fit an LDA model using the same DTM
lda_model = LDA(n_components=10, random_state=42)
family_lda_topics = lda_model.fit_transform(family_dtm)

# Display top 10 words per LDA topic
for topic_idx, topic in enumerate(lda_model.components_):
    top_words = [feature_names[i] for i in topic.argsort()[-10:]]
    print(f"\nLDA TOPIC #{topic_idx + 1}:\n", top_words)



LDA TOPIC #1:
 ['west', 'exaggerate', 'wobbly', 'delicate', 'east', 'penny', 'responsible', 'riddance', 'loops', 'replace']

LDA TOPIC #2:
 ['sits', 'begins', 'wall', 'window', 'walks', 'house', 'yeah', 'car', 'suddenly', 'continued']

LDA TOPIC #3:
 ['west', 'exaggerate', 'wobbly', 'delicate', 'east', 'penny', 'responsible', 'riddance', 'loops', 'replace']

LDA TOPIC #4:
 ['west', 'exaggerate', 'wobbly', 'delicate', 'east', 'penny', 'responsible', 'riddance', 'loops', 'replace']

LDA TOPIC #5:
 ['west', 'exaggerate', 'wobbly', 'delicate', 'east', 'penny', 'responsible', 'riddance', 'loops', 'replace']

LDA TOPIC #6:
 ['west', 'exaggerate', 'wobbly', 'delicate', 'east', 'penny', 'responsible', 'riddance', 'loops', 'replace']

LDA TOPIC #7:
 ['west', 'exaggerate', 'wobbly', 'delicate', 'east', 'penny', 'responsible', 'riddance', 'loops', 'replace']

LDA TOPIC #8:
 ['west', 'exaggerate', 'wobbly', 'delicate', 'east', 'penny', 'responsible', 'riddance', 'loops', 'replace']

LDA TOPIC #9:

Final Thoughts
Overall, this notebook sets up a full pipeline from raw text to topic analysis. The results depend a lot on the settings I use — like how many topics I choose and which words I include or leave out. In future steps, I plan to test different numbers of topics and try changing the min_df and max_df values to see how it affects the topics I get.