Cell 1: Import Libraries and Setup NLTK 
This cell is responsible for importing the necessary libraries and setting up NLTK. We use word_tokenize for tokenizing text, and pos_tag for labeling each word with its part of speech (POS). Counter helps us count the frequency of words later. The nltk.download commands ensure that we have the required language models available for POS tagging.



In [None]:
# Import necessary libraries
import os
import re
import nltk
from nltk import pos_tag, word_tokenize
from collections import Counter

#  Setup: Ensure NLTK knows where to download data
nltk.data.path.append(os.path.expanduser('~/nltk_data'))

# Download required models (only needs to be run once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')


Cell 2: Load and Preview the Text
In this cell, we load the text from the file and preview a specific portion of the text (from index 15000 to 16000). This gives us an idea of how the text is structured and whether the loading process worked correctly.

In [None]:
# Load the text file
with open("/Users/beauxcreel/code/ENGL370-2025/Creel/The_Science_of_Getting_Rich.txt", mode="r", encoding="utf-8") as f:
    tsogr = f.read()

# Preview a portion of the raw text
print(tsogr[15000:16000])


 Cell 3: Normalize and Tokenize the Text
 This cell contains the tknize function, which helps prepare the text for analysis. It removes unwanted characters (like punctuation, except for periods), converts the text to lowercase, and then tokenizes it into individual words. Tokenizing means breaking the text down into smaller pieces (tokens), which will help in identifying and analyzing different parts of speech (POS) later on.



In [11]:
# Custom tokenizer: remove punctuation (except periods) and lowercase all words
def tknize(a_string):
    clean = re.sub(r'[^a-zA-Z \.]', ' ', a_string).lower()
    return word_tokenize(clean)

# Tokenize the text
tsogr_tokens = tknize(tsogr)


In [None]:
Cell 4: POS Tag the Entire Text
Now that the text is tokenized, we use pos_tag to assign a part of speech
(POS) tag to each token. This means that every word in the tokenized text
will be labeled with a tag like "NN" (noun), "VB" (verb), or "JJ" (adjective). 
This step is crucial for identifying specific parts of speech such as adjectives
and adverbs.

In [10]:
# Tag each token with its part of speech (POS)
tsogr_tagged = pos_tag(tsogr_tokens, lang='eng')


In [None]:
Cell 5: Slice a Section and Extract Adjectives/Adverbs
In this cell, we slice the text to focus on a section between the indices 15000 and 16000, giving us a smaller sample. 
We then extract the adverbs (tagged as "RB") and adjectives (tagged as "JJ") from this section usinglist comprehensions. 
For each word, we check if its POS tag matches "RB" for adverbs or "JJ" for adjectives. After extracting the words, we print the 
adverbs and adjectives found in this section and display the total counts.

In [8]:
#  Slice a section from index 15000 to 16000
tagged_slice = tsogr_tagged[15000:16000]

# Find adverbs and adjectives in that slice
adverbs = [word for word, tag in tagged_slice if tag.startswith("RB")]
adjectives = [word for word, tag in tagged_slice if tag.startswith("JJ")]

# Display results
print("Adverbs in slice:", adverbs)
print("Adjectives in slice:", adjectives)
print(f"\nTotal adverbs: {len(adverbs)}, Total adjectives: {len(adjectives)}")

# Show original sentence fragment
reconstructed = " ".join([word for word, tag in tagged_slice])
print("\nReconstructed text slice:")
print(reconstructed)


Adverbs in slice: ['so', 'all', 'so', 'away', 'there', 'so', 'not', 'yet', 'up', 'so', 'so', 'so', 'together', 'there', 'together', 'about', 'as', 'so', 'then', 'long', 'so', 'up', 'so', 'alone', 'very', 'still', 'not', 'either', 'therefore', 'quite', 'there', 'up', 'so', 'so', 'so', 'dully', 'up', 'again', 'together', 'so', 'heartily', 'so', 'then', 'so', 'nd', 'so', 'up', 'never', 'so', 'only', 'also', 'well', 'not', 'not', 'i', 'so', 'almost', 'afeard', 'more', 'up', 'here', 'up', 'again']
Adjectives in slice: ['unlikely', 'cold', 'great', 'mr.', 'leigh', 'i', 'sir', 'petty', 's', 'little', 'displeased', 'poor', 'troubled', 'lonely', 'great', 'full', 'mr.', 's', 'great', 'friendly', 'white', 'tangier', 'many', 'little', 'good', 'matted', 'late', 'first', 'worth', 'next', 'much', 'worth', 'i', 'good', 'monday', 'next', 'lord', 's', 'pleasant', 'i', 'fit', 'white', 'queen', 's', 'mr.', 'great', 'good', 'simple', 'lord', 'weary', 'last', 'cold', 'spent', 'past', 'o', 'good', 'osborne',

In [None]:
Cell 6: Count Top 10 Adverbs and Adjectives in the Entire Text (end of assignement)
This final cell processes the entire text to count the frequency of each adjective and adverb. 
It extracts all the words that are tagged as adverbs (RB) and adjectives (JJ). 
After collecting the words, we use Counter to count how often each word appears. 
Finally, we print the top 10 most frequent adjectives and adverbs in the entire text. 
This allows us to identify which words are most commonly used in each category.



In [9]:
# Frequency Counts for Entire Text
# Extract all adverbs and adjectives from full tagged text
all_adverbs = [word for word, tag in tsogr_tagged if tag.startswith("RB")]
all_adjectives = [word for word, tag in tsogr_tagged if tag.startswith("JJ")]

# Count most common using Counter
adverb_counts = Counter(all_adverbs)
adjective_counts = Counter(all_adjectives)

# Display top 10 of each
print("\nTop 10 Adverbs in Entire Text:")
print(adverb_counts.most_common(10))

print("\nTop 10 Adjectives in Entire Text:")
print(adjective_counts.most_common(10))



Top 10 Adverbs in Entire Text:
[('so', 482), ('not', 298), ('very', 190), ('then', 148), ('up', 108), ('well', 100), ('now', 90), ('again', 80), ('there', 76), ('here', 52)]

Top 10 Adjectives in Entire Text:
[('i', 216), ('great', 198), ('good', 154), ('other', 134), ('s', 118), ('mr.', 94), ('sir', 80), ('little', 66), ('much', 58), ('electronic', 54)]
