# Simple Text Processing

In [1]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [2]:
# First download the necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ivantravisany/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ivantravisany/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ivantravisany/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Custom input paragraph
paragraph = """Space, a boundless expanse stretching beyond Earth, intertwines with time in the fabric of the universe, as described by Einstein's theory of relativity. It is not just a void but a dynamic stage where galaxies, stars, and planets form, evolve, and sometimes perish. The cosmos is a tapestry of spacetime, shaped by gravity and punctuated by extraordinary phenomena like black holes, which warp time and space to extremes, and cosmic inflation, which expanded the universe in its infancy faster than the speed of light. Time itself behaves differently in the vastness of space; near massive objects like neutron stars, time slows, a phenomenon measurable by precise clocks and essential to understanding the universe’s nature. The study of cosmology seeks to uncover the origins and ultimate fate of this universe, from the Big Bang, which birthed time and space some 13.8 billion years ago, to its potential end in a Big Freeze, Big Rip, or Big Crunch. Observations of cosmic microwave background radiation, relic light from the early universe, provide glimpses into these ancient epochs, while dark matter and dark energy—comprising most of the universe’s mass and energy—remain enigmatic forces driving cosmic expansion and structure. Space and time are inseparable, and as we explore deeper into this interconnected realm, through telescopes and theoretical physics, we edge closer to unraveling the mysteries of our existence within this vast, four-dimensional continuum."""

In [None]:
# Function to pretty-print lists of lists for easy readability
def pretty_print(title, data):
    print(f"\n{title}")
    print("-" * len(title))
    for i, item in enumerate(data, start=1):
        print(f"{i}: {' '.join(item)}")

In [None]:
# Step 1: Parse the paragraph into sentences
sentences = sent_tokenize(paragraph)

# Pretty print the output
print("\nOriginal Sentences:")
print("-" * 20)
for i, sentence in enumerate(sentences, start=1):
    print(f"{i}: {sentence}")


Original Sentences:
--------------------
1: Space, a boundless expanse stretching beyond Earth, intertwines with time in the fabric of the universe, as described by Einstein's theory of relativity.
2: It is not just a void but a dynamic stage where galaxies, stars, and planets form, evolve, and sometimes perish.
3: The cosmos is a tapestry of spacetime, shaped by gravity and punctuated by extraordinary phenomena like black holes, which warp time and space to extremes, and cosmic inflation, which expanded the universe in its infancy faster than the speed of light.
4: Time itself behaves differently in the vastness of space; near massive objects like neutron stars, time slows, a phenomenon measurable by precise clocks and essential to understanding the universe’s nature.
5: The study of cosmology seeks to uncover the origins and ultimate fate of this universe, from the Big Bang, which birthed time and space some 13.8 billion years ago, to its potential end in a Big Freeze, Big Rip, or B

In [None]:
# Step 2: Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

# For now print regular so the 'tokenization' can be seen
print("Tokenized Words")
print("-" * 20)
print(tokenized_sentences)

Tokenized Words
--------------------
[['Space', ',', 'a', 'boundless', 'expanse', 'stretching', 'beyond', 'Earth', ',', 'intertwines', 'with', 'time', 'in', 'the', 'fabric', 'of', 'the', 'universe', ',', 'as', 'described', 'by', 'Einstein', "'s", 'theory', 'of', 'relativity', '.'], ['It', 'is', 'not', 'just', 'a', 'void', 'but', 'a', 'dynamic', 'stage', 'where', 'galaxies', ',', 'stars', ',', 'and', 'planets', 'form', ',', 'evolve', ',', 'and', 'sometimes', 'perish', '.'], ['The', 'cosmos', 'is', 'a', 'tapestry', 'of', 'spacetime', ',', 'shaped', 'by', 'gravity', 'and', 'punctuated', 'by', 'extraordinary', 'phenomena', 'like', 'black', 'holes', ',', 'which', 'warp', 'time', 'and', 'space', 'to', 'extremes', ',', 'and', 'cosmic', 'inflation', ',', 'which', 'expanded', 'the', 'universe', 'in', 'its', 'infancy', 'faster', 'than', 'the', 'speed', 'of', 'light', '.'], ['Time', 'itself', 'behaves', 'differently', 'in', 'the', 'vastness', 'of', 'space', ';', 'near', 'massive', 'objects', 'lik

In [None]:
import string

# Define stop words and also punctuations
stop_words = set(stopwords.words("english"))
punctuation = set(string.punctuation)  # Includes , . ! ? etc.
stop_words.update(punctuation)  # Treat punctuation as stop words

# Remove both stop words and punctuation
filtered_sentences = [
    [word for word in sentence if word.lower() not in stop_words]
    for sentence in tokenized_sentences
]

# Then pretty print it
pretty_print("Filtered Sentences (No Stop Words or Punctuation):", filtered_sentences)


Filtered Sentences (No Stop Words or Punctuation):
--------------------------------------------------
1: Space boundless expanse stretching beyond Earth intertwines time fabric universe described Einstein 's theory relativity
2: void dynamic stage galaxies stars planets form evolve sometimes perish
3: cosmos tapestry spacetime shaped gravity punctuated extraordinary phenomena like black holes warp time space extremes cosmic inflation expanded universe infancy faster speed light
4: Time behaves differently vastness space near massive objects like neutron stars time slows phenomenon measurable precise clocks essential understanding universe ’ nature
5: study cosmology seeks uncover origins ultimate fate universe Big Bang birthed time space 13.8 billion years ago potential end Big Freeze Big Rip Big Crunch
6: Observations cosmic microwave background radiation relic light early universe provide glimpses ancient epochs dark matter dark energy—comprising universe ’ mass energy—remain enigma

In [8]:
# Step 4: Perform stemming
stemmer = PorterStemmer()
stemmed_sentences = [
    [stemmer.stem(word) for word in sentence] for sentence in filtered_sentences
]

pretty_print("Stemmed Words", stemmed_sentences)


Stemmed Words
-------------
1: space boundless expans stretch beyond earth intertwin time fabric univers describ einstein 's theori rel
2: void dynam stage galaxi star planet form evolv sometim perish
3: cosmo tapestri spacetim shape graviti punctuat extraordinari phenomena like black hole warp time space extrem cosmic inflat expand univers infanc faster speed light
4: time behav differ vast space near massiv object like neutron star time slow phenomenon measur precis clock essenti understand univers ’ natur
5: studi cosmolog seek uncov origin ultim fate univers big bang birth time space 13.8 billion year ago potenti end big freez big rip big crunch
6: observ cosmic microwav background radiat relic light earli univers provid glimps ancient epoch dark matter dark energy—compris univers ’ mass energy—remain enigmat forc drive cosmic expans structur
7: space time insepar explor deeper interconnect realm telescop theoret physic edg closer unravel mysteri exist within vast four-dimension c

In [9]:
# Step 5: Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_sentences = [
    [lemmatizer.lemmatize(word) for word in sentence] for sentence in filtered_sentences
]

pretty_print("Lemmatized Words", lemmatized_sentences)


Lemmatized Words
----------------
1: Space boundless expanse stretching beyond Earth intertwines time fabric universe described Einstein 's theory relativity
2: void dynamic stage galaxy star planet form evolve sometimes perish
3: cosmos tapestry spacetime shaped gravity punctuated extraordinary phenomenon like black hole warp time space extreme cosmic inflation expanded universe infancy faster speed light
4: Time behaves differently vastness space near massive object like neutron star time slows phenomenon measurable precise clock essential understanding universe ’ nature
5: study cosmology seek uncover origin ultimate fate universe Big Bang birthed time space 13.8 billion year ago potential end Big Freeze Big Rip Big Crunch
6: Observations cosmic microwave background radiation relic light early universe provide glimpse ancient epoch dark matter dark energy—comprising universe ’ mass energy—remain enigmatic force driving cosmic expansion structure
7: Space time inseparable explore de