In [1]:
paragraph = "Today was an enchanting day in the countryside, 🌳 where the gentle breeze whispered secrets through the fields 🌾 and the golden sun cast a warm glow upon the landscape ☀️.\n The rolling hills stretched for miles, adorned with patches of colorful wildflowers 🌼, creating a breathtaking panorama. As I roamed through the meadows, I was serenaded by the melodious chirping of crickets 🦗 and the soft rustle of leaves in the wind.\n The earthy scent of freshly cut grass filled the air, mingling with the sweet fragrance of blooming flowers. With each breath, I felt a profound sense of peace wash over me, a reminder of the beauty and tranquility that nature offers.\n In this idyllic setting, time seemed to stand still, allowing me to fully appreciate the simple joys of life. It's moments like these that remind us to slow down and savor the beauty that surrounds us, even in the midst of life's chaos."

### a. Word Tokenization
##### NLTK's word_tokenize function breaks the paragraph into a list of words based on spaces and punctuation, important step in text analysis. Used for sentiment analysis and pre-POS tagging.

In [3]:
import nltk
nltk.download('punkt')
nltk_tokens = nltk.word_tokenize(paragraph)
print(nltk_tokens)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\noelm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


['Today', 'was', 'an', 'enchanting', 'day', 'in', 'the', 'countryside', ',', '🌳', 'where', 'the', 'gentle', 'breeze', 'whispered', 'secrets', 'through', 'the', 'fields', '🌾', 'and', 'the', 'golden', 'sun', 'cast', 'a', 'warm', 'glow', 'upon', 'the', 'landscape', '☀️', '.', 'The', 'rolling', 'hills', 'stretched', 'for', 'miles', ',', 'adorned', 'with', 'patches', 'of', 'colorful', 'wildflowers', '🌼', ',', 'creating', 'a', 'breathtaking', 'panorama', '.', 'As', 'I', 'roamed', 'through', 'the', 'meadows', ',', 'I', 'was', 'serenaded', 'by', 'the', 'melodious', 'chirping', 'of', 'crickets', '🦗', 'and', 'the', 'soft', 'rustle', 'of', 'leaves', 'in', 'the', 'wind', '.', 'The', 'earthy', 'scent', 'of', 'freshly', 'cut', 'grass', 'filled', 'the', 'air', ',', 'mingling', 'with', 'the', 'sweet', 'fragrance', 'of', 'blooming', 'flowers', '.', 'With', 'each', 'breath', ',', 'I', 'felt', 'a', 'profound', 'sense', 'of', 'peace', 'wash', 'over', 'me', ',', 'a', 'reminder', 'of', 'the', 'beauty', 'and

### b. Sentance Tokenization
##### NLTK's sent_tokenize function divides the paragraph into a list of sentences, helps in machine translation, sentiment analysis and helps understand context of a sentance.




In [4]:
sent_tokens = nltk.sent_tokenize(paragraph)
print(sent_tokens)

['Today was an enchanting day in the countryside, 🌳 where the gentle breeze whispered secrets through the fields 🌾 and the golden sun cast a warm glow upon the landscape ☀️.', 'The rolling hills stretched for miles, adorned with patches of colorful wildflowers 🌼, creating a breathtaking panorama.', 'As I roamed through the meadows, I was serenaded by the melodious chirping of crickets 🦗 and the soft rustle of leaves in the wind.', 'The earthy scent of freshly cut grass filled the air, mingling with the sweet fragrance of blooming flowers.', 'With each breath, I felt a profound sense of peace wash over me, a reminder of the beauty and tranquility that nature offers.', 'In this idyllic setting, time seemed to stand still, allowing me to fully appreciate the simple joys of life.', "It's moments like these that remind us to slow down and savor the beauty that surrounds us, even in the midst of life's chaos."]


### c. Punctuation-based Tokenizer
##### This regular expression captures either words or punctuation marks, effectively tokenizing the paragraph, isolate words and phrases delimited by punctuation. Useful in text cleaning, where you want to separate punctuation from words, or for tasks focusing on specific patterns around punctuation.

In [5]:
import re
punct_tokens = re.findall(r'\b\w+\b|[.,;!?]', paragraph)
print(punct_tokens)

['Today', 'was', 'an', 'enchanting', 'day', 'in', 'the', 'countryside', ',', 'where', 'the', 'gentle', 'breeze', 'whispered', 'secrets', 'through', 'the', 'fields', 'and', 'the', 'golden', 'sun', 'cast', 'a', 'warm', 'glow', 'upon', 'the', 'landscape', '.', 'The', 'rolling', 'hills', 'stretched', 'for', 'miles', ',', 'adorned', 'with', 'patches', 'of', 'colorful', 'wildflowers', ',', 'creating', 'a', 'breathtaking', 'panorama', '.', 'As', 'I', 'roamed', 'through', 'the', 'meadows', ',', 'I', 'was', 'serenaded', 'by', 'the', 'melodious', 'chirping', 'of', 'crickets', 'and', 'the', 'soft', 'rustle', 'of', 'leaves', 'in', 'the', 'wind', '.', 'The', 'earthy', 'scent', 'of', 'freshly', 'cut', 'grass', 'filled', 'the', 'air', ',', 'mingling', 'with', 'the', 'sweet', 'fragrance', 'of', 'blooming', 'flowers', '.', 'With', 'each', 'breath', ',', 'I', 'felt', 'a', 'profound', 'sense', 'of', 'peace', 'wash', 'over', 'me', ',', 'a', 'reminder', 'of', 'the', 'beauty', 'and', 'tranquility', 'that', 

### d. Treebank Word Tokenizer
##### NLTK's TreebankWordTokenizer uses the Penn Treebank conventions to tokenize words(hyphenated words).  Suitable for tasks where handling contractions and hyphenated words is important, such as in linguistic analysis.

In [6]:
from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(paragraph)
print(treebank_tokens)

['Today', 'was', 'an', 'enchanting', 'day', 'in', 'the', 'countryside', ',', '🌳', 'where', 'the', 'gentle', 'breeze', 'whispered', 'secrets', 'through', 'the', 'fields', '🌾', 'and', 'the', 'golden', 'sun', 'cast', 'a', 'warm', 'glow', 'upon', 'the', 'landscape', '☀️.', 'The', 'rolling', 'hills', 'stretched', 'for', 'miles', ',', 'adorned', 'with', 'patches', 'of', 'colorful', 'wildflowers', '🌼', ',', 'creating', 'a', 'breathtaking', 'panorama.', 'As', 'I', 'roamed', 'through', 'the', 'meadows', ',', 'I', 'was', 'serenaded', 'by', 'the', 'melodious', 'chirping', 'of', 'crickets', '🦗', 'and', 'the', 'soft', 'rustle', 'of', 'leaves', 'in', 'the', 'wind.', 'The', 'earthy', 'scent', 'of', 'freshly', 'cut', 'grass', 'filled', 'the', 'air', ',', 'mingling', 'with', 'the', 'sweet', 'fragrance', 'of', 'blooming', 'flowers.', 'With', 'each', 'breath', ',', 'I', 'felt', 'a', 'profound', 'sense', 'of', 'peace', 'wash', 'over', 'me', ',', 'a', 'reminder', 'of', 'the', 'beauty', 'and', 'tranquility'

### e. Tweet Tokenizer
##### NLTK's TweetTokenizer is designed to handle tweets, preserving hashtags and mentions. Ideal for sentiment analysis, topic modeling, and other NLP tasks involving Twitter data.

In [7]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(paragraph)
print(tweet_tokens)


['Today', 'was', 'an', 'enchanting', 'day', 'in', 'the', 'countryside', ',', '🌳', 'where', 'the', 'gentle', 'breeze', 'whispered', 'secrets', 'through', 'the', 'fields', '🌾', 'and', 'the', 'golden', 'sun', 'cast', 'a', 'warm', 'glow', 'upon', 'the', 'landscape', '☀', '️', '.', 'The', 'rolling', 'hills', 'stretched', 'for', 'miles', ',', 'adorned', 'with', 'patches', 'of', 'colorful', 'wildflowers', '🌼', ',', 'creating', 'a', 'breathtaking', 'panorama', '.', 'As', 'I', 'roamed', 'through', 'the', 'meadows', ',', 'I', 'was', 'serenaded', 'by', 'the', 'melodious', 'chirping', 'of', 'crickets', '🦗', 'and', 'the', 'soft', 'rustle', 'of', 'leaves', 'in', 'the', 'wind', '.', 'The', 'earthy', 'scent', 'of', 'freshly', 'cut', 'grass', 'filled', 'the', 'air', ',', 'mingling', 'with', 'the', 'sweet', 'fragrance', 'of', 'blooming', 'flowers', '.', 'With', 'each', 'breath', ',', 'I', 'felt', 'a', 'profound', 'sense', 'of', 'peace', 'wash', 'over', 'me', ',', 'a', 'reminder', 'of', 'the', 'beauty', 

### f. Multi-Word Expression Tokenizer
##### NLTK's MWETokenizer allows tokenization of specific multi-word expressions. Useful in tasks where understanding multi-word phrases is essential, like in specialized domain language processing.


In [8]:
from nltk.tokenize import MWETokenizer
mwetokenizer = MWETokenizer([('rhythmic', 'symphony'), ('water\'s', 'edge')])
mwe_tokens = mwetokenizer.tokenize(nltk.word_tokenize(paragraph))
print(mwe_tokens)


['Today', 'was', 'an', 'enchanting', 'day', 'in', 'the', 'countryside', ',', '🌳', 'where', 'the', 'gentle', 'breeze', 'whispered', 'secrets', 'through', 'the', 'fields', '🌾', 'and', 'the', 'golden', 'sun', 'cast', 'a', 'warm', 'glow', 'upon', 'the', 'landscape', '☀️', '.', 'The', 'rolling', 'hills', 'stretched', 'for', 'miles', ',', 'adorned', 'with', 'patches', 'of', 'colorful', 'wildflowers', '🌼', ',', 'creating', 'a', 'breathtaking', 'panorama', '.', 'As', 'I', 'roamed', 'through', 'the', 'meadows', ',', 'I', 'was', 'serenaded', 'by', 'the', 'melodious', 'chirping', 'of', 'crickets', '🦗', 'and', 'the', 'soft', 'rustle', 'of', 'leaves', 'in', 'the', 'wind', '.', 'The', 'earthy', 'scent', 'of', 'freshly', 'cut', 'grass', 'filled', 'the', 'air', ',', 'mingling', 'with', 'the', 'sweet', 'fragrance', 'of', 'blooming', 'flowers', '.', 'With', 'each', 'breath', ',', 'I', 'felt', 'a', 'profound', 'sense', 'of', 'peace', 'wash', 'over', 'me', ',', 'a', 'reminder', 'of', 'the', 'beauty', 'and

###g. TextBlob Word Tokenize
##### TextBlob's words attribute provides a convenient way to access the words in the paragraph.  Suitable for quick and simple NLP tasks, especially in educational or prototyping contexts.

In [10]:
from textblob import TextBlob
blob = TextBlob(paragraph)
textblob_tokens = blob.words
print(textblob_tokens)


['Today', 'was', 'an', 'enchanting', 'day', 'in', 'the', 'countryside', '🌳', 'where', 'the', 'gentle', 'breeze', 'whispered', 'secrets', 'through', 'the', 'fields', '🌾', 'and', 'the', 'golden', 'sun', 'cast', 'a', 'warm', 'glow', 'upon', 'the', 'landscape', '☀️', 'The', 'rolling', 'hills', 'stretched', 'for', 'miles', 'adorned', 'with', 'patches', 'of', 'colorful', 'wildflowers', '🌼', 'creating', 'a', 'breathtaking', 'panorama', 'As', 'I', 'roamed', 'through', 'the', 'meadows', 'I', 'was', 'serenaded', 'by', 'the', 'melodious', 'chirping', 'of', 'crickets', '🦗', 'and', 'the', 'soft', 'rustle', 'of', 'leaves', 'in', 'the', 'wind', 'The', 'earthy', 'scent', 'of', 'freshly', 'cut', 'grass', 'filled', 'the', 'air', 'mingling', 'with', 'the', 'sweet', 'fragrance', 'of', 'blooming', 'flowers', 'With', 'each', 'breath', 'I', 'felt', 'a', 'profound', 'sense', 'of', 'peace', 'wash', 'over', 'me', 'a', 'reminder', 'of', 'the', 'beauty', 'and', 'tranquility', 'that', 'nature', 'offers', 'In', 'th

### h. spaCy Tokenizer
#####  spaCy tokenizes the paragraph using a sophisticated language model and provides detailed information about each token. Valuable in various NLP tasks, including named entity recognition, dependency parsing, and other advanced applications.


In [17]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(paragraph)
spacy_tokens = [token.text for token in doc]
print(spacy_tokens)

['Today', 'was', 'an', 'enchanting', 'day', 'in', 'the', 'countryside', ',', '🌳', 'where', 'the', 'gentle', 'breeze', 'whispered', 'secrets', 'through', 'the', 'fields', '🌾', 'and', 'the', 'golden', 'sun', 'cast', 'a', 'warm', 'glow', 'upon', 'the', 'landscape', '☀', '️.', '\n ', 'The', 'rolling', 'hills', 'stretched', 'for', 'miles', ',', 'adorned', 'with', 'patches', 'of', 'colorful', 'wildflowers', '🌼', ',', 'creating', 'a', 'breathtaking', 'panorama', '.', 'As', 'I', 'roamed', 'through', 'the', 'meadows', ',', 'I', 'was', 'serenaded', 'by', 'the', 'melodious', 'chirping', 'of', 'crickets', '🦗', 'and', 'the', 'soft', 'rustle', 'of', 'leaves', 'in', 'the', 'wind', '.', '\n ', 'The', 'earthy', 'scent', 'of', 'freshly', 'cut', 'grass', 'filled', 'the', 'air', ',', 'mingling', 'with', 'the', 'sweet', 'fragrance', 'of', 'blooming', 'flowers', '.', 'With', 'each', 'breath', ',', 'I', 'felt', 'a', 'profound', 'sense', 'of', 'peace', 'wash', 'over', 'me', ',', 'a', 'reminder', 'of', 'the', 

### i. Gensim Word Tokenizer
##### Gensim's tokenizer is part of the Gensim library, known for topic modeling and document similarity analysis. Used in topic modeling, document clustering, and other tasks related to semantic analysis.

In [14]:
from gensim.utils import tokenize
gensim_tokens = list(tokenize(paragraph))
print(gensim_tokens)


ModuleNotFoundError: No module named 'gensim'

### j. Tokenization with Keras
##### Keras' text_to_word_sequence method tokenizes the paragraph into words while converting everything to lowercase. Used in text classification, language modeling, and sequence-to-sequence tasks.

In [None]:
from keras.preprocessing.text import text_to_word_sequence
keras_tokens = text_to_word_sequence(paragraph)
print(keras_tokens)

['today', 'was', 'a', 'phenomenal', 'day', 'at', 'the', 'beach', '🏖️', 'filled', 'with', 'the', 'rhythmic', 'symphony', 'of', 'crashing', 'waves', '🌊', 'and', 'the', 'gentle', 'caress', 'of', 'the', "sun's", 'warm', 'embrace', '☀️', 'the', 'sandy', 'shore', 'stretched', 'for', 'miles', 'and', 'seagulls', 'soared', 'above', 'creating', 'a', 'picturesque', 'scene', 'as', 'i', 'strolled', 'along', 'the', "water's", 'edge', 'i', "couldn't", 'help', 'but', 'marvel', 'at', 'the', 'vibrant', 'hues', 'of', 'the', 'sunset', '☀️', '🌅', 'painting', 'the', 'sky', 'with', 'breathtaking', 'colors', 'the', 'salty', 'breeze', 'carried', 'away', 'the', 'worries', 'of', 'the', 'day', 'creating', 'a', 'sense', 'of', 'tranquility', 'that', 'i', "don't", 'often', 'experience', 'the', 'sand', 'between', 'my', 'toes', 'was', 'a', 'reminder', 'to', 'appreciate', "life's", 'simple', 'pleasures', 'i', "don't", 'think', "i've", 'ever', 'felt', 'more', 'at', 'peace', 'than', 'i', 'did', 'in', 'that', 'moment', "i