
*Using NLTK for tokenization, stemming, and lemmatization.*

**Tokenization:** Whitespace, Punctuation-based, Treebank, Tweet, and Multi-Word Expression (MWE).

**Stemming:** Porter Stemmer and Snowball Stemmer.

**Lemmatization:** WordNet Lemmatizer.**bold text**

In [1]:
!pip install nltk



In [9]:
import nltk
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer, TreebankWordTokenizer, TweetTokenizer
from nltk.tokenize.mwe import MWETokenizer
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")  # Download WordNet data for lemmatization

# Sample Text
text = "The quick brown fox, named Mr. Fox, jumps over the lazy dog! #AmazingFox"

# Tokenization Techniques
whitespace_tokenizer = WhitespaceTokenizer()
word_punct_tokenizer = WordPunctTokenizer()
treebank_tokenizer = TreebankWordTokenizer()
tweet_tokenizer = TweetTokenizer()

# Multi-word expression tokenizer
mwe_tokenizer = MWETokenizer([("Mr.", "Fox"), ("lazy", "dog")])

# Apply Tokenization
whitespace_tokens = whitespace_tokenizer.tokenize(text)
word_punct_tokens = word_punct_tokenizer.tokenize(text)
treebank_tokens = treebank_tokenizer.tokenize(text)
tweet_tokens = tweet_tokenizer.tokenize(text)
mwe_tokens = mwe_tokenizer.tokenize(treebank_tokens)  # Apply MWE on Treebank tokens

# Stemming
porter = PorterStemmer()
snowball = SnowballStemmer("english")

porter_stems = [porter.stem(word) for word in treebank_tokens]
snowball_stems = [snowball.stem(word) for word in treebank_tokens]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in treebank_tokens]

# Display Results
print("\nTokenization")
print("Whitespace Tokenizer:", whitespace_tokens)
print("Punctuation-based Tokenizer:", word_punct_tokens)
print("Treebank Tokenizer:", treebank_tokens)
print("Tweet Tokenizer:", tweet_tokens)
print("Multi-word Expression (MWE) Tokenizer:", mwe_tokens)

print("\nStemming")
print("Porter Stemmer:", porter_stems)
print("Snowball Stemmer:", snowball_stems)

print("\nLemmatization")
print("WordNet Lemmatizer:", lemmas)



Tokenization
Whitespace Tokenizer: ['The', 'quick', 'brown', 'fox,', 'named', 'Mr.', 'Fox,', 'jumps', 'over', 'the', 'lazy', 'dog!', '#AmazingFox']
Punctuation-based Tokenizer: ['The', 'quick', 'brown', 'fox', ',', 'named', 'Mr', '.', 'Fox', ',', 'jumps', 'over', 'the', 'lazy', 'dog', '!', '#', 'AmazingFox']
Treebank Tokenizer: ['The', 'quick', 'brown', 'fox', ',', 'named', 'Mr.', 'Fox', ',', 'jumps', 'over', 'the', 'lazy', 'dog', '!', '#', 'AmazingFox']
Tweet Tokenizer: ['The', 'quick', 'brown', 'fox', ',', 'named', 'Mr', '.', 'Fox', ',', 'jumps', 'over', 'the', 'lazy', 'dog', '!', '#AmazingFox']
Multi-word Expression (MWE) Tokenizer: ['The', 'quick', 'brown', 'fox', ',', 'named', 'Mr._Fox', ',', 'jumps', 'over', 'the', 'lazy_dog', '!', '#', 'AmazingFox']

Stemming
Porter Stemmer: ['the', 'quick', 'brown', 'fox', ',', 'name', 'mr.', 'fox', ',', 'jump', 'over', 'the', 'lazi', 'dog', '!', '#', 'amazingfox']
Snowball Stemmer: ['the', 'quick', 'brown', 'fox', ',', 'name', 'mr.', 'fox', '

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
