#### **1. Tokenizers**

Each tokenizer processes the text differently, reflecting its specific approach to handling words, punctuation, and sentence boundaries. The choice of tokenizer depends on the requirements of your NLP task and the nature of your text data

### White Space Tokenization:
- **Splits text based on spaces.**
- **Expected Tokens:** `["Let's", "see", "different", "tokenizers.", "They", "all", "work", "differently:", "white-space,", "regex,", "NLTK,", "Spacy!"]`

### NLTK Word Tokenization:
- **Deals with punctuation and contractions.**
- **Expected Tokens:** `["Let", "'s", "see", "different", "tokenizers", ".", "They", "all", "work", "differently", ":", "white-space", ",", "regex", ",", "NLTK", ",", "Spacy", "!"]`

### Regular Expression Tokenizer:
- **Uses a regex pattern (here \w+ which matches any word character).**
- **Expected Tokens:** `["Let", "s", "see", "different", "tokenizers", "They", "all", "work", "differently", "white", "space", "regex", "NLTK", "Spacy"]`

### Spacy Tokenization:
- **Sophisticated tokenizer handling punctuation, special characters.**
- **Expected Tokens:** `["Let", "'s", "see", "different", "tokenizers", ".", "They", "all", "work", "differently", ":", "white-space", ",", "regex", ",", "NLTK", ",", "Spacy", "!"]`

### Sentence Tokenization:
- **Splits text into sentences.**
- **Expected Sentences:** `["Let's see different tokenizers.", "They all work differently: white-space, regex, NLTK, Spacy!"]`

Each tokenizer processes the text differently, reflecting its specific approach to handling words, punctuation, and sentence boundaries. The choice of tokenizer depends on the requirements of your NLP task and the nature of your text data.

In [3]:
import nltk
import spacy
from nltk.tokenize import word_tokenize, RegexpTokenizer, sent_tokenize

# Sample text
text = "The quick, brown fox (an exemplary species) jumps over the lazy dog - a classic example of a pangram"

# White Space Tokenization
white_space_tokens = text.split()

# NLTK Word Tokenization
nltk.download('punkt')
nltk_tokens = word_tokenize(text)

# Regular Expression Tokenizer
regexp_tokenizer = RegexpTokenizer(r'\w+')
regex_tokens = regexp_tokenizer.tokenize(text)

# Spacy Tokenization
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_tokens = [token.text for token in doc]

# Sentence Tokenization
sentences = sent_tokenize(text)

# Collecting all tokenization results
tokenization_results = {
    "White Space Tokenization": white_space_tokens,
    "NLTK Tokenization": nltk_tokens,
    "Regex Tokenization": regex_tokens,
    "Spacy Tokenization": spacy_tokens,
    "Sentence Tokenization": sentences
}

for k in tokenization_results.keys():
  print(k)
  print(tokenization_results[k])

[nltk_data] Downloading package punkt to /Users/maximen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


White Space Tokenization
['The', 'quick,', 'brown', 'fox', '(an', 'exemplary', 'species)', 'jumps', 'over', 'the', 'lazy', 'dog', '-', 'a', 'classic', 'example', 'of', 'a', 'pangram']
NLTK Tokenization
['The', 'quick', ',', 'brown', 'fox', '(', 'an', 'exemplary', 'species', ')', 'jumps', 'over', 'the', 'lazy', 'dog', '-', 'a', 'classic', 'example', 'of', 'a', 'pangram']
Regex Tokenization
['The', 'quick', 'brown', 'fox', 'an', 'exemplary', 'species', 'jumps', 'over', 'the', 'lazy', 'dog', 'a', 'classic', 'example', 'of', 'a', 'pangram']
Spacy Tokenization
['The', 'quick', ',', 'brown', 'fox', '(', 'an', 'exemplary', 'species', ')', 'jumps', 'over', 'the', 'lazy', 'dog', '-', 'a', 'classic', 'example', 'of', 'a', 'pangram']
Sentence Tokenization
['The quick, brown fox (an exemplary species) jumps over the lazy dog - a classic example of a pangram']


#### **2. Stemming**

In [4]:
import nltk 
nltk.download('punkt_tab')

sentence = """The last meeting was exhausting. I am meeting my mom"""

# Initialize the PorterStemmer
stemmer = nltk.stem.PorterStemmer()

# Tokenize the sentence into words
words = nltk.tokenize.word_tokenize(sentence)

# Apply stemming to each word
stemmed_words = [stemmer.stem(word) for word in words]

print("Original Sentence:\n", sentence)
print("\nStemmed Words:\n", " ".join(stemmed_words))

Original Sentence:
 The last meeting was exhausting. I am meeting my mom

Stemmed Words:
 the last meet wa exhaust . i am meet my mom


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/maximen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


#### **3. Lemmatization**

In [5]:
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional for extended WordNet data

lemmatizer = nltk.stem.WordNetLemmatizer() # Initialize the WordNetLemmatizer

sentence = """The last meeting was exhausting. I am meeting my mom"""

# Tokenize the sentence into words
words = nltk.tokenize.word_tokenize(sentence)

# Apply lemmatization to each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("Original Sentence:\n", sentence)
print("\nStemmed Words:\n", " ".join(lemmatized_words))

[nltk_data] Downloading package wordnet to /Users/maximen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/maximen/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Original Sentence:
 The last meeting was exhausting. I am meeting my mom

Stemmed Words:
 The last meeting wa exhausting . I am meeting my mom


#### **4. POS tagging**

In [6]:
import nltk

# Download necessary data for POS tagging
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

sentence = """The last meeting was exhausting. I am meeting my mom"""

# Tokenize the sentence into words
words = nltk.word_tokenize(sentence)

# Perform POS tagging
pos_tags = nltk.pos_tag(words)

print("Word and POS Tags:\n", pos_tags)

Word and POS Tags:
 [('The', 'DT'), ('last', 'JJ'), ('meeting', 'NN'), ('was', 'VBD'), ('exhausting', 'VBG'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('meeting', 'VBG'), ('my', 'PRP$'), ('mom', 'NN')]


[nltk_data] Downloading package punkt to /Users/maximen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/maximen/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


#### **6. NER**

In [7]:
import nltk

nltk.download('punkt')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

sentence = "AbbVie Inc. is an American pharmaceutical company headquartered in North Chicago, Illinois."

# Tokenize the sentence into words
words = nltk.word_tokenize(sentence)

# Tag the words with part-of-speech
pos_tags = nltk.pos_tag(words)

# Perform Named Entity Recognition
named_entities = nltk.ne_chunk(pos_tags)

print("Named Entities:\n", named_entities)

Named Entities:
 (S
  (ORGANIZATION AbbVie/NNP Inc./NNP)
  is/VBZ
  an/DT
  (GPE American/JJ)
  pharmaceutical/JJ
  company/NN
  headquartered/VBD
  in/IN
  (GPE North/NNP Chicago/NNP)
  ,/,
  (GPE Illinois/NNP)
  ./.)


[nltk_data] Downloading package punkt to /Users/maximen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /Users/maximen/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to /Users/maximen/nltk_data...
[nltk_data]   Package words is already up-to-date!
