# 📘 What is Natural Language Processing (NLP)?

What we are doing right now is **natural language processing** — you're listening to the words and sentences I'm forming, and you're forming some kind of understanding from them.

When we ask a **computer** to do the same, it’s called **Natural Language Processing (NLP).**

---

### 📝 Example:

**Input (Unstructured):**  
`Add eggs and milk to my shopping list.`

This is *unstructured data* for machines.

Computers understand information in **structured formats**, such as lists or other data structures.

**Equivalent Structured Data (XML format):**
```xml
<shopping_list>
    <item>Eggs</item>
    <item>Milk</item>
</shopping_list>


# 🚀 Applications of NLP
#### 1. 🔄 Machine Translation
Translating text or speech from one language to another.

#### 2. 🤖 Chatbots or Virtual Assistants
Understanding and responding to user queries in natural language.

#### 3. 💬 Sentiment Analysis
Analyzing customer reviews, emails, or feedback to determine emotional tone.

#### 4. 🚫 Spam Detection
Identifying unwanted or harmful messages by analyzing text for: False promises, Unnecessary urgency, Malicious links, Requests for personal information

# ⚙️ Steps in NLP
Natural Language Processing typically involves the following key steps:

### 1️⃣ Tokenization
Breaking a sentence into smaller units called tokens (usually words or subwords).

Example: "Add eggs and milk" → ["Add", "eggs", "and", "milk"]

### 2️⃣ Stemming
Reducing words to their root form by trimming suffixes/prefixes.

Example: "running", "ran", "runs" → run

❗ Note: Stemming can be crude or inaccurate:

"university" and "universal" do not reduce to "universe"

### 3️⃣ Lemmatization
Identifies the dictionary root (lemma) of a word based on its context and meaning.

Example: "better" → good (correct lemma)
Stemming version: "better" → "bet" (inaccurate)

✅ Lemmatization is more accurate and meaningful than stemming.

### 4️⃣ Part of Speech (POS) Tagging
Assigns a grammatical role to each token in context.

Example:
"book" as a noun → I read a book.
"book" as a verb → Please book a table.

### 5️⃣ Named Entity Recognition (NER)
Detects and classifies named entities (proper nouns) in text into categories like:

👩 Person — e.g., Maria

🌍 Location — e.g., London

🏢 Organization — e.g., Google

📅 Date — e.g., July 5, 2025

NER is useful in extracting structured data from unstructured text.


In [3]:
! python --version

Python 3.12.7


In [4]:
! nltk --version

nltk, version 3.9.1


In [5]:
import nltk

## Tokenization - 
#### Breaking a sentence into smaller units called tokens (usually words or subwords)

word_tokenize = Break sentence or paragraph into small tokens(words, punctuation, symbols, numbers, etc)
sent_tokenize = Break paragraphs or string into Sentences(Usually split where delimiter is full stop(.) in the text)

In [7]:
s1 = "On a $50,000 mortgage of 30 years at 8 percent, the monthly payment would be $366.88."
result = nltk.word_tokenize(s1)
print(result)


['On', 'a', '$', '50,000', 'mortgage', 'of', '30', 'years', 'at', '8', 'percent', ',', 'the', 'monthly', 'payment', 'would', 'be', '$', '366.88', '.']


In [8]:

# Use of Escape Characters \ - only double quotes are consider "
s2 = "\"We beat some pretty good teams to get here,\" Slocum said."
result = nltk.word_tokenize(s2)
print(result)

['``', 'We', 'beat', 'some', 'pretty', 'good', 'teams', 'to', 'get', 'here', ',', "''", 'Slocum', 'said', '.']


In [9]:

# Note - How couldn't is interpreted as two different words and created 2 tokens. e.g. 'could', "n't"
# Note - Hypen join words are considered as single token e.g cliche-ridden, wanna-be
s3 = "Well, we couldn't have this predictable, cliche-ridden, \"Touched by an Angel\" (a show creator John Masius worked on) wanna-be if she didn't."
result = nltk.word_tokenize(s3)
print(result)


['Well', ',', 'we', 'could', "n't", 'have', 'this', 'predictable', ',', 'cliche-ridden', ',', '``', 'Touched', 'by', 'an', 'Angel', "''", '(', 'a', 'show', 'creator', 'John', 'Masius', 'worked', 'on', ')', 'wanna-be', 'if', 'she', 'did', "n't", '.']


In [10]:

# Note - repetative words are considered meaningful, also cannot = can + not ie. 2 tokens are created
s4 = "I cannot cannot work under these conditions!"
result = nltk.word_tokenize(s4)
print(result)

['I', 'can', 'not', 'can', 'not', 'work', 'under', 'these', 'conditions', '!']


In [11]:

# Note observation about number, percentage and year.
s5 = "The company spent 40.75% of its income last year."
result = nltk.word_tokenize(s5)
print(result)

['The', 'company', 'spent', '40.75', '%', 'of', 'its', 'income', 'last', 'year', '.']


In [12]:

# Note observation about time.
s6 = "He arrived at 3:00 pm."
result = nltk.word_tokenize(s6)
print(result)

['He', 'arrived', 'at', '3:00', 'pm', '.']


In [13]:

# Note Observation about : colon
s7 = "I bought these items: books, pencils, and pens."
result = nltk.word_tokenize(s7)
print(result)

['I', 'bought', 'these', 'items', ':', 'books', ',', 'pencils', ',', 'and', 'pens', '.']


In [14]:

# Note observation about number separated by comma and space are considered different number
s8 = "Though there were 150, 100 of them were old."
result = nltk.word_tokenize(s8)
print(result)

# Note observation about comma without space in number is considered as number formatting.
s9 = "There were 300,000, but that wasn't enough."
result = nltk.word_tokenize(s9)
print(result)

['Though', 'there', 'were', '150', ',', '100', 'of', 'them', 'were', 'old', '.']
['There', 'were', '300,000', ',', 'but', 'that', 'was', "n't", 'enough', '.']


In [15]:

# Note observatio about more'n
s10 = "It's more'n enough."
result = nltk.word_tokenize(s10)
print(result)

['It', "'s", 'more', "'n", 'enough', '.']


# 🧠 Gathering Spans of Tokenized Strings in NLTK (Python)

## ✨ What are Token Spans?

When you tokenize a sentence using `nltk.tokenize.word_tokenize()`, you get the **individual tokens** (words, punctuation marks, etc.).

But what if you want to know the **start and end position (span)** of each token in the original string?

👉 This is useful for tasks like:
- Highlighting tokens in the original text
- Mapping tokens to character offsets
- Building annotated datasets

---

## 🔧 Tool: `nltk.tokenize.TreebankWordTokenizer().span_tokenize()`

The `span_tokenize()` method from `TreebankWordTokenizer` provides **(start, end)** character index spans for each token **relative to the original sentence**.

---

### ✅ Example Code:

```python

In [17]:
from nltk.tokenize import TreebankWordTokenizer

# Input text
text = "On a $50,000 mortgage of 30 years at 8 percent, the monthly payment would be $366.88."

# Initialize tokenizer
tokenizer = TreebankWordTokenizer()

# Get tokens and spans
tokens = tokenizer.tokenize(text)
spans = list(tokenizer.span_tokenize(text))

# Display
for token, span in zip(tokens, spans):
    print(f"Token: '{token}'\t\tSpan: {span}\t\t\tOriginal Text: '{text[span[0]:span[1]]}'")

Token: 'On'		Span: (0, 2)			Original Text: 'On'
Token: 'a'		Span: (3, 4)			Original Text: 'a'
Token: '$'		Span: (5, 6)			Original Text: '$'
Token: '50,000'		Span: (6, 12)			Original Text: '50,000'
Token: 'mortgage'		Span: (13, 21)			Original Text: 'mortgage'
Token: 'of'		Span: (22, 24)			Original Text: 'of'
Token: '30'		Span: (25, 27)			Original Text: '30'
Token: 'years'		Span: (28, 33)			Original Text: 'years'
Token: 'at'		Span: (34, 36)			Original Text: 'at'
Token: '8'		Span: (37, 38)			Original Text: '8'
Token: 'percent'		Span: (39, 46)			Original Text: 'percent'
Token: ','		Span: (46, 47)			Original Text: ','
Token: 'the'		Span: (48, 51)			Original Text: 'the'
Token: 'monthly'		Span: (52, 59)			Original Text: 'monthly'
Token: 'payment'		Span: (60, 67)			Original Text: 'payment'
Token: 'would'		Span: (68, 73)			Original Text: 'would'
Token: 'be'		Span: (74, 76)			Original Text: 'be'
Token: '$'		Span: (77, 78)			Original Text: '$'
Token: '366.88'		Span: (78, 84)			Original Text: '366

---

## 🔍 Understanding the Output
- Each span is a tuple (start_index, end_index):
- These indexes refer to positions in the original string.
- You can use them like text[start:end] to extract the token from the raw input.

For example:

    text[0:2] → 'On'
    text[5:6] → '$'
    text[84:91] → '366.88'


In [27]:
s = '''Good muffins cost $3.88\nin New (York).  Please (buy) me\ntwo of them.\n(Thanks).'''

expected = [(0, 4), (5, 12), (13, 17), (18, 19), (19, 23),
            (24, 26), (27, 30), (31, 32), (32, 36), (36, 37), (37, 38),
            (40, 46), (47, 48), (48, 51), (51, 52), (53, 55), (56, 59),
            (60, 62), (63, 68), (69, 70), (70, 76), (76, 77), (77, 78)]

In [28]:
list(nltk.NLTKWordTokenizer().span_tokenize(s)) == expected

True

In [29]:

expected = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')',
'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.']

# from string s get the span list of tokens and compare with the above expected list
l1 = [s[start:end] for start, end in nltk.NLTKWordTokenizer().span_tokenize(s)]
print(l1)


['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')', 'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.']


In [30]:

print(l1 == expected)

True


# ❓ Difference Between `NLTKWordTokenizer().tokenize()` and `TreebankWordTokenizer().tokenize()` in NLTK

## 🧠 Short Answer:
There is **no functional difference** between `NLTKWordTokenizer` and `TreebankWordTokenizer`.  
👉 `NLTKWordTokenizer` is simply a **wrapper** or **alias** for `TreebankWordTokenizer` — designed to give a more consistent naming style across the NLTK package.

---

## 📦 Background

NLTK provides various tokenizers. Among them:

- `TreebankWordTokenizer`  
  ➤ A tokenizer that uses the Penn Treebank tokenization rules (splits contractions, punctuation, etc.).

- `NLTKWordTokenizer`  
  ➤ A more **user-friendly alias** of the same tokenizer, aligning with NLTK's naming conventions (like `NLTKSentenceTokenizer`, `NLTKTokenizer`).

---

## 🧪 Let’s See with Code

```python
from nltk.tokenize import TreebankWordTokenizer, NLTKWordTokenizer

text = "Mr. O'Neill can't attend the 3:00 p.m. meeting."

# Both tokenizers
treebank_tokenizer = TreebankWordTokenizer()
nltk_tokenizer = NLTKWordTokenizer()

# Tokenizing
tokens_treebank = treebank_tokenizer.tokenize(text)
tokens_nltk = nltk_tokenizer.tokenize(text)

# Check equality
print("Are both outputs equal?", tokens_treebank == tokens_nltk)

# Print tokens
print("\nTreebank Tokens:\n", tokens_treebank)
print("\nNLTKWordTokenizer Tokens:\n", tokens_nltk)
```
---

## Are both outputs equal? -> True 
- Treebank Tokens:
<code>['Mr.', 'O', "''", 'Neill', 'ca', "n't", 'attend', 'the', '3:00', 'p.m.', 'meeting', '.']</code>

- NLTKWordTokenizer Tokens:
<code>['Mr.', 'O', "''", 'Neill', 'ca', "n't", 'attend', 'the', '3:00', 'p.m.', 'meeting', '.']</code>


## 🛠 Behind the Scenes
### 🔍 NLTKWordTokenizer is defined as:
```python
from nltk.tokenize import TreebankWordTokenizer

class NLTKWordTokenizer(TreebankWordTokenizer):
    """This is just a renamed alias class for consistency."""
    pass

```

### So, NLTKWordTokenizer inherits directly from TreebankWordTokenizer, without changing any logic.

## ✅ Final Conclusion
Use either class — they behave identically.

If you prefer more technical or research-based naming, use:
TreebankWordTokenizer()

If you want to keep naming consistent with other NLTK tokenizers, use:
NLTKWordTokenizer()

# ➡️ Both are 100% interchangeable in functionality.