# 🧠 Tokenization in NLP

Natural Language Processing (NLP) deals with how computers understand and process human language.  
Before any text can be analyzed by a model, it must be **tokenized** — i.e., broken into smaller pieces called **tokens**.

---

## 📘 Topics Overview

| 🔢 | **Concept** | **Represents** |
|:--:|:-------------|:----------------|
| 1️⃣ | **Corpus** | A collection of *paragraphs* or the entire text dataset |
| 2️⃣ | **Documents** | A *single paragraph* or *sentence* within the corpus |
| 3️⃣ | **Vocabulary** | The set of *unique words* present in the corpus |
| 4️⃣ | **Words** | The *individual tokens* or *terms* in each sentence |

---

## 🧩 Detailed Explanation

### 1️⃣ Corpus — *The Entire Text Collection*
📌 A **corpus** is a large and structured set of texts used for training or evaluating NLP models.  
It may contain multiple documents, paragraphs, or sentences.

$$
Corpus = \{Document_1, Document_2, ..., Document_n\}
$$

🧠 **Example:**
> “Natural Language Processing is amazing. NLP enables machines to understand humans.”

Here, the entire text above is our **corpus**.

---

### 2️⃣ Documents — *Smaller Text Units*
📌 A **document** is a subset of the corpus.  
It could be a **paragraph**, **article**, or **sentence** depending on context.

$$
Corpus = \{D_1, D_2, D_3, ...\}
$$

🧠 **Example:**
1. “Natural Language Processing is amazing.”  
2. “NLP enables machines to understand humans.”

Both are separate **documents** within the same corpus.

---

### 3️⃣ Vocabulary — *Unique Words Collection*
📌 The **vocabulary** of a corpus is the set of all **unique tokens** (distinct words) appearing across documents.

$$
Vocabulary = \{w_1, w_2, ..., w_n\}
$$

✅ **Example:**
From the corpus above:
> Vocabulary = {Natural, Language, Processing, is, amazing, NLP, enables, machines, to, understand, humans}

---

### 4️⃣ Words — *Individual Tokens*
📌 The **words** (or **tokens**) are the **smallest textual units** in a sentence — usually split by spaces or punctuation.

$$
Sentence = \{w_1, w_2, w_3, ...\}
$$

🧠 **Example:**
> “NLP enables machines to understand humans.”

➡ Tokens = [‘NLP’, ‘enables’, ‘machines’, ‘to’, ‘understand’, ‘humans’]

---

## ⚙️ Why Tokenization Matters

✅ Converts raw text into a form that algorithms can understand  
✅ Helps in building vocabulary, frequency counts, embeddings, and context models  
✅ Serves as the **foundation** for further NLP tasks such as:
- Text Classification  
- Sentiment Analysis  
- Named Entity Recognition  
- Language Modeling

---

## 🔍 Summary Table

| Concept | Description | Example |
|:--------|:-------------|:---------|
| **Corpus** | Entire text dataset | All articles in Wikipedia |
| **Document** | One paragraph/sentence | “Natural Language Processing is amazing.” |
| **Vocabulary** | Unique words in corpus | {‘Natural’, ‘Language’, …} |
| **Words** | Individual tokens | [‘NLP’, ‘enables’, …] |

---

In [19]:
!pip install nltk




In [20]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/psundara/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# 🧠 Tokenization in NLP — Practical Implementation with NLTK

Natural Language Processing (NLP) involves preparing raw text so that machines can process it.  
One of the very first steps in any NLP pipeline is **Tokenization** — the process of splitting text into **sentences**, **words**, or **subwords**.

---

## 📘 What You’ll Learn

🔹 Understanding **Corpus**, **Documents**, **Vocabulary**, and **Words**  
🔹 Performing **Sentence Tokenization** using `sent_tokenize()`  
🔹 Performing **Word Tokenization** using `word_tokenize()`, `wordpunct_tokenize()`, and `TreebankWordTokenizer()`  
🔹 Seeing the difference between each tokenizer  
🔹 Observing how punctuation and contractions are handled  

---

📌 **Mathematical Representation**

- Sentence Tokenization:  
  $$ P \rightarrow [S_1, S_2, S_3, \dots, S_m] $$
  where each $S_i$ is a sentence.

- Word Tokenization:  
  $$ S_i \rightarrow [w_{i1}, w_{i2}, \dots, w_{ik}] $$

- Vocabulary Extraction:  
  $$ V = \bigcup_{i=1}^{m} \{ w_{ij} \mid j = 1, 2, \dots, k_i \} $$


In [33]:
# 📦 Define a sample corpus for tokenization demonstration

corpus = """Hello Welcome, to Prasanna Sundaram's NLP Tutorials.
Please go through the entire content! to become expert in NLP.
"""

# Display the raw corpus
print("🧾 Corpus Text:\n")
print(corpus)


🧾 Corpus Text:

Hello Welcome, to Prasanna Sundaram's NLP Tutorials.
Please go through the entire content! to become expert in NLP.



## ✂️ Sentence Tokenization (Paragraph → Sentences)

Sentence Tokenization splits a paragraph or document into **individual sentences**.  
This helps models process text meaningfully one sentence at a time.

---

📌 **Concept**
$$
\text{sent\_tokenize}(Paragraph) \rightarrow [Sentence_1, Sentence_2, \dots, Sentence_n]
$$

🔹 We’ll use **`nltk.sent_tokenize`**, which internally relies on the **Punkt** model trained on English data.  
🔹 It intelligently handles:
- Abbreviations (`Dr.` or `Mr.`)
- Question marks (`?`)
- Exclamations (`!`)
- Periods (`.`) in normal sentences.


In [34]:
# ✅ Sentence Tokenization Example
from nltk.tokenize import sent_tokenize

# Perform sentence tokenization on the corpus
documents = sent_tokenize(corpus)

# Display type and output
print("📂 Type:", type(documents))
print("\n🧩 Tokenized Sentences:\n")
for idx, sentence in enumerate(documents, start=1):
    print(f"{idx}. {sentence}")


📂 Type: <class 'list'>

🧩 Tokenized Sentences:

1. Hello Welcome, to Prasanna Sundaram's NLP Tutorials.
2. Please go through the entire content!
3. to become expert in NLP.


## 🔤 Word Tokenization (Sentence → Words)

Now that we have sentences, we can further break them into **individual words (tokens)**.  
Word tokenization is critical for tasks like:
- Building a **vocabulary**
- Counting word **frequency**
- Creating **embeddings**

---

📘 **Concept**
$$
\text{word\_tokenize}(Sentence) \rightarrow [w_1, w_2, ..., w_n]
$$

We’ll experiment with **three tokenizers** from NLTK:
1. `word_tokenize()` — general-purpose tokenizer  
2. `wordpunct_tokenize()` — splits on all non-word characters  
3. `TreebankWordTokenizer()` — follows Penn Treebank rules


In [35]:
# ✅ Word Tokenization using word_tokenize
from nltk.tokenize import word_tokenize

print("🧩 Word Tokenization using word_tokenize():\n")

# Tokenize the entire corpus
word_tokens = word_tokenize(corpus)
print("All Tokens (Corpus Level):", word_tokens, "\n")

# Tokenize each sentence separately
for idx, sentence in enumerate(documents, start=1):
    tokens = word_tokenize(sentence)
    print(f"Sentence {idx} Tokens: {tokens}")


🧩 Word Tokenization using word_tokenize():

All Tokens (Corpus Level): ['Hello', 'Welcome', ',', 'to', 'Prasanna', 'Sundaram', "'s", 'NLP', 'Tutorials', '.', 'Please', 'go', 'through', 'the', 'entire', 'content', '!', 'to', 'become', 'expert', 'in', 'NLP', '.'] 

Sentence 1 Tokens: ['Hello', 'Welcome', ',', 'to', 'Prasanna', 'Sundaram', "'s", 'NLP', 'Tutorials', '.']
Sentence 2 Tokens: ['Please', 'go', 'through', 'the', 'entire', 'content', '!']
Sentence 3 Tokens: ['to', 'become', 'expert', 'in', 'NLP', '.']


In [37]:
# 🔹 Word Tokenization using wordpunct_tokenize
from nltk import wordpunct_tokenize

print("\n🧩 Word Tokenization using wordpunct_tokenize():\n")

# This splits words on every non-alphanumeric character
# Example: "Sundaram's" -> ['Sundaram', "'", 's']
tokens_punct = wordpunct_tokenize(corpus)
print(tokens_punct)



🧩 Word Tokenization using wordpunct_tokenize():

['Hello', 'Welcome', ',', 'to', 'Prasanna', 'Sundaram', "'", 's', 'NLP', 'Tutorials', '.', 'Please', 'go', 'through', 'the', 'entire', 'content', '!', 'to', 'become', 'expert', 'in', 'NLP', '.']


In [39]:
# 🔹 Word Tokenization using 
# 📝 TreebankWordTokenizer follows Penn Treebank conventions:
# - Splits contractions: "can't" → ['ca', "n't"]
# - Handles ending punctuation carefully
# - Keeps internal punctuation attached unless sentence-ending
# if you check the result fullstop will not treated as sperate word, for the last full stop only it will consider as seperate word
from nltk.tokenize import TreebankWordTokenizer

print("\n🧩 Word Tokenization using TreebankWordTokenizer():\n")

tokenizer = TreebankWordTokenizer()
treebank_tokens = tokenizer.tokenize(corpus)
print(treebank_tokens)



🧩 Word Tokenization using TreebankWordTokenizer():

['Hello', 'Welcome', ',', 'to', 'Prasanna', 'Sundaram', "'s", 'NLP', 'Tutorials.', 'Please', 'go', 'through', 'the', 'entire', 'content', '!', 'to', 'become', 'expert', 'in', 'NLP', '.']


## 🧩 Tokenizer Comparison Summary

| Tokenizer | Splitting Behavior | Handles Contractions | Handles Punctuation | Notes |
|:-----------|:------------------:|:--------------------:|:------------------:|:------|
| **`word_tokenize`** | Moderate | ✅ | ✅ | Good default choice |
| **`wordpunct_tokenize`** | Aggressive | ❌ (splits `'s` → `'`, `s`) | ✅✅ | Best for punctuation-heavy text |
| **`TreebankWordTokenizer`** | Linguistically consistent | ✅ | ✅ | Used in research & corpora preprocessing |

---

### 📌 Key Takeaways
✅ Always inspect tokens before using them in downstream tasks.  
✅ Maintain the same tokenizer during both **training** and **inference**.  
✅ Combine tokenization with **lowercasing**, **stopword removal**, and **lemmatization** as preprocessing steps.

---

### 🧮 Mathematical Summary

- **Sentence Tokenization**
  $$
  P \mapsto [S_1, S_2, \dots, S_m]
  $$
- **Word Tokenization**
  $$
  S_i \mapsto [w_{i1}, w_{i2}, \dots, w_{ik}]
  $$
- **Vocabulary Extraction**
  $$
  V = \bigcup_{i=1}^{m} \{ w_{ij} \mid j = 1, 2, \dots, k_i \}
  $$
