# üß† Tokenization in NLP

Natural Language Processing (NLP) deals with how computers understand and process human language.  
Before any text can be analyzed by a model, it must be **tokenized** ‚Äî i.e., broken into smaller pieces called **tokens**.

---

## üìò Topics Overview

| üî¢ | **Concept** | **Represents** |
|:--:|:-------------|:----------------|
| 1Ô∏è‚É£ | **Corpus** | A collection of *paragraphs* or the entire text dataset |
| 2Ô∏è‚É£ | **Documents** | A *single paragraph* or *sentence* within the corpus |
| 3Ô∏è‚É£ | **Vocabulary** | The set of *unique words* present in the corpus |
| 4Ô∏è‚É£ | **Words** | The *individual tokens* or *terms* in each sentence |

---

## üß© Detailed Explanation

### 1Ô∏è‚É£ Corpus ‚Äî *The Entire Text Collection*
üìå A **corpus** is a large and structured set of texts used for training or evaluating NLP models.  
It may contain multiple documents, paragraphs, or sentences.

$$
Corpus = \{Document_1, Document_2, ..., Document_n\}
$$

üß† **Example:**
> ‚ÄúNatural Language Processing is amazing. NLP enables machines to understand humans.‚Äù

Here, the entire text above is our **corpus**.

---

### 2Ô∏è‚É£ Documents ‚Äî *Smaller Text Units*
üìå A **document** is a subset of the corpus.  
It could be a **paragraph**, **article**, or **sentence** depending on context.

$$
Corpus = \{D_1, D_2, D_3, ...\}
$$

üß† **Example:**
1. ‚ÄúNatural Language Processing is amazing.‚Äù  
2. ‚ÄúNLP enables machines to understand humans.‚Äù

Both are separate **documents** within the same corpus.

---

### 3Ô∏è‚É£ Vocabulary ‚Äî *Unique Words Collection*
üìå The **vocabulary** of a corpus is the set of all **unique tokens** (distinct words) appearing across documents.

$$
Vocabulary = \{w_1, w_2, ..., w_n\}
$$

‚úÖ **Example:**
From the corpus above:
> Vocabulary = {Natural, Language, Processing, is, amazing, NLP, enables, machines, to, understand, humans}

---

### 4Ô∏è‚É£ Words ‚Äî *Individual Tokens*
üìå The **words** (or **tokens**) are the **smallest textual units** in a sentence ‚Äî usually split by spaces or punctuation.

$$
Sentence = \{w_1, w_2, w_3, ...\}
$$

üß† **Example:**
> ‚ÄúNLP enables machines to understand humans.‚Äù

‚û° Tokens = [‚ÄòNLP‚Äô, ‚Äòenables‚Äô, ‚Äòmachines‚Äô, ‚Äòto‚Äô, ‚Äòunderstand‚Äô, ‚Äòhumans‚Äô]

---

## ‚öôÔ∏è Why Tokenization Matters

‚úÖ Converts raw text into a form that algorithms can understand  
‚úÖ Helps in building vocabulary, frequency counts, embeddings, and context models  
‚úÖ Serves as the **foundation** for further NLP tasks such as:
- Text Classification  
- Sentiment Analysis  
- Named Entity Recognition  
- Language Modeling

---

## üîç Summary Table

| Concept | Description | Example |
|:--------|:-------------|:---------|
| **Corpus** | Entire text dataset | All articles in Wikipedia |
| **Document** | One paragraph/sentence | ‚ÄúNatural Language Processing is amazing.‚Äù |
| **Vocabulary** | Unique words in corpus | {‚ÄòNatural‚Äô, ‚ÄòLanguage‚Äô, ‚Ä¶} |
| **Words** | Individual tokens | [‚ÄòNLP‚Äô, ‚Äòenables‚Äô, ‚Ä¶] |

---

In [19]:
!pip install nltk




In [20]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/psundara/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# üß† Tokenization in NLP ‚Äî Practical Implementation with NLTK

Natural Language Processing (NLP) involves preparing raw text so that machines can process it.  
One of the very first steps in any NLP pipeline is **Tokenization** ‚Äî the process of splitting text into **sentences**, **words**, or **subwords**.

---

## üìò What You‚Äôll Learn

üîπ Understanding **Corpus**, **Documents**, **Vocabulary**, and **Words**  
üîπ Performing **Sentence Tokenization** using `sent_tokenize()`  
üîπ Performing **Word Tokenization** using `word_tokenize()`, `wordpunct_tokenize()`, and `TreebankWordTokenizer()`  
üîπ Seeing the difference between each tokenizer  
üîπ Observing how punctuation and contractions are handled  

---

üìå **Mathematical Representation**

- Sentence Tokenization:  
  $$ P \rightarrow [S_1, S_2, S_3, \dots, S_m] $$
  where each $S_i$ is a sentence.

- Word Tokenization:  
  $$ S_i \rightarrow [w_{i1}, w_{i2}, \dots, w_{ik}] $$

- Vocabulary Extraction:  
  $$ V = \bigcup_{i=1}^{m} \{ w_{ij} \mid j = 1, 2, \dots, k_i \} $$


In [33]:
# üì¶ Define a sample corpus for tokenization demonstration

corpus = """Hello Welcome, to Prasanna Sundaram's NLP Tutorials.
Please go through the entire content! to become expert in NLP.
"""

# Display the raw corpus
print("üßæ Corpus Text:\n")
print(corpus)


üßæ Corpus Text:

Hello Welcome, to Prasanna Sundaram's NLP Tutorials.
Please go through the entire content! to become expert in NLP.



## ‚úÇÔ∏è Sentence Tokenization (Paragraph ‚Üí Sentences)

Sentence Tokenization splits a paragraph or document into **individual sentences**.  
This helps models process text meaningfully one sentence at a time.

---

üìå **Concept**
$$
\text{sent\_tokenize}(Paragraph) \rightarrow [Sentence_1, Sentence_2, \dots, Sentence_n]
$$

üîπ We‚Äôll use **`nltk.sent_tokenize`**, which internally relies on the **Punkt** model trained on English data.  
üîπ It intelligently handles:
- Abbreviations (`Dr.` or `Mr.`)
- Question marks (`?`)
- Exclamations (`!`)
- Periods (`.`) in normal sentences.


In [34]:
# ‚úÖ Sentence Tokenization Example
from nltk.tokenize import sent_tokenize

# Perform sentence tokenization on the corpus
documents = sent_tokenize(corpus)

# Display type and output
print("üìÇ Type:", type(documents))
print("\nüß© Tokenized Sentences:\n")
for idx, sentence in enumerate(documents, start=1):
    print(f"{idx}. {sentence}")


üìÇ Type: <class 'list'>

üß© Tokenized Sentences:

1. Hello Welcome, to Prasanna Sundaram's NLP Tutorials.
2. Please go through the entire content!
3. to become expert in NLP.


## üî§ Word Tokenization (Sentence ‚Üí Words)

Now that we have sentences, we can further break them into **individual words (tokens)**.  
Word tokenization is critical for tasks like:
- Building a **vocabulary**
- Counting word **frequency**
- Creating **embeddings**

---

üìò **Concept**
$$
\text{word\_tokenize}(Sentence) \rightarrow [w_1, w_2, ..., w_n]
$$

We‚Äôll experiment with **three tokenizers** from NLTK:
1. `word_tokenize()` ‚Äî general-purpose tokenizer  
2. `wordpunct_tokenize()` ‚Äî splits on all non-word characters  
3. `TreebankWordTokenizer()` ‚Äî follows Penn Treebank rules


In [35]:
# ‚úÖ Word Tokenization using word_tokenize
from nltk.tokenize import word_tokenize

print("üß© Word Tokenization using word_tokenize():\n")

# Tokenize the entire corpus
word_tokens = word_tokenize(corpus)
print("All Tokens (Corpus Level):", word_tokens, "\n")

# Tokenize each sentence separately
for idx, sentence in enumerate(documents, start=1):
    tokens = word_tokenize(sentence)
    print(f"Sentence {idx} Tokens: {tokens}")


üß© Word Tokenization using word_tokenize():

All Tokens (Corpus Level): ['Hello', 'Welcome', ',', 'to', 'Prasanna', 'Sundaram', "'s", 'NLP', 'Tutorials', '.', 'Please', 'go', 'through', 'the', 'entire', 'content', '!', 'to', 'become', 'expert', 'in', 'NLP', '.'] 

Sentence 1 Tokens: ['Hello', 'Welcome', ',', 'to', 'Prasanna', 'Sundaram', "'s", 'NLP', 'Tutorials', '.']
Sentence 2 Tokens: ['Please', 'go', 'through', 'the', 'entire', 'content', '!']
Sentence 3 Tokens: ['to', 'become', 'expert', 'in', 'NLP', '.']


In [37]:
# üîπ Word Tokenization using wordpunct_tokenize
from nltk import wordpunct_tokenize

print("\nüß© Word Tokenization using wordpunct_tokenize():\n")

# This splits words on every non-alphanumeric character
# Example: "Sundaram's" -> ['Sundaram', "'", 's']
tokens_punct = wordpunct_tokenize(corpus)
print(tokens_punct)



üß© Word Tokenization using wordpunct_tokenize():

['Hello', 'Welcome', ',', 'to', 'Prasanna', 'Sundaram', "'", 's', 'NLP', 'Tutorials', '.', 'Please', 'go', 'through', 'the', 'entire', 'content', '!', 'to', 'become', 'expert', 'in', 'NLP', '.']


In [39]:
# üîπ Word Tokenization using 
# üìù TreebankWordTokenizer follows Penn Treebank conventions:
# - Splits contractions: "can't" ‚Üí ['ca', "n't"]
# - Handles ending punctuation carefully
# - Keeps internal punctuation attached unless sentence-ending
# if you check the result fullstop will not treated as sperate word, for the last full stop only it will consider as seperate word
from nltk.tokenize import TreebankWordTokenizer

print("\nüß© Word Tokenization using TreebankWordTokenizer():\n")

tokenizer = TreebankWordTokenizer()
treebank_tokens = tokenizer.tokenize(corpus)
print(treebank_tokens)



üß© Word Tokenization using TreebankWordTokenizer():

['Hello', 'Welcome', ',', 'to', 'Prasanna', 'Sundaram', "'s", 'NLP', 'Tutorials.', 'Please', 'go', 'through', 'the', 'entire', 'content', '!', 'to', 'become', 'expert', 'in', 'NLP', '.']


## üß© Tokenizer Comparison Summary

| Tokenizer | Splitting Behavior | Handles Contractions | Handles Punctuation | Notes |
|:-----------|:------------------:|:--------------------:|:------------------:|:------|
| **`word_tokenize`** | Moderate | ‚úÖ | ‚úÖ | Good default choice |
| **`wordpunct_tokenize`** | Aggressive | ‚ùå (splits `'s` ‚Üí `'`, `s`) | ‚úÖ‚úÖ | Best for punctuation-heavy text |
| **`TreebankWordTokenizer`** | Linguistically consistent | ‚úÖ | ‚úÖ | Used in research & corpora preprocessing |

---

### üìå Key Takeaways
‚úÖ Always inspect tokens before using them in downstream tasks.  
‚úÖ Maintain the same tokenizer during both **training** and **inference**.  
‚úÖ Combine tokenization with **lowercasing**, **stopword removal**, and **lemmatization** as preprocessing steps.

---

### üßÆ Mathematical Summary

- **Sentence Tokenization**
  $$
  P \mapsto [S_1, S_2, \dots, S_m]
  $$
- **Word Tokenization**
  $$
  S_i \mapsto [w_{i1}, w_{i2}, \dots, w_{ik}]
  $$
- **Vocabulary Extraction**
  $$
  V = \bigcup_{i=1}^{m} \{ w_{ij} \mid j = 1, 2, \dots, k_i \}
  $$
