# Understanding Tokenization with NLTK

This notebook explains the concept of **tokenization** — the process of splitting text into individual units such as words or sentences — using the **NLTK (Natural Language Toolkit)** library in Python.

Tokenization is a fundamental step in Natural Language Processing (NLP), used in tasks like text preprocessing, sentiment analysis, and language modeling.


## Step 1: Importing and Downloading NLTK Tokenizers

Before we can tokenize text, we need to import the tokenization tools and download the 'punkt' tokenizer models which are used for sentence and word tokenization.


In [1]:
# Importing the necessary modules from NLTK
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Downloading the necessary NLTK data packages (only need to run once)
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/jashika/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
!pip install nltk



## Step 2: Define Sample Text

We will use a simple English paragraph as our input text for tokenization.


In [3]:
corpus= """Hello Welcome,to this Data Science Course.
It's a wonderful resource for beginners. It will help you upskill and learn new skills. Keep Visiting Jashika's GitHub Repo. You can also follow Jashika's LinkedIn.
"""

In [4]:
print(corpus)

Hello Welcome,to this Data Science Course.
It's a wonderful resource for beginners. It will help you upskill and learn new skills. Keep Visiting Jashika's GitHub Repo. You can also follow Jashika's LinkedIn.



## Step 3: Sentence Tokenization

The `sent_tokenize()` function breaks the text into individual sentences using pretrained models. This helps in understanding the structure of paragraphs.


In [5]:
##  Tokenization
## Splitting Sentence-->paragraphs
from nltk.tokenize import sent_tokenize

In [6]:
documents = sent_tokenize(corpus)

In [7]:
type(documents)

list

In [8]:
for sentence in documents:
    print(sentence)

Hello Welcome,to this Data Science Course.
It's a wonderful resource for beginners.
It will help you upskill and learn new skills.
Keep Visiting Jashika's GitHub Repo.
You can also follow Jashika's LinkedIn.


## Step 4: Word Tokenization

The `word_tokenize()` function splits the input text into individual words and punctuation marks. This is useful for most NLP tasks such as tagging, parsing, and classification.


In [9]:
## Tokenization 
## Paragraph-->words
## sentence--->words
from nltk.tokenize import word_tokenize

In [10]:
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'this',
 'Data',
 'Science',
 'Course',
 '.',
 'It',
 "'s",
 'a',
 'wonderful',
 'resource',
 'for',
 'beginners',
 '.',
 'It',
 'will',
 'help',
 'you',
 'upskill',
 'and',
 'learn',
 'new',
 'skills',
 '.',
 'Keep',
 'Visiting',
 'Jashika',
 "'s",
 'GitHub',
 'Repo',
 '.',
 'You',
 'can',
 'also',
 'follow',
 'Jashika',
 "'s",
 'LinkedIn',
 '.']

In [11]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'this', 'Data', 'Science', 'Course', '.']
['It', "'s", 'a', 'wonderful', 'resource', 'for', 'beginners', '.']
['It', 'will', 'help', 'you', 'upskill', 'and', 'learn', 'new', 'skills', '.']
['Keep', 'Visiting', 'Jashika', "'s", 'GitHub', 'Repo', '.']
['You', 'can', 'also', 'follow', 'Jashika', "'s", 'LinkedIn', '.']


## Comparing `wordpunct_tokenize` vs `TreebankWordTokenizer`

In this section, we compare two popular word tokenizers from NLTK:

### 1. `wordpunct_tokenize`
- Splits text into alphabetic and non-alphabetic characters.
- Treats all punctuation as separate tokens.
- For example: `"Don't"` becomes `["Don", "'", "t"]`.
- Useful for basic splitting but may over-segment text.

In [11]:
from nltk.tokenize import wordpunct_tokenize

In [13]:
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'this',
 'Data',
 'Science',
 'Course',
 '.',
 'It',
 "'",
 's',
 'a',
 'wonderful',
 'resource',
 'for',
 'beginners',
 '.',
 'It',
 'will',
 'help',
 'you',
 'upskill',
 'and',
 'learn',
 'new',
 'skills',
 '.',
 'Keep',
 'Visiting',
 'Jashika',
 "'",
 's',
 'GitHub',
 'Repo',
 '.',
 'You',
 'can',
 'also',
 'follow',
 'Jashika',
 "'",
 's',
 'LinkedIn',
 '.']

### 2. `TreebankWordTokenizer`
- Based on the Penn Treebank conventions.
- Smarter with handling contractions, punctuation, and special characters.
- For example: `"Don't"` becomes `["Do", "n't"]`, which reflects actual language usage in NLP models.
- Preferred in most NLP pipelines that require contextual understanding.

`wordpunct_tokenize` vs `TreebankWordTokenizer`


In [14]:
from nltk.tokenize import TreebankWordTokenizer

In [15]:
tokenizer=TreebankWordTokenizer()

In [16]:
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'this',
 'Data',
 'Science',
 'Course.',
 'It',
 "'s",
 'a',
 'wonderful',
 'resource',
 'for',
 'beginners.',
 'It',
 'will',
 'help',
 'you',
 'upskill',
 'and',
 'learn',
 'new',
 'skills.',
 'Keep',
 'Visiting',
 'Jashika',
 "'s",
 'GitHub',
 'Repo.',
 'You',
 'can',
 'also',
 'follow',
 'Jashika',
 "'s",
 'LinkedIn',
 '.']


👉 Use `TreebankWordTokenizer` for cleaner, linguistically-aware tokenization.  
👉 Use `wordpunct_tokenize` if you need raw punctuation separation.

| Feature                  | wordpunct_tokenize                     | TreebankWordTokenizer                 |
|--------------------------|----------------------------------------|---------------------------------------|
| Punctuation handling     | Splits all punctuation separately      | Keeps common English structure        |
| Contraction: "It's"      | ['It', "'", 's']                        | ['It', "'s"]                          |
| Possessive: "Jashika's"  | ['Jashika', "'", 's']                   | ['Jashika', "'s"]                     |
| Use Case                 | Rule-based, regex-preprocessing        | NLP pipelines like tagging, parsing   |



## 📘 Summary

- **Tokenization** is the process of converting a text into smaller components (tokens).
- NLTK provides easy-to-use methods like `sent_tokenize()` for sentence splitting and `word_tokenize()` for word splitting.
- Tokenization helps prepare raw text data for further NLP tasks.