# **Implementing Tokenization**

Tokenizers are essential tools in natural language processing that break down text into smaller units called tokens. These tokens can be words, characters, or subwords, making complex text understandable to computers. By dividing text into manageable pieces, tokenizers enable machines to process and analyze human language, powering various language-related applications like translation, sentiment analysis, and chatbots. Essentially, tokenizers bridge the gap between human language and machine understanding.

### Install necessary libraries

In [7]:
pip install nltk spacy transformers gensim tensorflow keras torchtext

Collecting torchtext
  Downloading torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
Downloading torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchtext
Successfully installed torchtext-0.18.0


In [10]:
# Import necessary libraries
import nltk
import spacy
from transformers import AutoTokenizer
from gensim.utils import tokenize as gensim_tokenize
from tensorflow.keras.preprocessing.text import text_to_word_sequence
import re

### 1. Tokenization with NLTK

In [14]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# donload necessary recources
nltk.download('punkt_tab')

text = "Natural Language Processing is amazing! let's learn tokenization."

#word tokenization
word_tokens=word_tokenize(text)
print("Word tokens: ",word_tokens)

# sentense tokenization
sentence_token=sent_tokenize(text)
print("Sentence Tokens: ",sentence_token)

Word tokens:  ['Natural', 'Language', 'Processing', 'is', 'amazing', '!', 'let', "'s", 'learn', 'tokenization', '.']
Sentence Tokens:  ['Natural Language Processing is amazing!', "let's learn tokenization."]


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### 2. Tokenization with SpaCy


In [15]:
import spacy
# Load english model
nlp=spacy.load("en_core_web_sm")

text = "Natural Language Processing is amazing! Let's learn tokenization."

# process text
doc=nlp(text)
# word tokens
word_tokens=[token.text for token in doc]
print("Words token: ",word_tokens)

# sentence tokens
sentence_tokens=[sent.text for sent in doc.sents]
print("Senetence tokens: ",sentence_token)

Words token:  ['Natural', 'Language', 'Processing', 'is', 'amazing', '!', 'Let', "'s", 'learn', 'tokenization', '.']
Senetence tokens:  ['Natural Language Processing is amazing!', "let's learn tokenization."]


### 3. Tokenization with Hugging Face Transformers


In [17]:
from transformers import AutoTokenizer

# load model
tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Natural Language Processing is amazing! Let's learn tokenization."

# Word tokenization
word_tokens=tokenizer.tokenize(text)
print("Word Tokens: ",word_tokens)

# Subword tokenization
words_ids=tokenizer.encode(text,add_special_tokens=True)
print("Tokens IDs : ",words_ids)

Word Tokens:  ['natural', 'language', 'processing', 'is', 'amazing', '!', 'let', "'", 's', 'learn', 'token', '##ization', '.']
Tokens IDs :  [101, 3019, 2653, 6364, 2003, 6429, 999, 2292, 1005, 1055, 4553, 19204, 3989, 1012, 102]


### 4. Tokenization with Python's str.split() (Basic Approach)

In [18]:
text = "Natural Language Processing is amazing! Let's learn tokenization."

word_tokens1=text.split()
print("Words Tokens: ",word_tokens1)

Words Tokens:  ['Natural', 'Language', 'Processing', 'is', 'amazing!', "Let's", 'learn', 'tokenization.']


### 5. Tokenization with re (Regular Expressions)


In [19]:
import re

text = "Natural Language Processing is amazing! Let's learn tokenization."

# custom regex
word_tokens = re.findall(r'\b\w+\b', text)
print("Word Tokens: ",word_tokens)

# Regex for sentence tokenization
sentence_tokens = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
print("Sentence Tokens:", sentence_tokens)

Word Tokens:  ['Natural', 'Language', 'Processing', 'is', 'amazing', 'Let', 's', 'learn', 'tokenization']
Sentence Tokens: ["Natural Language Processing is amazing! Let's learn tokenization."]


### 6. Tokenization with Gensim


In [20]:
from gensim.utils import tokenize

text = "Natural Language Processing is amazing! Let's learn tokenization."

#word tokenization
word_tokens=list(tokenize(text))
print("Word Tokens: ",word_tokens)

Word Tokens:  ['Natural', 'Language', 'Processing', 'is', 'amazing', 'Let', 's', 'learn', 'tokenization']


### 7. Tokenization with TensorFlow/Keras


In [21]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence

text = "Natural Language Processing is amazing! Let's learn tokenization."

# word tokenization
word_tokens=text_to_word_sequence(text)
print("Word Tokens: ",word_tokens)

Word Tokens:  ['natural', 'language', 'processing', 'is', 'amazing', "let's", 'learn', 'tokenization']


## Comparison of Tokenization Methods

| Library/Method        | Features                                                                 |
|-----------------------|-------------------------------------------------------------------------|
| NLTK                 | Simple, beginner-friendly, supports basic tokenization.                 |
| SpaCy                | Fast, efficient, language-specific models, great for production.        |
| Hugging Face         | Model-specific tokenization (subword-level, special tokens, etc.).      |
| `str.split()`        | Extremely basic, lacks NLP-specific capabilities.                       |
| `re` (Regex)         | Fully customizable tokenization.                                        |
| Gensim               | Lightweight and useful for topic modeling.                             |
| TensorFlow/Keras     | Deep learning-specific workflows.                                       |
| TorchText            | PyTorch integration, efficient preprocessing pipelines.                 |
