# Different Tokenizers

Each tokenizer processes the text differently, reflecting its specific approach to handling words, punctuation, and sentence boundaries. The choice of tokenizer depends on the requirements of your NLP task and the nature of your text data

### White Space Tokenization:
- **Splits text based on spaces.**
- **Expected Tokens:** `["Let's", "see", "different", "tokenizers.", "They", "all", "work", "differently:", "white-space,", "regex,", "NLTK,", "Spacy!"]`

### NLTK Word Tokenization:
- **Deals with punctuation and contractions.**
- **Expected Tokens:** `["Let", "'s", "see", "different", "tokenizers", ".", "They", "all", "work", "differently", ":", "white-space", ",", "regex", ",", "NLTK", ",", "Spacy", "!"]`

### Regular Expression Tokenizer:
- **Uses a regex pattern (here \w+ which matches any word character).**
- **Expected Tokens:** `["Let", "s", "see", "different", "tokenizers", "They", "all", "work", "differently", "white", "space", "regex", "NLTK", "Spacy"]`

### Spacy Tokenization:
- **Sophisticated tokenizer handling punctuation, special characters.**
- **Expected Tokens:** `["Let", "'s", "see", "different", "tokenizers", ".", "They", "all", "work", "differently", ":", "white-space", ",", "regex", ",", "NLTK", ",", "Spacy", "!"]`

### Sentence Tokenization:
- **Splits text into sentences.**
- **Expected Sentences:** `["Let's see different tokenizers.", "They all work differently: white-space, regex, NLTK, Spacy!"]`

Each tokenizer processes the text differently, reflecting its specific approach to handling words, punctuation, and sentence boundaries. The choice of tokenizer depends on the requirements of your NLP task and the nature of your text data.

In [1]:
import nltk
import spacy
from nltk.tokenize import word_tokenize, RegexpTokenizer, sent_tokenize

# Sample text
text = "The quick, brown fox (an exemplary species) jumps over the lazy dog - a classic example of a pangram"

# White Space Tokenization
white_space_tokens = text.split()

# NLTK Word Tokenization
nltk.download('punkt')
nltk_tokens = word_tokenize(text)

# Regular Expression Tokenizer
regexp_tokenizer = RegexpTokenizer(r'\w+')
regex_tokens = regexp_tokenizer.tokenize(text)

# Spacy Tokenization
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_tokens = [token.text for token in doc]

# Sentence Tokenization
sentences = sent_tokenize(text)

# Collecting all tokenization results
tokenization_results = {
    "White Space Tokenization": white_space_tokens,
    "NLTK Tokenization": nltk_tokens,
    "Regex Tokenization": regex_tokens,
    "Spacy Tokenization": spacy_tokens,
    "Sentence Tokenization": sentences
}

for k in tokenization_results.keys():
  print(k)
  print(tokenization_results[k])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


White Space Tokenization
['The', 'quick,', 'brown', 'fox', '(an', 'exemplary', 'species)', 'jumps', 'over', 'the', 'lazy', 'dog', '-', 'a', 'classic', 'example', 'of', 'a', 'pangram']
NLTK Tokenization
['The', 'quick', ',', 'brown', 'fox', '(', 'an', 'exemplary', 'species', ')', 'jumps', 'over', 'the', 'lazy', 'dog', '-', 'a', 'classic', 'example', 'of', 'a', 'pangram']
Regex Tokenization
['The', 'quick', 'brown', 'fox', 'an', 'exemplary', 'species', 'jumps', 'over', 'the', 'lazy', 'dog', 'a', 'classic', 'example', 'of', 'a', 'pangram']
Spacy Tokenization
['The', 'quick', ',', 'brown', 'fox', '(', 'an', 'exemplary', 'species', ')', 'jumps', 'over', 'the', 'lazy', 'dog', '-', 'a', 'classic', 'example', 'of', 'a', 'pangram']
Sentence Tokenization
['The quick, brown fox (an exemplary species) jumps over the lazy dog - a classic example of a pangram']
