### Tokenization

Tokenization is the process of breaking down text into smaller units, typically words or subwords, called tokens.
These tokens serve as the basic building blocks for further analysis in natural language processing (NLP) tasks.
Different tokenization methods are available in NLTK (Natural Language Toolkit) to suit various needs and scenarios.

Structure of Token -> prefix - morpheme - suffix

Corpus > Document > Paragraph > Sentence > Tokens

#### How Tokenization Works:

Sentence Tokenization: This step involves splitting the text corpus into individual sentences. This can typically be done using punctuation marks like periods, exclamation marks, and question marks as delimiters.

Example: "Hello! How are you today?" -> ["Hello!", "How are you today?"]

Word Tokenization: Once sentences are separated, the next step is to break down each sentence into individual words or tokens. This is usually done by splitting the sentences based on whitespace or punctuation.

Example: "How are you today?" -> ["How", "are", "you", "today?"]

Further Tokenization: Depending on the task and requirements, tokenization can be extended to include additional levels of granularity such as splitting hyphenated words, handling contractions, or identifying special characters.


    Examples of different types of tokenization method -
    sent_tokenize,
    word_tokenize,
    TreebankWordTokenizer,
    wordpunct_tokenize,
    WhitespaceTokenizer,
    TweetTokenizer,
    casual_tokenize,
    PunktSentenceTokenizer,
    ReppTokenizer


#### Advantages

Text Preprocessing: Tokenization is an essential step in text preprocessing for various NLP tasks, including sentiment analysis, machine translation, and information retrieval.

Normalization: It helps in standardizing the text data by breaking it down into standardized units, facilitating further analysis and processing.

Feature Extraction: Tokens serve as the basis for feature extraction in NLP models, enabling the extraction of meaningful information from text data.

#### Disadvantages

Ambiguity: Tokenization may encounter ambiguity in certain cases, such as tokenizing compound words or handling abbreviations, leading to potential errors in downstream tasks.

Language-specific Challenges: Tokenization may face challenges in languages with complex word structures, morphological variations, or non-standard orthographies.

Tokenization Errors: Errors in tokenization, such as splitting words incorrectly or treating punctuation inconsistently, can impact the quality of subsequent analyses and results.


In [10]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize, TreebankWordTokenizer, wordpunct_tokenize, WhitespaceTokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
corpus = """
In the domain of natural language processing (NLP), statistical NLP in particular, there's a need to train the model or algorithm with lots of data.
For this purpose, researchers have assembled many text corpora.
A common corpus is also useful for benchmarking models.
"""

In [None]:
print(corpus)


In the domain of natural language processing (NLP), statistical NLP in particular, there's a need to train the model or algorithm with lots of data.
For this purpose, researchers have assembled many text corpora.
A common corpus is also useful for benchmarking models.



sent_tokenize

In [None]:
sentence = sent_tokenize(corpus)
sentence

["\nIn the domain of natural language processing (NLP), statistical NLP in particular, there's a need to train the model or algorithm with lots of data.",
 'For this purpose, researchers have assembled many text corpora.',
 'A common corpus is also useful for benchmarking models.']

word_tokenize

In [None]:
words = word_tokenize(corpus)
words

['In',
 'the',
 'domain',
 'of',
 'natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 ',',
 'statistical',
 'NLP',
 'in',
 'particular',
 ',',
 'there',
 "'s",
 'a',
 'need',
 'to',
 'train',
 'the',
 'model',
 'or',
 'algorithm',
 'with',
 'lots',
 'of',
 'data',
 '.',
 'For',
 'this',
 'purpose',
 ',',
 'researchers',
 'have',
 'assembled',
 'many',
 'text',
 'corpora',
 '.',
 'A',
 'common',
 'corpus',
 'is',
 'also',
 'useful',
 'for',
 'benchmarking',
 'models',
 '.']

In [None]:
for words in sentence:
  words = word_tokenize(words)
  print(words)

['In', 'the', 'domain', 'of', 'natural', 'language', 'processing', '(', 'NLP', ')', ',', 'statistical', 'NLP', 'in', 'particular', ',', 'there', "'s", 'a', 'need', 'to', 'train', 'the', 'model', 'or', 'algorithm', 'with', 'lots', 'of', 'data', '.']
['For', 'this', 'purpose', ',', 'researchers', 'have', 'assembled', 'many', 'text', 'corpora', '.']
['A', 'common', 'corpus', 'is', 'also', 'useful', 'for', 'benchmarking', 'models', '.']


wordpunct_tokenize

In [None]:
WordPunc = wordpunct_tokenize(corpus)
WordPunc

['In',
 'the',
 'domain',
 'of',
 'natural',
 'language',
 'processing',
 '(',
 'NLP',
 '),',
 'statistical',
 'NLP',
 'in',
 'particular',
 ',',
 'there',
 "'",
 's',
 'a',
 'need',
 'to',
 'train',
 'the',
 'model',
 'or',
 'algorithm',
 'with',
 'lots',
 'of',
 'data',
 '.',
 'For',
 'this',
 'purpose',
 ',',
 'researchers',
 'have',
 'assembled',
 'many',
 'text',
 'corpora',
 '.',
 'A',
 'common',
 'corpus',
 'is',
 'also',
 'useful',
 'for',
 'benchmarking',
 'models',
 '.']

In [None]:
for words in sentence:
  words = wordpunct_tokenize(words)
  print(words)

['In', 'the', 'domain', 'of', 'natural', 'language', 'processing', '(', 'NLP', '),', 'statistical', 'NLP', 'in', 'particular', ',', 'there', "'", 's', 'a', 'need', 'to', 'train', 'the', 'model', 'or', 'algorithm', 'with', 'lots', 'of', 'data', '.']
['For', 'this', 'purpose', ',', 'researchers', 'have', 'assembled', 'many', 'text', 'corpora', '.']
['A', 'common', 'corpus', 'is', 'also', 'useful', 'for', 'benchmarking', 'models', '.']


TreebankWordTokenizer

In [None]:
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['In',
 'the',
 'domain',
 'of',
 'natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 ',',
 'statistical',
 'NLP',
 'in',
 'particular',
 ',',
 'there',
 "'s",
 'a',
 'need',
 'to',
 'train',
 'the',
 'model',
 'or',
 'algorithm',
 'with',
 'lots',
 'of',
 'data.',
 'For',
 'this',
 'purpose',
 ',',
 'researchers',
 'have',
 'assembled',
 'many',
 'text',
 'corpora.',
 'A',
 'common',
 'corpus',
 'is',
 'also',
 'useful',
 'for',
 'benchmarking',
 'models',
 '.']

WhitespaceTokenizer

In [None]:
list(WhitespaceTokenizer().span_tokenize(corpus))

[(1, 3),
 (4, 7),
 (8, 14),
 (15, 17),
 (18, 25),
 (26, 34),
 (35, 45),
 (46, 52),
 (53, 64),
 (65, 68),
 (69, 71),
 (72, 83),
 (84, 91),
 (92, 93),
 (94, 98),
 (99, 101),
 (102, 107),
 (108, 111),
 (112, 117),
 (118, 120),
 (121, 130),
 (131, 135),
 (136, 140),
 (141, 143),
 (144, 149),
 (150, 153),
 (154, 158),
 (159, 167),
 (168, 179),
 (180, 184),
 (185, 194),
 (195, 199),
 (200, 204),
 (205, 213),
 (214, 215),
 (216, 222),
 (223, 229),
 (230, 232),
 (233, 237),
 (238, 244),
 (245, 248),
 (249, 261),
 (262, 269)]

In [None]:
whitespace = WhitespaceTokenizer()
whitespace.tokenize(corpus)

['In',
 'the',
 'domain',
 'of',
 'natural',
 'language',
 'processing',
 '(NLP),',
 'statistical',
 'NLP',
 'in',
 'particular,',
 "there's",
 'a',
 'need',
 'to',
 'train',
 'the',
 'model',
 'or',
 'algorithm',
 'with',
 'lots',
 'of',
 'data.',
 'For',
 'this',
 'purpose,',
 'researchers',
 'have',
 'assembled',
 'many',
 'text',
 'corpora.',
 'A',
 'common',
 'corpus',
 'is',
 'also',
 'useful',
 'for',
 'benchmarking',
 'models.']

Using spaCy

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [4]:
# Tokenize the text using spaCy
doc = nlp(corpus)

In [14]:
# # Extract tokens
tokens = [token.text for token in doc]

In [15]:
print(tokens)

['\n', 'In', 'the', 'domain', 'of', 'natural', 'language', 'processing', '(', 'NLP', ')', ',', 'statistical', 'NLP', 'in', 'particular', ',', 'there', "'s", 'a', 'need', 'to', 'train', 'the', 'model', 'or', 'algorithm', 'with', 'lots', 'of', 'data', '.', '\n', 'For', 'this', 'purpose', ',', 'researchers', 'have', 'assembled', 'many', 'text', 'corpora', '.', '\n', 'A', 'common', 'corpus', 'is', 'also', 'useful', 'for', 'benchmarking', 'models', '.', '\n']
