## Implementation for Tokenization

Tokenization is the process of dividing a text into smaller units known as tokens. Tokens are typically words or sub-words in the context of natural language processing. Tokenization is a critical step in many NLP tasks, including text processing, language modelling, and machine translation. The process involves splitting a string, or text into a list of tokens. One can think of tokens as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

In [2]:
# Sentence Tokenization using sent_tokenize

from nltk.tokenize import sent_tokenize
 
text = "Hello everyone. Welcome to Natural Language Processing. You are studying NLP article."
sent_tokenize(text)

['Hello everyone.',
 'Welcome to Natural Language Processing.',
 'You are studying NLP article.']

In [3]:
# Sentence Tokenization using PunktSentenceTokenizer
# The Punkt tokenizer is a data-driven sentence tokenizer that comes with NLTK. It is trained on large corpus of text to identify sentence boundaries.

import nltk.data

# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

['Hello everyone.',
 'Welcome to Natural Language Processing.',
 'You are studying NLP article.']

In [4]:
# Word Tokenization using work_tokenize

from nltk.tokenize import word_tokenize

text = "Hello everyone. Welcome to Natural Language Processing (NLP)."
word_tokenize(text)

['Hello',
 'everyone',
 '.',
 'Welcome',
 'to',
 'Natural',
 'Language',
 'Processing',
 '(',
 'NLP',
 ')',
 '.']

In [5]:
# Word Tokenization Using TreebankWordTokenizer 

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)


['Hello',
 'everyone.',
 'Welcome',
 'to',
 'Natural',
 'Language',
 'Processing',
 '(',
 'NLP',
 ')',
 '.']