# Tokenization - Text Preprocessing

**Goal**: This notebook introduces the fundamental concept of tokenization in natural language processing (NLP). Tokenization is the process of splitting text into smaller units, such as words or sentences, which can be further processed by machine learning models.

**Context**: Tokenization is a crucial first step in any NLP pipeline. It converts raw text into meaningful chunks or tokens that allow models to analyze and extract information. Understanding how to tokenize text is critical in fields like information retrieval, language modeling, and text classification. In this notebook, we explore different tokenization techniques and their impact on further text processing steps.


In [4]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\bleew\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [2]:
corpus = """
    Hello.
    Welcome to this example of Tokenization! I'm on my way to become an expert in NLP.
    Please do look at the entire file.
"""
print(corpus)


    Hello.
    Welcome to this example of Tokenization! I'm on my way to become an expert in NLP.
    Please do look at the entire file.



In [6]:
##  Tokenization
##  Sentence -> paragraphs
from nltk.tokenize import sent_tokenize
documents = sent_tokenize(corpus)

In [7]:
type(documents)

list

In [9]:
for sentence in documents:
    print(sentence)


    Hello.
Welcome to this example of Tokenization!
I'm on my way to become an expert in NLP.
Please do look at the entire file.


In [10]:
##  Tokenization
##  Paragraph -> words
##  Sentence -> words
from nltk.tokenize import word_tokenize
word_tokenize(corpus)

['Hello',
 '.',
 'Welcome',
 'to',
 'this',
 'example',
 'of',
 'Tokenization',
 '!',
 'I',
 "'m",
 'on',
 'my',
 'way',
 'to',
 'become',
 'an',
 'expert',
 'in',
 'NLP',
 '.',
 'Please',
 'do',
 'look',
 'at',
 'the',
 'entire',
 'file',
 '.']

In [12]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', '.']
['Welcome', 'to', 'this', 'example', 'of', 'Tokenization', '!']
['I', "'m", 'on', 'my', 'way', 'to', 'become', 'an', 'expert', 'in', 'NLP', '.']
['Please', 'do', 'look', 'at', 'the', 'entire', 'file', '.']


In [13]:
#Splits puntuactions
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(corpus)

['Hello',
 '.',
 'Welcome',
 'to',
 'this',
 'example',
 'of',
 'Tokenization',
 '!',
 'I',
 "'",
 'm',
 'on',
 'my',
 'way',
 'to',
 'become',
 'an',
 'expert',
 'in',
 'NLP',
 '.',
 'Please',
 'do',
 'look',
 'at',
 'the',
 'entire',
 'file',
 '.']

In [15]:
#Does not splits some punctuations from words 
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello.',
 'Welcome',
 'to',
 'this',
 'example',
 'of',
 'Tokenization',
 '!',
 'I',
 "'m",
 'on',
 'my',
 'way',
 'to',
 'become',
 'an',
 'expert',
 'in',
 'NLP.',
 'Please',
 'do',
 'look',
 'at',
 'the',
 'entire',
 'file',
 '.']