# Tokenization

In this notebook we will learn tokenization. There are 4 different termnologies we use in tokenization.
 1. **Corpus** - This is a paragraph
 2. **Documents** - This is all the sentences from *corpus*
 3. **Vocabulary** - This is all the unique words available in the *corpus*
 4. **Words** - This is all the words available in the *corpus*.

In this notebook we will handle all the different aspects of tokenization from documents to words. we will be using **NLTK** library to perform tokenization.

In [2]:
!pip install nltk



In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rahul\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
corpus="""Natural Language Processing is a fascinating field of AI.
Tokenization is the first step in NLP preprocessing.
Python provides various libraries for NLP, such as NLTK and spaCy.
Text data needs to be cleaned before applying machine learning models.
"""

In [5]:
print(corpus)

Natural Language Processing is a fascinating field of AI.
Tokenization is the first step in NLP preprocessing.
Python provides various libraries for NLP, such as NLTK and spaCy.
Text data needs to be cleaned before applying machine learning models.



## Sentence --> paragraphs

In [7]:
from nltk.tokenize import sent_tokenize

In [8]:
documents = sent_tokenize(corpus)
type(documents)

list

In [9]:
for document  in documents:
    print(document)

Natural Language Processing is a fascinating field of AI.
Tokenization is the first step in NLP preprocessing.
Python provides various libraries for NLP, such as NLTK and spaCy.
Text data needs to be cleaned before applying machine learning models.


## Paragraph --> words
## sentence --> words

In [11]:
from nltk.tokenize import word_tokenize

In [12]:
words = word_tokenize(corpus)
print(words)

['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP', 'preprocessing', '.', 'Python', 'provides', 'various', 'libraries', 'for', 'NLP', ',', 'such', 'as', 'NLTK', 'and', 'spaCy', '.', 'Text', 'data', 'needs', 'to', 'be', 'cleaned', 'before', 'applying', 'machine', 'learning', 'models', '.']


In [13]:
for document  in documents:
    print(word_tokenize(document))

['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.']
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP', 'preprocessing', '.']
['Python', 'provides', 'various', 'libraries', 'for', 'NLP', ',', 'such', 'as', 'NLTK', 'and', 'spaCy', '.']
['Text', 'data', 'needs', 'to', 'be', 'cleaned', 'before', 'applying', 'machine', 'learning', 'models', '.']


---
Below we are using a different library from NLTK -> *wordpunct_tokenize*.

This library will also include punctuation in tokenization as observed in the below output

In [15]:
from nltk.tokenize import wordpunct_tokenize

In [16]:
wordsPunct = word_tokenize(corpus)
print(wordsPunct)

['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP', 'preprocessing', '.', 'Python', 'provides', 'various', 'libraries', 'for', 'NLP', ',', 'such', 'as', 'NLTK', 'and', 'spaCy', '.', 'Text', 'data', 'needs', 'to', 'be', 'cleaned', 'before', 'applying', 'machine', 'learning', 'models', '.']


---

We can also use *TreebankWordTokenizer* which will not consider "." as a seperate token but only the last "." (full stop) will be treated as a seperate token.

In [18]:
from nltk.tokenize import TreebankWordTokenizer

In [19]:
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Natural',
 'Language',
 'Processing',
 'is',
 'a',
 'fascinating',
 'field',
 'of',
 'AI.',
 'Tokenization',
 'is',
 'the',
 'first',
 'step',
 'in',
 'NLP',
 'preprocessing.',
 'Python',
 'provides',
 'various',
 'libraries',
 'for',
 'NLP',
 ',',
 'such',
 'as',
 'NLTK',
 'and',
 'spaCy.',
 'Text',
 'data',
 'needs',
 'to',
 'be',
 'cleaned',
 'before',
 'applying',
 'machine',
 'learning',
 'models',
 '.']