# Tokenization example

## Using NLTK


In [1]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.9.11-cp310-cp310-win_amd64.whl.metadata (41 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 1.5/1.5 MB 9.9 MB/s eta 0:00:00
Downloading regex-2024.9.11-cp310-cp310-win_amd64.whl (274 kB)
Installing collected packages: regex, nltk
Successfully installed nltk-3.9.1 regex-2024.9.11


In [8]:
## this step will remove Loopup Error 
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [2]:
corpus="""Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.
"""

In [3]:
corpus

"Hello Welcome,to Krish Naik's NLP Tutorials.\nPlease do watch the entire course! to become expert in NLP.\n"

Observe the __\n__ in the above output. On using `print()` __\n__ will go away.

In [4]:
print(corpus)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



In [5]:
##  Tokenization
## Sentence-->paragraphs
from nltk.tokenize import sent_tokenize

In [9]:
sent_tokenize(corpus)

["Hello Welcome,to Krish Naik's NLP Tutorials.",
 'Please do watch the entire course!',
 'to become expert in NLP.']

We will get a list of sentences. As soon as it sees full stop, \n and exclamation it divides paragraph into a new sentence.

In [10]:
documents=sent_tokenize(corpus)

In [9]:
type(documents)

list

In [10]:
for sentence in documents:
    print(sentence)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course!
to become expert in NLP.


In [11]:
## Tokenization 
## Paragraph-->words
## sentence--->words
from nltk.tokenize import word_tokenize

In [12]:
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

All characters such as full stop, comma, apostrophe (i.e.,  _Naik's_ => seperated as 2 words => Naik and 's) and exclamation mark are treated as a single word.

We can also apply `word_tokenize()` on sentences.

In [13]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'Krish', 'Naik', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


This can be done to seperate word such that we can focus and process each and every word when needed.

We can also look into another method `.wordpunct_tokenize`

In [14]:
from nltk.tokenize import wordpunct_tokenize

In [15]:
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

Observe here that the apsotrophe has also been splitted as a single word.

Also look at another method `TreebankWordTokenizer`. 

In [16]:
from nltk.tokenize import TreebankWordTokenizer

In [17]:
tokenizer=TreebankWordTokenizer()

In [18]:
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

Above observe that full stop will not be taken as a seperate word anywhere in the paragraph (eg: 'Tutorials.') except for the last sentence it will be taken up as a seperate word(eg: 'NLP',  '.').

Also, observe that apostrophe and character following it will be taken up as a word(eg: "'s").

Generally we use `word_tokenize` or `sent_tokenize`.

## Tokenization using Spacy

```
!pip install spacy
```

After this we need to download the language library for Spacy.

```
python -m spacy download en_core_web_sm
```

Can load the package via `spacy.load('en_core_web_sm')`.

In [5]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Additonal Reading:
1. [Datacamp blog - What is Tokenization?](https://www.datacamp.com/blog/what-is-tokenization)