<a href="https://colab.research.google.com/github/niksom406/Learning_NLP/blob/main/Tokenization_Example_Using_NLTK_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates different tokenization techniques using the NLTK library.

**Tokenization** is the process of breaking down a piece of text into smaller units called tokens. These tokens can be words, sentences, or subword units.

### Installing the NLTK library

First, we install the Natural Language Toolkit (NLTK) library using pip.

In [None]:
!pip install nltk



### Defining the Corpus

Here we define a sample text, which we will refer to as the "corpus," for demonstrating the tokenization process.

In [None]:
corpus = """Hello Welcome, This is a demo to help me learn tokenization topic.
Will be working various such projects. This is my first lesson in NLP.
Stay Tuned in Nikita's NLP Learning Experience.
"""

### Printing the Corpus

We print the corpus to see the original text.

In [None]:
print(corpus)

Hello Welcome, This is a demo to help me learn tokenization topic. 
Will be working various such projects. This is my first lesson in NLP.
Stay Tuned in Nikita's NLP Learning Experience.



### Sentence Tokenization

Sentence tokenization is the process of splitting a text into individual sentences. We use the `sent_tokenize` function from NLTK for this purpose.

In [None]:
## Tokenization
## Sentence --> paragraphs
from nltk.tokenize import sent_tokenize

### Downloading NLTK data

Some NLTK functions, like `sent_tokenize`, require specific data packages. We download the `punkt` tokenizer models here.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Performing Sentence Tokenization

We apply `sent_tokenize` to our corpus and store the resulting list of sentences in the `documents` variable.

In [None]:
documents = sent_tokenize(corpus)

### Printing the Tokenized Sentences

We print the `documents` list to see the individual sentences.

In [None]:
print(documents)

['Hello Welcome, This is a demo to help me learn tokenization topic.', 'Will be working various such projects.', 'This is my first lesson in NLP.', "Stay Tuned in Nikita's NLP Learning Experience."]


In [None]:
type(documents)

list

In [None]:
for sentence in documents:
  print(sentence)

Hello Welcome, This is a demo to help me learn tokenization topic.
Will be working various such projects.
This is my first lesson in NLP.
Stay Tuned in Nikita's NLP Learning Experience.


### Word Tokenization

Word tokenization is the process of splitting a text into individual words. NLTK provides several word tokenizers.

In [None]:
## Tokenization
## Paragraph --> words
## sentence --> words
from nltk.tokenize import word_tokenize

### Using `word_tokenize` on the Corpus

We use the `word_tokenize` function on the entire corpus to see how it tokenizes the text into words and punctuation.

In [None]:
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'This',
 'is',
 'a',
 'demo',
 'to',
 'help',
 'me',
 'learn',
 'tokenization',
 'topic',
 '.',
 'Will',
 'be',
 'working',
 'various',
 'such',
 'projects',
 '.',
 'This',
 'is',
 'my',
 'first',
 'lesson',
 'in',
 'NLP',
 '.',
 'Stay',
 'Tuned',
 'in',
 'Nikita',
 "'s",
 'NLP',
 'Learning',
 'Experience',
 '.']

### Using `word_tokenize` on Each Sentence

We iterate through the previously tokenized sentences and apply `word_tokenize` to each sentence to see word tokenization at the sentence level.

In [None]:
for sentence in documents:
  print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'This', 'is', 'a', 'demo', 'to', 'help', 'me', 'learn', 'tokenization', 'topic', '.']
['Will', 'be', 'working', 'various', 'such', 'projects', '.']
['This', 'is', 'my', 'first', 'lesson', 'in', 'NLP', '.']
['Stay', 'Tuned', 'in', 'Nikita', "'s", 'NLP', 'Learning', 'Experience', '.']


### Treebank Word Tokenization

The TreebankWordTokenizer is another word tokenizer provided by NLTK. It follows the conventions of the Penn Treebank corpus.

In [None]:
from nltk.tokenize import TreebankWordTokenizer

### Creating a TreebankWordTokenizer Instance

We create an instance of the `TreebankWordTokenizer`.

In [None]:
tokenizer = TreebankWordTokenizer()

### Using TreebankWordTokenizer on the Corpus

We use the `tokenize` method of the `TreebankWordTokenizer` instance on the corpus. Note how it handles punctuation differently compared to `word_tokenize`.

In [None]:
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'This',
 'is',
 'a',
 'demo',
 'to',
 'help',
 'me',
 'learn',
 'tokenization',
 'topic.',
 'Will',
 'be',
 'working',
 'various',
 'such',
 'projects.',
 'This',
 'is',
 'my',
 'first',
 'lesson',
 'in',
 'NLP.',
 'Stay',
 'Tuned',
 'in',
 'Nikita',
 "'s",
 'NLP',
 'Learning',
 'Experience',
 '.']