### Tokenization

Tokenization is the first step used in Natural Language Processing that is used to break the entire text corpus into smaller parts known as tokens.

There are two types of Tokenization:- 

1.  Sentence Tokenization --------->  It is the process of breaking the paragraphs into the sentences.

Ex:-- Input :- "I love Data Science. It is amazing!"

      Output :- ["I love Data Science.","It is amazing!"]

2.  Word     Tokenization --------->  It is the process of breaking the sentences into words.

Ex:-- Input :- "I love Data Science."

      Output :- ["I","love","Data","Science."]


### Steps used in this Algorithm:-

1.  Import all the necessary libraries

2.  Include the necessary Resource required for nltk

3.  Create a Sample Text Dataset

4.  Perform the Sentence Tokenization

5.  Perform the Word Tokenization

6.  Perform the RegExpTokenizer


### Step 1:  Import all the necessary libraries

In [80]:
import  numpy              as   np
import  pandas             as   pd
import  matplotlib.pyplot  as  plt
import  seaborn            as  sns

import  nltk

from    nltk.tokenize    import sent_tokenize, word_tokenize

### OBSERVATIONS:

1.  numpy ------------------>  Computation of the numerical array

2.  pandas ----------------->  Data Creation and Manipulation

3.  matplotlib ------------->  Data Visualization

4.  seaborn   -------------->  Data Correlation

5.  nltk    ----------------> Plays an imporatnt role in text preprocessing

6.  tokenize ---------------> Breaks the text into the smaller parts

7.  sent_tokenize ----------> Breaks the paragraphs into sentences

8.  word_tokenize ----------> Breaks the sentences into words

### Step 2: Include the necessary Resource required for nltk

In [81]:
nltk.download('punkt_tab')    

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### OBSERVATIONS:

1.  punkt_tab is an advance library module used in NLTK that contains the pre-trained statistical tables and helps the tokenizer to accurately split the text into the sentences.

### Step 3: Create a Sample Text Dataset

In [82]:
text = """Tokenization is the first step in text processing. 
It breaks the text into smaller parts called tokens — words, phrases, or symbols.
This helps machines understand human language more effectively!
"""

In [83]:
print(text)

Tokenization is the first step in text processing. 
It breaks the text into smaller parts called tokens — words, phrases, or symbols.
This helps machines understand human language more effectively!



### OBSERVATIONS:

1. Here a sample text data is defined.

### Step 4: Perform the Sentence Tokenization

In [84]:
print("-----------------Printing the Original text----------------------------------------------")

print(text)

print("-----------------Performing Sentence tokenization on the paragraphs texts-----------------")

sentences = sent_tokenize(text)

print(sentences)

-----------------Printing the Original text----------------------------------------------
Tokenization is the first step in text processing. 
It breaks the text into smaller parts called tokens — words, phrases, or symbols.
This helps machines understand human language more effectively!

-----------------Performing Sentence tokenization on the paragraphs texts-----------------
['Tokenization is the first step in text processing.', 'It breaks the text into smaller parts called tokens — words, phrases, or symbols.', 'This helps machines understand human language more effectively!']


### OBSERVATIONS:

1. We have printed the original text.

2. Setence tokenization is performed on the entire sample text where the entire text is divided into the sentences.

3. We have obtained the output in the form of the sentences

### Step 5: Perform the Word Tokenization

In [85]:
from nltk.tokenize import word_tokenize

### Join all the sentences in the lists into the texts

sent = ' '.join(sentences)


print("-----------------------------Original Sentences in texts---------------------------------------------")

print(sent)

print("-----------------------------Word Tokenization-------------------------------------------------------")

ans = word_tokenize(sent)

print(ans)

-----------------------------Original Sentences in texts---------------------------------------------
Tokenization is the first step in text processing. It breaks the text into smaller parts called tokens — words, phrases, or symbols. This helps machines understand human language more effectively!
-----------------------------Word Tokenization-------------------------------------------------------
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'processing', '.', 'It', 'breaks', 'the', 'text', 'into', 'smaller', 'parts', 'called', 'tokens', '—', 'words', ',', 'phrases', ',', 'or', 'symbols', '.', 'This', 'helps', 'machines', 'understand', 'human', 'language', 'more', 'effectively', '!']


### OBSERVATIONS:

1. We have joined all the sentences in the lists into the sentences. Now we have the original sentence text.

2. Word Tokenization is performed over the sentences and the sentence text is divided into the words.

3. We have obtained the output in the form of the words.

### Step 6: Perform the RegExpTokenizer

In [86]:
from nltk.tokenize  import RegexpTokenizer

### Create an object for Regular Expression Tokenizer

reg = RegexpTokenizer(r'\w+')


print("-----------------Printing the Original text----------------------------------------------")

print(text)

print("-----------------Printing the Cleaned text after the tokenization of the regular expression text---------------------------------------------")
### Tokenize the custom regular expression

print(reg.tokenize(text))

-----------------Printing the Original text----------------------------------------------
Tokenization is the first step in text processing. 
It breaks the text into smaller parts called tokens — words, phrases, or symbols.
This helps machines understand human language more effectively!

-----------------Printing the Cleaned text after the tokenization of the regular expression text---------------------------------------------
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'processing', 'It', 'breaks', 'the', 'text', 'into', 'smaller', 'parts', 'called', 'tokens', 'words', 'phrases', 'or', 'symbols', 'This', 'helps', 'machines', 'understand', 'human', 'language', 'more', 'effectively']


### OBSERVATIONS:

1. An object for Regular expression Tokenizer has been defined.

   (i)  \w ---------> extracts only numbers and letters

   (ii) +  ---------> Comprises of one or more occurances


2. Here we have a sample text

3. Regular Expression Tokenizer has been applied on the text where it removes all ',','.','$' and extracts only numbers and letters.

4.  Output is in the form of words enclosed in lists.