### The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.
### It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities

In [17]:
# Importing Required Library
import nltk

In [18]:
# Downloading All from Nltk
# We are downloading all because we need functions for tokenization so instead of importing one by one we are directly downloading all functions
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

True

### Tokenization in NLP is the process of breaking text into individual units or tokens, such as words or subwords, to facilitate language analysis and understanding in natural language processing tasks.

In [None]:
# Declaring text
text1 = "My Name is Xyz"
print("Original Text is: ",text1)
# Using Word tokenize
# This will tokenize the sentence into words and these words will be called tokens
tokens1 = nltk.word_tokenize(text1)
print("After Using Word Tokenize: ",tokens1)

Original Text is:  My Name is Xyz
After Using Word Tokenize:  ['My', 'Name', 'is', 'Xyz']


In [None]:
# Declaring text
text2 = "My Name is Xyz.I am learning NLP. We are doing first experiment"
print("Original Text is: ",text2)
# Using Sent tokenize
# We need to give space after full stop and it will act as delimiter else it will treat all the sentence as one
# This will tokenize the sentence into words and these words will be called tokens
tokens2 = nltk.sent_tokenize(text2)
print("After Using Sent Tokenize: ",tokens2)

Original Text is:  My Name is Xyz.I am learning NLP. We are doing first experiment
After Using Sent Tokenize:  ['My Name is Xyz.I am learning NLP.', 'We are doing first experiment']


In [None]:
# Now if we do not want to download entire nltk library functions we can import tokenizers using nltk.tokenizer
from nltk.tokenize import word_tokenize,sent_tokenize

### We can perform the same functions using this as well

In [None]:
# Declaring text
text1 = "My Name is Xyz"
print("Original Text is: ",text1)
# Using Word tokenize
# This will tokenize the sentence into words and these words will be called tokens
tokens1 = nltk.word_tokenize(text1)
print("After Using Word Tokenize: ",tokens1)

Original Text is:  My Name is Xyz
After Using Word Tokenize:  ['My', 'Name', 'is', 'Xyz']


In [8]:
# Declaring text
text2 = "My Name is Xyz. I am learning NLP. We are doing first experiment"
print("Original Text is: ",text2)
# Using Sent tokenize
# Sent Tokenize stands for Sentence Tokenize
# We need to give space after full stop and it will act as delimiter else it will treat all the sentence as one
# This will tokenize the sentence into words and these words will be called tokens
tokens2 = nltk.sent_tokenize(text2)
print("After Using Sent Tokenize: ",tokens2)

Original Text is:  My Name is Xyz. I am learning NLP. We are doing first experiment
After Using Sent Tokenize:  ['My Name is Xyz.', 'I am learning NLP.', 'We are doing first experiment']


### Trying More Tokenizers available in NLP
For This we have to import required librarires as well
1. **Whitespace Tokenizer**
   - **Library**: Python's `split()` or NLTK's `WhitespaceTokenizer`
   - Splits text on whitespace to create tokens.

2. **Word Tokenizer**
   - **Library**: NLTK's `word_tokenize` or spaCy
   - Splits text into words based on language-specific rules.

3. **Sentence Tokenizer**
   - **Library**: NLTK's `sent_tokenize` or spaCy
   - Splits text into sentences based on punctuation and grammar.

4. **Subword Tokenizer**
   - **Library**: Hugging Face's `tokenizers` or SentencePiece
   - Segments text into subword units, useful for languages with complex morphology.

5. **BERT Tokenizer**
   - **Library**: Hugging Face's `transformers`
   - Tokenizes text for BERT-based models, preserving subword information.

6. **Byte-Pair Encoding (BPE) Tokenizer**
   - **Library**: Hugging Face's `tokenizers` or GPT-2's `BPE`
   - Splits text into variable-length subword units, often used in neural machine translation.

7. **Regular Expression Tokenizer**
   - **Library**: Python's `re`
   - Tokenizes text based on custom-defined regular expressions.

11. **Penn Treebank Tokenizer**
    - **Library**: NLTK's `TreebankWordTokenizer`
    - Tokenizes text according to the Penn Treebank tokenization rules, often used in parsing tasks.

12. **Moses Tokenizer**
    - **Library**: NLTK's `MosesTokenizer` or the Moses SMT toolkit
    - Tokenizes text with a focus on statistical machine translation, handling complex tokenization tasks.

8. **Custom Tokenizer**
   - **Library**: Depends on the implementation
   - Allows for user-defined tokenization rules tailored to specific tasks or languages.

In [9]:
# We have already implemented word and sentence tokenizer now we can use the rest
# Declaring text
text = "My Name is Xyz. I am learning NLP. We are doing first experiment"
print("Original Text is: ",text)

# Using Whitespace tokenizer
# First we create an instance of class whitespacetokenizer and then we can use it
wh_tk_instance = nltk.WhitespaceTokenizer()

# Now using whitespace tokenizer
wh_tk = wh_tk_instance.tokenize(text)
print("After Using Whitespace Tokenize: ",wh_tk)

Original Text is:  My Name is Xyz. I am learning NLP. We are doing first experiment
After Using Whitespace Tokenize:  ['My', 'Name', 'is', 'Xyz.', 'I', 'am', 'learning', 'NLP.', 'We', 'are', 'doing', 'first', 'experiment']


In [31]:
# We have already implemented word and sentence tokenizer now we can use the rest
# Declaring text
text = "My Name is Xyz. I am learning NLP. We are doing first experiment"
print("Original Text is: ",text)

# Using Penn Treebank Tokenizer
# First we create an instance of class Penn Treebank and then we can use it
penn_tk_instance = nltk.TreebankWordTokenizer()

# Now using whitespace tokenizer
penn_tk = penn_tk_instance.tokenize(text)
print("After Using Penn Treebank Tokenize: ",penn_tk)

Original Text is:  My Name is Xyz. I am learning NLP. We are doing first experiment
After Using Penn Treebank Tokenize:  ['My', 'Name', 'is', 'Xyz.', 'I', 'am', 'learning', 'NLP.', 'We', 'are', 'doing', 'first', 'experiment']
