## Tokenization in Natural Language Processing (NLP)
### Tokenization is a fundamental process in Natural Language Processing (NLP) that involves breaking down a piece of text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the specific needs of the task at hand. 
#### NLTK, or Natural Language Toolkit, is a powerful library in Python used for working with human language data, known as natural language processing (NLP). It provides a suite of text processing libraries that cover classification, tokenization, stemming, tagging, parsing, and more. It's widely used for tasks such as sentiment analysis, text classification, and language modeling.

In [1]:
!pip install nltk





#### A corpus is a large and structured set of texts used for linguistic analysis and natural language processing (NLP). It can consist of anything from news articles and books to tweets and transcripts of spoken language

In [2]:
corpus = "The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in Agra, Uttar Pradesh, India. It was commissioned in 1631 by the fifth Mughal emperor, Shah Jahan (1628-1658) to house the tomb of his beloved wife, Mumtaz Mahal; it also houses the tomb of Shah Jahan himself. The tomb is the centrepiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall. Construction of the mausoleum was completed in 1648, but work continued on other phases of the project for another five years. The first ceremony held at the mausoleum was an observance by Shah Jahan, on 6 February 1643, of the 12th anniversary of the death of Mumtaz Mahal. The Taj Mahal complex is believed to have been completed in its entirety in 1653 at a cost estimated at the time to be around ₹5 million, which in 2023 would be approximately ₹35 billion (US$77.8 million). The building complex incorporates the design traditions of Indo-Islamic and Mughal architecture. It employs symmetrical constructions with the usage of various shapes and symbols. While the mausoleum is constructed of white marble inlaid with semi-precious stones, red sandstone was used for other buildings in the complex similar to the Mughal era buildings of the time. The construction project employed more than 20,000 workers and artisans under the guidance of a board of architects led by Ustad Ahmad Lahori, the emperor's court architect. The Taj Mahal was designated as a UNESCO World Heritage Site in 1983 for being 'the jewel of Islamic art in India and one of the universally admired masterpieces of the world's heritage'. It is regarded as one of the best examples of Mughal architecture and a symbol of Indian history. The Taj Mahal is a major tourist attraction and attracts more than five million visitors a year. In 2007, it was declared a winner of the New 7 Wonders of the World initiative."

In [3]:
print(corpus)

The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in Agra, Uttar Pradesh, India. It was commissioned in 1631 by the fifth Mughal emperor, Shah Jahan (1628-1658) to house the tomb of his beloved wife, Mumtaz Mahal; it also houses the tomb of Shah Jahan himself. The tomb is the centrepiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall. Construction of the mausoleum was completed in 1648, but work continued on other phases of the project for another five years. The first ceremony held at the mausoleum was an observance by Shah Jahan, on 6 February 1643, of the 12th anniversary of the death of Mumtaz Mahal. The Taj Mahal complex is believed to have been completed in its entirety in 1653 at a cost estimated at the time to be around ₹5 million, which in 2023 would be approximately ₹35 billion (US$77.8 million). The building complex incorporates the de

In [4]:
from nltk.tokenize import sent_tokenize

In [5]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\itzsh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [6]:
documents=sent_tokenize(corpus)

In [7]:
type(documents)

list

### Sentence Tokenization
#### converting Paragraphs to Sentences

In [8]:
for sentence in documents:
    print(sentence)

The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in Agra, Uttar Pradesh, India.
It was commissioned in 1631 by the fifth Mughal emperor, Shah Jahan (1628-1658) to house the tomb of his beloved wife, Mumtaz Mahal; it also houses the tomb of Shah Jahan himself.
The tomb is the centrepiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall.
Construction of the mausoleum was completed in 1648, but work continued on other phases of the project for another five years.
The first ceremony held at the mausoleum was an observance by Shah Jahan, on 6 February 1643, of the 12th anniversary of the death of Mumtaz Mahal.
The Taj Mahal complex is believed to have been completed in its entirety in 1653 at a cost estimated at the time to be around ₹5 million, which in 2023 would be approximately ₹35 billion (US$77.8 million).
The building complex incorporates the de

### Word Tokenization
#### converting Paragraphs to Words

In [9]:
from nltk.tokenize import word_tokenize

In [10]:
word_tokenize(corpus)

['The',
 'Taj',
 'Mahal',
 'is',
 'an',
 'ivory-white',
 'marble',
 'mausoleum',
 'on',
 'the',
 'right',
 'bank',
 'of',
 'the',
 'river',
 'Yamuna',
 'in',
 'Agra',
 ',',
 'Uttar',
 'Pradesh',
 ',',
 'India',
 '.',
 'It',
 'was',
 'commissioned',
 'in',
 '1631',
 'by',
 'the',
 'fifth',
 'Mughal',
 'emperor',
 ',',
 'Shah',
 'Jahan',
 '(',
 '1628-1658',
 ')',
 'to',
 'house',
 'the',
 'tomb',
 'of',
 'his',
 'beloved',
 'wife',
 ',',
 'Mumtaz',
 'Mahal',
 ';',
 'it',
 'also',
 'houses',
 'the',
 'tomb',
 'of',
 'Shah',
 'Jahan',
 'himself',
 '.',
 'The',
 'tomb',
 'is',
 'the',
 'centrepiece',
 'of',
 'a',
 '17-hectare',
 '(',
 '42-acre',
 ')',
 'complex',
 ',',
 'which',
 'includes',
 'a',
 'mosque',
 'and',
 'a',
 'guest',
 'house',
 ',',
 'and',
 'is',
 'set',
 'in',
 'formal',
 'gardens',
 'bounded',
 'on',
 'three',
 'sides',
 'by',
 'a',
 'crenellated',
 'wall',
 '.',
 'Construction',
 'of',
 'the',
 'mausoleum',
 'was',
 'completed',
 'in',
 '1648',
 ',',
 'but',
 'work',
 'co

#### converting Sentences to Words

In [11]:
for sentence in documents:
    print(word_tokenize(sentence))

['The', 'Taj', 'Mahal', 'is', 'an', 'ivory-white', 'marble', 'mausoleum', 'on', 'the', 'right', 'bank', 'of', 'the', 'river', 'Yamuna', 'in', 'Agra', ',', 'Uttar', 'Pradesh', ',', 'India', '.']
['It', 'was', 'commissioned', 'in', '1631', 'by', 'the', 'fifth', 'Mughal', 'emperor', ',', 'Shah', 'Jahan', '(', '1628-1658', ')', 'to', 'house', 'the', 'tomb', 'of', 'his', 'beloved', 'wife', ',', 'Mumtaz', 'Mahal', ';', 'it', 'also', 'houses', 'the', 'tomb', 'of', 'Shah', 'Jahan', 'himself', '.']
['The', 'tomb', 'is', 'the', 'centrepiece', 'of', 'a', '17-hectare', '(', '42-acre', ')', 'complex', ',', 'which', 'includes', 'a', 'mosque', 'and', 'a', 'guest', 'house', ',', 'and', 'is', 'set', 'in', 'formal', 'gardens', 'bounded', 'on', 'three', 'sides', 'by', 'a', 'crenellated', 'wall', '.']
['Construction', 'of', 'the', 'mausoleum', 'was', 'completed', 'in', '1648', ',', 'but', 'work', 'continued', 'on', 'other', 'phases', 'of', 'the', 'project', 'for', 'another', 'five', 'years', '.']
['The', 

### Word Tokenization using WordPunctTokenizer
#### The WordPunctTokenizer is one of the NLTK tokenizers that splits words based on punctuation boundaries. Each punctuation mark is treated as a separate token.

In [12]:
from nltk.tokenize import wordpunct_tokenize

In [13]:
wordpunct_tokenize(corpus)

['The',
 'Taj',
 'Mahal',
 'is',
 'an',
 'ivory',
 '-',
 'white',
 'marble',
 'mausoleum',
 'on',
 'the',
 'right',
 'bank',
 'of',
 'the',
 'river',
 'Yamuna',
 'in',
 'Agra',
 ',',
 'Uttar',
 'Pradesh',
 ',',
 'India',
 '.',
 'It',
 'was',
 'commissioned',
 'in',
 '1631',
 'by',
 'the',
 'fifth',
 'Mughal',
 'emperor',
 ',',
 'Shah',
 'Jahan',
 '(',
 '1628',
 '-',
 '1658',
 ')',
 'to',
 'house',
 'the',
 'tomb',
 'of',
 'his',
 'beloved',
 'wife',
 ',',
 'Mumtaz',
 'Mahal',
 ';',
 'it',
 'also',
 'houses',
 'the',
 'tomb',
 'of',
 'Shah',
 'Jahan',
 'himself',
 '.',
 'The',
 'tomb',
 'is',
 'the',
 'centrepiece',
 'of',
 'a',
 '17',
 '-',
 'hectare',
 '(',
 '42',
 '-',
 'acre',
 ')',
 'complex',
 ',',
 'which',
 'includes',
 'a',
 'mosque',
 'and',
 'a',
 'guest',
 'house',
 ',',
 'and',
 'is',
 'set',
 'in',
 'formal',
 'gardens',
 'bounded',
 'on',
 'three',
 'sides',
 'by',
 'a',
 'crenellated',
 'wall',
 '.',
 'Construction',
 'of',
 'the',
 'mausoleum',
 'was',
 'completed',
 'i

### Word Tokenization Using TreebankWordTokenizer 
#### TreebankWordTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text into individual words.

In [14]:
from nltk.tokenize import TreebankWordTokenizer

In [15]:
tokenizer=TreebankWordTokenizer()

In [16]:
tokenizer.tokenize(corpus)

['The',
 'Taj',
 'Mahal',
 'is',
 'an',
 'ivory-white',
 'marble',
 'mausoleum',
 'on',
 'the',
 'right',
 'bank',
 'of',
 'the',
 'river',
 'Yamuna',
 'in',
 'Agra',
 ',',
 'Uttar',
 'Pradesh',
 ',',
 'India.',
 'It',
 'was',
 'commissioned',
 'in',
 '1631',
 'by',
 'the',
 'fifth',
 'Mughal',
 'emperor',
 ',',
 'Shah',
 'Jahan',
 '(',
 '1628-1658',
 ')',
 'to',
 'house',
 'the',
 'tomb',
 'of',
 'his',
 'beloved',
 'wife',
 ',',
 'Mumtaz',
 'Mahal',
 ';',
 'it',
 'also',
 'houses',
 'the',
 'tomb',
 'of',
 'Shah',
 'Jahan',
 'himself.',
 'The',
 'tomb',
 'is',
 'the',
 'centrepiece',
 'of',
 'a',
 '17-hectare',
 '(',
 '42-acre',
 ')',
 'complex',
 ',',
 'which',
 'includes',
 'a',
 'mosque',
 'and',
 'a',
 'guest',
 'house',
 ',',
 'and',
 'is',
 'set',
 'in',
 'formal',
 'gardens',
 'bounded',
 'on',
 'three',
 'sides',
 'by',
 'a',
 'crenellated',
 'wall.',
 'Construction',
 'of',
 'the',
 'mausoleum',
 'was',
 'completed',
 'in',
 '1648',
 ',',
 'but',
 'work',
 'continued',
 'on'