<a href="https://colab.research.google.com/github/krishanu34/DataScience/blob/main/01.NLP/01.Text Preprocessing-Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Tokenization
- **Tokenization** is the process of splitting text into smaller units called **tokens**.
- Tokens can be words, subwords, or characters depending on the application.

### Examples:
- **Word-level tokenization**  
Sentence: "The cat sat on the mat."
Tokens: ["The", "cat", "sat", "on", "the", "mat"]



- **Character-level tokenization**  
Sentence: "cat"
Tokens: ["c", "a", "t"]



- **Subword tokenization** (common in modern NLP models like BERT, GPT)  
Word: "unhappiness"
Tokens: ["un", "happi", "ness"]



---

## 2. Corpus
- A **corpus** is a large collection of texts used for NLP tasks.
- It serves as the dataset for training or analysis.
- Example in our case:

Corpus = [
"The cat sat on the mat.",
"The dog barked loudly."
]



---

## 3. Documents
- A **document** is an individual text (sentence, paragraph, or article) inside the corpus.
- Example:

Document 1: "The cat sat on the mat."

Document 2: "The dog barked loudly."



---

## 4. Words / Tokens in Documents
- **Words (tokens)** are the individual terms obtained by applying tokenization to documents.
- Usually preprocessed (lowercased, punctuation removed).
- Example:

Document 1 tokens: ["the", "cat", "sat", "on", "the", "mat"]

Document 2 tokens: ["the", "dog", "barked", "loudly"]



---

## 5. Vocabulary
- The **vocabulary** is the set of unique words across the entire corpus.
- Example:

Vocabulary = {"the", "cat", "sat", "on", "mat", "dog", "barked", "loudly"}


- Vocabulary size = 8

---

## 🔑 Hierarchy Summary

- **Corpus (2 documents)**  
  - **Document 1** → tokens: ["the", "cat", "sat", "on", "the", "mat"]  
  - **Document 2** → tokens: ["the", "dog", "barked", "loudly"]  

- **Vocabulary** (unique words across documents):  
  {"the", "cat", "sat", "on", "mat", "dog", "barked", "loudly"}

In [1]:
! pip install nltk



In [35]:
corpus = """
Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.
It is concerned with the interaction between computers and human (natural) languages!
Specifically, it is concerned with programming computers to process and analyze large amounts of natural language data.
Challenges in NLP include speech recognition, natural language understanding, and natural language generation.
"""

In [36]:
print(corpus)


Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.
It is concerned with the interaction between computers and human (natural) languages!
Specifically, it is concerned with programming computers to process and analyze large amounts of natural language data.
Challenges in NLP include speech recognition, natural language understanding, and natural language generation.



#### Tokenization
##### Sentence --> Paragraphs

In [37]:
from nltk.tokenize import sent_tokenize

Install this if you get error
import nltk
nltk.download('punkt_tab')

In [38]:
sentences = sent_tokenize(corpus)

In [39]:
for sentece in sentences:
    print(sentece)


Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.
It is concerned with the interaction between computers and human (natural) languages!
Specifically, it is concerned with programming computers to process and analyze large amounts of natural language data.
Challenges in NLP include speech recognition, natural language understanding, and natural language generation.


#### Tokenization
##### Paragrapgh --> Words
##### Sentence --> Words

In [40]:
from nltk.tokenize import word_tokenize

In [41]:
word_tokenize(corpus)

['Natural',
 'language',
 'processing',
 'is',
 'a',
 'subfield',
 'of',
 'linguistics',
 ',',
 'computer',
 'science',
 ',',
 'and',
 'artificial',
 'intelligence',
 '.',
 'It',
 'is',
 'concerned',
 'with',
 'the',
 'interaction',
 'between',
 'computers',
 'and',
 'human',
 '(',
 'natural',
 ')',
 'languages',
 '!',
 'Specifically',
 ',',
 'it',
 'is',
 'concerned',
 'with',
 'programming',
 'computers',
 'to',
 'process',
 'and',
 'analyze',
 'large',
 'amounts',
 'of',
 'natural',
 'language',
 'data',
 '.',
 'Challenges',
 'in',
 'NLP',
 'include',
 'speech',
 'recognition',
 ',',
 'natural',
 'language',
 'understanding',
 ',',
 'and',
 'natural',
 'language',
 'generation',
 '.']

In [42]:
for sentece in sentences:
    print(word_tokenize(sentece))

['Natural', 'language', 'processing', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.']
['It', 'is', 'concerned', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', '!']
['Specifically', ',', 'it', 'is', 'concerned', 'with', 'programming', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']
['Challenges', 'in', 'NLP', 'include', 'speech', 'recognition', ',', 'natural', 'language', 'understanding', ',', 'and', 'natural', 'language', 'generation', '.']


In [43]:
from nltk.tokenize import wordpunct_tokenize

In [44]:
for sentece in sentences:
    print(wordpunct_tokenize(sentece))

['Natural', 'language', 'processing', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.']
['It', 'is', 'concerned', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', '!']
['Specifically', ',', 'it', 'is', 'concerned', 'with', 'programming', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']
['Challenges', 'in', 'NLP', 'include', 'speech', 'recognition', ',', 'natural', 'language', 'understanding', ',', 'and', 'natural', 'language', 'generation', '.']


In [45]:
from nltk.tokenize import TreebankWordTokenizer

In [46]:
tokenizer=TreebankWordTokenizer()

In [47]:
for sentece in sentences:
    print(tokenizer.tokenize(sentece))

['Natural', 'language', 'processing', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.']
['It', 'is', 'concerned', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', '!']
['Specifically', ',', 'it', 'is', 'concerned', 'with', 'programming', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']
['Challenges', 'in', 'NLP', 'include', 'speech', 'recognition', ',', 'natural', 'language', 'understanding', ',', 'and', 'natural', 'language', 'generation', '.']
