# Text - Preprocessing

## **1. Tokenization**
Tokenization is the process of breaking a **corpus (large text)** into **smaller meaningful units** like **sentences or words**.  

**Types of Tokenization:**
- **Sentence Tokenization** → Splitting text into sentences.  
- **Word Tokenization** → Splitting sentences into words.  

In [51]:
# A corpus (plural: corpora) is a large collection of text data used for training, testing, and analyzing Natural Language Processing (NLP) models.
corpus = """Alice was beginning to get very tired of sitting by her sister on the bank.
She had nothing to do. Once or twice, she peeped into the book her sister was reading.
But it had no pictures or conversations in it."""

In [52]:
#Tokenization
import nltk
nltk.download('punkt_tab')
from nltk import sent_tokenize

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\lalra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [53]:
# split a corpus into sentences. using sentence tokenizer
documents = sent_tokenize(corpus)

In [54]:
type(documents)

list

In [55]:
documents

['Alice was beginning to get very tired of sitting by her sister on the bank.',
 'She had nothing to do.',
 'Once or twice, she peeped into the book her sister was reading.',
 'But it had no pictures or conversations in it.']

In [56]:
for sentence in documents:
    print(sentence)

Alice was beginning to get very tired of sitting by her sister on the bank.
She had nothing to do.
Once or twice, she peeped into the book her sister was reading.
But it had no pictures or conversations in it.


In [57]:
from nltk.tokenize import word_tokenize

In [58]:
# Tokenize words
word_token = word_tokenize(corpus)
word_token

['Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on',
 'the',
 'bank',
 '.',
 'She',
 'had',
 'nothing',
 'to',
 'do',
 '.',
 'Once',
 'or',
 'twice',
 ',',
 'she',
 'peeped',
 'into',
 'the',
 'book',
 'her',
 'sister',
 'was',
 'reading',
 '.',
 'But',
 'it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it',
 '.']

In [59]:
for sentence in documents:
    word_token = word_tokenize(sentence)
    print(word_token)

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', '.']
['She', 'had', 'nothing', 'to', 'do', '.']
['Once', 'or', 'twice', ',', 'she', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', '.']
['But', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', '.']


In [60]:
from nltk.tokenize import TreebankWordTokenizer
tree_tokenizer = TreebankWordTokenizer()

In [61]:
# treebank tokenizer will consider the word ending woth full stop with the full-stop ex- (bank.) while word tokenizer will seperate(bank) and (.)
tree_tokenizer.tokenize(corpus)

['Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on',
 'the',
 'bank.',
 'She',
 'had',
 'nothing',
 'to',
 'do.',
 'Once',
 'or',
 'twice',
 ',',
 'she',
 'peeped',
 'into',
 'the',
 'book',
 'her',
 'sister',
 'was',
 'reading.',
 'But',
 'it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it',
 '.']

In [62]:
for sentence in documents:
    print(tree_tokenizer.tokenize(sentence))

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', '.']
['She', 'had', 'nothing', 'to', 'do', '.']
['Once', 'or', 'twice', ',', 'she', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', '.']
['But', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', '.']


## **2. Stemming (Reducing words to their root form by chopping endings)**
**Definition:**  
Stemming removes **prefixes or suffixes** from words to obtain their **root form** (also called the "stem").  

🔹 **But it's a rule-based approach!** It simply cuts off word endings **without considering meaning**, which sometimes leads to incorrect words.  

**Example:**  
📌 Given words and their stems:  
| Word | Stem (Incorrect sometimes) |
|---|---|
| Running | Run |
| Studies | Studi ❌ |
| Happily | Happili ❌ |
| Better | Better ❌ (doesn't reduce properly) |

**Downside:**  
🚫 Since stemming only **chops off letters**, it may **not always return a real word**.  

**Types of Stemming Algorithms:**  
1. **Porter Stemmer** – Simple, widely used.  
2. **Lancaster Stemmer** – More aggressive.  
3. **Snowball Stemmer** – More refined.  


In [63]:
# example words
words = ["running", "flies", "happily", "studies", "cars", "better", "history", "fairly"]

In [64]:
from nltk.stem import PorterStemmer

In [65]:
stemming = PorterStemmer()

In [66]:
for word in words:
    print(f"{word}--------->{stemming.stem(word)}")

running--------->run
flies--------->fli
happily--------->happili
studies--------->studi
cars--------->car
better--------->better
history--------->histori
fairly--------->fairli


##### Major disadvantage in stemming is the meaning of the word changes - (history--------->histori), (studies--------->studi), this can be overcome by Lemmetization
##### We can use other techniques of stemming to improve this like RegexpStemmer class

#### RegexpStemmer
The RegexpStemmer (Regular Expression Stemmer) in NLTK allows custom rule-based stemming using regex patterns. Instead of a predefined algorithm (like Porter or Lancaster), you define suffix removal rules.

In [67]:
from nltk.stem import RegexpStemmer

In [68]:
# the regex says for ex wherever there is ing remove that so running gives runn
regexp_stemming = RegexpStemmer('ing|s$|e$|able$', min=4)

In [69]:
regexp_stemming.stem("running")

'runn'

In [70]:
regexp_stemming.stem("eatable")

'eat'

#### SnowballStemmer
The SnowballStemmer (also called "Porter2 Stemmer") is an improved version of the Porter Stemmer. It is more accurate, multilingual, and less aggressive than the Lancaster Stemmer.

In [71]:
from nltk.stem import SnowballStemmer

In [72]:
snowball_stemming = SnowballStemmer("english")

In [73]:
snowball_stemming.stem("running")

'run'

In [74]:
snowball_stemming.stem("history")

'histori'

In [75]:
# so here fairly is correctly transformed to its root word that is ""fair" wheresas in PorterStemmer it is "fairli"
snowball_stemming.stem("fairly")

'fair'

#### LancasterStemmer
The Lancaster Stemmer is a very aggressive stemming algorithm that reduces words to their root forms but often over-stems, leading to loss of meaning. It is more aggressive than Porter and Snowball Stemmers.

In [76]:
from nltk.stem import LancasterStemmer

In [77]:
lancaster_stemming = LancasterStemmer()

In [78]:
lancaster_stemming.stem("running"), lancaster_stemming.stem("history"), lancaster_stemming.stem("fairly") 
# its performs overstemming on history resulting in change of meaning or loosing the actual meaning of the words

('run', 'hist', 'fair')

## **3. Lemmatization (Getting the dictionary base form of words)**
**Definition:**  
Lemmatization is an advanced version of stemming that reduces a word to its **meaningful base form (lemma)**, ensuring that it is a valid dictionary word.  

**Difference from Stemming:**  
✔ **Lemmatization considers meaning and grammar**, while stemming blindly removes endings.  

**Example:**  
📌 Given words and their correct lemmas:  
| Word | Lemma |
|---|---|
| Running | Run |
| Studies | Study |
| Happily | Happy |
| Better | Good ✅ (stemming fails here) |

🔹 **Why is Lemmatization better?**  
✔ It ensures that the word remains meaningful.  
✔ Uses a vocabulary-based approach (WordNet, spaCy, etc.).  
✔ Helps in machine learning models where meaningful text is needed.

#### Wordnet Lemmetizer
The WordNet Lemmatizer is a more advanced technique than stemming.
Nltk provides WordnetLemmatizer class which is think wrapper around the wornet corpus. This class uses morphy() function to the Wordnet Corpus Readder class to find a lemma

In [79]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lalra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [80]:
lemmatizer = WordNetLemmatizer()

In [81]:
# pos tag in Lemmetizer
# the Part Of Speech tag. Valid options are `"n"` for nouns,
#     `"v"` for verbs, `"a"` for adjectives, `"r"` for adverbs and `"s"`
#     for satellite adjectives.
lemmatizer.lemmatize("eating", pos='v')

'eat'

In [82]:
words = ["running", "flies", "happily", "studies", "cars", "better", "history"]

In [83]:
# there will be not much stemming as pos is set to noun so only cars is lemmitized to car
for word in words:
    print(f"{word}-------->{lemmatizer.lemmatize(word, pos='n')}")

running-------->running
flies-------->fly
happily-------->happily
studies-------->study
cars-------->car
better-------->better
history-------->history


In [84]:
# performs stemming well as pos tagging is verb
for word in words:
    print(f"{word}-------->{lemmatizer.lemmatize(word, pos='v')}")

running-------->run
flies-------->fly
happily-------->happily
studies-------->study
cars-------->cars
better-------->better
history-------->history


## **4. Stop-word Removal (Filtering out unimportant words)**
**Definition:**  
Stop-words are common words that **don’t add much meaning** to a sentence, like **“is”, “the”, “and”, “a”, “in”**. Removing them helps **reduce noise** in NLP tasks.  

**Example:**  
📌 Given sentence:  
> *"The cat is sitting on the mat."*  

✅ After stop-word removal:  
> *"cat sitting mat"*  

🔹 **Why remove stop-words?**  
✔ Reduces text size for faster processing.  
✔ Focuses only on meaningful words.  
✔ Improves search engine results by removing unnecessary words. 

In [85]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lalra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [86]:
# these words in english / german / arabic etc in NLP are removed as an stopwords
print(f"Total Number of Stopwords in English = {len(stopwords.words("english"))}")
print(stopwords.words("english"))

Total Number of Stopwords in English = 198
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own'

In [87]:
# first 10 stopwords for ex
stopwords.words("german")[:10]

['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an']

In [88]:
# first 10 stopwords for ex
stopwords.words("arabic")[:10]

['إذ', 'إذا', 'إذما', 'إذن', 'أف', 'أقل', 'أكثر', 'ألا', 'إلا', 'التي']

In [89]:
# A corpus (plural: corpora) is a large collection of text data used for training, testing, and analyzing Natural Language Processing (NLP) models.
corpus = """My Vision for India – Dr. A.P.J. Abdul Kalam. I have three visions for India.
In 3000 years of history, India has been invaded, conquered, ruled by others. Yet, India has always stood strong, preserving its culture, knowledge, and traditions. The first vision is Freedom. We must protect our nation’s independence with our knowledge, innovation, and courage.  
The second vision is Development. We must not be a developing nation forever. We have the potential to become a global leader in science, technology, and economy. We must believe in ourselves!  
The third vision is India must stand up to the world. We must develop the mindset of a developed nation and not be looked upon as a third-world country."""

In [90]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

In [91]:
sentecnces = nltk.sent_tokenize(corpus)

In [92]:
## Apply Stopwords and filter and then apply stemming
for i in range(len(sentecnces)):
    words = nltk.word_tokenize(sentecnces[i].lower())
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words("english"))]
    sentecnces[i] = ' '.join(words)
print(sentecnces)

['vision india – dr. a.p.j .', 'abdul kalam .', 'three vision india .', '3000 year histori , india invad , conquer , rule other .', 'yet , india alway stood strong , preserv cultur , knowledg , tradit .', 'first vision freedom .', 'must protect nation ’ independ knowledg , innov , courag .', 'second vision develop .', 'must develop nation forev .', 'potenti becom global leader scienc , technolog , economi .', 'must believ !', 'third vision india must stand world .', 'must develop mindset develop nation look upon third-world countri .']


In [93]:
## Apply Stopwords and filter and then apply lemmetization (better than stemming but takes more time)
from nltk.stem import WordNetLemmatizer
word_lemmatizer = WordNetLemmatizer()
for i in range(len(sentecnces)):
    words = nltk.word_tokenize(sentecnces[i].lower())
    words = [word_lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words("english"))]
    sentecnces[i] = ' '.join(words)
print(sentecnces)

['vision india – dr. a.p.j .', 'abdul kalam .', 'three vision india .', '3000 year histori , india invad , conquer , rule .', 'yet , india alway stood strong , preserv cultur , knowledg , tradit .', 'first vision freedom .', 'must protect nation ’ independ knowledg , innov , courag .', 'second vision develop .', 'must develop nation forev .', 'potenti becom global leader scienc , technolog , economi .', 'must believ !', 'third vision india must stand world .', 'must develop mindset develop nation look upon third-world countri .']


## **Part-of-Speech (POS) Tagging** 
**POS tagging (Part-of-Speech tagging)** is an **NLP process** that labels each word in a sentence with its **grammatical role**, such as **noun, verb, adjective, pronoun, etc.** 
## **🔹 Why is POS Tagging Important?**  
- **Sentence Structure Analysis** → Helps understand the meaning of words in context.  
- **Word Sense Disambiguation** → Differentiates between words with multiple meanings (e.g., *"book a ticket"* vs. *"read a book"*).
-  **NER (Named Entity Recognition)** → Helps in recognizing entities based on their word type.
-   **Grammar Correction & Speech Recognition** → Used in spell-checking and voice assistants.
-   **Lemmatization & Stemming** → Helps identify the base form of words correctly.  

---

##### **🔹 Common POS Tags in NLP**
Here are some commonly used **POS tags** in NLP (based on the **Penn Treebank POS Tagset**):  

| POS Tag | Full Form | Example |
|---|---|---|
| **NN** | Noun (Singular) | "dog", "apple" |
| **NNS** | Noun (Plural) | "dogs", "apples" |
| **NNP** | Proper Noun (Singular) | "India", "Google" |
| **NNPS** | Proper Noun (Plural) | "Indians", "Americans" |
| **VB** | Verb (Base Form) | "run", "eat" |
| **VBD** | Verb (Past Tense) | "ran", "ate" |
| **VBG** | Verb (Gerund/Present Participle) | "running", "eating" |
| **VBN** | Verb (Past Participle) | "eaten", "driven" |
| **VBP** | Verb (Singular Present) | "run", "eat" |
| **VBZ** | Verb (3rd Person Singular Present) | "runs", "eats" |
| **JJ** | Adjective | "quick", "happy" |
| **RB** | Adverb | "quickly", "happily" |
| **PRP** | Pronoun | "he", "she", "it" |
| **IN** | Preposition | "on", "in", "over" |
| **DT** | Determiner | "the", "a", "an" |

In [94]:
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\lalra\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [95]:
## Find the Pos tag each words will be tagged to ie, verb adverb, adjective, noun etc as per post tag for ex ('india', 'NN') Inda as Noun
for i in range(len(sentecnces)):
    words = nltk.word_tokenize(sentecnces[i].lower())
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words("english"))]
    tag_pos = pos_tag(words)
    print(tag_pos)

[('vision', 'NN'), ('india', 'NN'), ('–', 'NNP'), ('dr.', 'NN'), ('a.p.j', 'NN'), ('.', '.')]
[('abdul', 'JJ'), ('kalam', 'NN'), ('.', '.')]
[('three', 'CD'), ('vision', 'NN'), ('india', 'NN'), ('.', '.')]
[('3000', 'CD'), ('year', 'NN'), ('histori', 'NN'), (',', ','), ('india', 'JJ'), ('invad', 'NN'), (',', ','), ('conquer', 'NN'), (',', ','), ('rule', 'NN'), ('.', '.')]
[('yet', 'RB'), (',', ','), ('india', 'VB'), ('alway', 'RB'), ('stood', 'JJ'), ('strong', 'JJ'), (',', ','), ('preserv', 'JJ'), ('cultur', 'NN'), (',', ','), ('knowledg', 'NN'), (',', ','), ('tradit', 'NN'), ('.', '.')]
[('first', 'JJ'), ('vision', 'NN'), ('freedom', 'NN'), ('.', '.')]
[('must', 'MD'), ('protect', 'VB'), ('nation', 'NN'), ('’', 'NNP'), ('independ', 'NN'), ('knowledg', 'NN'), (',', ','), ('innov', 'NN'), (',', ','), ('courag', 'NN'), ('.', '.')]
[('second', 'JJ'), ('vision', 'NN'), ('develop', 'NN'), ('.', '.')]
[('must', 'MD'), ('develop', 'VB'), ('nation', 'NN'), ('forev', 'NN'), ('.', '.')]
[('poten

## **Named Entity Recognition (NER) in NLP**  

**Named Entity Recognition (NER)** is a Natural Language Processing (NLP) technique used to identify and classify **named entities** in text into predefined categories like:  

- **Person** – ("A.P.J. Abdul Kalam", "Elon Musk")
- **Organization** – ("NASA", "Google", "ISRO")
- **Location** – ("India", "New York", "Himalayas")
- **Date & Time** – ("January 26, 1950", "5 PM")
- **Monetary Values** – ("₹1000", "$1 billion")
- **Percentages** – ("50%", "95% accuracy")  

In [96]:
# A corpus (plural: corpora) is a large collection of text data used for training, testing, and analyzing Natural Language Processing (NLP) models.
corpus = """My Vision for India – Dr. A.P.J. Abdul Kalam. I have three visions for India.
In 3000 years of history, India has been invaded, conquered, ruled by others. Yet, India has always stood strong, preserving its culture, knowledge, and traditions. The first vision is Freedom. We must protect our nation’s independence with our knowledge, innovation, and courage.  
The second vision is Development. We must not be a developing nation forever. We have the potential to become a global leader in science, technology, and economy. We must believe in ourselves!  
The third vision is India must stand up to the world. We must develop the mindset of a developed nation and not be looked upon as a third-world country."""

In [97]:
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\lalra\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\lalra\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [98]:
words = nltk.word_tokenize(corpus)
pos_tag_elements = pos_tag(words)

In [99]:
# The command nltk.ne_chunk(pos_tag_elements).draw() is used for Named Entity Recognition (NER) visualization in NLTK. It creates a tree structure showing named entities in the text.

# uncomment this line to execute
# nltk.ne_chunk(pos_tag_elements).draw()

## BOW (Bag of Words)
---
#### Spam Classification

In [101]:
import pandas as pd
df = pd.read_csv('spam.csv',encoding="ISO-8859-1")
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)

In [None]:
df.rename(columns={"v1": "label", "v2": "message"}, inplace=True)
df.head()

In [None]:
df.shape

In [None]:
# The re module in Python stands for Regular Expressions and is used for pattern matching and text processing.
import re
import nltk
nltk.download('stopwords')

In [102]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
ss = SnowballStemmer('english')

In [None]:
len(df)

1. **Removing special characters & numbers** using `re.sub(r"[^a-zA-Z\s]", "", df['message'][i])`.  
2. **Converting text to lowercase** with `.lower()`.  
3. **Splitting text into words** using `.split()`.  
4. **Applying stemming** (`port_stem(word)`) to reduce words to their root form.  
5. **Removing stopwords** using `stopwords.words('english')`.  
6. **Joining words back into a sentence** using `' '.join(review)`.  

In [None]:
corpus = []
for i in range(0, len(df)):
    review = re.sub(r"[^a-zA-Z\s]", "", df['message'][i])
    review = review.lower()
    review = review.split()
    review = [ss.stem(word) for word in review if word not in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [None]:
print(corpus[0:50])

In [None]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
# Initialize CountVectorizer with a vocabulary size limit of 100 unique words
# binary=True means each word presence is represented as 0 or 1 (instead of actual word counts)
cv = CountVectorizer(max_features=100, binary=True)

# If binary=False (default), the transformed values represent actual word frequencies # The below would store actual word counts instead of 0/1
# cv = CountVectorizer(max_features=100)

In [None]:
# Fit and transform the 'corpus' into a Bag of Words (BoW) representation
X = cv.fit_transform(corpus)

In [None]:
# Convert sparse matrix representation of BoW to a dense array
X.toarray()

In [None]:
# Get the shape of the transformed dataset (rows = number of documents, columns = 100 words)
X.shape

In [None]:
# 100 words with frquency
cv.vocabulary_

## N-Grams

In [None]:
# ngram_range=(2,3)  -> Extracts both bigrams (2-word sequences) and trigrams (3-word sequences).
# (1,2) - Combination of Unigram and bigram
cv = CountVectorizer(max_features=100, binary=True, ngram_range=(2,3))

In [None]:
X = cv.fit_transform(corpus)

In [None]:
# Get the vocabulary dictionary where keys are n-grams and values are their index positions
cv.vocabulary_

## TF-IDF (Term Frequency - Inverse Document Frequency)

In [None]:
# Import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize TfidfVectorizer with a max of 100 features (words)
tfidf = TfidfVectorizer(max_features=100)

In [None]:
# Fit and transform the corpus into TF-IDF representation
X = tfidf.fit_transform(corpus)

In [None]:
X.toarray()

In [None]:
# Get the vocabulary dictionary, where:
# - Keys → Words (features) selected by TF-IDF
# - Values → Their index positions in the feature matrix
tfidf.vocabulary_

#### TF-IDF with N-Gram

In [None]:
tfidf = TfidfVectorizer(max_features=100, ngram_range=(2,3))

In [None]:
X = tfidf.fit_transform(corpus)

In [None]:
tfidf.vocabulary_

## Word2Vec Implementation
https://colab.research.google.com/drive/1DMK0Z3MM8D5st0-DdBmVdQFxTj8u-P0e