# 🧠 TF-IDF (Term Frequency – Inverse Document Frequency)

**TF-IDF** is a numerical statistic used in Natural Language Processing (NLP) and Information Retrieval to measure how important a word is in a document relative to a collection of documents (called a *corpus*).

---

## ⚙️ Step 1: Preprocessing Example

Let’s start with three simple sentences:

| Sentence | Original | After Lowercasing + Stopword Removal |
|-----------|-----------|--------------------------------------|
| S1 | The boy is good | **good boy** |
| S2 | The girl is good | **good girl** |
| S3 | The boy and girl is good | **boy girl good** |

After removing stopwords (like *the*, *is*, *and*), we get the cleaned sentences.

---

## 🧩 Step 2: Vocabulary

From all sentences combined, our **vocabulary** (unique words) is:
```

['boy', 'girl', 'good']

````

---

## 🧮 Step 3: Term Frequency (TF)

**Term Frequency (TF)** =  

---

### 🔹 Formula:

\[
TF = \frac{\text{Number of times a word appears in the sentence}}{\text{Total number of words in the sentence}}
\]

---

### 🧩 Example:

Let’s take the preprocessed sentences:

| Sentence | After Cleaning |
|-----------|----------------|
| S1 | good boy |
| S2 | good girl |
| S3 | boy girl good |

Now, let’s calculate the **TF** for each word.

| Word | S1: "good boy" | S2: "good girl" | S3: "boy girl good" |
|------|----------------|----------------|----------------------|
| boy  | 1/2 = 0.5      | 0/2 = 0.0      | 1/3 ≈ 0.33          |
| girl | 0/2 = 0.0      | 1/2 = 0.5      | 1/3 ≈ 0.33          |
| good | 1/2 = 0.5      | 1/2 = 0.5      | 1/3 ≈ 0.33          |

---

## 🧠 Step 4: Inverse Document Frequency (IDF)

**IDF** measures how *rare* or *important* a word is across all documents.

---

### 🔹 Formula:

\[
IDF = \log_e\left(\frac{N}{n_t}\right)
\]

Where:  
- **N** = Total number of documents (or sentences)  
- **nₜ** = Number of documents that contain the term *t*  
- **logₑ** = Natural logarithm (base *e*)  

---
| Word | No. of Sentences Containing Word | IDF = log(3 / count) |
|------|----------------------------------|----------------------|
| boy  | 2 (S1, S3) | log(3/2) = 0.176 |
| girl | 2 (S2, S3) | log(3/2) = 0.176 |
| good | 3 (S1, S2, S3) | log(3/3) = 0.0 |

---

## 🧩 Step 5: TF × IDF = TF-IDF Score

| Word | S1 | S2 | S3 |
|------|----|----|----|
| boy  | 0.5 × 0.176 = **0.088** | 0 | 0.33 × 0.176 = **0.058** |
| girl | 0 | 0.5 × 0.176 = **0.088** | 0.33 × 0.176 = **0.058** |
| good | 0.5 × 0.0 = **0.0** | 0.5 × 0.0 = **0.0** | 0.33 × 0.0 = **0.0** |

---

## 🧾 Step 6: Interpretation

- **High TF-IDF** → the word is frequent in a specific document but rare across others → more *informative*.
- **Low TF-IDF (close to 0)** → common words that appear in most documents (like “good”) → less useful for distinguishing.

---

## 📘 Final Summary

| Concept | Formula | Intuition |
|----------|----------|------------|
| **TF** | Count(word in doc) / Total words in doc | How frequent the word is in that sentence |
| **IDF** | log(Total docs / Docs containing word) | How rare the word is across all docs |
| **TF-IDF** | TF × IDF | Combined measure of frequency and importance |

---

## 💻 Implementation Example (Python)

```python
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "good boy",
    "good girl",
    "boy girl good"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())
````

**Output:**

```
['boy' 'girl' 'good']
[[0.7071 0.     0.7071]
 [0.     0.7071 0.7071]
 [0.5773 0.5773 0.5773]]
```

---

### 🏁 Key Takeaways

* TF-IDF balances **frequency (TF)** with **uniqueness (IDF)**.
* It helps reduce the weight of **common words** and highlights **rare, meaningful words**.
* It’s widely used in text mining, document classification, search engines, and NLP models.


## ⚖️ Advantages and Disadvantages of TF-IDF

---

### ✅ Advantages

1. **Simple and Easy to Implement**
   - TF-IDF is mathematically simple and easy to apply using libraries like `sklearn`.
   - **Example:** You can convert any text corpus into numerical features with just a few lines of code.

2. **Captures Word Importance**
   - It assigns higher weight to *important* words (that appear frequently in one document but rarely across others).
   - **Example:**  
     In product reviews — the word *“defective”* may appear rarely overall but often in negative reviews → TF-IDF highlights it as important.

3. **Fixed-Size Representation**
   - Each document is represented as a fixed-length vector (based on vocabulary size), making it easy to feed into machine learning models.
   - **Example:**  
     If the vocabulary size is 1000, every document → 1000-length vector.

4. **Reduces the Impact of Common Words**
   - Stopwords or frequent terms (like *“is”, “the”, “and”*) automatically get very low scores.
   - **Example:**  
     Even without manually removing “the” or “is”, TF-IDF gives them near-zero weights.

5. **Efficient for Text Classification and Search**
   - Works well for document ranking, spam detection, and keyword extraction where context is less important.
   - **Example:**  
     Search engines use TF-IDF to match queries with documents that contain rare but relevant terms.

---

### ❌ Disadvantages

1. **No Semantic Meaning is Captured**
   - TF-IDF only counts words — it doesn’t understand meaning, context, or relationships between words.
   - **Example:**  
     “not good” and “good” look similar to TF-IDF, even though they have opposite meanings.

2. **Sparsity Problem**
   - TF-IDF vectors are mostly zeros because most words don’t appear in every document.
   - **Example:**  
     If your vocabulary has 10,000 words and each document uses only 200 → 98% of the vector will be zeros.

3. **Out-of-Vocabulary (OOV) Issue**
   - New or unseen words in test data cannot be represented.
   - **Example:**  
     If “iPhone” never appeared in training but shows up in testing → model can’t handle it directly.

4. **Fixed Vocabulary Size**
   - Once the vocabulary is defined, adding new words or documents requires rebuilding the entire matrix.
   - **Example:**  
     Adding 10,000 new tweets means recomputing IDF for all words again.

5. **Ignores Word Order and Context**
   - TF-IDF treats each word independently and ignores sentence structure.
   - **Example:**  
     “dog bites man” and “man bites dog” will have identical vectors, though meanings differ.

6. **Not Suitable for Large-Scale or Streaming Data**
   - TF-IDF computation becomes expensive when the corpus is huge or continuously growing.
   - **Example:**  
     For millions of online articles updated daily, maintaining an up-to-date TF-IDF model is difficult.

---

### 💡 Summary Table

| Aspect | Advantage | Example | Disadvantage | Example |
|--------|------------|----------|--------------|----------|
| **Simplicity** | Easy to implement | Using `TfidfVectorizer` in sklearn | - | - |
| **Importance** | Captures rare important words | “defective”, “refund” in reviews | Common words still appear | “good”, “nice” get low info value |
| **Fixed Size** | Same vector length for all docs | 1000 features per doc | High dimensional & sparse | 98% zeros in large vocab |
| **Meaning** | Weights show importance | Highlights keywords | No semantic or order info | “good” = “not good” |
| **Vocabulary** | Defined once | Fixed features | Must rebuild for new words | Adding new tweets changes IDF |

---

### 🏁 Conclusion

- ✅ **Best used for:** Keyword extraction, search ranking, spam filtering, or topic-based clustering.  
- ❌ **Avoid for:** Deep understanding tasks (like sentiment analysis, chatbots, or translation) — use **Word2Vec**, **GloVe**, or **BERT** instead.


# 🧠 TF-IDF Practical Implementation (Using Python)

We’ll use `sklearn`’s `TfidfVectorizer` to transform text into numerical features based on **Term Frequency–Inverse Document Frequency (TF-IDF)**.

---

## 🧩 Step 1: Import Libraries

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
````

---

## 🧩 Step 2: Define a Sample Corpus

```python
corpus = [
    "The boy is good",
    "The girl is good",
    "The boy and girl is good"
]
```

After **lowercasing + stopword removal**, this becomes:

```
["good boy", "good girl", "boy girl good"]
```

---

## 🧩 Step 3: Apply TF-IDF Vectorizer

```python
vectorizer = TfidfVectorizer(stop_words='english')  # removes common English stopwords
X = vectorizer.fit_transform(corpus)
```

---

## 🧩 Step 4: Display TF-IDF Matrix

```python
# Convert to DataFrame for better readability
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df)
```

**Output:**

```
        boy      girl      good
0  0.707107  0.000000  0.707107
1  0.000000  0.707107  0.707107
2  0.577350  0.577350  0.577350
```

---

## 🧠 Step 5: Interpretation

| Sentence                       | TF-IDF Vector                       | Meaning                          |
| ------------------------------ | ----------------------------------- | -------------------------------- |
| S1: “The boy is good”          | (boy=0.707, good=0.707)             | Both words equally important     |
| S2: “The girl is good”         | (girl=0.707, good=0.707)            | Both words equally important     |
| S3: “The boy and girl is good” | (boy=0.577, girl=0.577, good=0.577) | All words share equal importance |

* TF-IDF gives **lower weight** to words that appear in *many documents* (like “good”).
* Words that appear in *fewer documents* (like “boy” or “girl”) get slightly **higher importance**.

---

## 🧮 Step 6: Check Vocabulary and IDF Values

```python
print(vectorizer.vocabulary_)   # word to index mapping
print(vectorizer.idf_)          # IDF scores for each word
```

**Example Output:**

```
{'boy': 0, 'girl': 1, 'good': 2}
[1.405, 1.405, 1.0]
```

→ “good” appears in every document → IDF = 1 (low importance)
→ “boy” and “girl” appear in fewer docs → higher IDF = 1.405

---

## 🏁 Step 7: Summary

| Feature                              | Meaning                                                                 |
| ------------------------------------ | ----------------------------------------------------------------------- |
| **TF (Term Frequency)**              | Measures how often a word appears in a document                         |
| **IDF (Inverse Document Frequency)** | Reduces weight for common words appearing in many documents             |
| **TF × IDF**                         | Gives importance to words that are frequent in one doc but rare overall |

---

## ✅ Use Cases

* Document similarity (e.g., news clustering)
* Search ranking (e.g., Google Search)
* Spam filtering (e.g., identifying common spam words)
* Keyword extraction for summarization

---

### 📘 Example Summary

| Word     | S1    | S2    | S3    | IDF   | Comment                                        |
| -------- | ----- | ----- | ----- | ----- | ---------------------------------------------- |
| **boy**  | 0.707 | 0.000 | 0.577 | 1.405 | Appears in fewer sentences → higher importance |
| **girl** | 0.000 | 0.707 | 0.577 | 1.405 | Also relatively unique                         |
| **good** | 0.707 | 0.707 | 0.577 | 1.000 | Common word → lower importance                 |

---

### 🧩 Key Takeaway

TF-IDF is **simple, powerful, and effective** for understanding which words matter most in a set of documents — a fundamental step before applying ML or NLP algorithms.



## **My own TF-IDF Pratical Implementation**

In [1]:
import pandas as pd
messages = pd.read_csv('SpamClassifier-master/smsspamcollection/SMSSpamCollection',
                      sep='\t', names=["label", "message"])

In [2]:
messages

Unnamed: 0,label,message
0,"ham Go until jurong point, crazy.. Available o...",
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,WINNER!! As a valued network customer you have...
6,ham,Even my brother is not like to speak with me. ...
7,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
8,spam,"XXXMobileMovieClub: To use your credit, click ..."
9,ham,Oh k...i'm watching here:)


In [3]:
## Data Cleaning and Processing
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [4]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [5]:
Wordlemmatizer = WordNetLemmatizer()

In [7]:
corpus = []
for i in range(0,len(messages)):
    msg = str(messages['message'][i])
    review = re.sub('[^a-zA-Z]',' ', msg)
    review = review.lower()
    review = review.split()
    review = [Wordlemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    print(corpus)

['nan']
['nan', 'ok lar joking wif u oni']
['nan', 'ok lar joking wif u oni', 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply']
['nan', 'ok lar joking wif u oni', 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply', 'u dun say early hor u c already say']
['nan', 'ok lar joking wif u oni', 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply', 'u dun say early hor u c already say', 'nah think go usf life around though']
['nan', 'ok lar joking wif u oni', 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply', 'u dun say early hor u c already say', 'nah think go usf life around though', 'winner valued network customer selected receivea prize reward claim call claim code kl valid hour']
['nan', 'ok lar joking wif u oni', 'free entry wkly comp win fa cup final tkts st may text fa receive

#### **Create TF-IDF And NGrams** 

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
## 100 most occuring words
tfidf = TfidfVectorizer(max_features=100)
X = tfidf.fit_transform(corpus).toarray()

In [12]:
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000,
                    formatter=dict(float=lambda x: "%.3g" % x))

In [13]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0.172, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.245, 0, 0, 0, 0, 0, 0.245, 0, 0, 0, 0, 0, 0, 0, ..., 0.205, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.205, 0.221, 0, 0, 0, 0, 0, 0, 0.221, 0.221, 0.157, 0, 0, 0, 0, 0, 0.172],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.378, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.756, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

#### **NGrams**

In [14]:
tfidf = TfidfVectorizer(max_features=100, ngram_range=(2,2))
X = tfidf.fit_transform(corpus).toarray()

In [17]:
tfidf.vocabulary_

{'lar joking': np.int64(89),
 'joking wif': np.int64(87),
 'free entry': np.int64(66),
 'entry wkly': np.int64(55),
 'comp win': np.int64(37),
 'fa cup': np.int64(59),
 'cup final': np.int64(44),
 'final tkts': np.int64(63),
 'fa receive': np.int64(60),
 'entry question': np.int64(54),
 'rate apply': np.int64(99),
 'dun say': np.int64(50),
 'early hor': np.int64(51),
 'hor already': np.int64(80),
 'already say': np.int64(2),
 'go usf': np.int64(73),
 'life around': np.int64(91),
 'around though': np.int64(4),
 'customer selected': np.int64(45),
 'call claim': np.int64(17),
 'claim code': np.int64(31),
 'code kl': np.int64(34),
 'kl valid': np.int64(88),
 'even brother': np.int64(56),
 'brother like': np.int64(14),
 'like speak': np.int64(94),
 'like aid': np.int64(93),
 'aid patent': np.int64(1),
 'date sunday': np.int64(48),
 'credit click': np.int64(43),
 'click link': np.int64(33),
 'link message': np.int64(97),
 'congratulation walmart': np.int64(40),
 'gift card': np.int64(71),
 '

In [18]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.707, 0, 0.707, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.304],
       [0, 0, 0.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0.577, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0.577, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.577, 0, 0, 0, 0, 0, 0, 0, 0],
  