In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:

# Query and top-k documents
query = "machine learning techniques"

In [None]:
documents = [
    "Deep learning and its applications",
    "Introduction to machine learning",
    "Statistical methods in AI",
]

Let's break down the TF-IDF calculation for the sentence **"Deep learning and its applications"** step by step. We'll calculate the **TF-IDF score** for each term in the sentence based on a small, hypothetical corpus for simplicity.

---

### Corpus (Documents)
We'll use the following corpus:
1. Document 1: "Deep learning and its applications"
2. Document 2: "Deep learning is a subset of machine learning"
3. Document 3: "Applications of machine learning include image recognition"

---

### Step 1: Tokenization
First, split each document into individual terms (words). Ignore case and remove punctuation.

**Document 1**: ["deep", "learning", "and", "its", "applications"]  
**Document 2**: ["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]  
**Document 3**: ["applications", "of", "machine", "learning", "include", "image", "recognition"]

---

### Step 2: Calculate Term Frequency (TF)
The **term frequency (TF)** measures how often a word appears in a document relative to the total number of words in that document.

$$
\text{TF}(t, d) = \frac{\text{Frequency of } t \text{ in } d}{\text{Total terms in } d}
$$

For Document 1 (**"Deep learning and its applications"**):

| Term           | Frequency | Total Terms | TF (Term Frequency)        |
|----------------|-----------|-------------|-----------------------------|
| deep           | 1         | 5           | $ \frac{1}{5} = 0.2 $    |
| learning       | 1         | 5           | $ \frac{1}{5} = 0.2 $    |
| and            | 1         | 5           | $ \frac{1}{5} = 0.2 $    |
| its            | 1         | 5           | $ \frac{1}{5} = 0.2 $    |
| applications   | 1         | 5           | $ \frac{1}{5} = 0.2 $    |

---

### Step 3: Calculate Inverse Document Frequency (IDF)
The **inverse document frequency (IDF)** measures how important a term is across the entire corpus. Rare terms have higher IDF values, while common terms have lower values.

$$
\text{IDF}(t) = \log \left( \frac{N}{\text{DF}(t)} \right)
$$

Where:
- $ N $ is the total number of documents.
- $ \text{DF}(t) $ is the number of documents containing the term $ t $.

| Term           | DF (Number of Documents with Term) | Total Documents (N) | IDF                                      |
|----------------|------------------------------------|----------------------|------------------------------------------|
| deep           | 2                                  | 3                    | $ \log \left( \frac{3}{2} \right) = 0.176 $ |
| learning       | 3                                  | 3                    | $ \log \left( \frac{3}{3} \right) = 0.0 $     |
| and            | 1                                  | 3                    | $ \log \left( \frac{3}{1} \right) = 0.477 $ |
| its            | 1                                  | 3                    | $ \log \left( \frac{3}{1} \right) = 0.477 $ |
| applications   | 2                                  | 3                    | $ \log \left( \frac{3}{2} \right) = 0.176 $ |

---

### Step 4: Calculate TF-IDF
Finally, compute the **TF-IDF** score for each term by multiplying the term frequency (TF) by the inverse document frequency (IDF):

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

For Document 1 (**"Deep learning and its applications"**):

| Term           | TF     | IDF       | TF-IDF                |
|----------------|--------|-----------|-----------------------|
| deep           | 0.2    | 0.176     | $ 0.2 \times 0.176 = 0.0352 $  |
| learning       | 0.2    | 0.0       | $ 0.2 \times 0.0 = 0.0 $       |
| and            | 0.2    | 0.477     | $ 0.2 \times 0.477 = 0.0954 $  |
| its            | 0.2    | 0.477     | $ 0.2 \times 0.477 = 0.0954 $  |
| applications   | 0.2    | 0.176     | $ 0.2 \times 0.176 = 0.0352 $  |

---

### Final TF-IDF Vector for Document 1
The vector representation of the document "Deep learning and its applications" based on TF-IDF is:

$$
\mathbf{v} = [0.0352, 0.0, 0.0954, 0.0954, 0.0352]
$$

This vector can now be used to compare the document with others (or a query) using similarity measures like **cosine similarity**.

We use the **logarithmic measure** in the **Inverse Document Frequency (IDF)** calculation to control the impact of term frequency across documents. The logarithm ensures that the IDF values grow more moderately as a term becomes rarer and prevents the measure from disproportionately favoring extremely rare terms. Here’s a detailed explanation:

---

### 1. **Handling Growth in Document Frequency**
The formula for IDF without a logarithm is:

$$
\text{IDF}(t) = \frac{N}{\text{DF}(t)}
$$

Here:
- $ N $ is the total number of documents.
- $ \text{DF}(t) $ is the number of documents containing the term $ t $.

Without the logarithm, the IDF value grows linearly as $ \text{DF}(t) $ decreases. For very rare terms (e.g., terms that appear in only 1 out of 1,000,000 documents), the IDF value can become extremely large, which can dominate the overall TF-IDF calculation and skew the importance of terms disproportionately.

By taking the logarithm, we **dampen this growth**, ensuring that IDF increases at a slower rate for rare terms. This makes the measure more stable and interpretable.

$$
\text{IDF}(t) = \log \left( \frac{N}{\text{DF}(t)} \right)
$$

---

### 2. **Emphasizing Rare Terms (But Not Too Much)**
The purpose of IDF is to assign higher importance to rare terms, as they are more likely to be meaningful or unique to specific documents. For instance:
- A term appearing in only 1 document out of 1,000,000 is clearly rare and important.
- A term appearing in 900,000 documents is common and less informative.

Using a logarithm ensures that **rare terms get higher scores**, but it prevents the scores from becoming excessively large. For example:
- Without logarithm: $ \text{IDF}(t) = \frac{10,000}{1} = 10,000 $ for a rare term.
- With logarithm: $ \text{IDF}(t) = \log(10,000) = 4 $.

This moderated score keeps the TF-IDF values reasonable.

---

### 3. **Smoothing the Scale**
Logarithms compress large ranges into smaller, manageable scales, which is particularly helpful in text data where document frequencies can vary wildly. For example:
- A term appearing in 1 document has an IDF of $ \log(\frac{1000}{1}) = 3 $.
- A term appearing in 10 documents has an IDF of $ \log(\frac{1000}{10}) = 2 $.
- A term appearing in 100 documents has an IDF of $ \log(\frac{1000}{100}) = 1 $.

This smooth progression ensures that the impact of IDF is proportional, not overly biased toward extreme rarity.

---

### 4. **Prevents Overweighting Noise**
Extremely rare terms could sometimes be noise or typographical errors. Without logarithmic scaling, their IDF values would dominate the TF-IDF calculation. The logarithm dampens their influence, making the calculation more robust to noisy data.

---

### 5. **Aligns with Information Theory**
The logarithm in IDF aligns with the principles of **information theory**, where the informativeness of an event is inversely related to its probability. The rarer a term is (lower $ \text{DF}(t) $), the more "informative" it is, and the logarithmic function models this relationship effectively.

---

### Summary
Using a logarithmic measure in IDF ensures:
1. Rare terms are emphasized but not overly dominant.
2. The scale of IDF values remains interpretable and manageable.
3. The calculation is more robust to noise and extreme values.
4. The measure aligns with the principles of information theory, reflecting term importance in a meaningful way.

In [None]:
# Step 1: TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([query] + documents)

The goal is to represent the query and documents in the same vector space, allowing us to compute similarities between them. Using TF-IDF ensures that:

Common terms (e.g., "the", "and") are downweighted because they have low importance (low IDF).
Rare terms (e.g., "machine", "learning") are given higher importance if they are significant to the query or document.
This representation is then used to calculate cosine similarity between the query and the documents for reranking or relevance scoring.

In [4]:
# Convert the sparse matrix to a dense format
dense_matrix = tfidf_matrix.toarray()

# Get feature names (terms in the vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix with terms
print("TF-IDF Matrix:")
print(f"Terms: {feature_names}")
print(dense_matrix)

TF-IDF Matrix:
Terms: ['ai' 'and' 'applications' 'deep' 'in' 'introduction' 'its' 'learning'
 'machine' 'methods' 'statistical' 'techniques' 'to']
[[0.         0.         0.         0.         0.         0.
  0.         0.44809973 0.55349232 0.         0.         0.70203482
  0.        ]
 [0.         0.47633035 0.47633035 0.47633035 0.         0.
  0.47633035 0.30403549 0.         0.         0.         0.
  0.        ]
 [0.         0.         0.         0.         0.         0.57457953
  0.         0.36674667 0.4530051  0.         0.         0.
  0.57457953]
 [0.5        0.         0.         0.         0.5        0.
  0.         0.         0.         0.5        0.5        0.
  0.        ]]


In [None]:

# Step 2: Compute cosine similarity
similarity_scores = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:]).flatten()

In this step, the idea is to calculate the similarity between the query and the documents, and while it might seem like they need to share the exact same words to match, that’s not entirely true. The goal here is to use **TF-IDF and cosine similarity** to find how relevant each document is to the query, and this goes beyond simple word matching.

Yes, if the query and the document share the same words, it definitely helps. For example, if the query is "machine learning techniques" and the document says "Introduction to machine learning," the shared words like "machine" and "learning" contribute to a higher cosine similarity. This happens because those words get higher **TF-IDF weights**, and their alignment increases the similarity score.

But the key here is that TF-IDF doesn’t treat all words equally. It focuses on **important terms**, meaning words that are rare across the entire corpus but frequent in a specific document or query. For instance, a term like "techniques" will carry more weight than a common word like "and" or "is." So, even if the query and document share a lot of common but unimportant words, their similarity might still be low if they miss the key terms.

One limitation, though, is that this approach doesn’t inherently handle synonyms or related words. For example, if my query is "methods for machine learning" and the document says "techniques in machine learning," the cosine similarity might not be as high as expected because the words "methods" and "techniques" don’t match exactly. This is where TF-IDF falls short compared to models like embeddings, which capture deeper semantic relationships.

So, the purpose of this code is to measure how well the terms in the query align with those in the documents, weighted by their importance. It works great for tasks where exact or near-exact word matching is sufficient. But for cases where I need to account for synonyms or broader contextual relevance, I might need to combine this with more advanced approaches like dense embeddings.

In [None]:

# Step 3: Rerank documents based on similarity scores
reranked_indices = similarity_scores.argsort()[::-1]
reranked_documents = [documents[i] for i in reranked_indices]

In [5]:
print("Reranked Documents:")
for i, doc in enumerate(reranked_documents):
    print(f"Rank {i+1}: {doc}")


Reranked Documents:
Rank 1: Introduction to machine learning
Rank 2: Deep learning and its applications
Rank 3: Statistical methods in AI
