In [1]:
import numpy as np

# TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection or corpus of documents. It is widely used in information retrieval and text mining, particularly in tasks like document classification and keyword extraction.

## TF-IDF Formula

The TF-IDF score is computed as the product of two components:

### 1. **TF (Term Frequency)**:
This measures how frequently a word appears in a document. The idea is that words that appear more frequently in a document are likely more important.

\[
\text{TF}(t, d) = \frac{\text{Count of term t in document d}}{\text{Total number of terms in document d}}
\]

- **t**: The term (word) in the document.
- **d**: The specific document.
- The numerator is the number of times the term **t** appears in document **d**.
- The denominator is the total number of words in document **d**.

#### Example:
If the word "apple" appears 3 times in a document that contains 100 words, then the **TF** of "apple" in that document is:

\[
\text{TF}(\text{apple}, d) = \frac{3}{100} = 0.03
\]

### 2. **IDF (Inverse Document Frequency)**:
This measures how important a word is across all documents in the corpus. The idea is that words that appear in many documents are less informative (common words like "the", "is", etc.), while words that appear in fewer documents are more informative (unique to a specific document or topic).

\[
\text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term t}} \right)
\]

- **Total number of documents**: The total number of documents in the corpus.
- **Number of documents containing term t**: The number of documents where the term **t** appears.

#### Example:
If there are 100 documents in the corpus, and the term "apple" appears in 10 of those documents, the **IDF** for "apple" is:

\[
\text{IDF}(\text{apple}) = \log \left( \frac{100}{10} \right) = \log(10) = 1
\]

## TF-IDF Calculation:

Once we have **TF** and **IDF**, we multiply them together to get the **TF-IDF** score for a term in a document.

\[
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
\]

#### Example:
If the term "apple" has a **TF** of 0.03 in a document and an **IDF** of 1, the **TF-IDF** for "apple" in that document would be:

\[\text{TF-IDF}(\text{apple}, d) = 0.03 \times 1 = 0.03\]

## Key Insights:
- **TF** captures the frequency of a term in a document.
- **IDF** adjusts the importance of the term based on its overall frequency in the corpus, emphasizing rare terms.
- **TF-IDF** combines these two, giving more weight to terms that are frequent in a document but rare across the entire corpus.

## Why Use TF-IDF?

- **TF** by itself can be misleading because common words (like "the" or "and") will have high frequencies in many documents.
- **IDF** helps to reduce the weight of terms that appear in many documents, making rare terms more significant.
- **TF-IDF** helps identify terms that are both significant in a document and relatively rare across the corpus, which is useful in many text analysis tasks like search engines, document clustering, and keyword extraction.


In [2]:
import numpy as np
from collections import defaultdict, Counter

class TfIdf:
    def __init__(self, ducs: list) -> None:
        self.ducs = ducs
        self.size_ducs = len(ducs)
        self.words = self._collect_words()
        self.word_index = {word: idx for idx, word in enumerate(self.words)}
        self.word_count = self._compute_word_count()
        self.idf_values = self._compute_idf()

    def transform(self) -> np.array:
        return self.tf_idf()

    def tf_idf(self) -> np.array:
        vectors = np.zeros((self.size_ducs, len(self.words)))
        for i, duc in enumerate(self.ducs):
            vectors[i] = self._tf_idf(duc)
        return vectors

    def _tf_idf(self, duc: list) -> np.array:
        vector = np.zeros(len(self.words))
        tf = Counter(duc)
        for word in tf:
            if word in self.word_index:
                vector[self.word_index[word]] = (tf[word] / len(duc)) * self.idf_values[word]
        return vector

    def _compute_idf(self) -> dict:
        idf_values = {}
        for word in self.word_index:
            df = self.word_count[word]
            idf_values[word] = np.log(self.size_ducs / (df + 1))  # Added +1 for smoothing
        return idf_values

    def _collect_words(self) -> list:
        words = set()
        for duc in self.ducs:
            words.update(duc)
        return sorted(words)

    def _compute_word_count(self) -> dict:
        word_count = defaultdict(int)
        for duc in self.ducs:
            unique_words = set(duc)
            for word in unique_words:
                word_count[word] += 1
        return word_count

In [3]:
# Sample documents (ducs)
ducs = [
    ["apple", "banana", "fruit", "green", "fruit", "sweet"],
    ["computer", "programming", "python", "data", "science", "learning"],
    ["football", "soccer", "sports", "game", "team", "players"],
    ["apple", "orange", "banana", "fruit", "sweet", "tasty"]
]

# Initialize the TfIdf transformer
tfidf = TfIdf(ducs)

# Transform the documents into TF-IDF vectors
tfidf_matrix = tfidf.transform()

# Display the vocabulary (terms) and the corresponding TF-IDF matrix
print("Vocabulary:", tfidf.words)
print("TF-IDF Matrix:\n", tfidf_matrix)

# Example: Get TF-IDF vector for the first document (document at index 0)
print("\nTF-IDF Vector for Document 0:", tfidf_matrix[0])

Vocabulary: ['apple', 'banana', 'computer', 'data', 'football', 'fruit', 'game', 'green', 'learning', 'orange', 'players', 'programming', 'python', 'science', 'soccer', 'sports', 'sweet', 'tasty', 'team']
TF-IDF Matrix:
 [[0.04794701 0.04794701 0.         0.         0.         0.09589402
  0.         0.11552453 0.         0.         0.         0.
  0.         0.         0.         0.         0.04794701 0.
  0.        ]
 [0.         0.         0.11552453 0.11552453 0.         0.
  0.         0.         0.11552453 0.         0.         0.11552453
  0.11552453 0.11552453 0.         0.         0.         0.
  0.        ]
 [0.         0.         0.         0.         0.11552453 0.
  0.11552453 0.         0.         0.         0.11552453 0.
  0.         0.         0.11552453 0.11552453 0.         0.
  0.11552453]
 [0.04794701 0.04794701 0.         0.         0.         0.04794701
  0.         0.         0.         0.11552453 0.         0.
  0.         0.         0.         0.         0.04794