#### TF-IDF Implementation Steps
1) Preprocess Text: Tokenize, lowercase, and optionally remove stopwords (similar to BoW).
2) Calculate Term Frequency (TF): Frequency of a term in a document.
3) Calculate Inverse Document Frequency (IDF): Measure of how unique a term is across the corpus.
4) Combine TF and IDF: Multiply to get the TF-IDF score.
5) Transform Corpus: Convert all documents into their TF-IDF vector representations.


In [2]:
import math
import re
from collections import Counter, defaultdict


class TFIDF:
    def __init__(self, stopwords=None):
        """
        Initializes the TF-IDF model.

        Parameters:
        - stopwords (set): A set of words to ignore. Default is None.
        """
        self.stopwords = stopwords if stopwords else set()
        self.vocab = set()
        self.doc_freq = defaultdict(int)  # Stores document frequency for each term
        self.idf = {}  # Stores inverse document frequency for each term

    def preprocess(self, text):
        """
        Preprocesses the input text (tokenization, lowercasing, removing stopwords).

        Parameters:
        - text (str): The raw text string.

        Returns:
        - tokens (list): A list of preprocessed tokens.
        """
        text = text.lower()
        text = re.sub(r"\W+", " ", text)
        tokens = text.split()
        tokens = [word for word in tokens if word not in self.stopwords]
        return tokens

    def build_vocab_and_idf(self, corpus):
        """
        Builds the vocabulary and computes the IDF values from a corpus.

        Parameters:
        - corpus (list of str): A list of documents (each document is a string).
        """
        num_docs = len(corpus)
        for document in corpus:
            tokens = set(
                self.preprocess(document)
            )  # Use set to count unique tokens in a document
            for token in tokens:
                self.vocab.add(token)
                self.doc_freq[token] += 1

        # Compute IDF for each term
        self.idf = {
            term: math.log((1 + num_docs) / (1 + self.doc_freq[term])) + 1
            for term in self.vocab
        }

    def compute_tf(self, document):
        """
        Computes the term frequency (TF) for a single document.

        Parameters:
        - document (str): A single document (string).

        Returns:
        - tf (dict): A dictionary of term frequencies.
        """
        tokens = self.preprocess(document)
        token_counts = Counter(tokens)
        total_tokens = len(tokens)
        tf = {term: count / total_tokens for term, count in token_counts.items()}
        return tf

    def vectorize(self, document):
        """
        Converts a document into its TF-IDF vector representation.

        Parameters:
        - document (str): A single document (string).

        Returns:
        - vector (dict): A dictionary representing the TF-IDF vector for the
        document.
        """
        tf = self.compute_tf(document)
        tfidf_vector = {
            term: tf.get(term, 0) * self.idf.get(term, 0) for term in self.vocab
        }
        return tfidf_vector

    def transform(self, corpus):
        """
        Transforms a corpus into TF-IDF vectors.

        Parameters:
        - corpus (list of str): A list of documents.

        Returns:
        - vectors (list of dict): A list of TF-IDF vectors for each document.
        """
        return [self.vectorize(doc) for doc in corpus]

    def vector_to_dense(self, vector):
        """
        Converts a sparse TF-IDF vector to a dense list of scores.

        Parameters:
        - vector (dict): A sparse dictionary representing the TF-IDF vector.

        Returns:
        - dense_vector (list): A dense list of TF-IDF scores.
        """
        return [vector[term] for term in sorted(self.vocab)]


In [3]:
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "A quick movement of the enemy will jeopardize six gunboats"
]

# Initialize TF-IDF model with stopwords
stopwords = {"the", "over", "a", "will"}
tfidf_model = TFIDF(stopwords=stopwords)

# Build vocabulary and compute IDF values
tfidf_model.build_vocab_and_idf(corpus)

# Transform the corpus into TF-IDF vectors
vectors = tfidf_model.transform(corpus)

# Display dense TF-IDF vectors
for i, vector in enumerate(vectors):
    dense_vector = tfidf_model.vector_to_dense(vector)
    print(f"Document {i+1} TF-IDF Vector: {dense_vector}")

Document 1 TF-IDF Vector: [0.2821911967599909, 0.21461367874196347, 0.0, 0.2821911967599909, 0.0, 0.0, 0.0, 0.2821911967599909, 0.21461367874196347, 0.0, 0.0, 0.0, 0.21461367874196347, 0.0, 0.0]
Document 2 TF-IDF Vector: [0.0, 0.2575364144903562, 0.0, 0.0, 0.0, 0.0, 0.3386294361119891, 0.0, 0.2575364144903562, 0.0, 0.3386294361119891, 0.0, 0.0, 0.3386294361119891, 0.0]
Document 3 TF-IDF Vector: [0.0, 0.0, 0.24187816865142076, 0.0, 0.24187816865142076, 0.24187816865142076, 0.0, 0.0, 0.0, 0.24187816865142076, 0.0, 0.24187816865142076, 0.18395458177882582, 0.0, 0.24187816865142076]


### How TF-IDF Improves on Bag of Words (BoW)
Both Bag of Words (BoW) and TF-IDF are methods to represent text data numerically. However, TF-IDF addresses several limitations of the basic BoW approach. Here's how TF-IDF improves upon BoW:

#### 1. Weighting Term Importance
BoW:
- Counts the frequency of each word in a document.
- Treats all words equally, regardless of their importance or relevance.

Example: Common words like "the", "is", and "and" are treated with the same importance as more meaningful words like "data", "model", or "accuracy".

TF-IDF:
- Assigns weights to words based on their importance.
  - TF (Term Frequency): Measures how often a word appears in a document.
  - IDF (Inverse Document Frequency): Reduces the weight of common words that appear in many documents (e.g., "the", "is") and increases the weight of rarer, more informative words.

Benefit:
- TF-IDF emphasizes important terms that are specific to a document, reducing the influence of frequently occurring but less informative terms.

#### 2. Reducing the Impact of Common Words
BoW:
- Highly frequent words (stopwords) dominate the representation.
- In many documents, words like "the" or "and" have the highest counts, even though they provide little information about the document's content.

TF-IDF:
- By calculating IDF, common words that appear in most documents receive lower weights.
- Rare or unique terms get higher weights, improving the model’s ability to distinguish between documents based on specific content.

Example:
- In a BoW vector, "the" might have a high frequency in most documents.
- In TF-IDF, the IDF component reduces the weight of "the", making unique words like "machine" or "learning" more influential.

#### 3. Handling Document Length Bias
BoW:
- Longer documents tend to have higher word frequencies, resulting in larger BoW vectors. This can skew the similarity between documents, favoring longer ones.

TF-IDF:
- Normalizes the term frequency by the document length.
  - TF scales each term frequency relative to the total number of terms in the document.
  - This makes the representation independent of document length, allowing fairer comparisons between short and long documents.

#### 4. Improving Document Similarity Calculations
BoW:
- Cosine Similarity or other distance metrics on BoW vectors may be dominated by common terms.
- Two documents about unrelated topics may appear similar if they both contain common words like "the" or "is".

TF-IDF:
- Reduces the impact of common terms in similarity calculations.
- Documents are compared based on the weight of informative, discriminative words, leading to more meaningful similarity scores.

#### 5. Applications in Search and Ranking
BoW:
- Treats all terms equally, so search systems might return irrelevant results if query terms are common.

TF-IDF:
- Improves search engines by ranking documents higher if they contain query terms with high TF-IDF scores.

Example: A search for "machine learning" will prioritize documents where these terms are significant, not just frequent.

