### TF-IDF (Term Frequency-Inverse Document Frequency)

### Understanding TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic used to reflect the importance of a word in a document within a collection or corpus. It is commonly used in information retrieval and text mining.

### Mathematical Formulation

TF-IDF is the product of two statistics: Term Frequency (TF) and Inverse Document Frequency (IDF).

### Term Frequency (TF):

$
TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$

### Inverse Document Frequency (IDF):

$
IDF(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right)
$

### TF-IDF:

$
TFIDF(t, d) = TF(t, d) \times IDF(t)
$

Step 1: Compute Term Frequency (TF)

- Term Frequency (TF) measures how frequently a term occurs in a document.

Step 2: Compute Document Frequency (DF)

- Document Frequency (DF) measures the number of documents in which a term appears.

Step 3: Compute Inverse Document Frequency (IDF)

- Inverse Document Frequency (IDF) measures how important a term is. It is computed as the logarithm of the ratio of the total number of documents to the document frequency of the term.

Step 4: Compute TF-IDF

- TF-IDF is the product of TF and IDF.

Write a function compute_tf_idf(corpus, query) that takes the following inputs:

- corpus: A list of documents, where each document is a list of words.
- query: A list of words for which you want to compute the TF-IDF scores.

The function should return a list of lists containing the TF-IDF scores for the query words in each document, rounded to five decimal places.

In [4]:
import math

# Corpus and query
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]

# Step 1: Compute Term Frequency (TF)
def compute_tf(doc, term):
    return doc.count(term) / len(doc)

# Step 2: Compute Document Frequency (DF)
def compute_df(corpus, term):
    df = 0
    for doc in corpus:
        if term in doc:
            df += 1
    return df

# Step 3: Compute Inverse Document Frequency (IDF)
def compute_idf(corpus, term):
    df = compute_df(corpus, term)
    return math.log(len(corpus) / (1 + df))  # Adding 1 to avoid division by zero

# Step 4: Compute TF-IDF
def compute_tfidf(corpus, query):
    tfidf_scores = {}
    for term in query:
        tfidf_scores[term] = {}
        for doc_index, doc in enumerate(corpus):
            tf = compute_tf(doc, term)
            idf = compute_idf(corpus, term)
            tfidf_scores[term][doc_index] = tf * idf
    return tfidf_scores

# Compute TF-IDF for the query
tfidf_scores = compute_tfidf(corpus, query)

# Print the results
for term in query:
    print(f"TF-IDF scores for term '{term}':")
    for doc_index, score in tfidf_scores[term].items():
        print(f"Document {doc_index}: {score}")

TF-IDF scores for term 'cat':
Document 0: 0.0
Document 1: 0.0
Document 2: 0.0


In [5]:
import numpy as np

def compute_tf_idf(corpus, query):
    """
    Compute TF-IDF scores for a query against a corpus of documents using only NumPy.
    The output TF-IDF scores retain five decimal places.
    """
    vocab = sorted(set(word for document in corpus for word in document).union(query))
    word_to_index = {word: idx for idx, word in enumerate(vocab)}

    tf = np.zeros((len(corpus), len(vocab)))

    for doc_idx, document in enumerate(corpus):
        for word in document:
            word_idx = word_to_index[word]
            tf[doc_idx, word_idx] += 1
        tf[doc_idx, :] /= len(document)

    df = np.count_nonzero(tf > 0, axis=0)

    num_docs = len(corpus)
    idf = np.log((num_docs + 1) / (df + 1)) + 1

    tf_idf = tf * idf

    query_indices = [word_to_index[word] for word in query]
    tf_idf_scores = tf_idf[:, query_indices]

    tf_idf_scores = np.round(tf_idf_scores, 5)

    return tf_idf_scores.tolist()

In [6]:
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]

print(compute_tf_idf(corpus, query))

# Expected Output:
# [[0.21461], [0.25754], [0.0]]

[[0.21461], [0.25754], [0.0]]
