Task: Implement TF-IDF (Term Frequency-Inverse Document Frequency)
Your task is to implement a function that computes the TF-IDF scores for a query against a given corpus of documents.

Function Signature
Write a function compute_tf_idf(corpus, query) that takes the following inputs:

corpus: A list of documents, where each document is a list of words.
query: A list of words for which you want to compute the TF-IDF scores.
Output
The function should return a list of lists containing the TF-IDF scores for the query words in each document, rounded to five decimal places.

Important Considerations
Handling Division by Zero:
When implementing the Inverse Document Frequency (IDF) calculation, you must account for cases where a term does not appear in any document (df = 0). This can lead to division by zero in the standard IDF formula. Add smoothing (e.g., adding 1 to both numerator and denominator) to avoid such errors.

Empty Corpus:
Ensure your implementation gracefully handles the case of an empty corpus. If no documents are provided, your function should either raise an appropriate error or return an empty result. This will ensure the program remains robust and predictable.

Edge Cases:

Query terms not present in the corpus.
Documents with no words.
Extremely large or small values for term frequencies or document frequencies.
By addressing these considerations, your implementation will be robust and handle real-world scenarios effectively.

Example:
Input:
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]

print(compute_tf_idf(corpus, query))
Output:
[[0.21461], [0.25754], [0.0]]
Reasoning:
The TF-IDF scores for the word "cat" in each document are computed and rounded to five decimal places.

	•	Document Frequency (DF) counts how many documents contain a given term.
	•	Example: If query = ["cat"], it appears in two documents, so df["cat"] = 2.

    	•	Inverse Document Frequency (IDF) reduces the weight of words that appear frequently in many documents.
	•	The formula used is:

IDF(t) = \log\left(\frac{N+1}{DF(t) + 1}\right) + 1

	•	N = total number of documents
	•	DF(t) = document frequency of term t
	•	Smoothing: Adding 1 to prevent division by zero if a term is absent.
	•	Example: If cat appears in 2 out of 3 documents,

IDF(\text{cat}) = \log\left(\frac{3+1}{2+1}\right) + 1 = \log\left(\frac{4}{3}\right) + 1 \approx 1.2877


✨ Summary
	1.	TF measures how frequently a term appears in a document.
	2.	IDF reduces the weight of common words across multiple documents.
	3.	TF-IDF provides a balanced score to highlight important words.
	4.	Implementation:
	•	Count document frequency (DF)
	•	Compute IDF with smoothing
	•	Calculate TF-IDF for each document


tf 是t , d的函数，idf是t的函数；tf(t, d); idf(t)

In [3]:
import math
from collections import Counter

def compute_tf_idf(corpus, query):
    if not corpus:
        return []

    # Compute document frequency for each term in query
    doc_count = len(corpus)
    df = {}
    # Iterate over each term in the query
    for term in query:
        count = 0  # Initialize count of documents containing the term

        # Iterate over each document in the corpus
        for doc in corpus:
            if term in doc:  # Check if the term is in the document
                count += 1  # Increment count if the term is found

        df[term] = count  # Store the count in the df dictionary

    print(df)
    # Compute IDF with smoothing
    idf = {term: math.log((doc_count + 1) / (df[term] + 1)) + 1 for term in query}
    print(idf)

    tf_idf_scores = []

    for doc in corpus:
        term_counts = Counter(doc)
        doc_length = len(doc)
        scores = []  # Initialize an empty list to store TF-IDF scores

        # Iterate over each term in the query
        for term in query:
            term_count = doc.count(term)  # Count occurrences of the term in the document
            if term_count > 0:  # If term appears in the document
                tf = term_count / doc_length  # Compute Term Frequency (TF)
                tf_idf = tf * idf[term]  # Multiply by IDF
                scores.append(round(tf_idf, 5))  # Round to 5 decimal places and store
            else:
                scores.append(0.0)  # If term is not in document, append 0.0

        tf_idf_scores.append(scores)

    return tf_idf_scores

# Example usage
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]

print(compute_tf_idf(corpus, query))  # Output: [[0.21461], [0.25754], [0.0]]


{'cat': 2}
{'cat': 1.2876820724517808}
[[0.21461], [0.25754], [0.0]]


In [None]:
import numpy as np

def compute_tf_idf(corpus, query):
    """
    Compute TF-IDF scores for a query against a corpus of documents using only NumPy.
    The output TF-IDF scores retain five decimal places.
    """
    vocab = sorted(set(word for document in corpus for word in document).union(query))
    word_to_index = {word: idx for idx, word in enumerate(vocab)}

    tf = np.zeros((len(corpus), len(vocab)))

    for doc_idx, document in enumerate(corpus):
        for word in document:
            word_idx = word_to_index[word]
            tf[doc_idx, word_idx] += 1
        tf[doc_idx, :] /= len(document)

    df = np.count_nonzero(tf > 0, axis=0)

    num_docs = len(corpus)
    idf = np.log((num_docs + 1) / (df + 1)) + 1

    tf_idf = tf * idf

    query_indices = [word_to_index[word] for word in query]
    tf_idf_scores = tf_idf[:, query_indices]

    tf_idf_scores = np.round(tf_idf_scores, 5)

    return tf_idf_scores.tolist()
