# Understanding TF-IDF (Term Frequency-Inverse Document Frequency)

**TF-IDF** is a popular technique in natural language processing for transforming text into meaningful numerical features. It stands for **Term Frequency-Inverse Document Frequency** and is used to reflect how important a word is to a document in a collection or corpus.

- **Term Frequency (TF):** Measures how frequently a word appears in a document. The more times a word appears, the higher its TF value.
- **Inverse Document Frequency (IDF):** Measures how unique or rare a word is across all documents. Words that appear in many documents have a lower IDF, while rare words have a higher IDF.

The TF-IDF score is calculated as:

    TF-IDF(word, document) = TF(word, document) * IDF(word)

**Why use TF-IDF?**
- It helps reduce the impact of common words (like "the", "is", "and") that appear in many documents and are less informative.
- It highlights words that are unique and important to a specific document.
- TF-IDF is widely used for text mining, information retrieval, and as input features for machine learning models.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample text data
text_data = [
    "The cat sat on the mat.",
    "The dog barked at the cat.",
    "The bird flew over the house.",
    "The fish swam in the pond.",
    "The rabbit hopped through the garden."
]
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(text_data)
# Convert the TF-IDF matrix to a DataFrame for better visualization
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
# Display the TF-IDF DataFrame
print(tfidf_df)
# The output will show the TF-IDF scores for each word in the text data
# Each row corresponds to a sentence, and each column corresponds to a word.
# The values represent the importance of each word in the context of the sentences.
# This is a simple example of how to use TF-IDF for text vectorization.