### **Tutorial 11: TF-IDF**

This tutorial explains the concept of **Term Frequency-Inverse Document Frequency (TF-IDF)**, a statistical measure used to evaluate the importance of a word in a document relative to a collection or corpus of documents. It is widely used in information retrieval and text mining tasks, such as document classification and clustering, and as a feature for natural language processing (NLP) models.

**TF-IDF** is a product of two components:

1. **Term Frequency (TF)**: This measures how frequently a term appears in a document. It can be calculated as:

   $$
   \text{TF}(t, d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
   $$

2. **Inverse Document Frequency (IDF)**: This measures the importance of the term in the entire corpus. It gives higher weight to terms that appear in fewer documents. It can be calculated as:

   $$
   \text{IDF}(t, D) = \log \frac{|D|}{|\{d \in D: t \in d\}|}
   $$



The **TF-IDF** score is computed by multiplying the **TF** and **IDF** values:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

---
In this tutorial, we will build a TF-IDF model from scratch using Python. We will:
- Create a custom implementation of TF-IDF calculation.
- Implement the vectorization process for a small corpus of text.
- Use this model to calculate the TF-IDF values for terms in each document.
- Verify the results uisng Python built-in scikit-learn TF-IDF-Vectorizer. 

### **Steps:**
 
1. **Calculate Term Frequency (TF)**:
   - For each document, count the frequency of each term.

2. **Calculate Inverse Document Frequency (IDF)**:
   - For each term, calculate how many documents it appears in, and then compute the inverse frequency.

3. **Compute TF-IDF**:
   - Multiply TF and IDF values for each term in each document.



In [1]:
from utils import tfidf_vectorizer as tv
import pandas as pd

tfidf = tv.Tfidf_Vectorizer()

In [None]:
# Please note that my method requires the corpus to be tokenized whereas scikit-learn requires the corpus to be a string. Which one is better?
corpus = [
    ["yangon", "is", "a", "city", "with", "a", "rich", "history"],
    ["bagan", "is", "famous", "for", "its", "ancient", "temples", "and", "pagodas"],
    ["the", "irrawaddy", "river", "is", "the", "lifeline", "of", "myanmar"],
    ["traditional", "foods", "like", "mohinga", "are", "popular", "in", "myanmar"],
    ["the", "shwedagon", "pagoda", "is", "a", "sacred", "site", "in", "yangon"],
    ["bamboo", "houses", "are", "common", "in", "rural", "areas", "of", "myanmar"],
    ["mandalay", "is", "known", "for", "its", "cultural", "heritage"],
    ["inle", "lake", "is", "famous", "for", "floating", "gardens", "and", "leg", "rowers"],
    ["kuthodaw", "pagoda", "is", "home", "to", "the", "world's", "largest", "book"],
    ["the", "thanaka", "paste", "is", "used", "as", "a", "traditional", "cosmetic", "in", "myanmar"]
]


tfidf_matrix = tfidf.fit_transform(corpus)
df_tfidf = pd.DataFrame(tfidf_matrix)
df_tfidf = df_tfidf.fillna(0)
df_tfidf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "yangon is a city with a rich history",
    "bagan is famous for its ancient temples and pagodas",
    "the irrawaddy river is the lifeline of myanmar",
    "traditional foods like mohinga are popular in myanmar",
    "the shwedagon pagoda is a sacred site in yangon",
    "bamboo houses are common in rural areas of myanmar",
    "mandalay is known for its cultural heritage",
    "inle lake is famous for floating gardens and leg rowers",
    "kuthodaw pagoda is home to the world's largest book",
    "the thanaka paste is used as a traditional cosmetic in myanmar"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
dense = X.todense()

df = pd.DataFrame(dense, columns=vectorizer.get_feature_names_out())
df
