<img src="../images/marman-1.jpg" width="500">
<center>Marman Coklat Keju Enak Beutt.. 🤤</center>

# TF-IDF (Term Frequency-Inverse Document Frequency)

- Salah satu metode statistik yang digunakan untuk mengukur seberapa penting suatu token terhadap suatu dokumen tertentu dari sekumpulan dokumen atau corpus.

## Term Frequency

- Hasil pembagian suatu istilah atau token tertentu dengan jumlah token pada dokumen.

    $tf(t,\ d)\ =\ \frac{f_{t,\ d}}{\sum_{t'\ \in\ d}{f_{t',\ d}}}$, dimana

- $f_{t,\ d}\ =\$ jumlah token tertentu dalam suatu dokumen

- $\sum_{t'\ \in\ d}{f_{t',\ d}}\ =\$ jumlah keseluruhan token pada dokumen yang mengandung token tertentu.

## Inverse Document Frequency

- Without smooting: $idf(t) = log \frac{1\ +\ n}{1\ +\ df(t)}$
- With smoothing: $idf(t) = log \frac{1\ +\ n}{1\ +\ df(t)} + 1$
- Dimana:
    - $n\ =\$ jumlah total dokumen pada korpus.
    
    - $df(t)\ =\$ jumlah dokumen dalam kumpulan dokumen yang mengandung istilah t.

## TF-IDF


- Hasil perkalian antara nilia term frequency dengan inverse document frequency.

- $tf\text{-}idf(t,\ d)\ =\ tf(t, d)\ \times \ idf(t)$, dimana:
    - $tf(t, d)\ =\$ term frequency.
    
    - $idf(t)\ =\$ inverse document frequency.

# Dataset

In [1]:
corpus = [
    "Saya suka makan bakso.",
    "Saya suka makan gado-gado.",
    "Saya suka makan pisang ijo.",
    "Saya sukanya sama kamu.",
    "Tapi, kamu gak suka sama aku."
]

corpus

['Saya suka makan bakso.',
 'Saya suka makan gado-gado.',
 'Saya suka makan pisang ijo.',
 'Saya sukanya sama kamu.',
 'Tapi, kamu gak suka sama aku.']

# TF-IDF Calculation

In [2]:
corpus

['Saya suka makan bakso.',
 'Saya suka makan gado-gado.',
 'Saya suka makan pisang ijo.',
 'Saya sukanya sama kamu.',
 'Tapi, kamu gak suka sama aku.']

## Index and Document

In [8]:
import pandas as pd

document = pd.DataFrame(
    data=[[idx, doc] for idx, doc in enumerate(corpus)],
    columns=["idx_doc", "doc"]
)

document

Unnamed: 0,idx_doc,doc
0,0,Saya suka makan bakso.
1,1,Saya suka makan gado-gado.
2,2,Saya suka makan pisang ijo.
3,3,Saya sukanya sama kamu.
4,4,"Tapi, kamu gak suka sama aku."


## Index and Token

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
response = vectorizer.fit_transform(corpus)

tokens = pd.DataFrame(
    data=[[idx, token] for idx, token in enumerate(vectorizer.get_feature_names())],
    columns=["idx_token", "token"]
)

tokens

Unnamed: 0,idx_token,token
0,0,aku
1,1,bakso
2,2,gado
3,3,gak
4,4,ijo
5,5,kamu
6,6,makan
7,7,pisang
8,8,sama
9,9,saya


## TF-IDF Table

In [5]:
tf_idf = str(response).split("\n")
tf_idf = [data.strip() for data in tf_idf]
tf_idf = [data.replace("(", "") for data in tf_idf]
tf_idf = [data.replace(")", "") for data in tf_idf]
tf_idf = [data.replace(" ", "") for data in tf_idf]
tf_idf = [data.replace("\t", ",") for data in tf_idf]
tf_idf = [data.split(",") for data in tf_idf]

pd.DataFrame(
    data=tf_idf,
    columns=["idx_doc", "idx_token", "tf_idf"]
)

Unnamed: 0,idx_doc,idx_token,tf_idf
0,0,1,0.6928236190934441
1,0,6,0.4639920522561333
2,0,10,0.3903247418941077
3,0,9,0.3903247418941077
4,1,2,0.8870672547201419
5,1,6,0.2970396394289926
6,1,10,0.2498791089818884
7,1,9,0.2498791089818884
8,2,4,0.5694966280887995
9,2,7,0.5694966280887995


In [6]:
tf_idf_df = pd.DataFrame(
    data=response.todense().T,
    index=vectorizer.get_feature_names(),
    columns=[f"D{i+1}" for i in range(len(corpus))]
)

tf_idf_df

Unnamed: 0,D1,D2,D3,D4,D5
aku,0.0,0.0,0.0,0.0,0.465281
bakso,0.692824,0.0,0.0,0.0,0.0
gado,0.0,0.887067,0.0,0.0,0.0
gak,0.0,0.0,0.0,0.0,0.465281
ijo,0.0,0.0,0.569497,0.0,0.0
kamu,0.0,0.0,0.0,0.498512,0.375386
makan,0.463992,0.29704,0.381399,0.0,0.0
pisang,0.0,0.0,0.569497,0.0,0.0
sama,0.0,0.0,0.0,0.498512,0.375386
saya,0.390325,0.249879,0.320844,0.34811,0.0
