# Understanding TF-IDF

---

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer

### What we did and why

- Created a list of text documents (**corpus**).
- Initialized a **TfidfVectorizer**, which converts text into numerical vectors.
- Used **fit_transform()** to:
  - extract all unique words,
  - compute their TF-IDF weights,
  - produce a document Ã— word matrix.
- Used **get_feature_names_out()** to retrieve the list of unique words (features) ordered by columns.

We use TF-IDF to represent text numerically so it can be processed by machine learning models or used for similarity analysis.


In [19]:
corpus = [
    'This is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document?',
    'This is the next test document',
    'This is the next test people',
]
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'next', 'one', 'people',
       'second', 'test', 'the', 'third', 'this'], dtype=object)

### What `X.toarray()` does

- `X` is a sparse TF-IDF matrix.
- Calling **`X.toarray()`** converts this sparse matrix into a full **NumPy array**.
- The result is a standard 2D matrix where:
  - rows = documents,
  - columns = features (words),
  - values = TF-IDF scores.

We use `toarray()` when we want to view or inspect the full matrix in a readable numeric form.


In [20]:
X.toarray()

array([[0.        , 0.46675428, 0.64515682, 0.34924353, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.34924353,
        0.        , 0.34924353],
       [0.        , 0.68515479, 0.        , 0.2563296 , 0.        ,
        0.        , 0.        , 0.57744984, 0.        , 0.2563296 ,
        0.        , 0.2563296 ],
       [0.52769604, 0.        , 0.        , 0.23424393, 0.        ,
        0.52769604, 0.        , 0.        , 0.        , 0.23424393,
        0.52769604, 0.23424393],
       [0.        , 0.46675428, 0.64515682, 0.34924353, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.34924353,
        0.        , 0.34924353],
       [0.        , 0.39221285, 0.        , 0.29346876, 0.54212422,
        0.        , 0.        , 0.        , 0.54212422, 0.29346876,
        0.        , 0.29346876],
       [0.        , 0.        , 0.        , 0.25906423, 0.4785688 ,
        0.        , 0.5836103 , 0.        , 0.4785688 , 0.25906423,
        0.        ,

### What this step does

- Converts the TF-IDF NumPy array into a **pandas DataFrame**.
- Sets the column names to the actual words returned by `get_feature_names_out()`.
- Makes the TF-IDF matrix easier to read, inspect, and analyze.

Each row = document  
Each column = word (feature)  
Each cell = TF-IDF weight


In [21]:
df = pd.DataFrame(
    X.toarray(),
    columns=vectorizer.get_feature_names_out()
)

df

Unnamed: 0,and,document,first,is,next,one,people,second,test,the,third,this
0,0.0,0.466754,0.645157,0.349244,0.0,0.0,0.0,0.0,0.0,0.349244,0.0,0.349244
1,0.0,0.685155,0.0,0.25633,0.0,0.0,0.0,0.57745,0.0,0.25633,0.0,0.25633
2,0.527696,0.0,0.0,0.234244,0.0,0.527696,0.0,0.0,0.0,0.234244,0.527696,0.234244
3,0.0,0.466754,0.645157,0.349244,0.0,0.0,0.0,0.0,0.0,0.349244,0.0,0.349244
4,0.0,0.392213,0.0,0.293469,0.542124,0.0,0.0,0.0,0.542124,0.293469,0.0,0.293469
5,0.0,0.0,0.0,0.259064,0.478569,0.0,0.58361,0.0,0.478569,0.259064,0.0,0.259064
