# Introduction

In this demo, we will give you 2 documents:

* document1 = 'petrol cars are cheaper than diesel cars'
* document2 = 'diesel is cheaper than petrol'

Then, we make a python program calculating TF-IDF values (section A). Finally, to understand **how to workflow of sklearn**, we will make the other calculation of TF_IDF values by hand in section B to compare to values in section A.

In [1]:
# Install some libraries
!pip3 install --quiet pandas
!pip3 install --quiet scikit-learn


In [2]:
# Check version of scikit-learn
!pip3 show scikit-learn

Name: scikit-learn
Version: 1.3.1
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /usr/local/lib/python3.11/site-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: 


# Section A. Python program to generate tf-idf values

## Step 1: Import the library

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Step 2: Set up the document corpus


In [4]:
document1 = 'petrol cars are cheaper than diesel cars'
document2 = 'diesel is cheaper than petrol'
document_id =  ["d1", "d2"]
corpus = [document1, document2]


## Step 3: Initialize TfidfVectorizer 

In [9]:
# 3.1. If you no use params stop_words='english', please comment below codes and uncomment in 3.2
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
tfidf_df = pd.DataFrame(X.toarray(), index=document_id, columns=vectorizer.get_feature_names_out())
tfidf_df.loc['document_fluency'] = (tfidf_df > 0).sum()
tfidf_df

Unnamed: 0,are,cars,cheaper,diesel,is,petrol,than
d1,0.377292,0.754584,0.268446,0.268446,0.0,0.268446,0.268446
d2,0.0,0.0,0.40909,0.40909,0.574962,0.40909,0.40909
document_fluency,1.0,1.0,2.0,2.0,1.0,2.0,2.0


In [10]:
# 3.2. If you no use params stop_words='english', please comment below codes and uncomment in 3.1
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
tfidf_df = pd.DataFrame(X.toarray(), index=document_id, columns=vectorizer.get_feature_names_out())
tfidf_df.loc['document_fluency'] = (tfidf_df > 0).sum()

In [7]:
# print(X.toarray())
tfidf_df

Unnamed: 0,cars,cheaper,diesel,petrol
d1,0.851354,0.302873,0.302873,0.302873
d2,0.0,0.57735,0.57735,0.57735
document_fluency,1.0,2.0,2.0,2.0


# Section B. Calculate TF-IDF values by hand

## 1. Calculate TF-IDF values for each term by hand

- As mentioned before , IDF value of a term is common across all documents. Here we will consider the case when smooth_idf = True (default behaviour). So idf(t) is given by

IDF(t) = log e ((1+n)/(1 + DF(t)) + 1

No. Of docs in corpus: n = 2

idf(“cars”) = log e (3/2) +1 => 1.405465083

idf(“cheaper”) = log e (3/3) + 1 => 1

idf(“diesel”) = log e (3/3) + 1 => 1

idf(“petrol”) = log e (3/3) + 1 => 1

- So we have the sparse matrix with shape 1 x 4

|    **cars** | **cheaper** | **diesel** | **petrol** |
|------------:|------------:|-----------:|-----------:|
| 1.405465083 | 1           | 1          | 1          |

## 2. Calculate tf-idf of the terms in each document d1 and d2.

- **For d1:**
  
tf-idf(“cars”) = tf(“cars”) x idf (“cars”) = 2 x 1.405465083 => 2.810930165

tf-idf(“cheaper”) = tf(“cheaper”) x idf (“cheaper”) = 1 x 1 => 1

tf-idf(“diesel”) = tf(“diesel”) x idf (“diesel”) = 1×1 => 1

tf-idf(“petrol”) = tf(“petrol”) x idf (“petrol”) = 1×1 => 1

- **For d2:**

tf-idf(“cars”) = tf(“cars”) x idf (“cars”) = 0 x 1.405465083 => 0

tf-idf(“cheaper”) = tf(“cheaper”) x idf (“cheaper”) = 1 x 1 => 1

tf-idf(“diesel”) = tf(“diesel”) x idf (“diesel”) = 1×1 => 1

tf-idf(“petrol”) = tf(“petrol”) x idf (“petrol”) = 1×1 => 1
- So we have the sparse matrix with shape 2 x 4

|    |    **cars** | **cheaper** | **diesel** | **petrol** |
|----|------------:|------------:|-----------:|-----------:|
| d1 | 2.810930165 | 1           | 1          | 1          |
| d2 | 0           | 1           | 1          | 1          |

## 3. Normalize tf-idf values

We have one final step. To avoid large documents in the corpus dominating smaller ones, we have a lot of ways to normalize each row in the sparse matrix. Now, we use the Euclidean norm:

- **First document d1**

2.810930165 / sqrt( 2.810930165 2 + 12 + 12 + 12) => 0.851354321

1 / sqrt( 2.8109301652 + 12 + 12 + 12) =>  0.302872811

1 / sqrt( 2.8109301652 + 12 + 12 + 12) => 0.302872811

1 / sqrt( 2.8109301652 + 12 + 12 + 12) => 0.302872811

- **Second document d2**

0 / sqrt(0^2   + 12 + 12 + 12) => 0

1 / sqrt(02  + 12 + 12 + 12)=> 0.577350269

1/ sqrt(02  + 12 + 12 + 12) => 0.577350269

1 / sqrt(02  + 12 + 12 + 12) => 0.577350269

- **As you can see, the result by hand is same with Python program**

|    |    **cars** | **cheaper** |  **diesel** |  **petrol** |
|----|------------:|------------:|------------:|------------:|
| d1 | 0.851354321 | 0.302872811 | 0.302872811 | 0.302872811 |
| d2 | 0           | 0.577350269 | 0.577350269 | 0.577350269 |