# <center> TF-IDF

# <a id= 'b0'>
<font size = 4>
    
**Table of contents:**<br>
[1. Introduction](#b1)<br>
[2. skLearn-Tfidf vectorizer](#b2)<br>

## <a id = 'b1'>
    
<font size = 10 color = 'midnightblue'> <b> Introduction

<div class="alert alert-block alert-success">    
<font size = 4>

**TF-IDF, or Term Frequency-Inverse Document Frequency, is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents.**

**1. Term Frequency (TF):**
    
>- Measures how often a term (word) appears in a document.
>- Calculated as the ratio of the number of times a term appears in a document to the total number of terms in that document.
>- It aims to highlight words that are more frequent within a specific document.

<font size = 5>

$$\text{TF(t,d)}=\frac{\text{number of times t appears in d }}{\text {total number of terms in d}}$$

<div class="alert alert-block alert-success">    
<font size = 4>  
    
**2. Inverse Document Frequency (IDF):**
>- Measures how important a term is across a collection of documents.
>- Calculated as the logarithm of the total number of documents divided by the number of documents containing the term, with the result inverted.

<font size = 5>
$$IDF (t) = log  \frac {N} {1 + df}$$

<div class="alert alert-block alert-success">    
<font size = 4>  

**3. TF-IDF**
> - The TF-IDF score for a term in a document is the product of its TF and IDF values.

<font size = 5>
    
$$ TF - IDF(t,d) = TF (t,d) * IDF (t) $$

In [1]:
import re
import pandas as pd
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer

<font size = 5 color = seagreen> <b>Create a collection dataset

In [2]:
dataset = [
    "The weather today is fantastic, with clear skies and a gentle breeze.",
    "Reading is a great way to escape reality and immerse oneself in different worlds.",
    "Climate change is a pressing global issue that requires immediate attention.",
    "Exercise is crucial for maintaining good physical and mental health.",
    "Learning a new language can be challenging but incredibly rewarding."
]

## <a id = 'b1'>
    
<font size = 10 color = 'midnightblue'> <b>tfidf vectorizer

<font size = 5 color = seagreen><b>  Define a count vectorizer

In [3]:
vectorizer = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word')

<font size = 5 color = seagreen><b>  Fit the tf-idf model

In [4]:
tfidf = vectorizer.fit(dataset)

<div class="alert alert-block alert-success">    
<font size = 4>

**The vectorizer object also returns the feature names for transformation which is the vocabulary**

In [5]:
print(list(tfidf.get_feature_names_out()))

['and', 'attention', 'be', 'breeze', 'but', 'can', 'challenging', 'change', 'clear', 'climate', 'crucial', 'different', 'escape', 'exercise', 'fantastic', 'for', 'gentle', 'global', 'good', 'great', 'health', 'immediate', 'immerse', 'in', 'incredibly', 'is', 'issue', 'language', 'learning', 'maintaining', 'mental', 'new', 'oneself', 'physical', 'pressing', 'reading', 'reality', 'requires', 'rewarding', 'skies', 'that', 'the', 'to', 'today', 'way', 'weather', 'with', 'worlds']


<div class="alert alert-block alert-success">    
<font size = 4>

**The vectorizer returns a sparse matrix where rows represent each sentence of the dataset and columns correspond to each word in vocabulary.**


In [6]:
vector = tfidf.transform(dataset).toarray()

In [7]:
dataset = [
    "The weather today is fantastic, with clear skies and a gentle breeze.",
    "Reading is a great way to escape reality and immerse oneself in different worlds.",
    "Climate change is a pressing global issue that requires immediate attention.",
    "Exercise is crucial for maintaining good physical and mental health.",
    "Learning a new language can be challenging but incredibly rewarding."
]

In [8]:
pd.DataFrame(vector,
             columns= list(tfidf.get_feature_names_out()),
             index = [f'sent_{i}' for i in range(1,len(dataset)+1)])

Unnamed: 0,and,attention,be,breeze,but,can,challenging,change,clear,climate,...,rewarding,skies,that,the,to,today,way,weather,with,worlds
sent_1,0.214305,0.0,0.0,0.319995,0.0,0.0,0.0,0.0,0.319995,0.0,...,0.0,0.319995,0.0,0.319995,0.0,0.319995,0.0,0.319995,0.319995,0.0
sent_2,0.195243,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.291533,0.0,0.291533,0.0,0.0,0.291533
sent_3,0.0,0.327607,0.0,0.0,0.0,0.0,0.0,0.327607,0.0,0.327607,...,0.0,0.0,0.327607,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_4,0.226198,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sent_5,0.0,0.0,0.333333,0.0,0.333333,0.333333,0.333333,0.0,0.0,0.0,...,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<div class="alert alert-block alert-success">    
<font size = 4>

**This vectorised data can be used as input features (predictors) to any ML model.**