## 🔢 Term Frequency (TF)

**Definition:**
Term Frequency measures how frequently a term appears in a single document.

**Formula:**
`TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)`

**Example:**
If the term **"data"** appears 3 times in a document with 100 words:
`TF("data") = 3 / 100 = 0.03`


---

## 📄 Document Frequency (DF)

**Definition:**
Document Frequency is the number of documents that contain the term at least once.

**Formula:**
`DF(t) = Number of documents containing term t`

**Example:**
If the term **"machine"** appears in 5 out of 10 documents:
`DF("machine") = 5`



In [38]:
data1 = [
    "I love natural language processing",
    "Bag of words is a simple technique",
    "Machine learning is fun and powerful",
    "Text classification uses word counts",
    "I enjoy learning about data science",
    "This model converts text to vectors",
    "NLP tasks include sentiment analysis",
    "Text data needs preprocessing",
    "Feature extraction is important in NLP",
    "We use vectorization to represent text"
]

In [39]:
# -TF(Term Frequency) IDF(Inverse Document Frequency
# Formula: Wx,y = tfxy * log(N / dfx)

# Where:
# Wx,y   → TF-IDF weight of term x in document y
# tfxy   → Term Frequency: Number of times term x appears in document y(this is the TF)
# dfx    → Document Frequency: Number of documents that contain x
# N      → Total number of documents
# Number of documents is the rows/sentences



In [40]:
#Import Libraries
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [41]:
tfidf = TfidfVectorizer()
t=tfidf.fit_transform(data1)
t

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 53 stored elements and shape (10, 44)>

In [42]:
t.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.5       , 0.        , 0.5       , 0.        ,
        0.        , 0.5       , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.5       , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.42435658, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.3156065 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.42435658,
        0.        , 0.        , 0.        , 0.        , 0. 

In [43]:
t1= pd.DataFrame(t.toarray(),columns=tfidf.get_feature_names_out())
t1

Unnamed: 0,about,analysis,and,bag,classification,converts,counts,data,enjoy,extraction,...,text,this,to,use,uses,vectorization,vectors,we,word,words
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.424357,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.424357
2,0.0,0.0,0.435368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.474727,0.0,0.474727,0.0,0.0,0.0,...,0.313903,0.0,0.0,0.0,0.474727,0.0,0.0,0.0,0.474727,0.0
4,0.474295,0.0,0.0,0.0,0.0,0.0,0.0,0.403194,0.474295,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.440231,0.0,0.0,0.0,0.0,...,0.291093,0.440231,0.374236,0.0,0.0,0.0,0.440231,0.0,0.0,0.0
6,0.0,0.460158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.478223,0.0,0.0,...,0.371977,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.435368,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.291093,0.0,0.374236,0.440231,0.0,0.440231,0.0,0.440231,0.0,0.0
