# **TF-IDF (Term Frequency-Inverse Document Frequency)**

## **What is TF-IDF?**
- A numerical statistic that highlights **important words** in a document.
- Helps in **text mining, information retrieval, and NLP tasks** like search engines and spam detection.

## **Formula**
\[
TF-IDF = TF \times IDF
\]

- **Term Frequency (TF)**: Measures word frequency in a document.
- **Inverse Document Frequency (IDF)**: Reduces the importance of common words.

## **Example: TF-IDF Calculation**
If the word "NLP" appears 3 times in a document with 100 words:
\[
TF = \frac{3}{100} = 0.03
\]
If "NLP" appears in 10 out of 1000 documents:
\[
IDF = \log\left(\frac{1000}{10}\right) = 2
\]
Thus, **TF-IDF = 0.03 × 2 = 0.06**.

## **Python Implementation**
```python
from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["I love NLP and deep learning", "NLP is amazing for text processing"]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names_out())  # Unique words
print(tfidf_matrix.toarray())  # TF-IDF values


In [1]:
import pandas as pd 
message = pd.read_csv("csv_files/spam_messages.csv")

In [2]:
message.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup...
3,ham,U dun say so early hor... U c already then...
4,spam,This is the 2nd time we have tried 2 con...


In [3]:
message.columns

Index(['label', 'message'], dtype='object')

In [6]:
import nltk 
import re 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()
lemma 







<WordNetLemmatizer>

In [27]:
corpus = []

for i in range(0,len(message)):
    review = re.sub('[^a-zA-Z]', ' ', str(message.iloc[i, 1]))
    review = review.lower()
    review = review.split()
    review = [lemma.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    


In [28]:
corpus 

['go jurong point crazy available',
 'ok lar joking wif u oni',
 'free entry wkly comp win fa cup',
 'u dun say early hor u c already',
 'nd time tried con',
 'b going esplanade fr home',
 'pity mood ot',
 'guy bitching acted like',
 'rofl true name']

### Create tf-idf 
sklearn.feature_extraction.text import tfidfVectorizer


In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer



In [30]:
tfidf = TfidfVectorizer(max_features = 100)
tfidf

In [43]:
x = tfidf.fit_transform(corpus).toarray()
tfidf

In [38]:
import numpy as np 

np.set_printoptions(
    edgeitems=30, 
    linewidth=10000, 
    formatter=dict(float=lambda x: f"{x:.2f}")  # Fixing 'flot' to 'float'
)


In [40]:
x

array([[0.00, 0.00, 0.45, 0.00, 0.00, 0.00, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.00, 0.45, 0.00, 0.00, 0.00, 0.00, 0.45, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.38, 0.00, 0.00, 0.38, 0.00, 0.00, 0.38, 0.00, 0.38, 0.00, 0.38, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.38, 0.38],
       [0.00, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.45, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
    

## N-Grams



In [49]:
tfidf = TfidfVectorizer(max_features = 100 , ngram_range=(2,3))
x = tfidf.fit_transform(corpus).toarray()



In [50]:
tfidf.vocabulary_

{'go jurong': 18,
 'jurong point': 27,
 'point crazy': 38,
 'crazy available': 5,
 'go jurong point': 19,
 'jurong point crazy': 28,
 'point crazy available': 39,
 'ok lar': 34,
 'lar joking': 29,
 'joking wif': 25,
 'wif oni': 48,
 'ok lar joking': 35,
 'lar joking wif': 30,
 'joking wif oni': 26,
 'free entry': 16,
 'entry wkly': 10,
 'wkly comp': 51,
 'comp win': 3,
 'win fa': 49,
 'fa cup': 14,
 'free entry wkly': 17,
 'entry wkly comp': 11,
 'wkly comp win': 52,
 'comp win fa': 4,
 'win fa cup': 50,
 'dun say': 6,
 'say early': 42,
 'early hor': 8,
 'hor already': 24,
 'dun say early': 7,
 'say early hor': 43,
 'early hor already': 9,
 'nd time': 32,
 'time tried': 44,
 'tried con': 46,
 'nd time tried': 33,
 'time tried con': 45,
 'going esplanade': 20,
 'esplanade fr': 12,
 'fr home': 15,
 'going esplanade fr': 21,
 'esplanade fr home': 13,
 'pity mood': 36,
 'mood ot': 31,
 'pity mood ot': 37,
 'guy bitching': 22,
 'bitching acted': 1,
 'acted like': 0,
 'guy bitching acted': 2

In [51]:
x

array([[0.00, 0.00, 0.00, 0.00, 0.00, 0.38, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.38, 0.38, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.38, 0.38, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.38, 0.38, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.38, 0.38, 0.00, 0.00, 0.38, 0.38, 0.00, 0.00, 0.00, 0.38, 0.38, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.38, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.30, 0.30, 0.00, 0.00, 0.00, 0.00, 0.00, 0.30, 0.30, 0.00, 0.00, 0.30, 0.00, 0.30, 0.30, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.30, 0.30, 0.30, 0.30],
       [0.00, 0.00,