<a href="https://colab.research.google.com/github/omarhamed888/NLP_Text_Preprocessing/blob/main/tf_idf_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF-IDF in NLP

This notebook demonstrates how to calculate Term Frequency-Inverse Document Frequency (TF-IDF) using Python's `scikit-learn` library.

## Import Libraries

Let's start by importing the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

## Sample Data

We'll use a small sample dataset to illustrate the TF-IDF calculation.

In [2]:
documents = [
    "The sky is blue.",
    "The sun is bright.",
    "The sun in the sky is bright.",
    "We can see the shining sun, the bright sun."
]

## Initialize TF-IDF Vectorizer

We'll use `TfidfVectorizer` from `scikit-learn` to compute the TF-IDF scores for the documents.

In [3]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

## Display TF-IDF Scores

Let's display the TF-IDF scores for each term in each document.

In [4]:
df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,blue,bright,can,in,is,see,shining,sky,sun,the,we
0,0.659191,0.0,0.0,0.0,0.420753,0.0,0.0,0.519714,0.0,0.343993,0.0
1,0.0,0.522109,0.0,0.0,0.522109,0.0,0.0,0.0,0.522109,0.426858,0.0
2,0.0,0.321846,0.0,0.504235,0.321846,0.0,0.0,0.397544,0.321846,0.526261,0.0
3,0.0,0.239102,0.374599,0.0,0.0,0.374599,0.374599,0.0,0.478204,0.390963,0.374599


In [5]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=df)

https://docs.google.com/spreadsheets/d/1kysewR2uFFdwv14cfqDdVX9TidpLVXHTneiBNSNS6AI#gid=0


In [9]:
# prompt: i want to use tf idf in email classification

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample email data (replace with your actual data)
emails = [
    ("Free Viagra now!!!", "spam"),
    ("Meeting at 3 PM", "ham"),
    ("Urgent! Your account has been compromised", "spam"),
    ("Check out this amazing offer!", "spam"),
    ("Project update", "ham"),
    ("Congratulations, you've won a prize!", "spam"),
    ("Dinner tonight?", "ham"),
    ("Get rich quick!", "spam"),
    ("Your order has been shipped", "ham"),
    ("Limited-time offer, click here", "spam")
]


df = pd.DataFrame(emails, columns=["text", "label"])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42
)

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = vectorizer.transform(X_test)

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test_tfidf)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Example prediction
new_email = ["You have won a free trip"]
new_email_tfidf = vectorizer.transform(new_email)
prediction = model.predict(new_email_tfidf)
print(f"Prediction for new email: {prediction[0]}")


Accuracy: 0.0
Prediction for new email: spam


## Conclusion

In this notebook, we demonstrated how to calculate TF-IDF scores for a set of documents using Python's `scikit-learn` library. TF-IDF is a common technique in NLP for transforming text data into numerical features that can be used for various tasks such as text classification, clustering, and information retrieval.