<a href="https://colab.research.google.com/github/hassanali-1999/project/blob/main/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It helps in identifying terms that are more relevant to a particular document while reducing the influence of commonly occurring terms across all documents.

1. **Term Frequency (TF)**: Measures how frequently a term appears in a document. It is often normalized by the total number of terms in the document.
2. **Inverse Document Frequency (IDF)**: Measures how important a term is by evaluating how frequently it appears across all documents. Terms that appear in fewer documents have higher IDF scores.
3. **TF-IDF Score**: The product of TF and IDF, which helps in identifying terms that are unique to a document and relevant in the context of the corpus.

In [1]:
## Example: Using Scikit-Learn for TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Sample documents
documents = [
    "I love programming in Python",
    "Python programming is fun",
    "Machine learning is fascinating"
]

# Create the TfidfVectorizer object


In [3]:
vectorizer = TfidfVectorizer()

# Fit and transform the documents


In [4]:
X = vectorizer.fit_transform(documents)

# Convert to array and get feature names


In [5]:
X_array = X.toarray()
feature_names = vectorizer.get_feature_names_out()

# Display the TF-IDF representation


In [6]:
import pandas as pd
df = pd.DataFrame(X_array, columns=feature_names)
print(df)

   fascinating       fun        in        is  learning      love   machine  \
0     0.000000  0.000000  0.562829  0.000000  0.000000  0.562829  0.000000   
1     0.000000  0.604652  0.000000  0.459854  0.000000  0.000000  0.000000   
2     0.528635  0.000000  0.000000  0.402040  0.528635  0.000000  0.528635   

   programming    python  
0     0.428046  0.428046  
1     0.459854  0.459854  
2     0.000000  0.000000  


In [8]:
document = [
    "Data science is an interdisciplinary field",
    "It uses scientific methods, processes, algorithms",
    "Data science is used for data analysis"
]



# #TODO : Create the TfidfVectorizer object


In [9]:
vec = TfidfVectorizer()

# #TODO : Fit and transform the documents


In [10]:
y = vec.fit_transform(document)

# #TODO : Convert to array and get feature names


In [11]:
y_array = y.toarray()
f_names = vec.get_feature_names_out()

# #TODO : Display the TF-IDF representation


In [12]:
pr = pd.DataFrame(y_array, columns=f_names)
print(pr)

   algorithms        an  analysis      data     field       for  \
0    0.000000  0.459548  0.000000  0.349498  0.459548  0.000000   
1    0.408248  0.000000  0.000000  0.000000  0.000000  0.000000   
2    0.000000  0.000000  0.393129  0.597969  0.000000  0.393129   

   interdisciplinary        is        it   methods  processes   science  \
0           0.459548  0.349498  0.000000  0.000000   0.000000  0.349498   
1           0.000000  0.000000  0.408248  0.408248   0.408248  0.000000   
2           0.000000  0.298984  0.000000  0.000000   0.000000  0.298984   

   scientific      used      uses  
0    0.000000  0.000000  0.000000  
1    0.408248  0.000000  0.408248  
2    0.000000  0.393129  0.000000  
