# NLP Practical: Feature Extraction (TF, TF-IDF)
**Name:** Prexit Joshi  
**Roll No.:** 118  

This notebook demonstrates how to perform **feature extraction** from a text document using:
- Term Frequency (TF)
- Term Frequency–Inverse Document Frequency (TF-IDF)

We will use Python's `sklearn` library to implement these techniques.


In [6]:
# Install required libraries (uncomment if running first time in Google Colab)
# !pip install scikit-learn


In [7]:
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd


In [8]:
# Example corpus (collection of documents)
corpus = [
    "Natural Language Processing is a part of Artificial Intelligence",
    "TF and TF-IDF are important techniques for feature extraction",
    "Feature extraction helps in text mining and information retrieval",
    "TF-IDF reduces the weight of common words in the corpus"
]


In [9]:
# Term Frequency (TF)
cv = CountVectorizer()
tf_matrix = cv.fit_transform(corpus)

# Convert to DataFrame for better readability
tf_df = pd.DataFrame(tf_matrix.toarray(), columns=cv.get_feature_names_out())
print("Term Frequency (TF) Matrix:")
tf_df


Term Frequency (TF) Matrix:


Unnamed: 0,and,are,artificial,common,corpus,extraction,feature,for,helps,idf,...,part,processing,reduces,retrieval,techniques,text,tf,the,weight,words
0,0,0,1,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,1,...,0,0,0,0,1,0,2,0,0,0
2,1,0,0,0,0,1,1,0,1,0,...,0,0,0,1,0,1,0,0,0,0
3,0,0,0,1,1,0,0,0,0,1,...,0,0,1,0,0,0,1,2,1,1


In [10]:
# Term Frequency-Inverse Document Frequency (TF-IDF)
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)

# Convert to DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
print("TF-IDF Matrix:")
tfidf_df


TF-IDF Matrix:


Unnamed: 0,and,are,artificial,common,corpus,extraction,feature,for,helps,idf,...,part,processing,reduces,retrieval,techniques,text,tf,the,weight,words
0,0.0,0.0,0.362224,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.362224,0.362224,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.263203,0.333839,0.0,0.0,0.0,0.263203,0.263203,0.333839,0.0,0.263203,...,0.0,0.0,0.0,0.0,0.333839,0.0,0.526405,0.0,0.0,0.0
2,0.288149,0.0,0.0,0.0,0.0,0.288149,0.288149,0.0,0.365481,0.0,...,0.0,0.0,0.0,0.365481,0.0,0.365481,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.295059,0.295059,0.0,0.0,0.0,0.0,0.232628,...,0.0,0.0,0.295059,0.0,0.0,0.0,0.232628,0.590118,0.295059,0.295059


### Conclusion:
- **TF** simply counts the occurrence of words in each document.  
- **TF-IDF** assigns higher weight to important words and reduces the weight of common words.  
This makes TF-IDF a more effective feature extraction technique for most NLP tasks.
