NLP Tutorial - Text Representation: TF-IDF

What is TF-IDF?

TF stands for Term Frequency and denotes the ratio of number of times a particular word appeared in a Document to total number of words in the document.


   Term Frequency(TF) = [number of times word appeared / total no of words in a document]


Term Frequency values ranges between 0 and 1. If a word occurs more number of times, then it's value will be close to 1.


IDF stands for Inverse Document Frequency and denotes the log of ratio of total number of documents/datapoints in the whole dataset to the number of documents that contains the particular word.


   Inverse Document Frequency(IDF) = [log(Total number of documents / number of documents that contains the word)]


In IDF, if a word occured in more number of documents and is common across all documents, then it's value will be less and ratio will approaches to 0.


Finally:


   TF-IDF = Term Frequency(TF) * Inverse Document Frequency(IDF)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]
corpus

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow',
 'Tesla is announcing new model-3 tomorrow',
 'Google is announcing new pixel-6 tomorrow',
 'Microsoft is announcing new surface tomorrow',
 'Amazon is announcing new eco-dot tomorrow',
 'I am eating biryani and you are eating grapes']

In [2]:
print(type(corpus))

<class 'list'>


In [3]:
#let's create the vectorizer and fit the corpus and transform them accordingly
v = TfidfVectorizer()
v.fit(corpus)
transform_output = v.transform(corpus)


In [4]:
#let's print the vocabulary

print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [5]:
v_feature = v.get_feature_names_out()
v_len = len(v_feature)
print(v_len)

28


In [6]:
# let's print the idf of each word

all_feature_names = v.get_feature_names_out()

for word in all_feature_names:
    #let's get the index in  the vacabulary
    indx = v.vocabulary_.get(word)
    #get the  score
    idf_score = v.idf_[indx]

    print(f"{word} : {idf_score}")


already : 2.386294361119891
am : 2.386294361119891
amazon : 2.386294361119891
and : 2.386294361119891
announcing : 1.2876820724517808
apple : 2.386294361119891
are : 2.386294361119891
ate : 2.386294361119891
biryani : 2.386294361119891
dot : 2.386294361119891
eating : 1.9808292530117262
eco : 2.386294361119891
google : 2.386294361119891
grapes : 2.386294361119891
iphone : 2.386294361119891
ironman : 2.386294361119891
is : 1.1335313926245225
loki : 2.386294361119891
microsoft : 2.386294361119891
model : 2.386294361119891
new : 1.2876820724517808
pixel : 2.386294361119891
pizza : 2.386294361119891
surface : 2.386294361119891
tesla : 2.386294361119891
thor : 2.386294361119891
tomorrow : 1.2876820724517808
you : 2.386294361119891


In [7]:
corpus[:2]

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already',
 'Apple is announcing new iphone tomorrow']

In [8]:
transform_output.toarray()[:2]

array([[0.24266547, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.24266547, 0.        , 0.        ,
        0.40286636, 0.        , 0.        , 0.        , 0.        ,
        0.24266547, 0.11527033, 0.24266547, 0.        , 0.        ,
        0.        , 0.        , 0.72799642, 0.        , 0.        ,
        0.24266547, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.30652086,
        0.5680354 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.5680354 ,
        0.        , 0.26982671, 0.        , 0.        , 0.        ,
        0.30652086, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.30652086, 0.        ]])

Problem Statement: Given a description about a product sold on e-commerce website, classify it in one of the 4 categories

Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification


This data consists of two columns.

Text	Label

Indira Designer Women's Art Mysore Silk Saree With Blouse Piece (Star-Red) This Saree Is Of Art Mysore Silk & Comes With Blouse Piece.	Clothing & Accessories
IO Crest SY-PCI40010 PCI RAID Host Controller Card Brings new life to any old desktop PC. Connects up to 4 SATA II high speed SATA hard disk drives. Supports Windows 8 and Server 2012	Electronics
Operating Systems in Depth About the Author Professor Doeppner is an associate professor of computer science at Brown University. His research interests include mobile computing in education, mobile and ubiquitous computing, operating systems and distribution systems, parallel computing, and security.	Books




*Text*: Description of an item sold on e-commerce website


*Label*: Category of that item. Total 4 categories: "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.

In [9]:
import pandas as pd

# read the data into a pandas dataframe

df = pd.read_csv("ecommerceDataset.csv")

In [10]:
print(df.shape)
# df.head(5)

(50425, 2)


In [11]:
df.head(3)

Unnamed: 0,Text,label
0,Paper Plane Design Framed Wall Hanging Motivat...,Household
1,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",Household
2,SAF 'UV Textured Modern Art Print Framed' Pain...,Household


In [12]:
df['label'].value_counts()


label
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: count, dtype: int64

In [13]:
df['Text'].value_counts()


Text
Think & Grow Rich About the Author NAPOLEON HILL, born in Pound, Southwest Virginia in 1883, was a very successful American author in the area of the new thought movement—one of the earliest producers of the modern genre of personal-success literature. He is widely considered to be one of the great writers on success. The turning point in Hill’s life occurred in the year 1908 when he interviewed the industrialist Andrew Carnegie—one of the most powerful men in the world at that time, as part of an assignment—an interview which ultimately led to the publication of Think and Grow Rich, one of his best-selling books of all time. the book examines the power of personal beliefs and the role they play in personal success. Hill, who had even served as the advisor to President Franklin D. Roosevelt from 1933-36, passed away at the age of 87.                                                                                                                                                      

# Handle Class Imbalance 

In [14]:
min_samples = 8500

df_Household = df[df.label=="Household"].sample(min_samples, random_state=102)
df_Books = df[df.label=="Books"].sample(min_samples, random_state=102)
df_Electronics = df[df.label=="Electronics"].sample(min_samples, random_state=102)
df_Clothing_and_Accessories = df[df.label=="Clothing & Accessories"].sample(min_samples, random_state=102)

In [15]:
df_balanced = pd.concat([df_Household,df_Books,df_Electronics,df_Clothing_and_Accessories],axis=0)
df_balanced.label.value_counts()


label
Household                 8500
Books                     8500
Electronics               8500
Clothing & Accessories    8500
Name: count, dtype: int64

In [16]:
#Add the new column which gives a unique number to each of these labels 

df_balanced['label_num'] = df_balanced['label'].map({
    'Household' : 0, 
    'Books': 1, 
    'Electronics': 2, 
    'Clothing & Accessories': 3
})

#checking the results 

df_balanced.shape
df_balanced.head(5)

Unnamed: 0,Text,label,label_num
828,NOVICZ Nylon Hammock Swing Hanging Rope Bed fo...,Household,0
2502,Casa Decor Ceramic Decorative Filigree Wall Ho...,Household,0
1529,Mollismoons Without Beans Luxury Fur and Leath...,Household,0
19163,Shraddha Collections 2 Compartments Steel Cash...,Household,0
9231,"Skittles Gems Fruits Pouch, 174g Fruit flavore...",Household,0


In [17]:
any_null_value = df_balanced.isna().sum()
print(any_null_value)



Text         1
label        0
label_num    0
dtype: int64


In [18]:
# clean null value
data_cleaned_all_null = df_balanced.dropna(how='all')
df_balanced.dropna(inplace=True)

any_null_value = df_balanced.isna().sum()
print(any_null_value)
# clean null value

Text         0
label        0
label_num    0
dtype: int64


In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_balanced.Text,
    df_balanced.label_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df_balanced.label_num
)

In [20]:

print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)

Shape of X_train:  (27199,)
Shape of X_test:  (6800,)


In [21]:
y_train.value_counts()


label_num
1    6800
0    6800
2    6800
3    6799
Name: count, dtype: int64

In [22]:
y_test.value_counts()


label_num
0    1700
1    1700
2    1700
3    1700
Name: count, dtype: int64

In [23]:
# Attempt 1 :

# using sklearn pipeline module create a classification pipeline to classify the Ecommerce Data.
# Note:

# use TF-IDF for pre-processing the text.

# use KNN as the classifier

# print the classification report.

In [24]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('KNN', KNeighborsClassifier())         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.95      0.94      1700
           1       0.97      0.95      0.96      1700
           2       0.96      0.96      0.96      1700
           3       0.98      0.98      0.98      1700

    accuracy                           0.96      6800
   macro avg       0.96      0.96      0.96      6800
weighted avg       0.96      0.96      0.96      6800



In [25]:
from sklearn.naive_bayes import MultinomialNB


#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),    
     ('Multi NB', MultinomialNB())         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.95      0.93      1700
           1       0.97      0.92      0.95      1700
           2       0.95      0.95      0.95      1700
           3       0.97      0.98      0.98      1700

    accuracy                           0.95      6800
   macro avg       0.95      0.95      0.95      6800
weighted avg       0.95      0.95      0.95      6800



In [26]:
from sklearn.ensemble import RandomForestClassifier

#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),        #using the ngram_range parameter 
     ('Random Forest', RandomForestClassifier())         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1700
           1       0.97      0.96      0.97      1700
           2       0.98      0.94      0.96      1700
           3       0.98      0.99      0.98      1700

    accuracy                           0.96      6800
   macro avg       0.96      0.96      0.96      6800
weighted avg       0.96      0.96      0.96      6800



# Use text pre-processing to remove stop words, punctuations and apply lemmatization


In [27]:
## utility function for pre-preprocessing the text 

import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return " ".join(filtered_tokens)

In [28]:
df['preprocessed_text'] = df['Text'].apply(preprocess)

ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'float'>