# 1. Creating Text Features - TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) in Creating Text Features

TF-IDF (Term Frequency-Inverse Document Frequency) is another powerful and widely used technique in Natural Language Processing (NLP) for creating numerical features from text data. It builds upon the Bag of Words (BOW) concept but also aims to weigh the importance of words based on their frequency across the entire corpus of documents.

The intuition behind TF-IDF is:

Term Frequency (TF): A word that appears frequently in a document is likely to be important to the content of that document.

Inverse Document Frequency (IDF): A word that appears frequently across all documents in the corpus is likely to be a common word (like "the", "a", "is") and thus less informative for distinguishing between documents. Words that appear rarely across the corpus are considered more unique and potentially more important for identifying the topic of a document.

How TF-IDF is Calculated:

The TF-IDF weight of a term (word) in a document is calculated as the product of its Term Frequency (TF) and its Inverse Document Frequency (IDF):

TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)

Where:

t is the term (word).

d is the document.

D is the collection (corpus) of documents.

Calculating Term Frequency (TF):

There are several ways to calculate TF. A common method is the raw count of the term in the document:

TF(t, d) = Number of times term t appears in document d

Other variations include:

Boolean Frequency: 1 if the term is present, 0 otherwise.

Term Frequency adjusted for document length: (Number of times term t appears in document d) / (Total number of terms in document d)

Logarithmically scaled frequency: 1 + log(Number of times term t appears in document d) (if count > 0, otherwise 0). This helps to dampen the effect of very frequent words within a document.

Calculating Inverse Document Frequency (IDF):

The IDF measures how rare a term is across the entire corpus. It is typically calculated as:

IDF(t, D) = log(Total number of documents in D / Number of documents in D that contain term t)

The logarithm helps to dampen the effect of very common words.

If a term appears in all documents, the denominator equals the numerator, and IDF becomes log(1) = 0, effectively giving the term a zero weight.

If a term appears in very few documents, the denominator is small, and IDF becomes a larger positive value.

Creating Feature Vectors with TF-IDF:

Similar to BOW, after calculating the TF-IDF weight for each term in each document, you can represent each document as a numerical vector. The vocabulary of the corpus forms the dimensions of the vector, and the value of each element in the vector is the TF-IDF weight of the corresponding term in that document.

Steps in TF-IDF Feature Creation:

1. Tokenization: Break down each document into terms (words).
2. Create Vocabulary: Identify all unique terms across the corpus.
3. Calculate TF: For each term in each document, calculate its Term Frequency.
4. Calculate IDF: For each term in the corpus, calculate its Inverse Document Frequency.
5. Calculate TF-IDF: For each term in each document, multiply its TF by its IDF.
6. Create Feature Vectors: Represent each document as a vector of TF-IDF scores for each term in the vocabulary.

Advantages of TF-IDF over BOW:

1. Weights Term Importance: TF-IDF takes into account not only how often a word appears in a document but also how common it is across the entire corpus. This helps to give more weight to words that are more specific to a particular document.
2. Better Discrimination: By downweighting common words, TF-IDF can often lead to better discrimination between documents compared to raw word counts in BOW.

Limitations of TF-IDF:

1. Still Ignores Word Order and Semantics: Like BOW, TF-IDF doesn't consider the order of words or the underlying meaning of the text.
2. Can Suffer from Data Sparsity: With a large vocabulary, the resulting vectors can still be high-dimensional and sparse.

In summary, TF-IDF is a text vectorization technique that assigns a weight to each word in a document based on its frequency in that document and its inverse frequency across the entire set of documents. It's a valuable method for highlighting words that are important to a specific document within a larger collection.

# Import necessary dependencies

In [24]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Create sample dataset

In [25]:
# Sample DataFrame of customer reviews for Men's Sports Apparel

reviews_data = pd.DataFrame({
    'ReviewText': [
        "Good quality product, comfortable for running.",
        "The material is not as expected, a bit thin.",
        "Excellent fit and very breathable for Kolkata weather.",
        "Average product, stitching could be better.",
        "Loved the design and the fabric is great for workouts.",
        "Not worth the price, expected better quality.",
        "Very happy with the purchase, fits perfectly.",
        "Disappointed with the color, it's different from the picture.",
        "Amazing for gym sessions, highly recommended.",
        "Okay product, nothing special."
    ]
})

print("Original Reviews Data:")
reviews_data

Original Reviews Data:


Unnamed: 0,ReviewText
0,"Good quality product, comfortable for running."
1,"The material is not as expected, a bit thin."
2,Excellent fit and very breathable for Kolkata ...
3,"Average product, stitching could be better."
4,Loved the design and the fabric is great for w...
5,"Not worth the price, expected better quality."
6,"Very happy with the purchase, fits perfectly."
7,"Disappointed with the color, it's different fr..."
8,"Amazing for gym sessions, highly recommended."
9,"Okay product, nothing special."


# TF-IDF (Term Frequency-Inverse Document Frequency) implementation

In [26]:
# 1. Initialize the TfidfVectorizer
# This will handle tokenization, vocabulary creation, IDF calculation, and TF-IDF weighting.
# We can customize it further (e.g., stop word removal, lowercasing).

tfidf_vectorizer = TfidfVectorizer()

In [27]:
# 2. Fit the vectorizer to the review text
# This step learns the vocabulary and calculates the IDF weights from all the reviews.

tfidf_vectorizer.fit(reviews_data['ReviewText'])

In [28]:
# 3. Transform the review text into a TF-IDF matrix
# This step converts each review into a vector of TF-IDF scores based on the learned vocabulary and IDF.

tfidf_matrix = tfidf_vectorizer.transform(reviews_data['ReviewText'])

The result is a sparse matrix where rows represent reviews and columns represent words in the vocabulary. Each cell contains the TF-IDF weight of the word in that review.

In [29]:
# 4. Convert the TF-IDF matrix to a DataFrame for better readability (optional)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Concatenate the TF-IDF features with the original reviews data
reviews_tfidf_df = pd.concat([reviews_data, tfidf_df], axis=1)

print("\nReviews Data with TF-IDF Features:")
reviews_tfidf_df


Reviews Data with TF-IDF Features:


Unnamed: 0,ReviewText,amazing,and,as,average,be,better,bit,breathable,color,...,sessions,special,stitching,the,thin,very,weather,with,workouts,worth
0,"Good quality product, comfortable for running.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"The material is not as expected, a bit thin.",0.0,0.0,0.391613,0.0,0.0,0.0,0.391613,0.0,0.0,...,0.0,0.0,0.0,0.232548,0.391613,0.0,0.0,0.0,0.0,0.0
2,Excellent fit and very breathable for Kolkata ...,0.0,0.324035,0.0,0.0,0.0,0.0,0.0,0.381176,0.0,...,0.0,0.0,0.0,0.0,0.0,0.324035,0.381176,0.0,0.0,0.0
3,"Average product, stitching could be better.",0.0,0.0,0.0,0.435368,0.435368,0.370102,0.0,0.0,0.0,...,0.0,0.0,0.435368,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Loved the design and the fabric is great for w...,0.0,0.295195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.41241,0.0,0.0,0.0,0.0,0.347251,0.0
5,"Not worth the price, expected better quality.",0.0,0.0,0.0,0.0,0.0,0.371249,0.0,0.0,0.0,...,0.0,0.0,0.0,0.259332,0.0,0.0,0.0,0.0,0.0,0.436717
6,"Very happy with the purchase, fits perfectly.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.246615,0.0,0.353044,0.0,0.353044,0.0,0.0
7,"Disappointed with the color, it's different fr...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.350647,...,0.0,0.0,0.0,0.416444,0.0,0.0,0.0,0.298082,0.0,0.0
8,"Amazing for gym sessions, highly recommended.",0.428856,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.428856,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Okay product, nothing special.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.530511,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The reviews_tfidf_df will show the original reviews along with the TF-IDF scores for each word in the vocabulary for each review. Notice that the values will be different from the simple word counts in the BOW example. Words that are common across multiple reviews will have lower TF-IDF scores compared to words that are more specific to certain reviews.

In [30]:
# inspect the vocabulary and IDF weights learned by the vectorizer

vocabulary = tfidf_vectorizer.vocabulary_
print("\nVocabulary:")
print(vocabulary)


Vocabulary:
{'good': 21, 'quality': 39, 'product': 37, 'comfortable': 9, 'for': 19, 'running': 41, 'the': 45, 'material': 30, 'is': 26, 'not': 31, 'as': 2, 'expected': 15, 'bit': 6, 'thin': 46, 'excellent': 14, 'fit': 17, 'and': 1, 'very': 47, 'breathable': 7, 'kolkata': 28, 'weather': 48, 'average': 3, 'stitching': 44, 'could': 10, 'be': 4, 'better': 5, 'loved': 29, 'design': 11, 'fabric': 16, 'great': 22, 'workouts': 50, 'worth': 51, 'price': 36, 'happy': 24, 'with': 49, 'purchase': 38, 'fits': 18, 'perfectly': 34, 'disappointed': 13, 'color': 8, 'it': 27, 'different': 12, 'from': 20, 'picture': 35, 'amazing': 0, 'gym': 23, 'sessions': 42, 'highly': 25, 'recommended': 40, 'okay': 33, 'nothing': 32, 'special': 43}


The vocabulary will show the mapping of each word to its column index in the BOW matrix.

This BOW representation can then be used as input features for various machine learning tasks such as sentiment analysis, text classification, or topic modeling on the customer reviews. You can further refine the BOW process by using techniques like stop word removal, stemming/lemmatization, and adjusting the parameters of CountVectorizer.

In [32]:
idf_values = dict(zip(tfidf_vectorizer.get_feature_names_out(), tfidf_vectorizer.idf_))
print("\nIDF Values:")
print(idf_values)


IDF Values:
{'amazing': np.float64(2.7047480922384253), 'and': np.float64(2.2992829841302607), 'as': np.float64(2.7047480922384253), 'average': np.float64(2.7047480922384253), 'be': np.float64(2.7047480922384253), 'better': np.float64(2.2992829841302607), 'bit': np.float64(2.7047480922384253), 'breathable': np.float64(2.7047480922384253), 'color': np.float64(2.7047480922384253), 'comfortable': np.float64(2.7047480922384253), 'could': np.float64(2.7047480922384253), 'design': np.float64(2.7047480922384253), 'different': np.float64(2.7047480922384253), 'disappointed': np.float64(2.7047480922384253), 'excellent': np.float64(2.7047480922384253), 'expected': np.float64(2.2992829841302607), 'fabric': np.float64(2.7047480922384253), 'fit': np.float64(2.7047480922384253), 'fits': np.float64(2.7047480922384253), 'for': np.float64(1.7884573603642702), 'from': np.float64(2.7047480922384253), 'good': np.float64(2.7047480922384253), 'great': np.float64(2.7047480922384253), 'gym': np.float64(2.7047

The idf_values dictionary will show the IDF weight assigned to each word in the vocabulary. Words that appear in more documents will have lower IDF values (closer to 0), and words that appear in fewer documents will have higher IDF values.

TF-IDF representation is often a more effective way to represent text data for machine learning models compared to simple BOW, as it takes into account the importance of words in the context of the entire document collection.