# 1. Creating Text Features - BOW

Bag of Words (BOW) in Creating Text Features

Bag of Words (BOW) is a fundamental and widely used technique in natural language processing (NLP) for converting text data into a numerical format that machine learning models can understand. It's a way of representing text data by considering the frequency of words within a document (like a sentence, paragraph, or an entire text).

Here's how the Bag of Words model works:

Tokenization: The first step is to break down each document into individual words, called tokens. This usually involves splitting the text by spaces and punctuation marks. For example, the sentence "The quick brown fox jumps over the lazy fox." would be tokenized into: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "fox"].

Creating a Vocabulary: Next, a vocabulary of all unique words across the entire corpus (collection of documents) is created. In our example, the vocabulary would be: {"The", "quick", "brown", "fox", "jumps", "over", "the", "lazy"}. Notice that "The" and "the" might be treated as the same word after lowercasing, and punctuation is typically removed.

Counting Word Frequencies: For each document in the corpus, the BOW model counts the number of times each word from the vocabulary appears in that document.

Creating Feature Vectors: Finally, each document is represented by a numerical vector. The length of this vector is equal to the size of the vocabulary. Each element in the vector corresponds to a word in the vocabulary, and the value of the element is the frequency (count) of that word in the document.

Example:

Consider two sentences:

Sentence 1: "The quick brown fox."
Sentence 2: "The lazy brown dog."
Steps:

Tokenization (and Lowercasing, Punctuation Removal):

Sentence 1: ["the", "quick", "brown", "fox"]

Sentence 2: ["the", "lazy", "brown", "dog"]

Creating a Vocabulary:

{"the", "quick", "brown", "fox", "lazy", "dog"}

Counting Word Frequencies:

Sentence 1: {"the": 1, "quick": 1, "brown": 1, "fox": 1, "lazy": 0, "dog": 0}

Sentence 2: {"the": 1, "quick": 0, "brown": 1, "fox": 0, "lazy": 1, "dog": 1}

Creating Feature Vectors:

Sentence 1: [1, 1, 1, 1, 0, 0] (order based on vocabulary)

Sentence 2: [1, 0, 1, 0, 1, 1]

Each sentence is now represented by a numerical vector that a machine learning model can process.

Key Characteristics of BOW:

1. Order of Words is Ignored: The "bag" in "Bag of Words" implies that the model treats all words in a document as a collection, disregarding their order and grammatical structure. Only the presence and frequency of words matter.
2. Simple and Efficient: BOW is a relatively simple and computationally efficient way to represent text.
3. Forms the Basis for More Advanced Techniques: While basic, BOW is a foundational concept that underlies more sophisticated text representation methods.

Limitations of BOW:

1. Ignores Semantics: BOW doesn't capture the meaning or context of words. For example, "good" and "amazing" are treated as distinct words, even though they have similar positive connotations.
2. Doesn't Account for Word Order: The loss of word order can be a significant drawback, as the meaning of a sentence can change drastically based on the arrangement of words.
3. High Dimensionality: With a large vocabulary, the resulting feature vectors can be very high-dimensional and sparse (containing many zeros), which can pose challenges for some machine learning algorithms.

In summary, Bag of Words is a technique for converting text into numerical vectors by counting the frequency of each word in the document, ignoring grammar and word order. It's a simple but important first step in many NLP tasks.

# Import necessary dependencies

In [16]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create sample dataset

In [17]:
# Sample DataFrame of customer reviews for Men's Sports Apparel (considering Kolkata context)
reviews_data = pd.DataFrame({
    'ReviewText': [
        "Good quality product, comfortable for running.",
        "The material is not as expected, a bit thin.",
        "Excellent fit and very breathable for Kolkata weather.",
        "Average product, stitching could be better.",
        "Loved the design and the fabric is great for workouts.",
        "Not worth the price, expected better quality.",
        "Very happy with the purchase, fits perfectly.",
        "Disappointed with the color, it's different from the picture.",
        "Amazing for gym sessions, highly recommended.",
        "Okay product, nothing special."
    ]
})

print("Original Reviews Data:")
reviews_data

Original Reviews Data:


Unnamed: 0,ReviewText
0,"Good quality product, comfortable for running."
1,"The material is not as expected, a bit thin."
2,Excellent fit and very breathable for Kolkata ...
3,"Average product, stitching could be better."
4,Loved the design and the fabric is great for w...
5,"Not worth the price, expected better quality."
6,"Very happy with the purchase, fits perfectly."
7,"Disappointed with the color, it's different fr..."
8,"Amazing for gym sessions, highly recommended."
9,"Okay product, nothing special."


# Bag of Words(BOW) implementation

In [18]:
# 1. Initialize the CountVectorizer
# This will handle tokenization, vocabulary creation, and counting word frequencies.
# We can customize it further (e.g., stop word removal, lowercasing).

vectorizer = CountVectorizer()

In [19]:
# 2. Fit the vectorizer to the review text
# This step learns the vocabulary from all the reviews.

vectorizer.fit(reviews_data['ReviewText'])

In [20]:
# 3. Transform the review text into a BOW matrix
# This step converts each review into a vector of word counts based on the learned vocabulary.

bow_matrix = vectorizer.transform(reviews_data['ReviewText'])

The result is a sparse matrix where rows represent reviews and columns represent words in the vocabulary.Each cell contains the frequency of the word in that review.

In [21]:
# 4. Convert the BOW matrix to a DataFrame for better readability

bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Concatenate the BOW features with the original reviews data
reviews_bow_df = pd.concat([reviews_data, bow_df], axis=1)

print("\nReviews Data with BOW Features:")
reviews_bow_df


Reviews Data with BOW Features:


Unnamed: 0,ReviewText,amazing,and,as,average,be,better,bit,breathable,color,...,sessions,special,stitching,the,thin,very,weather,with,workouts,worth
0,"Good quality product, comfortable for running.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"The material is not as expected, a bit thin.",0,0,1,0,0,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0
2,Excellent fit and very breathable for Kolkata ...,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,1,0,0,0
3,"Average product, stitching could be better.",0,0,0,1,1,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,Loved the design and the fabric is great for w...,0,1,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,1,0
5,"Not worth the price, expected better quality.",0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
6,"Very happy with the purchase, fits perfectly.",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0
7,"Disappointed with the color, it's different fr...",0,0,0,0,0,0,0,0,1,...,0,0,0,2,0,0,0,1,0,0
8,"Amazing for gym sessions, highly recommended.",1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9,"Okay product, nothing special.",0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


The reviews_bow_df will show the original review text along with new columns for each unique word found in all the reviews. The values in these new columns will be the frequency of that word in the corresponding review.

In [23]:
# inspect the vocabulary learned by the vectorizer

vocabulary = vectorizer.vocabulary_
print("\nVocabulary:")
print(vocabulary)


Vocabulary:
{'good': 21, 'quality': 39, 'product': 37, 'comfortable': 9, 'for': 19, 'running': 41, 'the': 45, 'material': 30, 'is': 26, 'not': 31, 'as': 2, 'expected': 15, 'bit': 6, 'thin': 46, 'excellent': 14, 'fit': 17, 'and': 1, 'very': 47, 'breathable': 7, 'kolkata': 28, 'weather': 48, 'average': 3, 'stitching': 44, 'could': 10, 'be': 4, 'better': 5, 'loved': 29, 'design': 11, 'fabric': 16, 'great': 22, 'workouts': 50, 'worth': 51, 'price': 36, 'happy': 24, 'with': 49, 'purchase': 38, 'fits': 18, 'perfectly': 34, 'disappointed': 13, 'color': 8, 'it': 27, 'different': 12, 'from': 20, 'picture': 35, 'amazing': 0, 'gym': 23, 'sessions': 42, 'highly': 25, 'recommended': 40, 'okay': 33, 'nothing': 32, 'special': 43}


The vocabulary will show the mapping of each word to its column index in the BOW matrix.

This BOW representation can then be used as input features for various machine learning tasks such as sentiment analysis, text classification, or topic modeling on the customer reviews. You can further refine the BOW process by using techniques like stop word removal, stemming/lemmatization, and adjusting the parameters of CountVectorizer.