In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Creating Bag of Words

The **Bag of Words (BoW)** technique is a fundamental method in natural language processing for converting text into numerical features. In BoW, each document is represented as a vector that counts the occurrence of each word in the vocabulary, ignoring grammar and word order but keeping multiplicity.

**Key points:**
- Each unique word in the entire text corpus becomes a feature (column) in the vector.
- The value for each feature is the number of times the word appears in the document.
- BoW is simple, interpretable, and works well for many text classification tasks, but it does not capture context or semantics.

This approach is often the first step before applying more advanced techniques like TF-IDF or word embeddings.

In [2]:
# generate me some sample text data, 5 sentences in an array
# to exemplify the BoW Model
text_data = [
    "The cat sat on the mat.",
    "The dog barked at the cat.",
    "The bird flew over the house.",
    "The fish swam in the pond.",
    "The rabbit hopped through the garden."
]


In [3]:
countvec = CountVectorizer()
# fit the model to the text data
countvec.fit(text_data)
# transform the text data into a bag of words model
bow_model = countvec.transform(text_data)
# convert the sparse matrix to a dense matrix
bow_model_dense = bow_model.todense()
# convert the dense matrix to a pandas DataFrame
bow_df = pd.DataFrame(bow_model_dense, columns=countvec.get_feature_names_out())    
# display the DataFrame
bow_df

Unnamed: 0,at,barked,bird,cat,dog,fish,flew,garden,hopped,house,in,mat,on,over,pond,rabbit,sat,swam,the,through
0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,2,0
1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0
2,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,2,0
3,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,2,0
4,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,2,1
