The Bag of Words (BoW) model is a popular text representation technique used in natural language processing (NLP) and machine learning. It’s designed to convert text into numerical data by focusing on the presence or absence of words, ignoring grammar and word order.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
texts = ["today i am feeling great.i was so sick for past days. today is a sunday. now sickness is gone."]
# Each item in this list represents a document, and these documents will be processed to count each unique word’s frequency.

BoW focuses solely on word frequency, not meaning derived from syntax or sentence structure.

In [4]:
vectorizer = CountVectorizer()    # here, each word is assigned a count (frequency) based on its occurrence in the document.

CountVectorizer, which will handle:

- Tokenization (splitting text into individual words),
- Lowercasing (making all words lowercase),
- Filtering (removing punctuation or any unnecessary symbols).

In [5]:
bow_matrix = vectorizer.fit_transform(texts)

fit_transform() does two things:

- Fit: Learns the vocabulary of all unique words across the texts.
- Transform: Converts each document into a numerical vector representing the frequency of each word.

In [6]:
print("The BoW Matrix is:", bow_matrix.toarray())

The BoW Matrix is: [[1 1 1 1 1 1 2 1 1 1 1 1 1 2 1]]


- The Bag of Words (BoW) model in this code creates a matrix showing word frequencies for each document based on a fixed vocabulary.
- It’s commonly used for text classification and sentiment analysis but doesn’t consider the order of words, focusing purely on word occurrence frequency across the text.