The **Bag of Words (BoW)** algorithm is a natural language processing (NLP) technique used to represent text data in numerical form. It is a simple and commonly used method for feature extraction from text. Here's a step-by-step explanation of the algorithm:

---

### **Algorithm Steps**

1. **Corpus Creation**:
   - Collect all the text documents (the corpus) that you want to process.

2. **Text Preprocessing**:
   - Standardize the text data by:
     - Converting all text to lowercase to maintain uniformity.
     - Removing punctuation, special characters, and numbers (if not needed).
     - Tokenizing the text (splitting it into individual words).
     - Removing stopwords (e.g., "and," "the," "is") that do not carry significant meaning.
     - (Optional) Applying stemming or lemmatization to reduce words to their root forms.

3. **Vocabulary Construction**:
   - Create a unique list of all words (vocabulary) in the corpus. Each word in this vocabulary will be treated as a feature.

4. **Vectorization**:
   - For each document in the corpus, create a vector of length equal to the size of the vocabulary.
   - Fill the vector with word counts or occurrences. The position of each word in the vector corresponds to its position in the vocabulary.

5. **Matrix Representation**:
   - Combine all vectors into a matrix where:
     - Each row represents a document.
     - Each column represents a word in the vocabulary.
     - Each cell contains the count (or frequency) of the word in the corresponding document.

---

### **Example**

#### Corpus:
- Document 1: "The cat sat on the mat."
- Document 2: "The dog barked at the cat."

#### 1. **Preprocessing**:
- Document 1: ["cat", "sat", "mat"]
- Document 2: ["dog", "barked", "cat"]

#### 2. **Vocabulary**:
- Vocabulary: ["cat", "sat", "mat", "dog", "barked"]

#### 3. **Vectorization**:
- Document 1: [1, 1, 1, 0, 0]  (cat: 1, sat: 1, mat: 1, dog: 0, barked: 0)
- Document 2: [1, 0, 0, 1, 1]  (cat: 1, sat: 0, mat: 0, dog: 1, barked: 1)

#### 4. **Matrix Representation**:
|           | cat | sat | mat | dog | barked |
|-----------|-----|-----|-----|-----|--------|
| Document 1|  1  |  1  |  1  |  0  |   0    |
| Document 2|  1  |  0  |  0  |  1  |   1    |

---

### **Key Properties**
- **Sparse Representation**: The matrix often has many zero values because not all words appear in every document.
- **Order-Insensitive**: The BoW representation does not consider the order of words in the document.
- **Dimensionality**: The size of the matrix increases with the size of the vocabulary.

---

### **Limitations**
1. **Context Ignorance**: BoW ignores the order and meaning of words.
2. **High Dimensionality**: Large corpora create very large matrices.
3. **No Semantic Understanding**: Similar words (e.g., "run" and "jog") are treated as completely different features.

---

### **Applications**
- Text classification (e.g., spam detection).
- Sentiment analysis.
- Document similarity calculations.

---

By transforming text into a fixed-size numerical representation, BoW enables text data to be used in machine learning models.

In [1]:
from collections import Counter

class BOW:
    def __init__(self, tokenized_corpus):

        self.tokenized_corpus = tokenized_corpus
        self.vocab = None
        self.bow = None

    def build_vocabulary(self):

        self.vocab = sorted(set(word for tokens in self.tokenized_corpus for word in tokens))

    def vectorize_document(self, tokens):

        token_counts = Counter(tokens)
        vector = [token_counts.get(word, 0) for word in self.vocab]
        return vector

    def fit(self):

        self.build_vocabulary()

    def transform(self):

        if self.vocab is None:
            raise ValueError("Vocabulary has not been built. Please call fit() before transform().")
        
        self.bow = [self.vectorize_document(tokens) for tokens in self.tokenized_corpus]
        return self.bow


In [2]:
# Example input: Tokenized documents
tokenized_corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "barked", "at", "the", "cat"]
]

# Initialize the BOW class
bow = BOW(tokenized_corpus)

# Fit the model to build the vocabulary
bow.fit()
print("Vocabulary:", bow.vocab)

# Transform the corpus into a BoW matrix
bow_matrix = bow.transform()

# Display the Bag of Words matrix
print("\nBag of Words Matrix:")
for i, vector in enumerate(bow_matrix):
    print(f"Document {i+1}: {vector}")


Vocabulary: ['at', 'barked', 'cat', 'dog', 'mat', 'on', 'sat', 'the']

Bag of Words Matrix:
Document 1: [0, 0, 1, 0, 1, 1, 1, 2]
Document 2: [1, 1, 1, 1, 0, 0, 0, 2]


# with library

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "The cat sat on the mat.",
    "The dog barked at the cat."
]


vectorizer = CountVectorizer()

bow_matrix = vectorizer.fit_transform(corpus)

vocabulary = vectorizer.get_feature_names_out()

print("Vocabulary:", vocabulary)

# Display the Bag of Words matrix
print("\nBag of Words Matrix (as dense array):")
print(bow_matrix.toarray())

Vocabulary: ['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']

Bag of Words Matrix (as dense array):
[[0 0 1 0 1 1 1 2]
 [1 1 1 1 0 0 0 2]]
