<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Building a Bag of Words Model</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Feature Engineering for NLP in Python)</span></div>

# Table of Contents

1. [Recap of Data Format for ML Algorithms](#section-1)
2. [The Bag of Words (BoW) Model](#section-2)
3. [BoW Example: From Corpus to Vectors](#section-3)
4. [Text Preprocessing](#section-4)
5. [Implementing BoW using Scikit-Learn](#section-5)
6. [Building a BoW Naive Bayes Classifier](#section-6)
7. [Text Preprocessing with CountVectorizer Arguments](#section-7)
8. [Full Workflow: Building the Spam Filter](#section-8)
9. [BoW Shortcomings & Introduction to n-grams](#section-9)
10. [Building n-gram Models](#section-10)
11. [Shortcomings of n-gram Models](#section-11)
12. [Conclusion](#section-12)

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Recap of Data Format for ML Algorithms</span><br>

Before diving into Natural Language Processing (NLP), it is crucial to understand the data requirements for standard Machine Learning (ML) algorithms.

For almost any ML algorithm to function correctly, the data must adhere to two specific rules:
1.  **Tabular Form**: The data must be structured in rows (observations) and columns (features).
2.  **Numerical Features**: The training features must be numbers. Algorithms cannot natively understand raw text strings like "The lion is the king".

This necessitates **Feature Engineering**: converting raw text into a numerical, tabular representation.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. The Bag of Words (BoW) Model</span><br>

The **Bag of Words (BoW)** model is a fundamental technique to convert text into numerical vectors. It simplifies text by disregarding grammar and word order, focusing only on word multiplicity.

### Core Steps:
1.  **Extract word tokens**: Break down the text into individual words.
2.  **Compute frequency**: Count how many times each word appears in a document.
3.  **Construct a word vector**: Create a vector for each document based on these frequencies and the entire vocabulary of the corpus.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. BoW Example: From Corpus to Vectors</span><br>

Let's look at a concrete example of how a corpus is transformed into vectors.

### The Corpus
Consider the following three documents (sentences):

1.  "The lion is the king of the jungle"
2.  "Lions have lifespans of a decade"
3.  "The lion is an endangered species"

### The Vocabulary
First, we extract the unique words (vocabulary) from the entire corpus. Note that in this raw example, case sensitivity matters ("The" vs "the") and plurals are distinct ("lion" vs "Lions").

**Vocabulary**: `a`, `an`, `decade`, `endangered`, `have`, `is`, `jungle`, `king`, `lifespans`, `lion`, `Lions`, `of`, `species`, `the`, `The`

### The Vectors
Each sentence is converted into a vector of counts corresponding to the vocabulary order above.

| Document | Vector Representation |
| :--- | :--- |
| "The lion is the king of the jungle" | `[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]` |
| "Lions have lifespans of a decade" | `[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0]` |
| "The lion is an endangered species" | `[0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]` |

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> Notice that in the first sentence, the word "the" appears twice (once as "The" and once as "the" in the raw text, but if we count "the" specifically in the vocabulary slot, it has a count of 2). In a strict raw implementation without preprocessing, "The" and "the" might be separate features.</div>

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Text Preprocessing</span><br>

To improve the efficiency and accuracy of the BoW model, we perform text preprocessing. This reduces the size of the vocabulary (dimensionality) and groups similar words.

### Common Preprocessing Steps:
1.  **Lowercasing**: Converting "Lions" and "lion" $\rightarrow$ "lion", "The" and "the" $\rightarrow$ "the".
2.  **Removing Punctuation**: Stripping symbols like `. , ! ?`.
3.  **Removing Stopwords**: Removing common words that carry little meaning (e.g., "is", "of", "the").

### Benefits:
*   **Smaller Vocabularies**: Fewer columns in our data table.
*   **Improved Performance**: Reducing the number of dimensions (features) often helps ML algorithms generalize better.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Implementing BoW using Scikit-Learn</span><br>

We can implement the Bag of Words model easily using Python's `pandas` and `scikit-learn`.

### Step 1: Define the Corpus


In [None]:
import pandas as pd

# Define the corpus as a pandas Series
corpus = pd.Series([
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species'
])

print("Corpus:")
print(corpus)



### Step 2: Generate the BoW Matrix
We use `CountVectorizer` from `sklearn.feature_extraction.text`. This class handles tokenization, counting, and basic preprocessing automatically.



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
# fit_transform learns the vocabulary and transforms the text
bow_matrix = vectorizer.fit_transform(corpus)

# Convert to array to visualize the matrix
print("BoW Matrix (Dense Array):")
print(bow_matrix.toarray())

# Optional: View the vocabulary mapping
print("\nVocabulary Mapping:")
print(vectorizer.vocabulary_)



**Explanation of Output**:
The output array corresponds to the counts of each word in the vocabulary for each document. `CountVectorizer` by default lowercases text, which is why the output might differ slightly from the manual example if case sensitivity was strictly enforced there.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Building a BoW Naive Bayes Classifier</span><br>

One of the most common applications of BoW is **Spam Filtering**. We will build a classifier to detect whether a message is "spam" or "ham" (legitimate).

### The Dataset
Imagine we have a dataset with two columns: `message` and `label`.

| message | label |
| :--- | :--- |
| WINNER!! As a valued network customer you have been selected to receive a $900 prize reward! To claim call 09061701461 | spam |
| Ah, work. I vaguely remember that. What does it feel like? | ham |

### The Workflow
1.  **Text Preprocessing**: Clean the data.
2.  **Building BoW Model**: Convert text to numerical vectors.
3.  **Machine Learning**: Train a classifier (Naive Bayes) on the vectors.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. Text Preprocessing with CountVectorizer Arguments</span><br>

`CountVectorizer` is powerful because it can handle preprocessing steps via arguments during initialization.

### Key Arguments:
*   `lowercase`: `True` (default) or `False`. Converts all characters to lowercase.
*   `strip_accents`: `'unicode'`, `'ascii'`, or `None`. Removes accents/special characters.
*   `stop_words`: `'english'`, a custom `list`, or `None`. Removes common stop words.
*   `token_pattern`: regex. Defines what constitutes a "token" (word).
*   `tokenizer`: function. Allows you to pass a custom tokenizer function.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Full Workflow: Building the Spam Filter</span><br>

Let's implement the full pipeline. First, we will create a dummy dataset to simulate the spam data.

### Step 1: Setup Data and Split


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Creating a dummy dataset for demonstration
data = {
    'message': [
        'WINNER!! As a valued network customer you have been selected to receive a $900 prize reward!',
        'Ah, work. I vaguely remember that. What does it feel like?',
        'Congratulations! You won a free ticket to the Bahamas. Call now!',
        'I am going to the grocery store later, do you need anything?',
        'URGENT! Your mobile number has been awarded with a Â£2000 prize.',
        'Hey, are we still meeting for lunch tomorrow?'
    ],
    'label': ['spam', 'ham', 'spam', 'ham', 'spam', 'ham']
}
df = pd.DataFrame(data)

# Split into training and test sets
# We use 25% of data for testing
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], 
    df['label'], 
    test_size=0.25, 
    random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")



### Step 2: Build the BoW Model
We configure the vectorizer to strip accents, remove English stop words, and keep the original casing (just for demonstration, usually lowercase is better).



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object with specific preprocessing
vectorizer = CountVectorizer(
    strip_accents='ascii', 
    stop_words='english', 
    lowercase=False
)

# Generate training BoW vectors
# We fit AND transform on training data
X_train_bow = vectorizer.fit_transform(X_train)

# Generate test BoW vectors
# We ONLY transform test data (using the vocabulary learned from training)
X_test_bow = vectorizer.transform(X_test)

print("Shape of Training Matrix:", X_train_bow.shape)
print("Shape of Test Matrix:", X_test_bow.shape)



### Step 3: Train the Naive Bayes Classifier
We use `MultinomialNB`, which is standard for classification with discrete features (like word counts).



In [None]:
from sklearn.naive_bayes import MultinomialNB

# Create MultinomialNB object
clf = MultinomialNB()

# Train clf
clf.fit(X_train_bow, y_train)

# Compute accuracy on test set
accuracy = clf.score(X_test_bow, y_test)
print(f"Model Accuracy: {accuracy}")



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 9. BoW Shortcomings & Introduction to n-grams</span><br>

While BoW is effective, it has significant limitations regarding **context**.

### The Context Problem
Consider these two reviews:
1.  "The movie was good and not boring" (Positive)
2.  "The movie was not good and boring" (Negative)

**Problem**: Both sentences contain the exact same words. A standard BoW model produces the **exact same vector representation** for both, losing the meaning entirely. The sentiment depends heavily on the position of the word "not".

### Solution: n-grams
An **n-gram** is a contiguous sequence of $n$ elements (words) in a given document.
*   **n = 1**: Unigram (Bag of Words)
*   **n = 2**: Bigram
*   **n = 3**: Trigram

**Example**: "for you a thousand times over"

**n=2 (Bigrams)**:


In [None]:
[
    'for you',
    'you a',
    'a thousand',
    'thousand times',
    'times over'
]



**n=3 (Trigrams)**:


In [None]:
[
    'for you a',
    'you a thousand',
    'a thousand times',
    'thousand times over'
]


By capturing sequences, n-grams retain more context (e.g., "not good" vs "good").

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 10. Building n-gram Models</span><br>

We can build n-gram models using `CountVectorizer` by adjusting the `ngram_range` argument.

### Applications
*   Sentence completion
*   Spelling correction
*   Machine translation correction

### Implementation
The `ngram_range` argument takes a tuple `(min_n, max_n)`.



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

text_data = ['for you a thousand times over']

# Generates ONLY bigrams (n=2)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigrams = bigram_vectorizer.fit_transform(text_data)

print("Bigrams Vocabulary:")
print(bigram_vectorizer.get_feature_names_out())

# Generates unigrams, bigrams, AND trigrams (n=1 to n=3)
ngram_vectorizer = CountVectorizer(ngram_range=(1, 3))
ngrams = ngram_vectorizer.fit_transform(text_data)

print("\nUnigrams, Bigrams, and Trigrams Vocabulary:")
print(ngram_vectorizer.get_feature_names_out())



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 11. Shortcomings of n-gram Models</span><br>

While n-grams capture context, they introduce new challenges:

1.  **Curse of Dimensionality**: As $n$ increases, the size of the vocabulary explodes. If you have $V$ unique words, you could theoretically have $V^n$ n-grams. This creates massive, sparse matrices that are computationally expensive to process.
2.  **Rarity**: Higher-order n-grams (e.g., 5-grams) are very rare in corpora, leading to overfitting or lack of generalizability.
3.  **Guidance**: It is generally recommended to keep $n$ small (usually 1, 2, or 3).

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 12. Conclusion</span><br>

In this notebook, we explored the foundational techniques of Feature Engineering for NLP:

*   **Bag of Words (BoW)**: We learned how to convert raw text into numerical vectors by counting word frequencies.
*   **Preprocessing**: We saw how cleaning text (lowercasing, removing stopwords) improves model efficiency.
*   **Scikit-Learn Implementation**: We utilized `CountVectorizer` to automate tokenization and vectorization.
*   **Classification**: We successfully built a pipeline to classify spam messages using BoW and Naive Bayes.
*   **n-grams**: We addressed the context limitations of BoW by introducing n-grams, which capture sequences of words, while noting the trade-off with dimensionality.

**Next Steps**:
To further improve NLP models, one might explore **TF-IDF** (Term Frequency-Inverse Document Frequency) to weigh the importance of words, or **Word Embeddings** (like Word2Vec or GloVe) for capturing semantic meaning beyond simple counts.
