# Bag of Words (BoW) — Simplified Example

This example demonstrates how **Bag of Words (BoW)** represents text numerically, along with **binary BoW** variation.

---

## 🧾 Input Sentences

| **Text**                                 | **O/P** | **Preprocessing Step**       | **Result (S)**                        |
| ---------------------------------------- | ------- | ---------------------------- | ------------------------------------- |
| He is a good boy and a good student      | 1       | Lowercase + Stopword removal | **S1 → good boy good student**        |
| She is a good girl and good in studies   | 1       | Lowercase + Stopword removal | **S2 → good girl good studies**       |
| Boy and girl are good and bright student | 1       | Lowercase + Stopword removal | **S3 → boy girl good bright student** |
| Student studies well and is a good boy   | 1       | Lowercase + Stopword removal | **S4 → student studies good boy**     |

> **Stopwords removed:** he, is, a, she, and, are, in, well

---

## 🧩 Vocabulary and Frequency Table

| **Vocabulary** | **Frequency** |
| -------------- | ------------- |
| good           | 5             |
| boy            | 3             |
| girl           | 2             |
| student        | 3             |
| studies        | 2             |
| bright         | 1             |

---

## 1️⃣ Normal Bag of Words Representation

| **Sentence** | **good** | **boy** | **girl** | **student** | **studies** | **bright** | **O/P** |
| ------------ | -------- | ------- | -------- | ----------- | ----------- | ---------- | ------- |
| **S1**       | 2        | 1       | 0        | 1           | 0           | 0          | 1       |
| **S2**       | 2        | 0       | 1        | 0           | 1           | 0          | 1       |
| **S3**       | 1        | 1       | 1        | 1           | 0           | 1          | 1       |
| **S4**       | 1        | 1       | 0        | 1           | 1           | 0          | 1       |

### 💡 Explanation

* Each number represents **how many times** a word occurs in the sentence.
* Example: In **S1**, the word *good* appears **2 times**, while *student* appears once.
* This representation gives the model information about **word frequency** in each document.

---

## 2️⃣ Binary Bag of Words Representation

| **Sentence** | **good** | **boy** | **girl** | **student** | **studies** | **bright** | **O/P** |
| ------------ | -------- | ------- | -------- | ----------- | ----------- | ---------- | ------- |
| **S1**       | 1        | 1       | 0        | 1           | 0           | 0          | 1       |
| **S2**       | 1        | 0       | 1        | 0           | 1           | 0          | 1       |
| **S3**       | 1        | 1       | 1        | 1           | 0           | 1          | 1       |
| **S4**       | 1        | 1       | 0        | 1           | 1           | 0          | 1       |

> In **binary BoW**, each word is represented as **1 if present**, or **0 if absent**, ignoring frequency.

---

## ⚙️ Summary of Steps

1. **Lowercase conversion** — normalize text.
2. **Stopword removal** — remove common, non-informative words.
3. **Vocabulary creation** — list all unique words in the dataset.
4. **Vectorization** — represent text as numeric vectors:

   * **Normal BoW:** counts occurrences.
   * **Binary BoW:** marks presence or absence.

---

## 🐍 Python Example — Normal & Binary BoW

```python
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "He is a good boy and a good student",
    "She is a good girl and good in studies",
    "Boy and girl are good and bright student",
    "Student studies well and is a good boy"
]

# Normal Bag of Words
cv_normal = CountVectorizer(stop_words=['he','is','a','she','and','are','in','well'])
X_normal = cv_normal.fit_transform(corpus)
print("Normal BoW:")
print(cv_normal.get_feature_names_out())
print(X_normal.toarray())

# Binary Bag of Words
cv_binary = CountVectorizer(stop_words=['he','is','a','she','and','are','in','well'], binary=True)
X_binary = cv_binary.fit_transform(corpus)
print("\nBinary BoW:")
print(cv_binary.get_feature_names_out())
print(X_binary.toarray())
```

### Example Output

```
Normal BoW:
['boy' 'bright' 'girl' 'good' 'student' 'studies']
[[1 0 0 2 1 0]
 [0 0 1 2 0 1]
 [1 1 1 1 1 0]
 [1 0 0 1 1 1]]

Binary BoW:
['boy' 'bright' 'girl' 'good' 'student' 'studies']
[[1 0 0 1 1 0]
 [0 0 1 1 0 1]
 [1 1 1 1 1 0]
 [1 0 0 1 1 1]]
```

---

## ✅ Advantages of Bag of Words

1. **Simple and intuitive** — easy to implement and understand.
2. **Efficient for smaller datasets** — works well when vocabulary size is manageable.
3. **Fixed-size input** — output vectors are of consistent length, suitable for ML models.
4. **Good baseline** — often used as a benchmark before trying advanced models.
5. **Works with many algorithms** — especially useful for Naïve Bayes, SVM, and logistic regression.

---

## ⚠️ Disadvantages of Bag of Words

1. **Sparse matrix problem** — large vocabulary creates huge vectors with mostly zeros → risk of **overfitting**.
2. **Loss of word order** — sentence structure and grammar are ignored.
3. **Out-of-Vocabulary (OOV)** — unknown/new words in test data cannot be represented.
4. **No semantic meaning** — doesn’t capture context (e.g., “good” ≠ “excellent”).
5. **High dimensionality** — vocabulary growth increases computational cost.
6. **Insensitive to synonyms** — different words with similar meanings are treated independently.

---

### 🧠 Summary

> Bag of Words is a **foundational text representation technique**. Despite its simplicity and limitations, it forms the basis for more advanced models like **TF-IDF**, **Word2Vec**, and **Transformer-based embeddings (e.g., BERT)**.

---



#### **Bag of words Practicals**

In [14]:
import pandas as pd
messages = pd.read_csv('SpamClassifier-master/smsspamcollection/SMSSpamCollection',
                      sep='\t', names=["label", "message"])

In [15]:
messages

Unnamed: 0,label,message
0,"ham Go until jurong point, crazy.. Available o...",
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,WINNER!! As a valued network customer you have...
6,ham,Even my brother is not like to speak with me. ...
7,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
8,spam,"XXXMobileMovieClub: To use your credit, click ..."
9,ham,Oh k...i'm watching here:)


#### **Data Cleaning and Preprocessing**

In [16]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [17]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [24]:
corpus = []
for i in range(0,len(messages)):
    msg = str(messages['message'][i])
    review = re.sub('[^a-zA-Z]',' ', msg)
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    print(corpus)

['nan']
['nan', 'ok lar joke wif u oni']
['nan', 'ok lar joke wif u oni', 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli']
['nan', 'ok lar joke wif u oni', 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli', 'u dun say earli hor u c alreadi say']
['nan', 'ok lar joke wif u oni', 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli', 'u dun say earli hor u c alreadi say', 'nah think goe usf live around though']
['nan', 'ok lar joke wif u oni', 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli', 'u dun say earli hor u c alreadi say', 'nah think goe usf live around though', 'winner valu network custom select receivea prize reward claim call claim code kl valid hour']
['nan', 'ok lar joke wif u oni', 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt ra

#### **Create Bag of Words**
> Source: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
## For binary BOW enable binary=True
cv=CountVectorizer(max_features=100, binary=True)

In [26]:
X = cv.fit_transform(corpus).toarray()


In [27]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [28]:
X.shape


(48, 100)

In [29]:
cv.vocabulary_

{'lar': np.int64(64),
 'joke': np.int64(61),
 'free': np.int64(43),
 'entri': np.int64(35),
 'comp': np.int64(18),
 'win': np.int64(99),
 'fa': np.int64(39),
 'cup': np.int64(25),
 'final': np.int64(41),
 'st': np.int64(92),
 'may': np.int64(73),
 'text': np.int64(94),
 'receiv': np.int64(86),
 'std': np.int64(93),
 'rate': np.int64(85),
 'appli': np.int64(2),
 'dun': np.int64(32),
 'earli': np.int64(33),
 'hor': np.int64(54),
 'alreadi': np.int64(1),
 'goe': np.int64(47),
 'live': np.int64(71),
 'custom': np.int64(26),
 'select': np.int64(91),
 'prize': np.int64(84),
 'reward': np.int64(89),
 'claim': np.int64(14),
 'call': np.int64(11),
 'kl': np.int64(63),
 'hour': np.int64(56),
 'even': np.int64(36),
 'date': np.int64(28),
 'credit': np.int64(23),
 'click': np.int64(15),
 'link': np.int64(70),
 'messag': np.int64(76),
 'congratul': np.int64(20),
 'gift': np.int64(45),
 'go': np.int64(46),
 'http': np.int64(57),
 'bit': np.int64(7),
 'babe': np.int64(3),
 'see': np.int64(90),
 'remi

#### **N Gram**

In [43]:
## Create the Bag of words with ngram
from sklearn.feature_extraction.text import CountVectorizer


# N Gram
## for Binary BOW enable binary=True

# cv = CountVectorizer(max_features = 100, binary=True, ngram_range=(1, 1))   ## Unigram   (1,1)

# cv = CountVectorizer(max_features = 200, binary=True, ngram_range=(1, 2)    ## Unigram and Bigram  (1,2)

# cv = CountVectorizer(max_features = 200, binary=True, ngram_range=(2, 2))   ## Bigram Only (2,2)

# cv = CountVectorizer(max_features = 200, binary=True, ngram_range=(1, 3))   ## Unigram, Bigram, Trigram (1,3)

cv = CountVectorizer(max_features = 200, binary=True, ngram_range=(3, 3))     ## only Trigram (3,3)




X = cv.fit_transform(corpus).toarray()

In [44]:
cv.vocabulary_

{'ok lar joke': np.int64(107),
 'lar joke wif': np.int64(73),
 'joke wif oni': np.int64(71),
 'free entri wkli': np.int64(53),
 'entri wkli comp': np.int64(43),
 'wkli comp win': np.int64(173),
 'comp win fa': np.int64(27),
 'win fa cup': np.int64(170),
 'fa cup final': np.int64(47),
 'cup final tkt': np.int64(34),
 'final tkt st': np.int64(51),
 'tkt st may': np.int64(154),
 'st may text': np.int64(136),
 'may text fa': np.int64(86),
 'text fa receiv': np.int64(141),
 'fa receiv entri': np.int64(48),
 'receiv entri question': np.int64(115),
 'entri question std': np.int64(42),
 'question std txt': np.int64(113),
 'std txt rate': np.int64(138),
 'txt rate appli': np.int64(158),
 'dun say earli': np.int64(38),
 'say earli hor': np.int64(126),
 'earli hor alreadi': np.int64(39),
 'hor alreadi say': np.int64(64),
 'nah think goe': np.int64(98),
 'think goe usf': np.int64(151),
 'goe usf live': np.int64(60),
 'usf live around': np.int64(161),
 'live around though': np.int64(80),
 'winner v

In [45]:
 X


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])