### Introduction: Bag-of-Words (BoW)

- Text data is **unstructured** and cannot be directly understood by machines
- **Bag-of-Words (BoW)** converts text into a **numerical representation**
- It represents text using **word frequency**, ignoring grammar and word order
- BoW enables machine learning models to **analyze and predict from text**

#### Example
**Data**
- Document 1: *"The cat sat on the mat."*
- Document 2: *"The dog chased the cat."*

**Step 1: Tokenization**
- Doc 1 Tokens: `["the", "cat", "sat", "on", "the", "mat"]`
- Doc 2 Tokens: `["the", "dog", "chased", "the", "cat"]`

**Step 2: Vocabulary Creation**
- Vocabulary: `{"the", "cat", "sat", "on", "mat", "dog", "chased"}`
- Vocabulary Size = **7**

**Step 3: Word Frequency Count**

| Word     | Doc 1 | Doc 2 |
|---------|-------|-------|
| the     | 2     | 2     |
| cat     | 1     | 1     |
| sat     | 1     | 0     |
| on      | 1     | 0     |
| mat     | 1     | 0     |
| dog     | 0     | 1     |
| chased  | 0     | 1     |

**Step 4: Vector Representation**
- Vocabulary Order: `["the", "cat", "sat", "on", "mat", "dog", "chased"]`
- Doc 1 Vector: `[2, 1, 1, 1, 1, 0, 0]`
- Doc 2 Vector: `[2, 1, 0, 0, 0, 1, 1]`

**Key Idea**
- Word order is ignored â†’ only **frequency matters**
- Text is converted into **numerical vectors** usable by ML models
``

### Process of Vocabulary Creation
1. **Gather Documents**: Collect all text documents for analysis
2. **Tokenization**:
   - Split text into words/tokens.
   - Can use whitespace, punctuation, or NLP libraries
4. **Normalization**
   Convert text to a consistent format:
    - Lowercasing words
    - Removing punctuation
    - Removing numbers (optional)
5. **Identify Unique Tokens**
  - Collect all normalized tokens
  - Unique tokens form the **vocabulary**


In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

In [3]:
corpus = [
    "The cat sat on the mat.",
    "The dog chased the cat."
] # data

# Get English stop words and punctuation
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

In [6]:
all_tokens = []
for doc in corpus:
    # Tokenize the document
    tokens = word_tokenize(doc.lower()) # Convert to lowercase and tokenize

    # Remove punctuation and stop words
    filtered_tokens = [
        word for word in tokens
        if word not in stop_words and word not in punctuation
    ]
    all_tokens.extend(filtered_tokens)

# Create the vocabulary (set of unique tokens)
vocabulary = sorted(list(set(all_tokens)))
print(f"Original Corpus:{corpus}")
print(f"All Tokens:{all_tokens}") #(after lowercasing, tokenization, stop word & punctuation removal)
print(f"Vocabulary:{vocabulary}")

Original Corpus:['The cat sat on the mat.', 'The dog chased the cat.']
All Tokens:['cat', 'sat', 'mat', 'dog', 'chased', 'cat']
Vocabulary:['cat', 'chased', 'dog', 'mat', 'sat']


## Document-Term Matrix: Numerical Representation

It is used as the core numerical representation in **Bag-of-Words**. It enables machine learning models to process text data

- **Rows** represent **documents**
- **Columns** represent **terms (vocabulary words)**
- **Cell values** show **frequency of a term in a document**

In [7]:
import numpy as np
# taking the previous result vocabulary
# Word counts for Document 1 (based on vocabulary order)
doc1_counts = [1, 0, 0, 1, 1]

# Word counts for Document 2 (based on vocabulary order)
doc2_counts = [1, 1, 1, 0, 0]

# Create the Document-Term Matrix
document_term_matrix = np.array([doc1_counts, doc2_counts])

print("Document-Term Matrix:")
print(document_term_matrix)
print("Vocabulary (column headers):")
print(vocabulary)

Document-Term Matrix:
[[1 0 0 1 1]
 [1 1 1 0 0]]
Vocabulary (column headers):
['cat', 'chased', 'dog', 'mat', 'sat']


### Implementing Bag-of-Words with Scikit-learn's CountVectorizer

In [8]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
    "This is a sample document for demonstration purposes."
]

In [12]:
vectorizer = CountVectorizer()

#Fit the vectorizer to the corpus and transform the corpus into a DTM
# The fit_transform method does two things:
# a) It learns the vocabulary from the corpus (fit).
# b) It converts the corpus into a Document-Term Matrix (transform).
dtm = vectorizer.fit_transform(corpus)

print("Type of the resulting DTM:", type(dtm))
print("Shape of the DTM (n_documents, n_features):", dtm.shape)

# To see the vocabulary (the features/column names)
feature_names = vectorizer.get_feature_names_out()
print("Vocabulary (Features):", feature_names)

# To view the DTM as a dense Pandas DataFrame for better readability
# Note: Converting large sparse matrices to dense can consume a lot of memory.
# For demonstration with a small corpus, it's fine.
dtm_df = pd.DataFrame(dtm.toarray(), columns=feature_names)
dtm_df.index.name = 'Document Index'
dtm_df.index = [f'Doc {i}' for i in range(len(corpus))]

print("Document-Term Matrix (as Pandas DataFrame):")
print(dtm_df)

Type of the resulting DTM: <class 'scipy.sparse._csr.csr_matrix'>
Shape of the DTM (n_documents, n_features): (5, 13)
Vocabulary (Features): ['and' 'demonstration' 'document' 'first' 'for' 'is' 'one' 'purposes'
 'sample' 'second' 'the' 'third' 'this']
Document-Term Matrix (as Pandas DataFrame):
       and  demonstration  document  first  for  is  one  purposes  sample  \
Doc 0    0              0         1      1    0   1    0         0       0   
Doc 1    0              0         2      0    0   1    0         0       0   
Doc 2    1              0         0      0    0   1    1         0       0   
Doc 3    0              0         1      1    0   1    0         0       0   
Doc 4    0              1         1      0    1   1    0         1       1   

       second  the  third  this  
Doc 0       0    1      0     1  
Doc 1       1    1      0     1  
Doc 2       0    1      1     1  
Doc 3       0    1      0     1  
Doc 4       0    0      0     1  


## Understanding Sparsity in Document-Term Matrices

#### What Causes Sparsity in BoW?
- **Large Vocabulary**: Thousands or millions of unique words
- **Short Documents**: Each document uses only a small subset of the vocabulary
- **Uneven Word Distribution**: Many words appear in very few documents

#### Why is Sparsity Important?
**1. Computational Efficiency**
- Saves memory using sparse matrix formats (CSR, CSC)
- Speeds up computations by ignoring zero values

**2. Algorithmic Implications**
- Helps identify informative features
- Some algorithms (e.g., KNN) struggle with high sparsity

**3. Dimensionality Reduction**
- Indicates need for PCA, SVD, etc.
- Improves performance of certain models


In [14]:
# Finding sparsity
print(f"Number of non-zero elements: {dtm.nnz}")

Number of non-zero elements: 28


In [21]:
from scipy.sparse import issparse
# To convert to a dense array (use with caution for large matrices)
dtm_dense = dtm.toarray()
print("Is the DTM sparse?", dtm.getformat() == 'csr') # CSR is a sparse format
print("Is the dense version sparse?", issparse(dtm_dense)) # False, it's a numpy array
# NumPy arrays(dtm_dense) do not have getformat()

Is the DTM sparse? True
Is the dense version sparse? False
