# Bag Of Words

## **Step 1: Sample Sentences**
We have two sentences:
1. "I love machine learning"
2. "Machine learning is amazing"

## **Step 2: Vocabulary**
The unique words in both sentences are:



In [1]:
['I', 'love', 'machine', 'learning', 'is', 'amazing']

['I', 'love', 'machine', 'learning', 'is', 'amazing']


## **Step 3: Convert to Vector**
We represent each sentence as a vector where each value indicates the frequency of a word.

| Sentence                 | I | love | machine | learning | is | amazing |
|-------------------------|---|------|---------|----------|----|---------|
| "I love machine learning" | 1 | 1    | 1       | 1        | 0  | 0       |
| "Machine learning is amazing" | 0 | 0    | 1       | 1        | 1  | 1       |

## **Limitations of BoW**
- **Ignores word meaning** (context is lost).
- **Ignores word order** (e.g., "not good" and "good" may be treated similarly).
- **Creates large sparse vectors** for big datasets.

## **Alternatives to BoW**
- **TF-IDF (Term Frequency-Inverse Document Frequency)** – Weighs words based on importance.
- **Word Embeddings (e.g., Word2Vec, GloVe, BERT)** – Captures semantic meaning.




In [2]:
!pip install numpy




In [3]:
from sklearn.feature_extraction.text import CountVectorizer

  from pandas.core import (


In [4]:
# Sample sentences
sentences = ["I learn machine learning", "Machine learning is amazing"]

In [5]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()

In [6]:
# Fit and transform sentences into BoW representation
X = vectorizer.fit_transform(sentences)

In [7]:
print("Feature Names (Vocabulary)", vectorizer.get_feature_names_out())
print("\nBag of Words Representation:\n", X.toarray())

Feature Names (Vocabulary) ['amazing' 'is' 'learn' 'learning' 'machine']

Bag of Words Representation:
 [[0 0 1 1 1]
 [1 1 0 1 1]]


### Checking for Sparse Matrix

In [8]:
# Larger corpus with more sentences
corpus = [
    "Machine learning is fun",
    "Deep learning is a subset of machine learning",
    "Natural language processing is a part of AI",
    "AI and machine learning are transforming the world",
    "AI is the future"
]

In [9]:
# Fit and transform the corpus into a sparse matrix
X = vectorizer.fit_transform(corpus)

In [10]:
# Print sparse matrix representation
print("Sparse Matrix Representation:\n", X)

Sparse Matrix Representation:
 <Compressed Sparse Row sparse matrix of dtype 'int64'
	with 29 stored elements and shape (5, 18)>
  Coords	Values
  (0, 9)	1
  (0, 8)	1
  (0, 6)	1
  (0, 4)	1
  (1, 9)	1
  (1, 8)	2
  (1, 6)	1
  (1, 3)	1
  (1, 14)	1
  (1, 11)	1
  (2, 6)	1
  (2, 11)	1
  (2, 10)	1
  (2, 7)	1
  (2, 13)	1
  (2, 12)	1
  (2, 0)	1
  (3, 9)	1
  (3, 8)	1
  (3, 0)	1
  (3, 1)	1
  (3, 2)	1
  (3, 16)	1
  (3, 15)	1
  (3, 17)	1
  (4, 6)	1
  (4, 0)	1
  (4, 15)	1
  (4, 5)	1


In [11]:
# Print dense representation for comparison
print("\nReconstructed Dense Matrix:\n", X.toarray())


Reconstructed Dense Matrix:
 [[0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 1 0 2 1 0 1 0 0 1 0 0 0]
 [1 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0]
 [1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1]
 [1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0]]


In [12]:
# Check if the result is a sparse matrix
from scipy.sparse import csr_matrix


print("\nIs it a Sparse Matrix?", isinstance(X, csr_matrix))


Is it a Sparse Matrix? True


In [16]:
corpus = [
    "Books are a gateway to knowledge, imagination, and history. They transport readers to different worlds, introduce new ideas, and expand perspectives.",
    "Fiction books allow readers to experience adventure, romance, and mystery, while non-fiction books educate about science, history, and personal growth.",
    "Classics like ‘Pride and Prejudice’ and ‘1984’ continue to inspire generations. Modern books explore contemporary issues and futuristic possibilities.",
    "Libraries and bookstores offer endless choices, from biographies to fantasy novels. Reading helps improve vocabulary, critical thinking, and creativity.",
    "Digital books and audiobooks have made reading more accessible, but many still prefer the feel of a physical book.",
    "Some people collect rare books, while others enjoy bestsellers. Whether it’s a thrilling novel or an insightful self-help book, reading is a habit that enriches the mind.",
    "Books preserve knowledge, document culture, and inspire new ideas, making them an essential part of human civilization."
]

In [17]:
#vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

In [19]:
vectorizer.vocabulary_

{'books': 12,
 'are': 7,
 'gateway': 40,
 'to': 100,
 'knowledge': 58,
 'imagination': 50,
 'and': 6,
 'history': 47,
 'they': 97,
 'transport': 101,
 'readers': 87,
 'different': 24,
 'worlds': 105,
 'introduce': 54,
 'new': 68,
 'ideas': 49,
 'expand': 32,
 'perspectives': 79,
 'fiction': 37,
 'allow': 4,
 'experience': 33,
 'adventure': 3,
 'romance': 89,
 'mystery': 67,
 'while': 104,
 'non': 69,
 'educate': 27,
 'about': 1,
 'science': 90,
 'personal': 78,
 'growth': 42,
 'classics': 17,
 'like': 60,
 'pride': 85,
 'prejudice': 83,
 '1984': 0,
 'continue': 20,
 'inspire': 53,
 'generations': 41,
 'modern': 65,
 'explore': 34,
 'contemporary': 19,
 'issues': 56,
 'futuristic': 39,
 'possibilities': 81,
 'libraries': 59,
 'bookstores': 13,
 'offer': 73,
 'endless': 28,
 'choices': 15,
 'from': 38,
 'biographies': 10,
 'fantasy': 35,
 'novels': 71,
 'reading': 88,
 'helps': 46,
 'improve': 51,
 'vocabulary': 102,
 'critical': 22,
 'thinking': 98,
 'creativity': 21,
 'digital': 25,
 '

### Using CV with trigrams

In [20]:
cv = CountVectorizer(binary=True, ngram_range=(3,3))

In [22]:
X = cv.fit_transform(corpus)

In [23]:
cv.vocabulary_

{'books are gateway': 24,
 'are gateway to': 17,
 'gateway to knowledge': 53,
 'to knowledge imagination': 119,
 'knowledge imagination and': 73,
 'imagination and history': 63,
 'and history they': 12,
 'history they transport': 60,
 'they transport readers': 112,
 'transport readers to': 120,
 'readers to different': 99,
 'to different worlds': 115,
 'different worlds introduce': 38,
 'worlds introduce new': 125,
 'introduce new ideas': 68,
 'new ideas and': 82,
 'ideas and expand': 61,
 'and expand perspectives': 10,
 'fiction books allow': 50,
 'books allow readers': 22,
 'allow readers to': 4,
 'readers to experience': 100,
 'to experience adventure': 116,
 'experience adventure romance': 46,
 'adventure romance and': 3,
 'romance and mystery': 104,
 'and mystery while': 14,
 'mystery while non': 81,
 'while non fiction': 123,
 'non fiction books': 84,
 'fiction books educate': 51,
 'books educate about': 25,
 'educate about science': 41,
 'about science history': 1,
 'science his

### Using Bigrams and Trigrams 

In [25]:
cv = CountVectorizer(binary=True, ngram_range=(2,3)) # Bigrams and Trigrams 

In [26]:
X = cv.fit_transform(corpus)

In [27]:
cv.vocabulary_

{'books are': 49,
 'are gateway': 35,
 'gateway to': 108,
 'to knowledge': 242,
 'knowledge imagination': 149,
 'imagination and': 129,
 'and history': 25,
 'history they': 122,
 'they transport': 228,
 'transport readers': 244,
 'readers to': 202,
 'to different': 234,
 'different worlds': 77,
 'worlds introduce': 254,
 'introduce new': 139,
 'new ideas': 167,
 'ideas and': 125,
 'and expand': 21,
 'expand perspectives': 93,
 'books are gateway': 50,
 'are gateway to': 36,
 'gateway to knowledge': 109,
 'to knowledge imagination': 243,
 'knowledge imagination and': 150,
 'imagination and history': 130,
 'and history they': 26,
 'history they transport': 123,
 'they transport readers': 229,
 'transport readers to': 245,
 'readers to different': 203,
 'to different worlds': 235,
 'different worlds introduce': 78,
 'worlds introduce new': 255,
 'introduce new ideas': 140,
 'new ideas and': 168,
 'ideas and expand': 126,
 'and expand perspectives': 22,
 'fiction books': 102,
 'books allow

### Now this is where the problem comes.. Sparse matrix

In [28]:
corpus[0]

'Books are a gateway to knowledge, imagination, and history. They transport readers to different worlds, introduce new ideas, and expand perspectives.'

In [30]:
X[0].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]], dtype=int64)

# TF-IDF 

* To evaluate how important a word is in a document relative to the entire corpus.

* **Term Frequency (TF)** measures how often a word appears in a document (or sentence).
* **Formula:**
```

    Term Frequency (TF) = (Number of repetitions of the word in the sentence) / (Total number of words in the sentence)

```

* **Inverse Document Frequency (IDF)** removes common words and increases importance of rare word
* **Formula:**
```

    IDF = log( Total number of documents / Number of documents containing the word )

```

* Multiply **TF and IDF**
* A **higher TF-IDF** means the word is important and unique in that document.
* A **lower TF-IDF** means the word is common and less important.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
cv = TfidfVectorizer()

In [33]:
X = cv.fit_transform(corpus)

In [35]:
corpus[0]

'Books are a gateway to knowledge, imagination, and history. They transport readers to different worlds, introduce new ideas, and expand perspectives.'

In [34]:
X[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.23683366, 0.24928945, 0.        , 0.        ,
        0.        , 0.        , 0.11841683, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.24928945,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.24928945, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.24928945, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.20693165, 0.        , 0.20693165,
        0.24928945, 0.        , 0.        , 0.        , 0.24928945,
        0.        , 0.        , 0.        , 0.20693165, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.20693165, 0.        ,
        0.        , 0.        , 0.        , 0.  

### Let's say I want to see the top features only

In [36]:
cv = TfidfVectorizer(max_features=10)

In [37]:
X = cv.fit_transform(corpus)

In [41]:
# Top 10 important words
cv.vocabulary_

{'books': 2,
 'to': 9,
 'and': 0,
 'history': 4,
 'readers': 6,
 'new': 5,
 'fiction': 3,
 'reading': 7,
 'the': 8,
 'book': 1}

In [38]:
corpus[0]

'Books are a gateway to knowledge, imagination, and history. They transport readers to different worlds, introduce new ideas, and expand perspectives.'

In [39]:
X[0].toarray()

array([[0.43760176, 0.        , 0.21880088, 0.        , 0.38235129,
        0.38235129, 0.38235129, 0.        , 0.        , 0.56749745]])