# What is One-Hot Encoding?

It represents each word with a vector of all 0s, except a single 1 at the index of that word in the vocabulary.


vocab = ["cat", "dog", "fish"]

**One-hot encoded vectors:**

"cat"  → [1, 0, 0]  
"dog"  → [0, 1, 0]  
"fish" → [0, 0, 1]


# ❌ Major Cons of One-Hot Encoding

**1️⃣ Sparsity (Wasted Space & Memory)**

If you have a large vocabulary, the vector becomes huge and mostly filled with zeros — which wastes space.

In [1]:
import numpy as np

# A large vocab (e.g., 10,000 words)
vocab_size = 10000
word_index = 4321  # Random word index

# One-hot encoding that word
vector = np.zeros(vocab_size)
vector[word_index] = 1

print(f"Vector size: {len(vector)}")
print(f"Non-zero entries: {np.count_nonzero(vector)}")


Vector size: 10000
Non-zero entries: 1


**2️⃣ No Semantic Meaning or Similarity**

Words like "king" and "queen", or "run" and "jog", are treated as completely unrelated — no similarity is captured.

In [2]:
from sklearn.metrics.pairwise import cosine_similarity

# One-hot vectors
king = np.array([[1, 0, 0, 0]])
queen = np.array([[0, 1, 0, 0]])
run = np.array([[0, 0, 1, 0]])
bottle = np.array([[0, 0, 0, 1]])

# Cosine similarity between "king" and "queen"
print("king vs queen:", cosine_similarity(king, queen)[0][0])
print("run vs bottle:", cosine_similarity(run, bottle)[0][0])


king vs queen: 0.0
run vs bottle: 0.0


**3️⃣ No Context / Word Order Information**

One-hot encoding can’t tell if the word is the subject or object, or what other words are nearby.

Sentence 1: "Dog bites man"
Sentence 2: "Man bites dog"

Same words → same one-hot vectors — even though meanings are opposite.

No way to tell who did what to whom.

**4️⃣ Out-of-Vocabulary (OOV) Problem**

 Problem:
If a new word is seen during testing that wasn't in training vocab, we can’t encode it.

In [3]:
vocab = {"cat": 0, "dog": 1, "fish": 2}
test_word = "lion"


if test_word in vocab:
    vec = np.zeros(len(vocab))
    vec[vocab[test_word]] = 1
else:
    print(f"'{test_word}' not in vocabulary!")


'lion' not in vocabulary!


🧺 What is Bag of Words (BoW)?

BoW is a text representation technique where we represent a document as a vector of word counts.

It counts the occurrence of each word in a document.

It ignores grammar, order, and meaning — just focuses on word frequency.

It's used for text classification, sentiment analysis, and many traditional NLP tasks.

# 🧺 What is BoW?

BoW represents each document as a vector of word frequencies from a fixed vocabulary.

It ignores order, grammar, and context.

📘 Example:

Corpus:


Doc1: "I love NLP",
Doc2: "NLP is fun"


Vocabulary:


["I", "love", "NLP", "is", "fun"]

Vector for:

Doc1 → [1, 1, 1, 0, 0]

Doc2 → [0, 0, 1, 1, 1]

❌ Main Disadvantages of BoW

1️⃣ No Understanding of Word Meaning (No Semantics)

Problem:
Words like "good" and "great" are treated as completely unrelated — even though they mean similar things.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

docs = ["This phone is good", "This phone is great"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names_out())
print(X.toarray())


['good' 'great' 'is' 'phone' 'this']
[[1 0 1 1 1]
 [0 1 1 1 1]]


 **Even though "good" and "great" are synonyms, the vectors are totally different.**

**BoW doesn’t understand meaning or similarity.**

2️⃣ No Word Order or Grammar

Problem:
BoW doesn't know who is doing what to whom — so two opposite sentences can have the same vector.

In [5]:
docs = ["dog bites man", "man bites dog"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names_out())
print(X.toarray())


['bites' 'dog' 'man']
[[1 1 1]
 [1 1 1]]



**Both vectors are identical even though:"dog bites man" ✅"man bites dog" (different meaning!)**

**Word position and sentence structure is completely lost.**


3️⃣ High Dimensionality (Sparsity)

Problem:
If your vocabulary is large (e.g. 50,000+ words), each document becomes a huge sparse vector — full of zeros.

In [6]:
import numpy as np

vocab_size = 10000
vector = np.zeros(vocab_size)
vector[100] = 1  # Only one word used in the doc

print(f"Vector length: {len(vector)}")
print(f"Non-zero entries: {np.count_nonzero(vector)}")


Vector length: 10000
Non-zero entries: 1


**4️⃣ Out-of-Vocabulary (OOV) Problem**


🔍 Problem:
BoW model can't handle new words not seen during training.

In [7]:
train_docs = ["NLP is fun"]
test_docs = ["AI is fun"]

vectorizer = CountVectorizer()
vectorizer.fit(train_docs)

# Now transform test data
X_test = vectorizer.transform(test_docs)

print(vectorizer.get_feature_names_out())
print(X_test.toarray())


['fun' 'is' 'nlp']
[[1 1 0]]


**New word "AI" is ignored, model may misbehave.**

# ✅ Advantages of Bag of Words (BoW)

1️⃣ Simple and Easy to Implement

Very intuitive: just count how many times each word appears.

Easy to implement using tools like CountVectorizer in scikit-learn.

🔧 Good for beginners and quick prototypes.

2️⃣ Fast to Train and Use

Since it doesn’t require training like Word2Vec or BERT, it’s fast and lightweight.

Works well with classical ML models like:

Naive Bayes

Logistic Regression

SVM

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

docs = ["I love this phone", "This phone is terrible"]
labels = [1, 0]  # 1 = Positive, 0 = Negative

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

model = MultinomialNB()
model.fit(X, labels)


**3️⃣ Good Performance on Small or Clean Datasets**

If the text is short and vocabulary is limited (e.g., SMS spam detection, news headlines), BoW can give surprisingly good results.

✔️ Works best when:

Grammar doesn’t matter much.

Word presence matters more than order.

# N-grams (Uni-grams, Bi-grams, Tri-grams)


Concept

N-grams are continuous sequences of N words from a sentence.
This helps preserve word order and context compared to plain BoW.

| N-gram Type | Example for sentence: `"People watch movies"` |
| ----------- | --------------------------------------------- |
| Uni-grams   | \["People", "watch", "movies"]                |
| Bi-grams    | \["People watch", "watch movies"]             |
| Tri-grams   | \["People watch movies"]                      |


So, instead of just individual words, we also consider word pairs, triplets, etc.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love NLP",
    "NLP is not boring",
    "I do not like spam"
]

# Bi-grams only (n=2)
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(sentences)

print("Bi-gram Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nBi-gram Vectors:")
print(X.toarray())


Bi-gram Vocabulary:
['do not' 'is not' 'like spam' 'love nlp' 'nlp is' 'not boring' 'not like']

Bi-gram Vectors:
[[0 0 0 1 0 0 0]
 [0 1 0 0 1 1 0]
 [1 0 1 0 0 0 1]]


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love natural language processing",
    "Natural language processing is fun"
]

# Tri-gram representation
vectorizer = CountVectorizer(ngram_range=(3, 3))
X = vectorizer.fit_transform(sentences)

# Output
print("Tri-gram Vocabulary:")
print(vectorizer.get_feature_names_out())

print("\nTri-gram Vectors:")
print(X.toarray())


Tri-gram Vocabulary:
['language processing is' 'love natural language'
 'natural language processing' 'processing is fun']

Tri-gram Vectors:
[[0 1 1 0]
 [1 0 1 1]]


# advantage of N-grams

**1. Captures context**
   
Explanation:
Unlike Bag-of-Words (which treats "New York" and "York New" the same because it ignores order), N-grams preserve word order, so the position of words affects the token representation.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

texts = ["I live in New York", "I live in York New"]

# Using unigrams (BoW)
vectorizer_uni = CountVectorizer(ngram_range=(1,1))
unigrams = vectorizer_uni.fit_transform(texts)
print("Unigrams vocabulary:", vectorizer_uni.vocabulary_)
print(unigrams.toarray())

# Using bigrams
vectorizer_bi = CountVectorizer(ngram_range=(2,2))
bigrams = vectorizer_bi.fit_transform(texts)
print("\nBigrams vocabulary:", vectorizer_bi.vocabulary_)
print(bigrams.toarray())


Unigrams vocabulary: {'live': 1, 'in': 0, 'new': 2, 'york': 3}
[[1 1 1 1]
 [1 1 1 1]]

Bigrams vocabulary: {'live in': 2, 'in new': 0, 'new york': 3, 'in york': 1, 'york new': 4}
[[1 0 1 1 0]
 [0 1 1 0 1]]


Math Insight: In unigram BoW, the vector is

[count(live), count(in), count(new), count(york)]
Both sentences → [1, 1, 1, 1] (identical).

In bigram, the position changes token identity: "new york" ≠ "york new".

**Detects phrases**

Explanation:
Some phrases have meaning that cannot be inferred from individual words ("machine learning" ≠ "machine" + "learning" separately).

In [12]:
texts = ["I love machine learning", "I love learning about machines"]

vectorizer_bi = CountVectorizer(ngram_range=(2,2))
bigrams = vectorizer_bi.fit_transform(texts)
print("Bigrams vocabulary:", vectorizer_bi.vocabulary_)
print(bigrams.toarray())


Bigrams vocabulary: {'love machine': 3, 'machine learning': 4, 'love learning': 2, 'learning about': 1, 'about machines': 0}
[[0 0 0 1 1]
 [1 1 1 0 0]]


Meaning: "machine learning" is treated as one feature instead of two separate unigrams, which preserves semantic meaning.

**3. Improves prediction**

Explanation:
In next-word prediction tasks, knowing the previous two or more words (bigram/trigram) gives better accuracy than knowing just one.

In [13]:
from collections import defaultdict

corpus = "I am hungry. I am sleepy. I am happy.".lower().split()

# Build bigram counts
bigrams = defaultdict(int)
for i in range(len(corpus)-1):
    pair = (corpus[i], corpus[i+1])
    bigrams[pair] += 1

print("Bigram counts:", dict(bigrams))

# Predict next word after "I am"
prefix = ("i", "am")
predictions = {k[1]: v for k, v in bigrams.items() if k[0] == prefix[1]}
print("Likely next words after 'I am':", predictions)


Bigram counts: {('i', 'am'): 3, ('am', 'hungry.'): 1, ('hungry.', 'i'): 1, ('am', 'sleepy.'): 1, ('sleepy.', 'i'): 1, ('am', 'happy.'): 1}
Likely next words after 'I am': {'hungry.': 1, 'sleepy.': 1, 'happy.': 1}


Math Insight:
In probability terms:

𝑃
(
word
𝑡
∣
word
𝑡
−
1
,
word
𝑡
−
2
)
P(word 
t
​
 ∣word 
t−1
​
 ,word 
t−2
​
 )
is higher for correct context than

𝑃
(
word
𝑡
∣
word
𝑡
−
1
)
P(word 
t
​
 ∣word 
t−1
​
 )
because more context narrows the distribution.

**4. Negation handling**
   
Explanation:
With unigrams, "not good" is treated as "not" and "good" separately, losing the negation meaning.
With bigrams, "not good" is a single feature that clearly signals negative sentiment.

In [14]:
texts = ["This is good", "This is not good"]

vectorizer_uni = CountVectorizer(ngram_range=(1,1))
print("Unigrams:", vectorizer_uni.fit_transform(texts).toarray())

vectorizer_bi = CountVectorizer(ngram_range=(2,2))
print("Bigrams:", vectorizer_bi.fit_transform(texts).toarray())
print("Bigrams vocab:", vectorizer_bi.vocabulary_)


Unigrams: [[1 1 0 1]
 [1 1 1 1]]
Bigrams: [[1 0 0 1]
 [0 1 1 1]]
Bigrams vocab: {'this is': 3, 'is good': 0, 'is not': 1, 'not good': 2}
