#### One-Hot Encoding in NLP
One-hot encoding is a method of converting categorical text data (words) into numerical binary vectors.

Each word in the vocabulary is assigned a unique index, and it is represented as a vector where only its assigned index is `1`, and all other positions are `0`.

---

#### **Example: One-Hot Encoding Words in a Sentence**
```python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np

# Define vocabulary
words = ["nlp", "love", "i", "is", "amazing", "and"]

# Convert words to integer labels
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(words)

# Reshape for One-Hot Encoding
integer_encoded = integer_encoded.reshape(-1, 1)

# Apply One-Hot Encoding
onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

# Display Results
print("Vocabulary:", words)
print("Integer Encoded:", integer_encoded.flatten())
print("One-Hot Encoded:\n", onehot_encoded)


Vocabulary: ['nlp', 'love', 'i', 'is', 'amazing', 'and']

Integer Encoded: [4 3 2 5 0 1]

One-Hot Encoded:

 [
 [0. 0. 0. 0. 1. 0.]
 
  [0. 0. 0. 1. 0. 0.]
  
  [0. 0. 1. 0. 0. 0.]
  
  [0. 0. 0. 0. 0. 1.]
  
  [1. 0. 0. 0. 0. 0.]
  
  [0. 1. 0. 0. 0. 0.]
  ]


In [None]:
import nltk


In [12]:
from nltk.tokenize import word_tokenize



In [13]:
from sklearn.preprocessing import OneHotEncoder


In [14]:
import numpy as np

In [15]:
nltk.download('punkt')

text = "I love NLP and NLP is amazing"
tokens = word_tokenize(text.lower())  # Tokenizing the text

print("Tokens:", tokens)


Tokens: ['i', 'love', 'nlp', 'and', 'nlp', 'is', 'amazing']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\naman\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
vocab = list(dict.fromkeys(tokens))  # Unique words
print("Vocabulary:", vocab)


Vocabulary: ['i', 'love', 'nlp', 'and', 'is', 'amazing']


In [27]:
!pip install scikit-learn




In [28]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder






In [30]:
onehot_encoder = OneHotEncoder(sparse_output=False)

word_indexes = np.array(vocab).reshape(-1, 1)  # Reshape for encoding
onehot_vectors = onehot_encoder.fit_transform(word_indexes)

# Creating a dictionary for word-to-one-hot mapping
one_hot_dict = {word: onehot_vectors[i] for i, word in enumerate(vocab)}

print("\nOne-Hot Encoding:")
for word, vec in one_hot_dict.items():
    print(f"{word}: {vec}")



One-Hot Encoding:
i: [0. 0. 1. 0. 0. 0.]
love: [0. 0. 0. 0. 1. 0.]
nlp: [0. 0. 0. 0. 0. 1.]
and: [0. 1. 0. 0. 0. 0.]
is: [0. 0. 0. 1. 0. 0.]
amazing: [1. 0. 0. 0. 0. 0.]


Advantages of One-Hot Encoding
✅ Simple and easy to implement
✅ Works well with small vocabularies
✅ Good for rule-based NLP models

Limitations of One-Hot Encoding
❌ Creates high-dimensional sparse vectors (for large vocabularies)
❌ Does not capture word relationships or context

When to Use One-Hot Encoding?
Text classification with a small vocabulary
Rule-based NLP tasks
Baseline models before moving to word embeddings (Word2Vec, BERT, etc.)