One-Hot Encoding is a common method in Natural Language Processing (NLP) for representing text data as numerical values.

https://medium.com/@kalyan45/natural-language-processing-one-hot-encoding-5b31f76b09a0

One-Hot Encoding is a common method in Natural Language Processing (NLP) for representing text data as numerical values. It transforms each word or token in a text dataset into a unique vector where only one element is "hot" (set to 1), and all others are "cold" (set to 0). This binary representation is especially useful in machine learning, where algorithms require numerical input.

How One-Hot Encoding Works:
Vocabulary Creation: First, a unique vocabulary list is created, consisting of all unique words (or tokens) in the dataset.
Binary Vector Representation: Each word is assigned a binary vector. The length of the vector equals the total number of unique words in the vocabulary.
Single Active Position: For each word, only one element in the vector is set to 1 (representing that word's index in the vocabulary), while all other elements are set to 0.
Example
Let's say we have a small vocabulary based on the sentence: "I like NLP and NLP likes me."

Vocabulary: ["I", "like", "NLP", "and", "likes", "me"]



### Advantages of One-Hot Encoding
Simple and easy to implement.
Efficient for small vocabularies.
### Disadvantages
High Dimensionality: For large vocabularies, one-hot encoding creates very high-dimensional vectors, which is inefficient in terms of storage and computation.

Sparse Vectors: Most elements in each vector are 0, leading to sparsity.

Lack of Context: Words with similar meanings have completely different vectors, so there’s no information about word relationships or semantic meaning.

## Example 1

In [17]:
paragraph = """I love NLP. I love machine learning. NLP loves me"""

In [29]:
import nltk
from nltk.tokenize import sent_tokenize
import re                               #The re module in Python is used here for regular expressions,
# nltk.download('punkt')

# Tokenize paragraph into sentences
sentences = sent_tokenize(paragraph)

# Remove punctuation from each sentence
sentences = [re.sub(r'[^\w\s]', '', sentence) for sentence in sentences]

print("Sentences:", sentences)

Sentences: ['I love NLP', 'I love machine learning', 'NLP loves me']


In [39]:
# Step 1: Preprocess text by tokenizing and removing stop words (if necessary)
stop_words = set(stopwords.words('english'))
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

In [41]:
tokenized_sentences

[['i', 'love', 'nlp'],
 ['i', 'love', 'machine', 'learning'],
 ['nlp', 'loves', 'me']]

In [43]:
# # Step 2: Build vocabulary (unique words)
vocabulary = sorted(set(word for sentence in tokenized_sentences for word in sentence if word not in stop_words))
vocabulary

['learning', 'love', 'loves', 'machine', 'nlp']

In [45]:
# # Step 3: Create one-hot encoding dictionary
one_hot_dict = {word: np.eye(len(vocabulary))[i] for i, word in enumerate(vocabulary)}
one_hot_dict

{'learning': array([1., 0., 0., 0., 0.]),
 'love': array([0., 1., 0., 0., 0.]),
 'loves': array([0., 0., 1., 0., 0.]),
 'machine': array([0., 0., 0., 1., 0.]),
 'nlp': array([0., 0., 0., 0., 1.])}

In [47]:
# # Step 4: Encode sentences
encoded_sentences = [[one_hot_dict[word] for word in sentence if word in one_hot_dict] for sentence in tokenized_sentences]
encoded_sentences

[[array([0., 1., 0., 0., 0.]), array([0., 0., 0., 0., 1.])],
 [array([0., 1., 0., 0., 0.]),
  array([0., 0., 0., 1., 0.]),
  array([1., 0., 0., 0., 0.])],
 [array([0., 0., 0., 0., 1.]), array([0., 0., 1., 0., 0.])]]

In [49]:
# # Display the one-hot encoding for each sentence
for i, encoded_sentence in enumerate(encoded_sentences):
    print(f"\nSentence {i+1}:", sentences[i])
    for j, vector in enumerate(encoded_sentence):
        print(f"  Word '{tokenized_sentences[i][j]}' -> One-hot: {vector}")


Sentence 1: I love NLP
  Word 'i' -> One-hot: [0. 1. 0. 0. 0.]
  Word 'love' -> One-hot: [0. 0. 0. 0. 1.]

Sentence 2: I love machine learning
  Word 'i' -> One-hot: [0. 1. 0. 0. 0.]
  Word 'love' -> One-hot: [0. 0. 0. 1. 0.]
  Word 'machine' -> One-hot: [1. 0. 0. 0. 0.]

Sentence 3: NLP loves me
  Word 'nlp' -> One-hot: [0. 0. 0. 0. 1.]
  Word 'loves' -> One-hot: [0. 0. 1. 0. 0.]


### Example 2

In [64]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.preprocessing import OneHotEncoder
import numpy as np


# Sample paragraph
paragraph = """Natural Language Processing (NLP) is a fascinating field that combines computer science, 
artificial intelligence, and linguistics. NLP enables computers to understand, interpret, and respond to 
human language in a valuable way. With applications in various domains such as chatbots, translation, 
and sentiment analysis, NLP has transformed how we interact with technology."""

# Step 1: Define stop words
stop_words = set(stopwords.words('english'))


In [62]:
# Step 2: Tokenize the paragraph into sentences
tokenized_sentences = sent_tokenize(paragraph)
tokenized_sentences

['Natural Language Processing (NLP) is a fascinating field that combines computer science, \nartificial intelligence, and linguistics.',
 'NLP enables computers to understand, interpret, and respond to \nhuman language in a valuable way.',
 'With applications in various domains such as chatbots, translation, \nand sentiment analysis, NLP has transformed how we interact with technology.']

In [73]:
# Step 2: Tokenize each sentence into words
tokenized_words = [word_tokenize(sentence) for sentence in tokenized_sentences]
tokenized_words

[['Natural',
  'Language',
  'Processing',
  '(',
  'NLP',
  ')',
  'is',
  'a',
  'fascinating',
  'field',
  'that',
  'combines',
  'computer',
  'science',
  ',',
  'artificial',
  'intelligence',
  ',',
  'and',
  'linguistics',
  '.'],
 ['NLP',
  'enables',
  'computers',
  'to',
  'understand',
  ',',
  'interpret',
  ',',
  'and',
  'respond',
  'to',
  'human',
  'language',
  'in',
  'a',
  'valuable',
  'way',
  '.'],
 ['With',
  'applications',
  'in',
  'various',
  'domains',
  'such',
  'as',
  'chatbots',
  ',',
  'translation',
  ',',
  'and',
  'sentiment',
  'analysis',
  ',',
  'NLP',
  'has',
  'transformed',
  'how',
  'we',
  'interact',
  'with',
  'technology',
  '.']]

In [77]:
# Step 3: Tokenize each sentence into words and remove stop words
cleaned_sentences = [
    [word for word in word_tokenize(sentence.lower()) if word.isalnum() and word not in stop_words]
    for sentence in tokenized_sentences
]
cleaned_sentences

[['natural',
  'language',
  'processing',
  'nlp',
  'fascinating',
  'field',
  'combines',
  'computer',
  'science',
  'artificial',
  'intelligence',
  'linguistics'],
 ['nlp',
  'enables',
  'computers',
  'understand',
  'interpret',
  'respond',
  'human',
  'language',
  'valuable',
  'way'],
 ['applications',
  'various',
  'domains',
  'chatbots',
  'translation',
  'sentiment',
  'analysis',
  'nlp',
  'transformed',
  'interact',
  'technology']]

Flattening Process:

The purpose of this line is to flatten the list of lists (cleaned_sentences) into a single list containing all the words.
Without flattening, cleaned_sentences would be a list where each element is another list (i.e., a sentence).
By flattening it, we extract each individual word from each sentence, resulting in a single list that contains all the words from all sentences.

In [68]:
# # Step 4: Flatten the list of cleaned sentences to create a single list of words
flattened_words = [word for sentence in cleaned_sentences for word in sentence]
flattened_words

['natural',
 'language',
 'processing',
 'nlp',
 'fascinating',
 'field',
 'combines',
 'computer',
 'science',
 'artificial',
 'intelligence',
 'linguistics',
 'nlp',
 'enables',
 'computers',
 'understand',
 'interpret',
 'respond',
 'human',
 'language',
 'valuable',
 'way',
 'applications',
 'various',
 'domains',
 'chatbots',
 'translation',
 'sentiment',
 'analysis',
 'nlp',
 'transformed',
 'interact',
 'technology']

In [71]:
# # Step 5: Create a unique vocabulary
vocabulary = list(set(flattened_words))
vocabulary

['nlp',
 'various',
 'interact',
 'field',
 'translation',
 'language',
 'way',
 'processing',
 'valuable',
 'transformed',
 'combines',
 'computer',
 'technology',
 'understand',
 'natural',
 'fascinating',
 'respond',
 'human',
 'interpret',
 'artificial',
 'intelligence',
 'linguistics',
 'sentiment',
 'analysis',
 'chatbots',
 'enables',
 'domains',
 'computers',
 'science',
 'applications']

In [81]:
# # # Step 6: One-hot encoding
# # Reshape the vocabulary for one-hot encoding
vocabulary_array = np.array(vocabulary).reshape(-1, 1)
vocabulary_array

array([['nlp'],
       ['various'],
       ['interact'],
       ['field'],
       ['translation'],
       ['language'],
       ['way'],
       ['processing'],
       ['valuable'],
       ['transformed'],
       ['combines'],
       ['computer'],
       ['technology'],
       ['understand'],
       ['natural'],
       ['fascinating'],
       ['respond'],
       ['human'],
       ['interpret'],
       ['artificial'],
       ['intelligence'],
       ['linguistics'],
       ['sentiment'],
       ['analysis'],
       ['chatbots'],
       ['enables'],
       ['domains'],
       ['computers'],
       ['science'],
       ['applications']], dtype='<U12')

In [89]:
# # # Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

In [93]:
# # # Fit and transform the vocabulary
one_hot_encoded = encoder.fit_transform(vocabulary_array)
one_hot_encoded

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0

In [95]:
# Step 7: Print one-hot encoded vectors
print("\nOne-Hot Encoded Vectors:")
for word, encoding in zip(vocabulary, one_hot_encoded):
    print(f"{word}: {encoding}")



One-Hot Encoded Vectors:
nlp: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
various: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0.]
interact: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
field: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
translation: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 0.]
language: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
way: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1.]
processing: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
valuable: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0.]
transformed: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

### Example 3

In [99]:
import numpy as np

# Define the corpus of text
corpus = [
	"The quick brown fox jumped over the lazy dog.",
	"She sells seashells by the seashore.",
	"Peter Piper picked a peck of pickled peppers."
]

# Create a set of unique words in the corpus
unique_words = set()
for sentence in corpus:
	for word in sentence.split():
		unique_words.add(word.lower())

# Create a dictionary to map each
# unique word to an index
word_to_index = {}
for i, word in enumerate(unique_words):
	word_to_index[word] = i

# Create one-hot encoded vectors for
# each word in the corpus
one_hot_vectors = []
for sentence in corpus:
	sentence_vectors = []
	for word in sentence.split():
		vector = np.zeros(len(unique_words))
		vector[word_to_index[word.lower()]] = 1
		sentence_vectors.append(vector)
	one_hot_vectors.append(sentence_vectors)

# Print the one-hot encoded vectors 
# for the first sentence
print("One-hot encoded vectors for the first sentence:")
for vector in one_hot_vectors[0]:
	print(vector)


One-hot encoded vectors for the first sentence:
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


### Example 3 

In [103]:
import numpy as np

# Define the sentences
sentences = [
	'The cat sat on the mat.',
	'The dog chased the cat.',
	'The mat was soft and fluffy.'
]

# Create a vocabulary set
vocab = set()
for sentence in sentences:
	words = sentence.lower().split()
	for word in words:
		vocab.add(word)

# Create a dictionary to map words to integers
word_to_int = {word: i for i, word in enumerate(vocab)}

# Create a binary vector for each word in each sentence
vectors = []
for sentence in sentences:
	words = sentence.lower().split()
	sentence_vectors = []
	for word in words:
		binary_vector = np.zeros(len(vocab))
		binary_vector[word_to_int[word]] = 1
		sentence_vectors.append(binary_vector)
	vectors.append(sentence_vectors)

# Print the one-hot encoded vectors for each word in each sentence
for i in range(len(sentences)):
	print(f"Sentences {i + 1}:")
	for j in range(len(vectors[i])):
		print(f"{sentences[i].split()[j]}: {vectors[i][j]}")


Sentences 1:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
cat: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
sat: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
on: [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
the: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
mat.: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
Sentences 2:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
dog: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
chased: [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
the: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
cat.: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Sentences 3:
The: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
mat: [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
was: [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
soft: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
and: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
fluffy.: [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


### Drawbacks of One-Hot Encoding in NLP : 
One of the major disadvantages of one-hot encoding in NLP is that it produces high-dimensional sparse vectors that can be extremely costly to process. This is due to the fact that one-hot encoding generates a distinct binary vector for each unique word in the text, resulting in a very big feature space. Furthermore, because one-hot encoding does not catch the semantic connections between words, machine-learning models that use these vectors as input may perform poorly. As a result, other encoding methods, such as word embeddings, are frequently used in NLP jobs. Word embeddings convert words into low-dimensional dense vectors that record meaningful connections between words, making them more useful for many NLP tasks.