Word to Vector using One-Hot Encoding
One-Hot Encoding (OHE) is a simple way to represent words as vectors in Natural Language Processing (NLP). However, it is not an actual word embedding technique like Word2Vec or GloVe. Instead, it is a basic way of converting words into numerical format.

1. Understanding One-Hot Encoding for Words
One-Hot Encoding represents each word as a unique binary vector. In this representation:

Each word in the vocabulary is assigned a unique index.

The vector for a word has 1 at the assigned index and 0 elsewhere.

For example, if we have the following vocabulary:

css
Copy
Edit
["cat", "dog", "fish", "bird"]
Then, One-Hot Encoding for each word would be:

"cat" → [1, 0, 0, 0]

"dog" → [0, 1, 0, 0]

"fish" → [0, 0, 1, 0]

"bird" → [0, 0, 0, 1]

2. Pros and Cons of One-Hot Encoding
✅ Advantages:
Simple and Easy to Implement – It does not require complex computations.

No Loss of Information – Each word is represented uniquely, maintaining its identity.

Good for Small Vocabulary – Works well when the vocabulary size is small.

❌ Disadvantages:
High Dimensionality – If the vocabulary has N words, each word vector will have N dimensions, leading to sparsity.

No Semantic Meaning – The vectors do not capture relationships between words. For example, "cat" and "dog" should be similar in meaning, but their vectors are completely different.

Memory Inefficient – It requires large amounts of memory for big vocabularies.



In [1]:
a="Hello i am nasim akram currently learning how to conver the word to vector using the one hot encoder"

In [2]:
import numpy as np
vocab=['cat','dog','fish','bird']
word_to_index={word : i for i, word in enumerate(vocab)}


def one_hot_encod(word,vocab_size):
    vector=np.zeros(vocab_size)
    index=word_to_index[word]
    vector[index]=1
    return vector


In [3]:
one_hot_vectores={word: one_hot_encod(word, len(vocab)) for  word in vocab}

In [4]:
one_hot_vectores

{'cat': array([1., 0., 0., 0.]),
 'dog': array([0., 1., 0., 0.]),
 'fish': array([0., 0., 1., 0.]),
 'bird': array([0., 0., 0., 1.])}

In [5]:
# Chnage them to vector

vocabulary=['Brat','Brahman','Brahmin','Brahma','Brass','Brassica','Brassard','Brassiere','Brassicae','Brassicae','Brassicae','Brassicae']    

In [6]:
word__to__index={word:i for i ,word in enumerate(vocabulary)}
def one_hot_encod(word,vocab_size):
    vector=np.zeros(vocab_size)
    index=word__to__index[word]
    vector[index]=1
    return vector



one_hot__vectors={word: one_hot_encod(word,len(vocabulary)) for word in vocabulary}





In [7]:
np.unique(vocabulary)

array(['Brahma', 'Brahman', 'Brahmin', 'Brass', 'Brassard', 'Brassica',
       'Brassicae', 'Brassiere', 'Brat'], dtype='<U9')

In [8]:
one_hot__vectors

{'Brat': array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'Brahman': array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'Brahmin': array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'Brahma': array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'Brass': array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]),
 'Brassica': array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]),
 'Brassard': array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]),
 'Brassiere': array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]),
 'Brassicae': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.])}

In [9]:
vocabu=["Nasim","Zasim","Machine learning","Sammer Tishima"]
word_to___index={word:i for i, word in enumerate(vocabu)}

word_to___index

{'Nasim': 0, 'Zasim': 1, 'Machine learning': 2, 'Sammer Tishima': 3}

In [10]:
def Encoder_function(word,vocab_size):
    vector=np.zeros(vocab_size)
    index=word_to___index[word]
    vector[index]=1
    return vector

In [11]:
encoded_vector={word:Encoder_function(word,len(vocabu)) for word in vocabu}

In [12]:
encoded_vector

{'Nasim': array([1., 0., 0., 0.]),
 'Zasim': array([0., 1., 0., 0.]),
 'Machine learning': array([0., 0., 1., 0.]),
 'Sammer Tishima': array([0., 0., 0., 1.])}

In [33]:
#One Hot Encoding for the sentence
sentences = [
    "Python is a powerful programming language.",
    "Machine learning is changing the world.",
    "I love solving coding challenges.",
    "Data science is a fascinating field.",
    "The weather is nice today.",
    "How are you doing today?",
    "What is your favorite programming language?",
    "Can you help me with this problem?",
    "I enjoy learning new things every day.",
    "Let’s meet at 5 PM.",
    "Never stop learning, because life never stops teaching.",
    "Hard work always pays off in the end.",
    "Believe in yourself and all that you are.",
    "Success comes to those who never give up.",
    "Challenges make you stronger and wiser.",
    "I told my computer I needed a break, and now it won’t turn on!",
    "Debugging is like being a detective in a crime movie where you are also the murderer.",
    "I have a love-hate relationship with Python’s indentation.",
    "My brain has too many tabs open right now.",
    "Why do programmers prefer dark mode? Because light attracts bugs!"
]





In [34]:
for i in sentences:
    print(i)

Python is a powerful programming language.
Machine learning is changing the world.
I love solving coding challenges.
Data science is a fascinating field.
The weather is nice today.
How are you doing today?
What is your favorite programming language?
Can you help me with this problem?
I enjoy learning new things every day.
Let’s meet at 5 PM.
Never stop learning, because life never stops teaching.
Hard work always pays off in the end.
Believe in yourself and all that you are.
Success comes to those who never give up.
Challenges make you stronger and wiser.
I told my computer I needed a break, and now it won’t turn on!
Debugging is like being a detective in a crime movie where you are also the murderer.
I have a love-hate relationship with Python’s indentation.
My brain has too many tabs open right now.
Why do programmers prefer dark mode? Because light attracts bugs!


In [35]:
l1=[]
for i in sentences:
    
    l1.extend(i.split())
l1


['Python',
 'is',
 'a',
 'powerful',
 'programming',
 'language.',
 'Machine',
 'learning',
 'is',
 'changing',
 'the',
 'world.',
 'I',
 'love',
 'solving',
 'coding',
 'challenges.',
 'Data',
 'science',
 'is',
 'a',
 'fascinating',
 'field.',
 'The',
 'weather',
 'is',
 'nice',
 'today.',
 'How',
 'are',
 'you',
 'doing',
 'today?',
 'What',
 'is',
 'your',
 'favorite',
 'programming',
 'language?',
 'Can',
 'you',
 'help',
 'me',
 'with',
 'this',
 'problem?',
 'I',
 'enjoy',
 'learning',
 'new',
 'things',
 'every',
 'day.',
 'Let’s',
 'meet',
 'at',
 '5',
 'PM.',
 'Never',
 'stop',
 'learning,',
 'because',
 'life',
 'never',
 'stops',
 'teaching.',
 'Hard',
 'work',
 'always',
 'pays',
 'off',
 'in',
 'the',
 'end.',
 'Believe',
 'in',
 'yourself',
 'and',
 'all',
 'that',
 'you',
 'are.',
 'Success',
 'comes',
 'to',
 'those',
 'who',
 'never',
 'give',
 'up.',
 'Challenges',
 'make',
 'you',
 'stronger',
 'and',
 'wiser.',
 'I',
 'told',
 'my',
 'computer',
 'I',
 'needed',
 '

In [42]:
from collections import OrderedDict
def unique_words(sentences):
    words=[]
    
    for i in sentences:
        words.extend(OrderedDict.fromkeys(i.split()))
    
    
    seen=OrderedDict()

    for word in words:
        if word not in seen:
            seen[word]=None
    return list(seen.keys())





In [43]:
unique_words(sentences)

['Python',
 'is',
 'a',
 'powerful',
 'programming',
 'language.',
 'Machine',
 'learning',
 'changing',
 'the',
 'world.',
 'I',
 'love',
 'solving',
 'coding',
 'challenges.',
 'Data',
 'science',
 'fascinating',
 'field.',
 'The',
 'weather',
 'nice',
 'today.',
 'How',
 'are',
 'you',
 'doing',
 'today?',
 'What',
 'your',
 'favorite',
 'language?',
 'Can',
 'help',
 'me',
 'with',
 'this',
 'problem?',
 'enjoy',
 'new',
 'things',
 'every',
 'day.',
 'Let’s',
 'meet',
 'at',
 '5',
 'PM.',
 'Never',
 'stop',
 'learning,',
 'because',
 'life',
 'never',
 'stops',
 'teaching.',
 'Hard',
 'work',
 'always',
 'pays',
 'off',
 'in',
 'end.',
 'Believe',
 'yourself',
 'and',
 'all',
 'that',
 'are.',
 'Success',
 'comes',
 'to',
 'those',
 'who',
 'give',
 'up.',
 'Challenges',
 'make',
 'stronger',
 'wiser.',
 'told',
 'my',
 'computer',
 'needed',
 'break,',
 'now',
 'it',
 'won’t',
 'turn',
 'on!',
 'Debugging',
 'like',
 'being',
 'detective',
 'crime',
 'movie',
 'where',
 'also',
 

In [52]:
def OneHotEncoded(words,size):
    list_of_vec=[]
    
    
    for i in range(len(words)):
        vector=np.zeros(size)
        vector[i]=1        
        list_of_vec.append(vector)
    return list_of_vec
        
        

In [53]:
OneHotEncoded(unique_words(sentences),len(unique_words(sentences)))

[array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0.]),
 array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.

In [54]:
from sklearn.preprocessing import OneHotEncoder

In [55]:
wordss=np.array(unique_words(sentences)).reshape(-1,1)
encoder=OneHotEncoder(sparse=False)
one_hot_encoded=encoder.fit_transform(wordss)



In [56]:
one_hot_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [57]:
for word, vector in zip(wordss,one_hot_encoded):
    print(f"{word}:{vector}")

['Python']:[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]
['is']:[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]
['a']:[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.