<a href="https://colab.research.google.com/github/nishkalavallabhi/practicalnlp/blob/V_2_0/Ch2/One-Hot%20Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## One Hot Encoding of text

One Hot Encoding treats each word as a category and make a sparse vector to represent each text.
To keep this notebook simple, we have pre-processed the wages data in the notebook "Pre-Processing of Wages Data for One-Hot Encoding".  Pre-processing essentially included removal of features(colums) which mean the same as other column example: Education and education-num mean the same hence education is removed.

In this notebook, we will try to predict the salary of a person given his details and we will see how One-Hot Encoding impacts the accuracy.

In [0]:
#NACHISS - CLEAN

In [2]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [3]:
#Build the vocabulary.
vocab = {}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count = count +1
            vocab[word] = count
print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


In [4]:
print(vocab)
#Get one hot representation for any string based on this vocabulary. 
#If the word exists in the vocabulary, its representation is returned. 
#If not, a list of zeroes is returned for that word. 
def get_onehot_vector(somestring):
    onehot_encoded = []
    for word in somestring.split():
        temp = [0]*len(vocab)
        if word in vocab:
            temp[vocab[word]] = 1
        onehot_encoded.append(temp)
    return onehot_encoded

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


In [5]:
get_onehot_vector(processed_docs[1]) #one hot representation for a text from our corpus.

[[0, 0, 0, 1, 0, 0], [0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0]]

In [6]:
get_onehot_vector("man and dog are good") 
#one hot representation for a random text, using the above vocabulary

[[0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]

In [7]:
get_onehot_vector("man and man are good") 


[[0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]