
## **Word Embedding - NLP with Deep Learning** ([Using Google Colab]([text](https://colab.research.google.com/drive/1NkJmlUZWK21flAkqsxV86tXbaORRH5ys?authuser=5#scrollTo=-uqVOAju1zTa)))

In [3]:
import tensorflow as tf
print(tf.__version__)

2.15.0


In [6]:
##tensorflow >2.0
from tensorflow.keras.preprocessing.text import one_hot

Taking a corpus or sample data

In [7]:
### Sample sentences or corpus
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good']

In [8]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

Now we have to define the **vocabulary size** which will be like a hyper parameter tuning. Based on the size our words will get indexed. Say If I choose 100 then all of my word will get indexed in between 1 to 100

In [9]:
voc_size =10000

### 1. One Hot Representation of the words

After OHE each word will become index number based on vocabulary size we mentioned. So we will get array of index
e.g.

In [20]:
onehot_repr=[one_hot(words,voc_size)for words in sent]
print(onehot_repr)

print(f"First sentence : {sent[0]}")
print(f"OHE representation : {onehot_repr[0]}")

[[3741, 615, 135, 245], [3741, 615, 135, 6998], [3741, 926, 135, 8206], [7249, 3753, 6665, 7492, 7506], [7249, 3753, 6665, 7492, 5082], [2240, 3741, 9692, 135, 5878], [9704, 3152, 8309, 7492]]
First sentence : the glass of milk
OHE representation : [3741, 615, 135, 245]


### 2. Word Embedding Representation

In [14]:
from tensorflow.keras.layers import Embedding # This will do the embedding like converts to word2vec
from tensorflow.keras.preprocessing.sequence import pad_sequences # This will add padding to make same length of all sentence
from tensorflow.keras.models import Sequential

In [15]:
import numpy as np

### 2.1. Padding
Now as we see sentence length is not fixed but we need same length as input feature so we can add padding like adding 0 in last for post padding and in begining for pre padding.

In [17]:
sent_length= 8 # assuming worst case scenario highest length of sentence available in corpus is 8
embedded_docs=pad_sequences(onehot_repr, padding='pre',maxlen=sent_length) # For pre 0 will be added at begining
print(embedded_docs)

[[   0    0    0    0 3741  615  135  245]
 [   0    0    0    0 3741  615  135 6998]
 [   0    0    0    0 3741  926  135 8206]
 [   0    0    0 7249 3753 6665 7492 7506]
 [   0    0    0 7249 3753 6665 7492 5082]
 [   0    0    0 2240 3741 9692  135 5878]
 [   0    0    0    0 9704 3152 8309 7492]]


Let's see now my first sentence or document and it's OHE representation

In [19]:
print(f"First sentence : {sent[0]}")
print(f"OHE representation : {embedded_docs[0]}")

First sentence : the glass of milk
OHE representation : [   0    0    0    0 3741  615  135  245]


### 2.2. Feature Representation
Now we have done OHE next step is each word in sentence which is index of vocabulary will be represented as **vector of some dimension** that's called feature representation.




In [24]:
dim = 10 # for large data set we can have 300 dimension for each word which is sufficient

In [25]:
model=Sequential() # The Sequential model is a linear stack of layers, where you can add layers one by one.
model.add(Embedding(voc_size,10,input_length=sent_length)) # adding an Embedding layer to the model which will work as word2vec
model.compile('adam','mse') # compiling the model using the Adam optimizer and Mean Squared Error (MSE) as the loss function

In [26]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 8, 10)             100000    
                                                                 
Total params: 100000 (390.62 KB)
Trainable params: 100000 (390.62 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


let's see my first sentences after word embedding

In [29]:
print("My first sentence")
print(sent[0])
print("=========================")
print("After One Hot Encoding")
print(embedded_docs[0])
print("=========================")
print("After Embedding")
print(model.predict(embedded_docs[0]))

My first sentence
the glass of milk
After One Hot Encoding
[   0    0    0    0 3741  615  135  245]
After Embedding
[[ 0.02334119  0.02603856 -0.02011134  0.03384751  0.04289399  0.04751729
   0.01939005  0.02223737  0.03556294  0.03270935]
 [ 0.02334119  0.02603856 -0.02011134  0.03384751  0.04289399  0.04751729
   0.01939005  0.02223737  0.03556294  0.03270935]
 [ 0.02334119  0.02603856 -0.02011134  0.03384751  0.04289399  0.04751729
   0.01939005  0.02223737  0.03556294  0.03270935]
 [ 0.02334119  0.02603856 -0.02011134  0.03384751  0.04289399  0.04751729
   0.01939005  0.02223737  0.03556294  0.03270935]
 [-0.04506241  0.01671313 -0.03848625 -0.00520509  0.00547225  0.03812024
  -0.0225917   0.00381789  0.02134356  0.03596712]
 [-0.04015439  0.00291693  0.0259621  -0.02744615  0.02922304 -0.03540211
  -0.04640775  0.02629754  0.01908923  0.01543989]
 [-0.01597004 -0.0012911  -0.00037912 -0.03518567  0.04150781  0.02370092
  -0.04217914  0.04465172  0.04971534 -0.01529777]
 [-0.026

let's see the feature representation in 10 dimension for word **glass**

In [32]:
print(model.predict(embedded_docs[0])[1]) # 10 dimension of word vector for glass

[ 0.02334119  0.02603856 -0.02011134  0.03384751  0.04289399  0.04751729
  0.01939005  0.02223737  0.03556294  0.03270935]


### *So we have done the word embedding for the given corpus successfully*

