### Word Embedding Techniques using Embedding Layer in Keras

### Libraries USed Tensorflow> 2.0  and keras
Word embeddings provide a dense representation of words and their relative meanings.

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning.

Each word is represented by a real-valued vector, often tens or hundreds of dimensions. This is contrasted to the thousands or millions of dimensions required for sparse word representations, such as a one-hot encoding.



In [1]:
##tensorflow >2.0
from tensorflow.keras.preprocessing.text import one_hot

In [2]:
### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [3]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [5]:
### Vocabulary size: size of dictionary
voc_size=10000

#### One Hot Representation

In [7]:
#to represent setences in one hot 
# for all words in sentecences( for loop ) pass each word in one_hot() where arg is word and voc size
# we will get index from dictionary
onehot_repr=[one_hot(words,voc_size)for words in sent] 
print(onehot_repr)

#the- 8611 glass 3959, same for 1st n 2nd sentence


[[8611, 3959, 9692, 5943], [8611, 3959, 9692, 758], [8611, 1636, 9692, 2394], [5113, 6445, 3482, 3926, 5085], [5113, 6445, 3482, 3926, 3957], [7824, 8611, 407, 9692, 2095], [4342, 8339, 8998, 3926]]


### Word Embedding Representation

In [8]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences #very imp.
#when Im passing the words to embedding layer, all the words should have same length of sentences
from tensorflow.keras.models import Sequential

In [9]:
import numpy as np

In [11]:
# embedding words procedure: 
#embedding matrix takes one hot as input. 
#comvert into same dimension matrix for each sentences

# my all input sentences have max 5 words. however I gave sent lentgh as 8, if I have 4 words then 1st 4 will be
#padded using padding technique. for 5 word sentences, 1st 3 words will be padded etc
# because padding= 3, hence 1st 3 words are padded as 0

sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

#it converts all sentences into 8 word 2D matrix. 


[[   0    0    0    0 8611 3959 9692 5943]
 [   0    0    0    0 8611 3959 9692  758]
 [   0    0    0    0 8611 1636 9692 2394]
 [   0    0    0 5113 6445 3482 3926 5085]
 [   0    0    0 5113 6445 3482 3926 3957]
 [   0    0    0 7824 8611  407 9692 2095]
 [   0    0    0    0 4342 8339 8998 3926]]


In [12]:
dim=10
# how many features for featurised rep
# embedding

In [13]:
model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
#based on number of dim, add() crete a neural network by providing parameters voc size, dim n sent length

model.compile('adam','mse') # compiling 

#this just created embedding model, I still havent passed my input sentences

2022-04-08 02:11:11.931193: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 8, 10)             100000    
                                                                 
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [15]:
print(model.predict(embedded_docs))
#each sentence is provided into the model 
# for each sentence, I will get array of dim= 10 feature. That is feature rep, that is each word is converted into 
#dmension of 10 vectors



[[[ 0.02784664 -0.01569377  0.01188022  0.02965844  0.04715444
    0.040605    0.03608351  0.03407392 -0.02174805  0.03766486]
  [ 0.02784664 -0.01569377  0.01188022  0.02965844  0.04715444
    0.040605    0.03608351  0.03407392 -0.02174805  0.03766486]
  [ 0.02784664 -0.01569377  0.01188022  0.02965844  0.04715444
    0.040605    0.03608351  0.03407392 -0.02174805  0.03766486]
  [ 0.02784664 -0.01569377  0.01188022  0.02965844  0.04715444
    0.040605    0.03608351  0.03407392 -0.02174805  0.03766486]
  [-0.04975528  0.03627307 -0.0166804   0.015333    0.01016755
   -0.0360647  -0.02347807 -0.0420367   0.03846068 -0.0059222 ]
  [-0.04386101  0.04193321 -0.02103049  0.04083448 -0.01698311
    0.02147334  0.04542402  0.01374302  0.0479172   0.04452595]
  [-0.03387966 -0.035608    0.04849163  0.00199207 -0.04251428
   -0.0419197   0.00797311 -0.01449846 -0.04899523  0.0294641 ]
  [-0.00712545 -0.04838101  0.00919814  0.00970845  0.03730467
   -0.00813708  0.00768455 -0.02008561 -0.024630

In [20]:
# considering 1st sentence
embedded_docs[0]

array([   0,    0,    0,    0, 8611, 3959, 9692, 5943], dtype=int32)

In [22]:
print(model.predict(embedded_docs)[0])

#first 4 entries are same as all are 0
# 5th entry is for word 'the', 6th entry is for word 'glass' etc


[[ 0.02784664 -0.01569377  0.01188022  0.02965844  0.04715444  0.040605
   0.03608351  0.03407392 -0.02174805  0.03766486]
 [ 0.02784664 -0.01569377  0.01188022  0.02965844  0.04715444  0.040605
   0.03608351  0.03407392 -0.02174805  0.03766486]
 [ 0.02784664 -0.01569377  0.01188022  0.02965844  0.04715444  0.040605
   0.03608351  0.03407392 -0.02174805  0.03766486]
 [ 0.02784664 -0.01569377  0.01188022  0.02965844  0.04715444  0.040605
   0.03608351  0.03407392 -0.02174805  0.03766486]
 [-0.04975528  0.03627307 -0.0166804   0.015333    0.01016755 -0.0360647
  -0.02347807 -0.0420367   0.03846068 -0.0059222 ]
 [-0.04386101  0.04193321 -0.02103049  0.04083448 -0.01698311  0.02147334
   0.04542402  0.01374302  0.0479172   0.04452595]
 [-0.03387966 -0.035608    0.04849163  0.00199207 -0.04251428 -0.0419197
   0.00797311 -0.01449846 -0.04899523  0.0294641 ]
 [-0.00712545 -0.04838101  0.00919814  0.00970845  0.03730467 -0.00813708
   0.00768455 -0.02008561 -0.02463096  0.02042559]]
