# Word Embeddings


A word embedding is a learned representation for text where words that have the same meaning have a similar representation.

# Word Embedding Algorithms
Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.

The learning process is either joint with the neural network model on some task, such as document classification, or is an unsupervised process, using document statistics.

This section reviews three techniques that can be used to learn a word embedding from text data.

# Embedding Layer

An embedding layer, for lack of a better name, is a word embedding that is learned jointly with a neural network model on a specific natural language processing task, such as language modeling or document classification

# Word2Vec

Word2Vec is a statistical method for efficiently learning a standalone word embedding from a text corpus.

# GloVe
The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the word2vec method for efficiently learning word vectors, developed by Pennington, et al. at Stanford.

# Word Embedding Techniques using Embedding Layer in Keras

In [1]:
from tensorflow.keras.preprocessing.text import one_hot

In [3]:

### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [4]:
### Vocabulary size
voc_size=10000

# One Hot Representation

In [5]:
onehot_repr=[one_hot(words,voc_size)for words in sent] 
print(onehot_repr)

[[6, 7453, 925, 8119], [6, 7453, 925, 1437], [6, 6426, 925, 1426], [8383, 3931, 1260, 5111, 8299], [8383, 3931, 1260, 5111, 6417], [3220, 6, 5874, 925, 8067], [9292, 8770, 4750, 5111]]


# Word Embedding Represntation

In [7]:

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

import numpy as np

In [8]:
sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0    0    6 7453  925 8119]
 [   0    0    0    0    6 7453  925 1437]
 [   0    0    0    0    6 6426  925 1426]
 [   0    0    0 8383 3931 1260 5111 8299]
 [   0    0    0 8383 3931 1260 5111 6417]
 [   0    0    0 3220    6 5874  925 8067]
 [   0    0    0    0 9292 8770 4750 5111]]


In [9]:
dim=10

In [10]:

model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')

In [11]:

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 10)             100000    
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [12]:
print(model.predict(embedded_docs))

[[[ 4.16819938e-02 -5.43098524e-03 -1.51553042e-02  1.72118433e-02
    2.07072757e-02  7.37534836e-03  3.63064520e-02  2.81833746e-02
   -9.02490690e-03  4.92810272e-02]
  [ 4.16819938e-02 -5.43098524e-03 -1.51553042e-02  1.72118433e-02
    2.07072757e-02  7.37534836e-03  3.63064520e-02  2.81833746e-02
   -9.02490690e-03  4.92810272e-02]
  [ 4.16819938e-02 -5.43098524e-03 -1.51553042e-02  1.72118433e-02
    2.07072757e-02  7.37534836e-03  3.63064520e-02  2.81833746e-02
   -9.02490690e-03  4.92810272e-02]
  [ 4.16819938e-02 -5.43098524e-03 -1.51553042e-02  1.72118433e-02
    2.07072757e-02  7.37534836e-03  3.63064520e-02  2.81833746e-02
   -9.02490690e-03  4.92810272e-02]
  [-1.69889107e-02 -1.37569755e-03 -3.92897241e-02 -1.78460479e-02
   -4.20581102e-02 -3.44532840e-02  1.49007104e-02  4.55289744e-02
   -7.55148008e-03  2.39036717e-02]
  [ 3.77750434e-02 -4.34159301e-02 -1.61685944e-02  9.92812961e-03
   -4.57546823e-02  2.84096263e-02 -2.33545545e-02 -2.57159602e-02
    3.63175757e-

In [13]:

print(model.predict(embedded_docs)[0])

[[ 0.04168199 -0.00543099 -0.0151553   0.01721184  0.02070728  0.00737535
   0.03630645  0.02818337 -0.00902491  0.04928103]
 [ 0.04168199 -0.00543099 -0.0151553   0.01721184  0.02070728  0.00737535
   0.03630645  0.02818337 -0.00902491  0.04928103]
 [ 0.04168199 -0.00543099 -0.0151553   0.01721184  0.02070728  0.00737535
   0.03630645  0.02818337 -0.00902491  0.04928103]
 [ 0.04168199 -0.00543099 -0.0151553   0.01721184  0.02070728  0.00737535
   0.03630645  0.02818337 -0.00902491  0.04928103]
 [-0.01698891 -0.0013757  -0.03928972 -0.01784605 -0.04205811 -0.03445328
   0.01490071  0.04552897 -0.00755148  0.02390367]
 [ 0.03777504 -0.04341593 -0.01616859  0.00992813 -0.04575468  0.02840963
  -0.02335455 -0.02571596  0.03631758  0.04784216]
 [ 0.02253621 -0.01142471  0.04809308  0.00100965 -0.04898838  0.02958932
   0.00636113 -0.03732587 -0.01762884 -0.0301518 ]
 [-0.01348916 -0.02871039 -0.0475859   0.04592964  0.03675845  0.03470277
   0.01562592  0.03713954  0.00154363  0.04721672]]