# Word Embedding techniques using Embedding layer in keras

### Load Libraries

In [1]:
from keras.preprocessing.text import one_hot

### Load Data

In [2]:
english_text = """Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as \"algebraic objects\". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before."""

In [3]:
english_text

'Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.'

### Data Cleaning

In [4]:
english_text

'Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.'

In [5]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')

nltk.download('stopwords')
english_sentences = nltk.sent_tokenize(english_text)
corpus = []
for sent in english_sentences:
  cleaned = re.sub('[^a-zA-Z]', ' ', sent)
  cleaned = nltk.word_tokenize(cleaned.lower())
  stop_word_remove = [x for x in cleaned if not x in set(stopwords.words("english"))]
  stop_word_remove = ' '.join(stop_word_remove)
  corpus.append(stop_word_remove)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
english_text

'Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.'

In [7]:
corpus

['perhaps one significant advances made arabic mathematics began time work al khwarizmi namely beginnings algebra',
 'important understand significant new idea',
 'revolutionary move away greek concept mathematics essentially geometry',
 'algebra unifying theory allowedrational numbers irrational numbers geometrical magnitudes etc treated algebraic objects',
 'gave mathematics whole new development path much broader concept existed provided vehicle future development subject',
 'another important aspect introduction algebraic ideas allowed mathematics applied itselfin way happened']

In [8]:
len(corpus)

6

### Define Parameter

In [9]:
vocabulary_size = 10000
dimensions = 3

## One Hot Representation

For each word in sentences, we are getting his index in the vocabulary

In [10]:
one_hot_rep = [one_hot(sent, vocabulary_size) for sent in corpus]

In [11]:
print(one_hot_rep)

[[7553, 2745, 5746, 7083, 9041, 9454, 1257, 4426, 8852, 1321, 9781, 6648, 8747, 1348, 6288], [1675, 1639, 5746, 7810, 9662], [4780, 7322, 8197, 809, 1027, 1257, 4297, 7852], [6288, 9095, 8742, 9760, 3014, 9792, 3014, 6695, 9932, 312, 3554, 4893, 168], [7881, 1257, 4290, 7810, 9495, 5159, 4400, 1337, 1027, 2115, 7887, 864, 1721, 9495, 1924], [6604, 1675, 26, 9914, 4893, 5503, 1921, 1257, 2413, 5365, 2199, 2576]]


## Word Embedding Representation

In [12]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

In [13]:
## Get max lenght of sentence
for sent in corpus:
  print(len(sent.split()))

15
5
8
13
15
12


In [14]:
max_lenght = 15

#### Convert all vectors to the same lenght

In [16]:
Embedded_docs = pad_sequences(one_hot_rep, padding='pre', maxlen=max_lenght)
print(Embedded_docs)

[[7553 2745 5746 7083 9041 9454 1257 4426 8852 1321 9781 6648 8747 1348
  6288]
 [   0    0    0    0    0    0    0    0    0    0 1675 1639 5746 7810
  9662]
 [   0    0    0    0    0    0    0 4780 7322 8197  809 1027 1257 4297
  7852]
 [   0    0 6288 9095 8742 9760 3014 9792 3014 6695 9932  312 3554 4893
   168]
 [7881 1257 4290 7810 9495 5159 4400 1337 1027 2115 7887  864 1721 9495
  1924]
 [   0    0    0 6604 1675   26 9914 4893 5503 1921 1257 2413 5365 2199
  2576]]


In [17]:
dimensions = 20

In [20]:
model = Sequential()

model.add(Embedding(vocabulary_size, dimensions, input_length=max_lenght))
model.compile('adam', 'mse')

In [26]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 15, 20)            200000    
Total params: 200,000
Trainable params: 200,000
Non-trainable params: 0
_________________________________________________________________


#### Embedding Matric

In [22]:
print(model.predict(Embedded_docs))

[[[-0.0427902  -0.04847389 -0.04603308 ... -0.04576445 -0.02128353
    0.00405754]
  [ 0.01261211 -0.0113859  -0.04223588 ...  0.01363099  0.00328436
   -0.04418647]
  [-0.01511942 -0.00636585  0.02926222 ...  0.04974481  0.000705
   -0.04855983]
  ...
  [-0.01945938 -0.02246976  0.02172699 ...  0.04347995 -0.02497318
    0.02813511]
  [-0.02353998  0.02811073  0.04367189 ... -0.00953877  0.0168567
    0.03753323]
  [ 0.04361914  0.00519378  0.02555257 ... -0.01409789 -0.00665547
   -0.01486976]]

 [[-0.03935535  0.01950193 -0.04968914 ... -0.02960539  0.01438368
   -0.01437724]
  [-0.03935535  0.01950193 -0.04968914 ... -0.02960539  0.01438368
   -0.01437724]
  [-0.03935535  0.01950193 -0.04968914 ... -0.02960539  0.01438368
   -0.01437724]
  ...
  [-0.01511942 -0.00636585  0.02926222 ...  0.04974481  0.000705
   -0.04855983]
  [ 0.03856099 -0.01051296  0.00967028 ... -0.03198502  0.0376736
   -0.0315864 ]
  [-0.00089426  0.03352723  0.00718429 ... -0.01812184 -0.04363843
   -0.032697

In [24]:
Embedded_docs[0]

array([7553, 2745, 5746, 7083, 9041, 9454, 1257, 4426, 8852, 1321, 9781,
       6648, 8747, 1348, 6288], dtype=int32)

In [25]:
print(model.predict(Embedded_docs)[0])

[[-4.27902006e-02 -4.84738946e-02 -4.60330844e-02 -5.36127016e-03
   3.83434556e-02  3.71821411e-02 -1.00387558e-02 -1.40544027e-03
  -4.17988077e-02  3.93988937e-03  4.98488657e-02 -9.01416689e-03
   4.52711321e-02 -3.29205878e-02  1.50269978e-02  1.03565082e-02
  -4.63032238e-02 -4.57644463e-02 -2.12835316e-02  4.05753776e-03]
 [ 1.26121081e-02 -1.13858953e-02 -4.22358774e-02  3.21129598e-02
  -2.01909784e-02  2.59012021e-02  4.53926250e-03  2.08758004e-02
  -2.91516669e-02  2.31784247e-02 -4.56436276e-02 -3.05805095e-02
   1.40910037e-02 -1.33065805e-02 -2.35776305e-02 -1.26406923e-02
  -1.92021616e-02  1.36309899e-02  3.28435749e-03 -4.41864729e-02]
 [-1.51194222e-02 -6.36584684e-03  2.92622186e-02 -3.64180580e-02
  -3.86753678e-02 -4.39455360e-03 -3.68689299e-02  3.10438015e-02
   2.99994834e-02  3.69051360e-02 -4.09891978e-02  2.53283046e-02
  -2.69523393e-02  1.06595047e-02  3.94217633e-02  4.97905053e-02
  -2.42841728e-02  4.97448109e-02  7.05003738e-04 -4.85598333e-02]
 [-7.43