## A Detailed Guide to understand the Word Embeddings and Embedding Layer in Keras.

## [Don't forget to upvote ;) ]

In this kernel I have explained the keras embedding layer. To do so I have created a sample corpus of just 3 documents and that should be sufficient to explain the working of the keras embedding layer.


Embeddings are useful in a variety of machine learning applications. Because of the fact I have attached many data sources to the kernel where I fell that embeddings and Keras embedding layer may prove to be useful.

Before diving in let us skim through some of the applilcations of the embeddings : 

**1 ) The first application that strikes me is in the Collaborative Filtering based Recommender Systems where we have to create the user embeddings and the movie embeddings by decomposing the utility matrix which contains the user-item ratings.**

To see a complete tutorial on CF based recommender systems using embeddings in Keras you can follow **[this](https://www.kaggle.com/rajmehra03/cf-based-recsys-by-low-rank-matrix-factorization)** kernel of mine.


**2 ) The second use is in the Natural Language Processing and its related applications whre we have to create the word embeddings for all the words present in the documents of our corpus.**

This is the terminology that I shall use in this kernel.


**Thus the embedding layer in Keras can be used when we want to create the embeddings to embed higher dimensional data into lower dimensional vector space.**

#### IMPORTING MODULES

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\g.gagliano\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [3]:
import tensorflow.keras as keras

In [4]:
# Ignore  the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# data visualisation and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
%matplotlib inline  
style.use('fivethirtyeight')
sns.set(style='whitegrid',color_codes=True)

#nltk
import nltk

#stop-words
from nltk.corpus import stopwords
stop_words=set(nltk.corpus.stopwords.words('english'))

# tokenizing
from nltk import word_tokenize,sent_tokenize

#keras
from keras.preprocessing.text import one_hot,Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense , Flatten ,Embedding,Input
from keras.models import Model

Using TensorFlow backend.


#### CREATING SAMPLE CORPUS OF DOCUMENTS ie TEXTS

In [5]:
sample_text_1="bitty bought a bit of butter"
sample_text_2="but the bit of butter was a bit bitter"
sample_text_3="so she bought some better butter to make the bitter butter better"

corp=[sample_text_1,sample_text_2,sample_text_3]
no_docs=len(corp)


#### INTEGER ENCODING ALL THE DOCUMENTS

After this all the unique words will be reprsented by an integer. For this we are using **one_hot** function from the Keras. Note that the **vocab_size**  is specified large enough so as to ensure **unique integer encoding**  for each and every word.

**Note one important thing that the integer encoding for the word remains same in different docs. eg 'butter' is  denoted by 31 in each and every document.**

In [6]:
vocab_size=50 
encod_corp=[]
for i,doc in enumerate(corp):
    encod_corp.append(one_hot(doc,50))
    print("The encoding for document",i+1," is : ",one_hot(doc,50))

The encoding for document 1  is :  [13, 29, 22, 49, 34, 10]
The encoding for document 2  is :  [39, 36, 49, 34, 10, 20, 22, 49, 29]
The encoding for document 3  is :  [12, 31, 29, 24, 45, 10, 19, 27, 36, 29, 10, 45]


#### PADDING THE DOCS (to make very doc of same length)

**The Keras Embedding layer requires all individual documents to be of same length.**  Hence we wil pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer the **'input_length'**  will be equal to the length  (ie no of words) of the document with maximum length or maximum number of words.

To pad the shorter documents I am using **pad_sequences** functon from the Keras library.

In [10]:
# length of maximum document. will be nedded whenever create embeddings for the words
maxlen=-1
for doc in corp:
    tokens=nltk.word_tokenize(doc)
    if(maxlen<len(tokens)):
        maxlen=len(tokens)
print("The maximum number of words in any document is : ",maxlen)

The maximum number of words in any document is :  12


In [11]:
# now to create embeddings all of our docs need to be of same length. hence we can pad the docs with zeros.
pad_corp=pad_sequences(encod_corp,maxlen=maxlen,padding='post',value=0.0)
print("No of padded documents: ",len(pad_corp))

No of padded documents:  3


In [12]:
for i,doc in enumerate(pad_corp):
     print("The padded encoding for document",i+1," is : ",doc)

The padded encoding for document 1  is :  [13 29 22 49 34 10  0  0  0  0  0  0]
The padded encoding for document 2  is :  [39 36 49 34 10 20 22 49 29  0  0  0]
The padded encoding for document 3  is :  [12 31 29 24 45 10 19 27 36 29 10 45]


#### ACTUALLY CREATING THE EMBEDDINGS using KERAS EMBEDDING LAYER

Now all the documents are of same length (after padding). And so now we are ready to create and use the embeddings.

**I will embed the words into vectors of 8 dimensions.**

In [13]:
# specifying the input shape
input=Input(shape=(no_docs,maxlen),dtype='float64')

In [14]:
'''
shape of input. 
each document has 12 element or words which is the value of our maxlen variable.

'''
word_input=Input(shape=(maxlen,),dtype='float64')  

# creating the embedding
word_embedding=Embedding(input_dim=vocab_size,output_dim=8,input_length=maxlen)(word_input)

word_vec=Flatten()(word_embedding) # flatten
embed_model =Model([word_input],word_vec) # combining all into a Keras model

**PARAMETERS OF THE EMBEDDING LAYER --- **

**'input_dim' = the vocab size that we will choose**. 
In other words it is the number of unique words in the vocab.

**'output_dim'  = the number of dimensions we wish to embed into**. Each word will be represented by a vector of this much dimensions.

**'input_length' = lenght of the maximum document**. which is stored in maxlen variable in our case.

In [15]:
embed_model.compile(optimizer=keras.optimizers.Adam(lr=1e-3),loss='binary_crossentropy',metrics=['acc']) 
# compiling the model. parameters can be tuned as always.

In [16]:
print(type(word_embedding))
print(word_embedding)

<class 'tensorflow.python.framework.ops.Tensor'>
Tensor("embedding_1/embedding_lookup/Identity_1:0", shape=(None, 12, 8), dtype=float32)


In [17]:
print(embed_model.summary()) # summary of the model

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 12)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 12, 8)             400       
_________________________________________________________________
flatten_1 (Flatten)          (None, 96)                0         
Total params: 400
Trainable params: 400
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
embeddings=embed_model.predict(pad_corp) # finally getting the embeddings.

In [19]:
print("Shape of embeddings : ",embeddings.shape)
print(embeddings)

Shape of embeddings :  (3, 96)
[[ 2.2178996e-02 -8.4844977e-04  5.0595514e-03 -1.5644442e-02
   1.3148475e-02 -2.1111691e-02 -4.6953607e-02  1.6933132e-02
  -1.9370759e-02 -3.6944319e-02  3.8881872e-02  3.6577333e-02
  -1.6315721e-02  4.2205278e-02 -8.3308443e-03  4.1705500e-02
  -7.5347051e-03  2.3108069e-02 -2.4566937e-02  2.9110324e-02
   3.4486245e-02 -7.0220828e-03  1.1225618e-02 -4.4404279e-02
   4.4594098e-02  4.1802470e-02  1.9291509e-02  1.0375597e-02
   3.2623593e-02  4.2842541e-02  4.1719783e-02  2.6841711e-02
  -4.3161776e-02 -1.2376916e-02  3.5929929e-02 -1.7572306e-02
  -3.0960847e-02 -4.2403806e-02  8.1663355e-03  3.5919357e-02
  -4.7829151e-03  2.1874461e-02  2.7669106e-02 -4.1286945e-02
  -2.4988502e-04  3.8018156e-02 -5.2970536e-03 -4.0472373e-03
  -1.3843823e-02 -9.7693093e-03  2.9820051e-02 -4.8851348e-02
  -3.9608669e-02 -4.3516647e-02 -2.3724068e-02  2.1602139e-03
  -1.3843823e-02 -9.7693093e-03  2.9820051e-02 -4.8851348e-02
  -3.9608669e-02 -4.3516647e-02 -2.3724

In [20]:
embeddings=embeddings.reshape(-1,maxlen,8)
print("Shape of embeddings : ",embeddings.shape) 
print(embeddings)

Shape of embeddings :  (3, 12, 8)
[[[ 2.2178996e-02 -8.4844977e-04  5.0595514e-03 -1.5644442e-02
    1.3148475e-02 -2.1111691e-02 -4.6953607e-02  1.6933132e-02]
  [-1.9370759e-02 -3.6944319e-02  3.8881872e-02  3.6577333e-02
   -1.6315721e-02  4.2205278e-02 -8.3308443e-03  4.1705500e-02]
  [-7.5347051e-03  2.3108069e-02 -2.4566937e-02  2.9110324e-02
    3.4486245e-02 -7.0220828e-03  1.1225618e-02 -4.4404279e-02]
  [ 4.4594098e-02  4.1802470e-02  1.9291509e-02  1.0375597e-02
    3.2623593e-02  4.2842541e-02  4.1719783e-02  2.6841711e-02]
  [-4.3161776e-02 -1.2376916e-02  3.5929929e-02 -1.7572306e-02
   -3.0960847e-02 -4.2403806e-02  8.1663355e-03  3.5919357e-02]
  [-4.7829151e-03  2.1874461e-02  2.7669106e-02 -4.1286945e-02
   -2.4988502e-04  3.8018156e-02 -5.2970536e-03 -4.0472373e-03]
  [-1.3843823e-02 -9.7693093e-03  2.9820051e-02 -4.8851348e-02
   -3.9608669e-02 -4.3516647e-02 -2.3724068e-02  2.1602139e-03]
  [-1.3843823e-02 -9.7693093e-03  2.9820051e-02 -4.8851348e-02
   -3.9608669e

The resulting shape is (3,12,8).

**3---> no of documents**

**12---> each document is made of 12 words which was our maximum length of any document.**

**& 8---> each word is 8 dimensional.**

 

#### GETTING ENCODING FOR A PARTICULAR WORD IN A SPECIFIC DOCUMENT

In [21]:
for i,doc in enumerate(embeddings):
    for j,word in enumerate(doc):
        print("The encoding for ",j+1,"th word","in",i+1,"th document is : \n\n",word)

The encoding for  1 th word in 1 th document is : 

 [ 0.022179   -0.00084845  0.00505955 -0.01564444  0.01314848 -0.02111169
 -0.04695361  0.01693313]
The encoding for  2 th word in 1 th document is : 

 [-0.01937076 -0.03694432  0.03888187  0.03657733 -0.01631572  0.04220528
 -0.00833084  0.0417055 ]
The encoding for  3 th word in 1 th document is : 

 [-0.00753471  0.02310807 -0.02456694  0.02911032  0.03448625 -0.00702208
  0.01122562 -0.04440428]
The encoding for  4 th word in 1 th document is : 

 [0.0445941  0.04180247 0.01929151 0.0103756  0.03262359 0.04284254
 0.04171978 0.02684171]
The encoding for  5 th word in 1 th document is : 

 [-0.04316178 -0.01237692  0.03592993 -0.01757231 -0.03096085 -0.04240381
  0.00816634  0.03591936]
The encoding for  6 th word in 1 th document is : 

 [-0.00478292  0.02187446  0.02766911 -0.04128695 -0.00024989  0.03801816
 -0.00529705 -0.00404724]
The encoding for  7 th word in 1 th document is : 

 [-0.01384382 -0.00976931  0.02982005 -0.048

#### Now this makes it easier to visualize that we have 3(size of corp) documents with each consisting of 12(maxlen) words and each word mapped to a 8-dimensional vector.

#### HOW TO WORK WITH A REAL PIECE OF TEXT

Just like above we can now use any other document. We can sent_tokenize the doc into sentences.

Each sentence has a list of words which we will integer encode using the 'one_hot' function as below. 

Now each sentence will be having different number of words. So we will need to pad the sequences to the sentence with maximum words.

**At this point we are ready to feed the input to Keras Embedding layer as shown above.**

**'input_dim' = the vocab size that we will choose**

**'output_dim'  = the number of dimensions we wish to embed into**

**'input_length' = lenght of the maximum document**

## THE END !!!

**If you want to see the application of Keras embedding layer on a real task eg text classification then please check out my [this](https://github.com/mrc03/IMDB-Movie-Review-Sentiment-Analysis) repo on Github in which I have used the embeddings to perform sentiment analysis on IMdb movie review dataset.**

## [ Please Do upvote the kernel;) ]