# Word Embedding

Neural networks and ANNs have been shown to be very powerful for many natural langauge processing tasks.  In this notebook, we first describe how to preprocess words into numeric inputs that neural networks can understand.  We will then describe embedding as a technique not only to convert words and sentences into numbers, but also as a generic dimensionality reduction technique.

## One-Hot Encoding

Let's first understand how to preprocess textual data for our neural networks. Neural networks can take numbers as an input, but not raw text. Hence, we need to figure out a way to convert these words into a numerical format. The traditional method we used to follow is one hot encoding.

In one hot encoding, we essentially represent each word as a vector of length V, where V is the total number of unique words available in the entire textual data. V is also called the vocabulary count. Each word’s one hot encoded vector is essentially a binary vector with the value 1 being in a unique index for each word and the value 0 being in every other index of the vector. 

Let's create some text to understand One-Hot Encoding:

In [31]:
reviews = [
    'A wonderful book',
    'Love the book',
    'Pulitzer level quality',
    'A beach reading',
    'Writing is awesome',
    'Horrid ... never read again',
    'Has lots of mistakes',
    'Book is fine, bad delivery',
    'A 5th grader job',
    'Bad bad bad'
]

We will take vocabulary size as 50 and one-hot encode the words using one_hot function from Keras.

In [32]:
from tensorflow.keras.preprocessing.text import one_hot

VOCAB_SIZE = 50
encoded_reviews = [one_hot(d,VOCAB_SIZE) for d in reviews]
print(f'encoded reviews: {encoded_reviews}')

encoded reviews: [[33, 9, 2], [43, 38, 2], [2, 10, 24], [33, 11, 46], [8, 9, 7], [18, 31, 25, 45], [26, 31, 13, 24], [2, 9, 16, 26, 13], [33, 41, 39, 28], [26, 26, 26]]


One-hot encoding creates 2 problems for machine learning:

- The size of each word’s one hot encoded vector will always be V, which is the size of the entire vocabulary. In the example it was 5 but usually the value of V can reach 10k or even 100k. This means we need a huge vector just to represent a single word, which can lead to excessive memory usage while representing text as vectors.
- The index which is assigned to each word does not hold any semantic meaning, it is merely an arbitrary value assigned to it. When we consider the vectors for two words, we would ideally want the vectors of similar meaning to have similar vectors. 


## Word Embedding

Word embedding (or word vector) is essentially a much more efficient method of representing word(s). Word embedding takes up much lesser space than one hot encoded vectors and they also hold semantic information regarding the word. 

The intuition behind embedding is this principle

> Similar words occur more frequently together than dissimilar words.

When a word occurs within the vicinity of another word, it doesn’t always mean it has a similar meaning, but when we consider the frequency of words which are found close together, we find that words of similar meaning are often found together.

For example, the word “Dog” will be found within the vicinity of the word “Cat” a lot more frequently than it will be found within the vicinity of the word “Computer”, this is because “Dog” shares certain semantic similarity with “Cat” and there will hence be many sentences which have both “Dog” and “Cat”. This is the key factor which deep learning researchers have exploited to come up with word vectors. 

## Creating Word Vectors with Embedding 
Embedding layer enables us to convert each word into a fixed length vector of defined size. The resultant vector is a dense one with having real values instead of just 0’s and 1’s. The fixed length of word vectors helps us to represent words in a better way along with reduced dimensions.

This way embedding layer works like a lookup table. The words are the keys in this table, while the dense word vectors are the values.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix

In [10]:
from numpy import array
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten,Embedding,Dense

tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

Let's create a simple model with just 1 embedding layer:

In [5]:
model = Sequential()
embedding_layer = Embedding(input_dim=10,output_dim=4,input_length=2)
model.add(embedding_layer)
model.compile('adam','mse')

There are three parameters to the embedding layer

- input_dim : Size of the vocabulary
- output_dim : Length of the vector for each word
- input_length : Maximum length of a sequence

In the above example, we are setting 10 as the vocabulary size, as we will be encoding numbers 0 to 9. We want the length of the word vector to be 4, hence output_dim is set to 4. The length of the input sequence to embedding layer will be 2.

Let's give this a go with sample inputs:

In [9]:
input_data = np.array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
pred = model.predict(input_data)
print(input_data.shape)
print(pred)

(1, 10)
[[[-0.02180538 -0.01846747  0.00881319 -0.04496161]
  [-0.02187794  0.01439232  0.03302777 -0.01083774]
  [-0.01575593 -0.04280739  0.04219476 -0.00234377]
  [-0.02075353 -0.02370724  0.01560435 -0.03962264]
  [-0.00951229  0.02094544 -0.04710805  0.04518663]
  [-0.04701214  0.01733999 -0.04683822  0.04251542]
  [-0.03966779  0.02537152 -0.01777736  0.04978364]
  [-0.04238213 -0.00793669  0.0344539  -0.02890325]
  [-0.02519262  0.04823616  0.01561889 -0.03657982]
  [ 0.00447683  0.0488834   0.02157776  0.03869686]]]


The "embedding" vectors are initialized by Keras but not _trained__ yet.  Let's first talk about what they means first.

These weights are the "embedding" vector representations of the "words" in vocabulary {0, 1, ... 9}.  This is a lookup table of size 10 x 4, for words 0 to 9. The first word (0) is represented by first row in this table, which is:

```[-0.02180538 -0.01846747  0.00881319 -0.04496161]```

## Training Word Embedding with Real Data

Let's create some sample book reviews:

In [14]:
reviews = [
    'A wonderful book',
    'Love the book',
    'Pulitzer level quality',
    'A beach reading',
    'Writing is awesome',
    'Horrid ... never read again',
    'Has lots of mistakes',
    'Book is fine, bad delivery',
    'A 5th grader job',
    'Bad bad bad'
]

labels = array([1,1,1,1,1,0,0,0,0,0])

We will take vocabulary size as 50 and one-hot encode the words using one_hot function from Keras.

In [17]:
from tensorflow.keras.preprocessing.text import one_hot

VOCAB_SIZE = 50
encoded_reviews = [one_hot(d,VOCAB_SIZE) for d in reviews]
print(f'encoded reviews: {encoded_reviews}')

encoded reviews: [[33, 9, 2], [43, 38, 2], [2, 10, 24], [33, 11, 46], [8, 9, 7], [18, 31, 25, 45], [26, 31, 13, 24], [2, 9, 16, 26, 13], [33, 41, 39, 28], [26, 26, 26]]


Here you can see the length of each encoded review is equal to the number of words in that review. Keras one_hot is basically converting each word into its one-hot encoded index. Now we need to apply padding so that all the encoded reviews are of same length. Let’s define 5 as the maximum length and pad the encoded vectors with 0’s in the end.

In [15]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_length = 5
padded_reviews = pad_sequences(encoded_reviews,maxlen=max_length,padding='post')
print(padded_reviews)

[[33  9  2  0  0]
 [43 38  2  0  0]
 [ 2 10 24  0  0]
 [33 11 46  0  0]
 [ 8  9  7  0  0]
 [18 31 48 25 45]
 [26 31 13 24  0]
 [47 13 21 46 17]
 [32 46 33  5 28]
 [26 26 26  0  0]]


After creating padded one-hot representation of the reviews, we are ready to pass it as input to the embedding layer. Let's create a simple Keras model. We will fix the length of embedded vectors for each word to be 8 and the input length will be the maximum length which we have already defined as 5.

In [25]:
model = Sequential()
embedding_layer = Embedding(input_dim=VOCAB_SIZE,output_dim=8,input_length=max_length)
model.add(embedding_layer)

In [26]:
model.output_shape

(None, 5, 8)

In [27]:
model.add(Flatten())
model.output_shape

(None, 40)

In [28]:
model.add(Dense(1,activation='sigmoid'))
model.output_shape

(None, 1)

In [29]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
print(model.summary())

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 5, 8)              400       
                                                                 
 flatten_3 (Flatten)         (None, 40)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 41        
                                                                 
Total params: 441
Trainable params: 441
Non-trainable params: 0
_________________________________________________________________
None


Let's train the model for 100 epochs.

In [22]:
model.fit(padded_reviews,labels,epochs=100,verbose=0)

<keras.callbacks.History at 0x7f90010be640>

Once the training is completed, embedding layer has learnt the weights which are nothing but the vector representations of each word. Lets check the shape of the weight matrix.

In [23]:
print(embedding_layer.get_weights()[0].shape)

(50, 8)


If we check the embeddings for the first word, we get the following vector.

In [30]:
print(embedding_layer.get_weights()[0][0])

[ 0.02152041 -0.02873616 -0.02618781  0.03694941 -0.03886317 -0.04280834
  0.00144855 -0.04730001]


So this is how we train an embedding layer on our text corpus and get the embedded vectors for each word. These vectors are then used to represent words in a sentence.

Embeddings are a great way to deal with NLP problems because of two reasons
- it helps in dimensionality reduction over one-hot encoding as we can control the number of features
- it is capable of understanding the context of a word so that similar words have similar embeddings