# Word Embeddings

# 1. What Are Word Embeddings?

A word embedding is a learned representation for text where words that have the same meaning have a similar representation.

It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

>One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities.

## 1.1. One Hot Encoding

One-hot encoding may be one’s go-to move when a categorical feature is encountered in a data set(if you think such a feature is useful for the model, obviously). It’s a simple and straight forward technique and it works by replacing each category with a vector full of zeros, except for the position of its corresponding index value, which has a value of 1.

When applying one-hot encoding to a text document, tokens are replaced by their one-hot vectors and a given sentence is in turn transformed into a 2D-matrix with a shape of (n, m), with n being the number of token in the sentence and m the size of the vocabulary. 

<img src="figures/words_ohe.png" width="600px">


```
As an example, say that we were to use one-hot encoding for the following sentences:

s1. "This is sentence one."
s2. "Now, here is our sentence number two."

The vocabulary from the two sentences is:
vocabulary = {"here": 0, "is": 1, "now": 2, "number": 3, "one": 4, "our": 5, "sentence": 6, "this": 7, "two": 8}
The two sentences represented by one-hot vectors are:
indices                             words
      0  1  2  3  4  5  6  7  8
s1: [[0, 0, 0, 0, 0, 0, 0, 1, 0], - "this"
     [0, 1, 0, 0, 0, 0, 0, 0, 0], - "is"
     [0, 0, 0, 0, 0, 0, 1, 0, 0], - "sentence"
     [0, 0, 0, 0, 1, 0, 0, 0, 0], - "one"
     [0, 0, 0, 0, 0, 0, 0, 0, 0], - PAD
     [0, 0, 0, 0, 0, 0, 0, 0, 0], - PAD
     [0, 0, 0, 0, 0, 0, 0, 0, 0]] - PAD
     
s2: [[0, 0, 1, 0, 0, 0, 0, 0, 0], - "now"
     [1, 0, 0, 0, 0, 0, 0, 0, 0], - "here"
     [0, 1, 0, 0, 0, 0, 0, 0, 0], - "is"
     [0, 0, 0, 0, 0, 1, 0, 0, 0], - "our"
     [0, 0, 0, 0, 0, 0, 1, 0, 0], - "sentence"
     [0, 0, 0, 1, 0, 0, 0, 0, 0], - "number"
     [0, 0, 0, 0, 0, 0, 0, 0, 1]] - "two"
```

The size of the vocabulary would only grow as the training corpus gets larger and larger, as a result, each token would be represented by vectors with an increasingly larger length, making matrices more sparse. 

For NLP tasks such as Text Generation or Classification, one-hot representation or count vectors might be capable enough to represent the required information for the model to make wise decisions. However, their usage won’t be as effective for other tasks such as Sentiment Analysis, Neural Machine Translation, and Question Answering where a deeper understanding of the context is required to achieve great results.
Take One-hot encoding as an example, using it will not lead to a well-generalized model for these tasks because no comparison between any two given words can be done. All vectors are orthogonal to one another, the inner product of any two vectors is zero and their similarities cannot be measure by distance.

## 1.2. Embeddings

For this, we turn to Word Embeddings, a featurized word-level representation capable of capturing the semantic meanings of words.
With embeddings, each word is represented by a dense vector of fixed size(generally range from 50 to 300), with values corresponding to a set of features i.e. Masculinity, Femininity, Age, etc. As shown in the figure below, these features are seen as different aspects of a word’s semantic meaning, and their values are obtained by random initialization and are updated during training, just like the parameters of the model do.

<img src="figures/words_embeddings.png">

When training embeddings, we do not tell the model what these features should be, instead it’s up to the model to decide what are the best ones for the learning task. When setting up an embedding matrix(a set of word embeddings), we only define its shape — the number of words and each vector’s length. What each feature is representing is generally difficult to interpret.

Word Embedding’s ability to capture semantic meanings can be illustrated by projecting these high-dimensional vectors to a 2D space for visualization via t-SNE. If embeddings were obtained successfully, plotting these vectors with t-SNE would demonstrate how words that have similar meaning would end up being closer to one another.

<img src="figures/words_embeddings_2D.png" width="600px">

## 1.3. Training Word Embeddings

Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.

The learning process is either joint with the neural network model on some task, such as document classification, or is an unsupervised process, using document statistics.

<img src="figures/catagorical_embedding.svg" width="800px">

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset (could also be used with sparse catagorical inputs).

It is a flexible layer that can be used in a variety of ways, such as:

It can be used alone to learn a word embedding that can be saved and used in another model later.
It can be used as part of a deep learning model where the embedding is learned along with the model itself.
It can be used to load a pre-trained word embedding model, a type of transfer learning.

# 2. TF Keras Embedding Layer

Keras offers an Embedding layer that can be used for neural networks on text data.

**It requires that the input data be integer encoded** (tf keras has an efficient way of transforming integer encodings to embeddings instead of mapping the one hot sparse vector encoding to dense representation), so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

It must specify 3 arguments:

* input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
* output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
* input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).

If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.

# 3. How embedding layer works in TF Keras?

Embedding layer performs select operation for each integer input that is used as the index to access a table that contains all possible vectors.

<img src="figures/embedding_layer.jpeg" >



Let’s see a simple text processing example. Our training set consists only of two phrases:

Hope to see you soon → [0, 1, 2, 3, 4]

Nice to see you again → [5, 1, 2, 3, 6]

So we encoded these phrases by assigning each word a unique integer number (by order of appearance in our training dataset for example) as the index of the embedding matrix. That is the reason below, why you need to specify the size of the vocabulary as the first argument (so the table can be initialised).


In [3]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding

X = np.array([
    [0, 1, 2, 3, 4],
    [5, 1, 2, 3, 6]
])

model = Sequential()
model.add(Embedding(input_dim=7,
                    output_dim=2,
                    input_length=5))

model.compile(loss='mse', optimizer='sgd')
print(model.predict(X, batch_size=32))  # output (2, 5, 2)

[[[-0.00717581 -0.01187963]
  [-0.04137641 -0.00176752]
  [ 0.02428936  0.02234766]
  [-0.02888275 -0.0383324 ]
  [-0.02276908 -0.04999352]]

 [[ 0.0312614  -0.01271391]
  [-0.04137641 -0.00176752]
  [ 0.02428936  0.02234766]
  [-0.02888275 -0.0383324 ]
  [ 0.00499725 -0.02520583]]]


In [4]:
model.layers[0].weights

[<tf.Variable 'embedding/embeddings:0' shape=(7, 2) dtype=float32, numpy=
 array([[-0.00717581, -0.01187963],
        [-0.04137641, -0.00176752],
        [ 0.02428936,  0.02234766],
        [-0.02888275, -0.0383324 ],
        [-0.02276908, -0.04999352],
        [ 0.0312614 , -0.01271391],
        [ 0.00499725, -0.02520583]], dtype=float32)>]

**You can emulate an embedding layer with fully-connected (Dense) layer via converting every integer input into one-hot encoding and performing matrix multiplication with the weights(embedding matrix).**

The whole point of dense embedding is to avoid one-hot representation. In NLP, the vocabulary size can be of the order 100k (sometimes even a million). On top of that, it’s often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, gather operation itself is faster than matrix dot-product, both in forward and backward pass.

# 4. How the embedding layer is trained in Keras?

Embedding layers in Keras are trained just like any other layer in your network architecture: they are tuned to minimize the loss function by using the selected optimization method. The major difference with other layers, is that their output is not a mathematical function of the input. Instead the input to the layer is used to index a table with the embedding vectors as we have seen above. However, the underlying automatic differentiation engine treats embeddings as any other weights in network and optimize these vectors to minimize the loss function.

There are very specific network setups that try to learn an embedding on a supervised task like word2vec, captures the semantics of words, or sentiment classification problems. Let’s implement a small binary classification task by visualisation the details of the architecture to gain deeper level of understanding.

We will have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”.

<img src="figures/embedding_layer_learning.jpeg">



In [16]:
from numpy import array
from tensorflow.keras.preprocessing import text
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.callbacks import LambdaCallback

# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])


# tokenize documents
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(docs)
word2idx = tokenizer.word_index

# integer encode the documents
encoded_docs = [[word2idx[w] for w in text.text_to_word_sequence(doc)] for doc in docs]
print(encoded_docs)

# pad documents to a max length of 4 words
max_length = 4
vocab_size = 20 
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 4, input_length=max_length ))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
 

# fit the model
model.fit(padded_docs, labels, epochs=100, verbose=1 )
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]]
[[ 6  2  0  0]
 [ 3  1  0  0]
 [ 7  4  0  0]
 [ 8  1  0  0]
 [ 9  0  0  0]
 [10  0  0  0]
 [ 5  4  0  0]
 [11  3  0  0]
 [ 5  1  0  0]
 [12 13  2 14]]
Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 4, 4)              80        
_________________________________________________________________
flatten_8 (Flatten)          (None, 16)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 17        
Total params: 97
Trainable params: 97
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100

Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Accuracy: 100.000000


# 5. Pre-trained Embeddings

## 5.1. Word2Vec


Word2Vec is a statistical method for efficiently learning a standalone word embedding from a text corpus.

It was developed by Tomas Mikolov, et al. at Google in 2013 as a response to make the neural-network-based training of the embedding more efficient and since then has become the de facto standard for developing pre-trained word embedding.

Additionally, the work involved analysis of the learned vectors and the exploration of vector math on the representations of words. For example, that subtracting the “man-ness” from “King” and adding “women-ness” results in the word “Queen“, capturing the analogy “king is to queen as man is to woman“.

We find that these representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, “King – Man + Woman” results in a vector very close to “Queen.”


Two different learning models were introduced that can be used as part of the word2vec approach to learn the word embedding; they are:

* **Continuous Bag-of-Words Model** which predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.
* **Continuous Skip-gram Model** which predict words within a certain range before and after the current word in the same sentence. A worked example of this is given below.


<img src="figures/word2vec.png">

Both models are focused on learning about words given their local usage context, where the context is defined by a window of neighboring words. This window is a configurable parameter of the model.

The size of the sliding window has a strong effect on the resulting vector similarities. Large windows tend to produce more topical similarities, while smaller windows tend to produce more functional and syntactic similarities.

While a bag-of-words model predicts a word given the neighboring context, a skip-gram model predicts the context (or neighbors) of a word, given the word itself. The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (see the diagram below for an example). The context of a word can be represented through a set of skip-gram pairs of (target_word, context_word) where context_word appears in the neighboring context of target_word.

Consider the following sentence of 8 words.

>The wide road shimmered in the hot sun.

The context words for each of the 8 words of this sentence are defined by a window size. The window size determines the span of words on either side of a target_word that can be considered context word. Take a look at this table of skip-grams for target words based on different window sizes.

<img src="figures/skip_grams.png" width="800px">

## 5.2. GloVe


* GloVe is another algorithm for learning the word embedding. It's the simplest of them.

* This is not used as much as word2vec or skip-gram models, but it has some enthusiasts because of its simplicity.

* GloVe stands for Global vectors for word representation.

* Let's use this example: "I want a glass of orange juice to go along with my cereal".

* We will choose a context and a target from the choices we have mentioned in the previous sections.

* Then we will calculate this for every pair: Xct = # times t appears in context of c

* Xct = Xtc if we choose a window pair, but they will not equal if we choose the previous words for example. In GloVe they use a window which means they are equal

* The model is defined like this:

<img src="figures/glove.png" width="800px">
