# Natural Language Neural Nets

NLP really blew up with the 2013 paper introducing word2vec by Mikolov et al \[1\]. This introduced a way to represent similarity and relationships between words through the use of word vectors.

![The words 'hello' and 'world' in vector format](../../assets/images/word2vec_example.png)

These initial word vectors contained a dimensionality of 50–100 values. The encoding mechanism of these vectors meant that similar words would be grouped together (Monday, Tuesday, etc) — and calculations on the vector space could produce genuinely insightful relationships.

![Arithmetic using word vectors](../../assets/images/king_man_woman_queen_example.png)

A well-known example is that of taking the vector for *King*, subtracting the vector *Man*, and adding the vector *Woman* resulting in the nearest datapoint being *Queen*.

## Recurrence

During this boom in NLP, the recurrent neural network (RNN) quickly became the favorite for most language applications. RNNs suited language well thanks to their *recurrence*.

![Recurrent neural net](../../assets/images/rnn.png)

A recurrent neural network unit will consume the first time-step *'the'*, pass on its output state to the next time-step *'quick'* — this recurrent process continues for a specified length of time-steps (the sequence length).

This recurrence allowed the neural nets to consider the order of words and their effect on preceding and subsequent words — allow the nuances of human language to be better represented.

Although we didn’t see their popular usage until 2013 — the concept and methodology of RNNs were being discussed across several papers in the 80s \[1\], \[2\].

### Vanishing Gradients

RNNs came with their problems, primarily the vanishing gradient problem. The recurrence of these networks means that they are by nature very deep networks with many points containing an operation between the incoming data and the neuron weight.

When calculating the error of the network, and using that to update the network weights, we walk back through the network updating weight after weight.

If the update gradient is a small number, we are multiplying an increasingly smaller and smaller number — meaning the full network either takes a very long time to train or simply does not work.

On the other hand, if our weight recurring value is too high — we suffer from the exploding gradient problem. Here, the network weights will oscillate without learning any meaningful representation.

### Long Short Term Memory

The solution to the vanishing gradients problem came with the introduction of long short-term memory (LSTM) units.

![Flow of information through a single LSTM unit](../../assets/images/lstm.png)

LSTM units introduced a more stable passage of information — the cell state, shown in black above. This additional stream of information was controlled that the chain of time-states with a minimal number of transformations controlled by *'gates'*.

![Flow of information over several time-steps](../../assets/images/lstm_cell_state.png)

This allowed long-term dependencies to be learned by allowing information from much earlier in a sequence to be retained and applied to states much later in the sequence.

### Attention

Very quickly recurrent encoder-decoder models were complemented with additional hidden states and neural network layers — these produced the attention mechanism.

![Attention mechanism in LSTM network](../../assets/images/lstm_attention.png)

Adding encoder-decoder networks allowed the output layers of a model to not only receive the final-state of the RNN units — but to also receive information from every state of the input layer, creating an ‘attention’ mechanism.

![Attention between encoder and decoder neurons during English-French translation task](../../assets/images/attention_example.png)

Using this approach, we find that similarity between the encoder and decoder states will result in a higher weight — producing outcomes like that of the French translation image above.

With this encoder-decoder implementation three tensors, the *Query*, *Key*, and *Value* are used in the attention operation. The *Query* is pulled from the hidden state of the decoder at every time-step — alignment between this and the *Key-Value* tensors is assessed to produce the context vector.

![Context vector in LSTM attention network](../../assets/images/context_vector_lstm.png)

The context vector is then passed back into the decoder where is used to produce a prediction for that time-step. This process is repeated for every time-step in the decoder space.

## Attention Is All You Need

Solo-attention began with the 2017 paper *'Attention Is All You Need'* \[3\]. You may have guessed it already, this paper introduced the idea that we don’t need to use these complex convolutional or recurrent neural networks alongside attention — attention is in fact, all you need.

### Self-Attention

Self-attention was a key factor for this to function. It meant that whereas before the Query came from the output decoder, it is now generated directly from the input values alongside Key and Value.

![Self attention, attention changes with a slight variation in the sentences](../../assets/images/self_attention_two_diff_phrases.png)

Because the Query, Key, and Value are all produced by the input, we are able to encode the alignment between different parts of the same input sequence. If we take the image above, we can see that changing the final word from *tired* to *wide* shifts the attention focus from *animal* to *street*.

This allows the attention mechanism to encode relationships between all of the words in the input data.

### Multi-Head Attention

The next big change to the attention mechanism was the addition of multiple attention heads — essentially many self-attention operations performed in parallel and initialized with different weights.

![Multi-head attention](../../assets/images/multi_head_attention.png)

Multi-head attention refers to the processing of multiple attention ‘heads’ in parallel. The outputs of these multiple heads are concatenated together.

Without multi-head attention, the A. Vaswani et al. transformer model actually performed worse than many of its predecessors \[3\].

The parallel mechanism allowed the model to represent several *subspaces* of the same sequence. These different levels of attention were then concatenated and processed by a linear unit.

### Positional Encoding

The input to a Transformer model is not sequential like RNNs. In the past, it was this sequential operation allowed us to consider the position and order of words.

To maintain the positional information of words a positional encoding is added to the word embedding before entering the attention mechanism.

The approach taken in the Attention Is All You Need paper was to produce a different sinusoidal function for every dimension in the embedding dimension.

Remember before we said word2vec introduced to the concept of representing a word as many numbers in a 50 to 100-dimensional vector? Here, in the Vaswani et al. paper, they use the same idea but to represent the position of a word.

However, this time — rather than calculating the vector values using an ML model, the values are calculated using modified sinusoidal functions:

$$
PE_{(pos, 2i)} = sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})
$$

$$
PE_{(pos, 2i+1)} = cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})
$$

The two formulae are our alternativing positional encoding values. Using the word position **pos**, embedding dimension **i**, and the number of embedding dimensions **d_model**. Together they produce this:

![Sine followed by cosine function](../../assets/images/sin_cos_alternating_pos_encoding.png)

Each index of the vector is assigned an alternating sine-cosine-sine function (index 0 is sine, index 1 is cosine). Next, as the index value increases away from zero towards d (the embedding dimensionality) the frequency of the sinusoidal function decreases:

![Frequency change with increasing embedding indices](../../assets/images/sin_func_embeddings_frequency.png)

We can take the same unruly sinusoidal plot from above, add in the 512 embedding dimensionality used in the A. Vaswani et al. paper and map these onto an *easier* to understand heatmap:

![Positional encoding heatmap](../../assets/images/positional_encoding_heatmap.png)

We can see the higher frequency in the lower embedding dimensions (left) — which decreases as the embedding dimension increases. Around dimension 24, the frequency has decreased so much that we no longer see any change in the remaining (alternating) sine-cosine waves.

These positional encodings are then added to the word embeddings.

*As a side note, this means that both the word embedding dimensionality and positional encoding dimensionality must match.*

### The Transformer

![First transformer architecture from A. Vaswani et al. paper](../../assets/images/first_transformer.png)

The resultant architecture of these changes to the attention model produced the world’s first Transformer.

Beyond the word embedding, positional encodings, and multi-head self-attention operations already discussed — the model is reasonably easy to comprehend.