# Transformer fundamentals

## How does RNN relate to NLP and Transformers

RNN alongside with Long-Short Term Memory(LSTM) blocks have been historically used to solve tasks
that require attention in NLP such as translation and text to speech. 
We can go some years back and read an interesting [article](https://towardsdatascience.com/learn-how-recurrent-neural-networks-work-84e975feaaf7)
from 2017 which exemplifies Siri and Google Translate as products that use this architecture. However successful,
there are major drawbacks related to trainability a performance when using these models in tasks that require
attention to context:

1. RNNs cannot process long sequences of text.
2. RNNs are sequential (narrow) Deep Learning Networks, so they're slow to train and slow to run as they're not highly parallelizable.
3. Highly sequential models suffer from vanishing/exploding gradients

![Sample RNN architecture](https://miro.medium.com/v2/resize:fit:720/format:webp/1*4KwIUHWL3sTyguTahIxmJw.png)

## Enter Transformers

In 2017 an influential [paper](http://arxiv.org/abs/1706.03762) was published by Google Research folks. In this paper a model
that employed _self-attention_ mechanisms was showcased. This model, dubbed _transformer_ used attention layers arranged in
an _encoder_-_decoder_ architecture to perform text generation tasks.
From the paper, we can take a look at the general architecture of a transformer model.

![Transformer architecture](https://daleonai.com/images/screen-shot-2021-05-06-at-12.12.21-pm.png)

The model highlights the following features:

- Positional Encodings:
	- These encodings ensure that different embeddings get coded into the inputs and encoder outputs according to the token position as input.
- Attention:
	- This is the key takeaway from the paper "Attention is all you need". Transformers boast N=6 encoder and N=6 decoder layers with the so called "Multi-Head Attention" sublayers. These Multi-Head attention layers have each multiple "projections" of the self-attention mechanism. We will discuss the mechanism in depth in the next lessons.
- Masking:
	- In the decoder, Transformers also have a masking mechanism that prevents "illegal" connections to get trained into the model. That is, an input cannot be weighed against itself nor posterior inputs in the decoder layer.
	
## Attention mechanism

A technique that allows the model to focus on specific parts of the input sequence. This is done assigning weights.
- Encoder passes more data to the decoder (All hidden states)
- Decoder uses all hidden state information.
- Extra steps in the decoder:
	1. Looks at the set of encoder hidden states that it has received.
	2. Give each hidden state a score.
	3. Multiply hidden state by it's soft-max score.

## Natural Language Processing

The following infographic from this Fabio Chiusano [article](https://medium.com/nlplanet/a-brief-timeline-of-nlp-from-bag-of-words-to-the-transformer-family-7caad8bbba56) shows the evolution of NLP models on which the Transformer-based models shine after the publishing of the original Transformer paper in 2017.
![](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*7pdmDaLMF6GX2PU-POdjXg.png)

Although there have been useful applications in other areas such as [vision](https://jbcordonnier.com/posts/attention-cnn/) and [speech recognition](https://ieeexplore.ieee.org/document/8462506), transformers and transformer-based architectures have seen their main applications in the domain of Natural Language Processing.

The best known transformer-based architectures are GPT and BERT. 

BERT is short for "Bidirectional Encoder Representation from Transformers", it works by employing the self-attention mechanism in a single encoder. It allows for different kinds of masking and can employ attention both forward and backwards in a sentence.

GPT is short for "Generative Pre-trained Transformers" and that namesake corresponds to a family of different architectures of Transformers trained on big datasets. Perhaps the most famous GPT model at the time of this writeup was GPT-3 or ChatGPT from OpenAI which popularized using GPT models for text generation, translation, summarizing of articles to enumerate some of the most common NLP tasks nowadays.

## References
- [NeuroMatch Academy](https://deeplearning.neuromatch.io/tutorials/W2D5_AttentionAndTransformers/student/W2D5_Tutorial1.html)
- [Google Cloud on Transformers](https://www.youtube.com/watch?v=SZorAJ4I-sA)
- [How Recurrent Neural Networks Work](https://towardsdatascience.com/learn-how-recurrent-neural-networks-work-84e975feaaf7)
- [Dale on AI blog](https://daleonai.com/transformers-explained)
- [Attention Mechanism: Google Cloud](https://www.youtube.com/watch?v=fjJOgb-E41w)
- [Attention is all you need](http://arxiv.org/abs/1706.03762)