# Behond RNNs: Transformers and BERT

In this lesson, we will cover the following topics:
- The Evolution of Natural Language Processing
- Benefits of Transformer Architectures
- BERT, GPT-3, and Their Capabilities
- New Ways of Evaluating Our Models

## Evolution of NLP
- data scientist experimented to improve RNNs and discovered a new technique called "Attention"
- the goal of Attention is to "focus" the model on certain parts of the data
- this technique showed benefit for RNNs and architectures like Seq2Seq
- a landmark paper was released by Google demonstrating that attention inside an autoencoder architecture was superior to recurrents in RNNs

Architecture of Transfomer:
- The Transformer keeps the Encoder and Decoder from Seq2Seq
- The Encoder takes input embeddings and passes them through an attention-feed forward NN loop N times
- Then, it passes the output to the Decoder where it passes through another attention-feed forward NN loop
- Output then goes through a Linear and Softmax layer

Transformer architectures do not use recurrence. 

## Benefits of Transformer Architectures

### Faster to Train
The replacement of recurrent cells with feedforward networks improves the parallelization of Transformers. Current high-performance computing systems are designed to work well with this type of parallelization.

### Better Performance
Transformers offer better performance than RNNs across most natural language tasks. Therefore, we can use them to solve new problems.

### Versatility
The Transformer architecture can move between different domains like NLP and Computer Vision.

## Difference between RNNs and Transformers

Udacity answer:
>RNNs use a concept called “memory” to learn underlying patterns in a sequence and project them forward. They can do this inside of a greater patter or outside of a greater pattern. In sequence-to-sequence architectures, we learned that an RNN can fit within an encoder-decoder construct to solve NLP problems. Transformers are reliant on the encoder-decoder pattern to solve problems. Specifically, Tranformers remove the RNN entirely and rely on ever greater stacks of encoder-decoders to solve their problem.

My answer:
>The most important difference between Transformers and RNNs is the fact that Transformers do not rely on recurrence and therefore don't use the concept of "memory" like RNN uses. Instead, Transformers introduce a new technique called Attention to "focus" the model on certain parts of the data. The replacement of recurrent cells with feedforward networks improves the speed of training. It also has shown better performance than RNNs across most natural language tasks.

## Example code using BERT

Using BERT with Hugging Face:

In [4]:
## Load the BERT model using the pipeline function, define a compatible model and tokenize
from transformers import pipeline

qa_model = pipeline("question-answering") # Question-answering pipeline defaults to DistilBERT

## Provide question and context

question = "Who is the Great Pumpkin?"
context = '''The Great Pumpkin is a supernatural figure who rises from the pumpkin patch on Halloween evening, and flies around bringing toys to sincere and believing children.'''

## Ask question

qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.911424458026886,
 'start': 0,
 'end': 17,
 'answer': 'The Great Pumpkin'}

## BERT

BERT is a state-of-the-art Transformer released by Google in 2018. BERT pioneered several techniques to achieve cutting-edge results.

First, BERT is pretrained on a natural language corpus to develop a general understanding of a language. Then, the model is fine-tuned to perform specific tasks.

Next, BERT used bidirectional embeddings to capture relationships on both sides of a given word. This improved performance over unidirectional embeddings.

Finally, BERT stacked encoders and decoders together. The stacked encoder architecture increased the size of the model and dramatically increased the performance.

## GLUE
GLUE stands for General Language Understanding Evaluation. This is a new metric to measure a general understanding of language. It measures 9 language tasks. 

See http://gluebenchmark.com

Other metrics:
- BLEU is a specialized metric for language translation!
- ROUGE is a metric for text summarization. This is a new capability of transformers!

## Future of NLP

Research Areas
- **Types of Attention**: Attention is at the forefront of these new models. Modifications to the attention mechanism may improve modeling results.
- **Few-shot Learning**: Some problems don't have enough data to be solved with current architectures. Streamlining architectures to learn from a few examples will open up new application areas.
- **Combining Multiple Tasks**: Developing models that can perform multiple tasks may lead to the development of general intelligence.