# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Week 5: Transformer Architecture</font>

# <font color="#003660">Notebook 1: A Tour of Hugging Face Transformers</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... will have a high-level understanding of the Transformer architecture, <br>
        ... will know the basic types of Transformers (i.e., encoder, decoder, encoder-decoder), and <br>
        ... will know how to use pre-trained NLP pipeline (e.g., sentiment analysis, NER, translation) from the Hugging Face Transformers library.
    </font>
</div>
</center>
</p>

The following content is heavily inspired by the following excellent sources:


*   Tunstall et al. (2021): Natural Language Processing with Transformers. O'Reilly. https://www.oreilly.com/library/view/natural-language-processing/9781098103231/
*   Hugging Face (2021): Transformer Models - Hugging Face Course. https://huggingface.co/course/



# What are Transformers?

## Overview

The general **Transformer** architecture consists of two components:

*   **Encoder**: The encoder processes an input sequence and builds a numerical representation (feature vector) of it. This means that this component is optimized for understanding the input.
*   **Decoder**: The decoder processes the feature vector produced by the encoder, plus other inputs, to generate an output sequence. This means that this component is optimized for generating outputs.







<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/architecture_high-level.png"/><br></center>

Depending on the task at hand, the components of the Transformer architecture can be used independently or in combination:

*   **Encoders**: Good for tasks that require understanding of the input, such as *text classification* and *named entity recognition*. Watch [this video](https://youtu.be/MUqNwgPjJvQ) to learn more.
*   **Decoders**: Good for generative tasks such as *text generation*. Watch [this video](https://youtu.be/d_ixlCubqQw) to learn more.
*   **Encoder-decoder Models** (aka sequence-to-sequence models): Good for generative tasks that require an input, such as *translation* or *summarization*. Watch [this video](https://youtu.be/0_4KEb08xrE) to learn more.







<center><br><img width=500 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/transformer_models.png"/><br></center>

## Attention is all you need!

In general, the encoder and decoder components can be any kind of neural network architecture that is suited for modeling sequences. For example, in the diagram below, simple recurrent neural networks (RNN) are used to implement the encoder and decoder components.

In this example, the English sentence “Transformers are great!” is encoded into a hidden state vector that is then decoded to produce the German translation “Transformer sind grossartig!” (Note: besides the hidden state, the decoder normally also uses the already generated output tokens as an additional input).

Note how this network processes inputs and generates outputs in a sequential way, indicated the vertical lines between the RNN cells. This sequential processing is slow, as it cannot take full advantage of the parallel processing capabilities of a GPU.

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/transformer_rnn.png"/><br></center>

One weakness with the above architecture is that only the final hidden state of the encoder is passed to the decoder, which creates an information bottleneck. The meaning of the whole input sequence has to be captured in in just one vector, which is especially challenging for long sequences where information at the start of the sequence might be lost in the process of creating a single vector representation.

Of course, a straight forward way to avoid this bottleneck is to pass all of the encoder’s hidden states to the decorder (not shown in the diagrams). This way, almost no information would be lost. Yet, at the same time we would probably pass a lot of irrelevant information to the decoder.

A meet-in-the-middle approach between the two extremes (passing just one hidden state vs. passing all hidden states) is a mechanism called **Attention**. Attention allows the decoder to assign a weight to each individual hidden input state (see diagram below), indicating which states are important or unimportant for producing the next element of the output sequence. Like all other weights of the network, the attention weights are learned during training.

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/transformer_att1.png"/><br></center>

The figure below visualizes the attention weights for an English-to-French translation model. Note how the decoder is able to correctly align the words “zone” and “Area”, which are ordered differently in the two languages.

<center><br><img width=500 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/att_translation.png"/><br></center>

In the famous "[Attention is all you need paper](http://papers.nips.cc/paper/7181-attention-is-all-you-%0Aneed.pdf)", Vaswani et al. showed that the RNNs inside the encoder and decoder components can be replaced entirely with Attention and feed-forward layers.

This architecture has three main advantages:

* First, without RNN cells, all tokens can be fed in parallel through the model, which makes the model faster and allows to train it on larger corpora.

* Second, the Attention mechanism makes the network more effective on tasks that require memorizing information over long time sequences.

* Third, the Attention mechanisms creates a representation for each token that is dependent on its surrounding tokens. This makes the representation of each token context aware, such that the representation of the word “apple” (fruit) is different from “apple” (computer manufacturer).

<center><br><img width=600 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/transformer_att2.png"/><br></center>

In modern Transformer architectures, several blocks of Attention and feed-forward layers are stacked in the encoder to produce rich hidden states which are then passed to the decoder.

This information should be enough to build an intuitive understanding of Attention. If you want to learn more about its technical details, read Chapter 3 of Tunstall et al. (2021).

## Transfer Learning

Another feature of Transformers is this use of transfer learning. Transfer learning is a machine learning approach that involves applying knowledge gained from solving one problem to solve a different but related problem.

This usually works by splitting a model in terms of a **body** and **head**. During pretraining, the weights of the body are optimized to represent broad features of the source domain (e.g., vocabulary, grammar). These weights are then used to initialize the new model for the new task. The head is a task-specific network that is only trained during fine-tuning.

The figure below illustrates this idea and contrasts it with traditional machine learning. For example, in the models on the right Body A could be pretrained with a language modeling task in Domain A and the used to initialize Body B of a sentiment analysis model in Domain B. The weights of Head B, the sentiment classifier, are learned from scratch.

<center><br><img width=700 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/transfer_learning.png"/><br></center>

Transfer learning typically produces models that can be fine-tuned efficiently on a variety of downstream tasks (i.e., with less time and data).

# Import Packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `sklearn` is a free software machine learning library for the Python programming language.
- `transformers` provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages.

In [None]:
!pip install transformers[sentencepiece]

In [None]:
import pandas as pd
import numpy as np
from sklearn import metrics
from transformers import pipeline

# Pre-trained Pipelines

In the following, we will use some pre-trained NLP pipelines from the Hugging Face Transformers library.

These pipelines are an easy way to use models for inference. They abstract most of the complex code from the library (e.g., code for tokenizing input sequences or for training loops), offering a simple API dedicated to specific tasks.

Below are some examples of pre-trained Transformer pipelines. For more, see: https://huggingface.co/transformers/main_classes/pipelines.html

## Sentiment Analysis

The sentiment analysis pipeline uses a model that was fine-tuned on the Stanford Sentiment Treebank, which is an English corpus of annotated movie reviews. This is an example of an encoder model.

In [None]:
sa_classifier = pipeline("sentiment-analysis")

In [None]:
sa_classifier([
            "This is the best movie I have ever seen.",
            "I hate this movie so much!",
            "I don't like the new iPhone.",
            "Becks is not bad.",
     ])

## Named Entity Recognition

Named entities are names of products, places or people and detecting and extracting them from text is called named entity recognition (NER). This is an example of an encoder model.

In [None]:
ner = pipeline("ner", grouped_entities=True)

In [None]:
ner("My name is Oliver and I work at Paderborn University.")

## Question Answering

In question answering we provide the model with a passage of text called the context, along with a question whose answer we’d like to extract. The model then returns the span of text corresponding to the answer. This is an example of an encoder model.

In [None]:
question_answerer = pipeline("question-answering")

In [None]:
question_answerer(
    question="What's my name?",
    context="My name is Oliver and I work at Paderborn University.",
)

## Text Generation

Text generation is the task of generating text with the goal of appearing indistinguishable to human-written text. This is an example of a decoder model.

In [None]:
txt_generator = pipeline("text-generation")

In [None]:
txt_generator("In this course, you will learn how to",
              num_return_sequences = 2, max_length = 20)

## Translation

Translation is the task of translating text or speech from one language to another. This is an example of an encoder-decoder model.

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

In [None]:
translator("La bibliothèque Hugging Face est tout simplement géniale.")

## Summarization

The goal of text summarization is to take a long text as input and generate a short version with all relevant facts. This is an example of an encoder-decoder model.

In [None]:
summarizer = pipeline("summarization")

In [None]:
summarizer(
  """
  Darth Vader is a fictional character in the Star Wars franchise. The character
  is the primary antagonist in the original trilogy and, as Anakin Skywalker,
  is a primary protagonist in the prequel trilogy. Star Wars creator George
  Lucas has collectively referred to the first six episodic films of the
  franchise as "the tragedy of Darth Vader". He has become one of the most
  iconic villains in popular culture, and has been listed among the greatest
  villains and fictional characters ever.
  """,
  max_length = 60
)

## Text-to-Image Generation

Conditional image generation is the task of generating new images from a dataset conditional on their class. Instead of specifying one of a pre-defined set of classes, we can also specify the "class" through a textual prompt.

In [None]:
!pip install diffusers["torch"]
!pip install transformers

In [None]:
from diffusers import DiffusionPipeline


Load diffusion model and move it to the GPU.

In [None]:
generator = DiffusionPipeline.from_pretrained("CompVis/ldm-text2im-large-256")

Pass a prompt to the model.

In [None]:
image = generator("An image of a dog in Van Gogh style").images[0]

In [None]:
image.save("image_of_dog_painting.png")