In [0]:
from torch import nn

<!--@slideshow slide-->
<center><h1>Transformers</h1></center>

<!--@slideshow slide-->
<center>
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/gpt-3.png">

<a href="https://medium.com/@Synced/openai-unveils-175-billion-parameter-gpt-3-language-model-3d3f453124cd">link</a>

</center>

<!--@slideshow slide-->
# Outline
1. Sequence-to-Sequence (seq2seq) learning. Encoder-decoder architecture.
1. Motivation of Transformer.
1. Attention
1. Transformer block
1. Encoder
1. Decoder
1. BERT


<!--@slideshow slide-->
# seq2seq. Encoder-decoder architecture

<!--@slideshow slide-->
<center>
RNN for classification
<br>
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/rnn-classification.png">
</center>

<!--@slideshow slide-->
<center>
RNN for language modeling
<br>
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/rnn-lm.png">
</center>


<!--@slideshow slide-->
In the demo [language model with RNN](https://colab.research.google.com/drive/1gg6RZqjz6d-1o6pguZaktwHUVAFyScoX#scrollTo=YCjdUfGrsqWr&line=11&uniqifier=1) we trained a <font color="red">conditioned</font> RNN:

$$
h_t = f(h_{t-1}, x_t, \color{red}{c})
$$
$$
y_t = g(h_t)
$$


<!--@slideshow fragment-->
We need conditioning in many applications:
- Machine translation (English $\rightarrow$ Spanish)
- Text summarization (long text $\rightarrow$ short text)
- Dialogue / Chatbots / Question answering
- ...

<!--@slideshow fragment-->
**Idea**:
1. _Learn_ the condition from the first sequence (many-to-one RNN).
1. Use the learned condition to generate the second sequence (many-to-many language model RNN).

> This is called sequence-to-sequence (seq2seq) learning.


<!--@slideshow slide-->
<center>
RNN for seq2seq
<br>
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/rnn-seq2seq.png">
</center>


<!--@slideshow slide-->
Definitions:
- The part that _learns_ the condition is called **encoder**.
- The part that generates the result based on the condition is called **decoder**.

**Intuition**:
- Encoder learns ("encodes") the representation of the input sequence
- Decoder reconstructs ("decodes") the text from the learned representation.

<!--@slideshow slide-->
# Motivation of Transformer

**Problem**: RNN is sequential (= slow).

**Q**: But _why_ do we use it?

<!--@slideshow fragment-->
**A**: RNN accepts texts of any length and should learn _long-term dependencies_ between words (at least, in theory).

<!--@slideshow slide-->
This is how dependencies may look like:
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/word-dependencies.png" style="width:70%;">
<table>
<tbody>
  <tr>
    <td></td>
    <td><b>This</b></td>
    <td><b>is</b></td>
    <td><b>a</b></td>
    <td><b>sentence</b></td>
  </tr>
  <tr>
    <td><b>This</b></td>
    <td>DET</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td><b>is</b></td>
    <td>nsubj</td>
    <td>AUX</td>
    <td></td>
    <td>attr</td>
  </tr>
  <tr>
    <td><b>a</b></td>
    <td></td>
    <td></td>
    <td>DET</td>
    <td></td>
  </tr>
  <tr>
    <td><b>sentence</b></td>
    <td></td>
    <td></td>
    <td>det</td>
    <td>NOUN</td>
  </tr>
</tbody>
</table>

<!--@slideshow slide-->
# Attention


<!--@slideshow slide-->
RNNs produce **word vectors** based on **dependencies** between words.

**Idea**: represent words as vectors and measure dependency using dot product.

**Our goal**: represent the word $q$ given the sequence $k_1, k_2, \dots$

> $q$ does not necessarily belong to the sequence!

<!--@slideshow slide-->
**Definitions**:
- $k_i$ is a **key** vector of size $d_k$.
- Each key $k_i$ has the associated **value** - vector $v_i$ of size $d_v$.
- $q$ is a **query** vector of size $\color{red}{d_k}$.

<!--@slideshow fragment-->
- Dependency between $q$ and $k_i$ is quantified by a dot product:
$$
\mathrm{softmax}(q \cdot k_i) = \dfrac{\exp(q \cdot k_i)}{\sum_j \exp(q \cdot k_j)}
$$

<!--@slideshow fragment-->
- The vector for $q$ is the average value vector:
$$
A(q, k_1, v_1, k_2, v_2, \dots) = \sum_i v_i \cdot \mathrm{softmax}(q \cdot k_i)
$$


<!--@slideshow slide-->
Put all keys into matrix $K$ and all values into matrix $V$:
$$
A(q, K, V) = \sum_i v_i \cdot \mathrm{softmax}(q \cdot k_i)
$$


<!--@slideshow fragment-->
If we have many queries, put them into matrix $Q$:
$$
A(Q, K, V) = \mathrm{softmax}(Q K^T) V
$$

<!--@slideshow fragment-->
**Problem**: for large $d_k$, dot products are large and softmax is "peaked".

**Solution**: normalize the dot product:
$$
A(Q, K, V) = \mathrm{softmax}\left(\dfrac{Q K^T}{\color{red}{\sqrt{d_k}}}\right) V
$$

<!--@slideshow slide-->
<center>
<img src="http://nlp.seas.harvard.edu/images/the-annotated-transformer_33_0.png">
</center>

<!--@slideshow fragment-->
The function $A(Q, K, V)$ is called **Dot-Product Attention** function.

<!--@slideshow slide-->
In the **encoder**: $Q = K = V$.

> In other words: the word vectors themselves select each other

We’ll see in the **decoder** why we separate them in the definition.

<!--@slideshow slide-->
**Problem**: only one way for words to interact with one-another.

**Solution**: _multi-head_ attention

<!--@slideshow slide-->
- First map $Q, K, V$ into $h=8$ lower dimensional spaces via linear layers.
- Then apply attention.
- Then concatenate outputs and pipe through linear layer.

<img src="http://nlp.seas.harvard.edu/images/the-annotated-transformer_38_0.png">

<!--@slideshow slide-->
# Transformer block

<!--@slideshow slide-->
**Transformer block** is a layer that has two “sublayers”
1. Multihead attention
2. 2-layer feed-forward neural network (with ReLU)

<!--@slideshow slide-->
<center>
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/transformer-block.png">
</center>

<!--@slideshow slide-->

Techniques to speed up training:
- Residual connection
- Dropout
- Layer normalization



<!--@slideshow slide-->
## Residual connection

<!--@slideshow fragment-->
For a layer $\mathcal{F}(\mathbf{x})$, the _residual connection_ is $\mathcal{F}(\mathbf{x}) + \mathbf{x}$.

<br>
<center>
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/residual.JPEG" style="width:50%">
</center>

<!--@slideshow fragment-->
**Idea**: it helps to propagate the gradient
$$
\dfrac{\partial (\mathcal{F}(\mathbf{x}) + \mathbf{x})}{\partial \mathbf{x}} = \dfrac{\partial \mathcal{F}(\mathbf{x})}{\partial \mathbf{x}} \color{red}{+ 1}
$$

<!--@slideshow slide-->
## Dropout
Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of output features of the layer during training.

![dropout](https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/pic/dropout.png)

<!--@slideshow slide-->
**Intuition**: voting over $2^N$ thinned networks with shared weights

![thinned networks](https://raw.githubusercontent.com/horoshenkih/harbour-space-ds210/master/pic/thinned-networks.png)

<!--@slideshow slide-->
## Layer normalization

<!--@slideshow fragment-->

Layer normalization changes input to have mean 0 and variance 1, per layer and per training point (and adds two more parameters)

In [0]:
#@slideshow fragment
class LayerNorm(nn.Module):
    "Construct a layernorm module."
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

<!--@slideshow fragment-->

From the [original paper](https://arxiv.org/pdf/1607.06450.pdf):

> Layer normalization is very effective at **stabilizing** the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially **reduce the training time** compared with previously published techniques.

<!--@slideshow slide-->
Back to the **transformer block**: 2 sublayers
1. Multihead attention
2. 2-layer feed-forward neural network (with ReLU)

<!--@slideshow fragment-->
To speed up training, the actual output of each $\mathrm{Sublayer}(x)$ is
$$
\mathrm{LayerNorm}(x + \mathrm{Dropout}(\mathrm{Sublayer}(x)))
$$

In [0]:
#@slideshow fragment
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

<!--@slideshow slide-->
# Encoder

<!--@slideshow slide-->
## Input: byte-pair encoding

<!--@slideshow fragment-->
**Problem**: represent rare words (like "athazagoraphobia").

**Solution**: use subword units.
> In FastText, $n$-grams are used.

<!--@slideshow fragment-->
**Q**: which $n$ shall we use?

**A**: don't fix $n$, extract frequent subwords instead.

> This is inspired by the data compression technique called Byte Pair Encoding.

<!--@slideshow slide-->
![](https://miro.medium.com/max/1400/1*x1Y_n3sXGygUPSdfXTm9pQ.gif)

<!--@slideshow slide-->
In Tex Mining, BPE is slightly modified in its implementation: the frequently occurring subword pairs are **merged together instead of being replaced** by another byte to enable compression.

> This would basically lead the rare word `athazagoraphobia` to be split up into more frequent subwords such as `['▁ath', 'az', 'agor', 'aphobia']`.


<!--@slideshow fragment-->
1. Initialize vocabulary.
2. Represent each word in the corpus as a combination of the characters along with the special end of word token `</w>`.
2. Iteratively count character pairs in all tokens of the vocabulary.
4. Merge every occurrence of the most frequent pair, add the new character n-gram to the vocabulary.
5. Repeat step 4 until the desired number of merge operations are completed or the desired vocabulary size is achieved (which is a hyperparameter).

<!--@slideshow slide-->
## Input: positional encoding

<!--@slideshow fragment-->
**Problem**: we don't want the model to be sequential, but word order is important.

<!--@slideshow fragment-->
**Solution**: represent the _position_ as a vector.

<!--@slideshow slide-->
The exact formula:
$$
PE(pos) = \begin{pmatrix}
\cos\left(\dfrac{pos}{1}\right) \\
\sin\left(\dfrac{pos}{1}\right) \\
\cos\left(\dfrac{pos}{10000^{\frac{2}{d}}}\right) \\
\sin\left(\dfrac{pos}{10000^{\frac{2}{d}}}\right) \\
\dots \\
\cos\left(\dfrac{pos}{10000}\right) \\
\sin\left(\dfrac{pos}{10000}\right) \\
\end{pmatrix}
$$

From the [original paper](https://arxiv.org/pdf/1706.03762.pdf):
> it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE(pos+k)$ can be represented as a linear function of $PE(pos)$.

<!--@slideshow slide-->
## Complete encoder
<center>
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/transformer-encoder.png">
</center>
Blocks are repeated 6 times (in vertical stack).

<!--@slideshow slide-->
![](https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/attention-1.png)

<!--@slideshow slide-->
![](https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/attention-2.png)

<!--@slideshow slide-->
# Decoder

<!--@slideshow slide-->
## Masked self-attention
  $$
    A(q, k_1, v_1, k_2, v_2, \dots) = \sum_{i \color{red}{ < t}} v_i \cdot \mathrm{softmax}(q \cdot k_i)
  $$
  where $q = k_t$ for some $t$.


<!--@slideshow fragment-->
Why do we need masking?
- Suppose $k_t$ attends to some future word $k_{t + \delta}$.
- The word $k_{t + \delta}$ may attend the word $k_t$ **in one of the previous transformer blocks**.
- So $k_t$ may effectively "see itself".

<!--@slideshow slide-->
## Encoder-Decoder Attention
- queries come from previous decoder layer
- keys and values come from output of encoder


<!--@slideshow slide-->
<center>
<img src="http://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png">
</center>

<!--@slideshow slide-->
# BERT (Bidirectional Encoder Representations from Transformers)

<!--@slideshow slide-->
**Problem**: Transformer uses _masking_ so that words cannot "see themselves" - but _language understanding_ is bidirectional!

<!--@slideshow fragment-->
Remember this example?
1. "Students opened their ___ as the proctor started the clock."
1. "Students opened their ___ and started coding."

<!--@slideshow fragment-->
**Solution**: Mask out 15% of the input words, and then predict the masked words.
> the man went to the [MASK] to buy a [MASK] of milk

<!--@slideshow slide-->
<center>
<img src="https://raw.githubusercontent.com/horoshenkih/harbour-space-text-mining-course/master/pic/elmo-gpt-bert.png">
</center>

<!--@slideshow slide-->
# Demo: [transformer notebooks](https://huggingface.co/transformers/notebooks.html)

<!--@slideshow slide-->
# Summary
1. Sequence-to-Sequence (seq2seq): learn the condition with _encoder_, then generate text with conditioning with _decoder.
1. Motivation of Transformer: learn long-range dependencies faster.
1. Attention: measure dependency between words using dot product of vectors.
1. Transformer block:
  - multi-head attention + FFNN
  - residual connections, dropout, layer normalization
1. Encoder
  - input: BPE, positional encoding
1. Decoder
  - masked self-attention, encoder-decoder attention
1. BERT
  - bidirectional model with masking

<!--@slideshow slide-->
# Recommended resources
- [CS224n Lecture 13: Contextual Word Representations
and Pretraining](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture13-contextual-representations.pdf)
- [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
- [Transformer Notebooks](https://huggingface.co/transformers/notebooks.html)
- [Byte Pair Encoding — The Dark Horse of Modern NLP](https://towardsdatascience.com/byte-pair-encoding-the-dark-horse-of-modern-nlp-eb36c7df4f10)
- [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/)