# 1. Encoder-Decoder

for problem like machine translation, we often use the encoder-decoder architecture:

![jupyter](./encoder-decoder.svg)

# 2. Seq2Seq

combine rnn & encoder-decoder, we derive the seq2seq model.

it behave differently in training & predicting, seq2seq in training, this is the so called teacher forcing:
![jupyter](seq_new_train.svg)
here red rectangle represent encoder rnn, yellow rectangle represent decoder rnn, rnn here can be simple-rnn or gru or lstm e.t.c

seq2seq in predicting:
![jupyter](seq_predict.svg)
in machine-translation, raw text is represented by one-hot vector, before feeding in to encoder&decoder, we need to embedding them. 

after decoder-output, we also need a fully-connected-network to transform embedding vectors back to raw vectors.

multi-layer rnns can also be applied, adding all these:

![jupyter](seq2seq-details.svg)

In [1]:
from torch import nn


class Seq2SeqEncoder(nn.Module):

    def __init__(self, input_dim, embed_dim, hidden_dim, num_layers):
        super().__init__()
        self.input_dim = input_dim
        self.embedded_dim = embed_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_dim, num_layers=num_layers)

    def forward(self, enc_x):
        embedded = self.embedding(enc_x)
        output, state = self.rnn(embedded)
        return output, state

In [2]:
class Seq2SeqDecoder(nn.Module):

    def __init__(self, output_dim, embed_dim, hidden_dim, num_layers):
        super().__init__()
        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.num_layers = num_layers

        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_dim, num_layers=num_layers)
        self.linear = nn.Linear(hidden_dim, output_dim)
        self.soft_max = nn.Softmax()

    def forward(self, dec_x, state):
        embedded = self.embedding(dec_x)
        output, state = self.rnn(embedded, state)
        prediction = self.soft_max(self.linear(output))
        return prediction, state

In [3]:
class Seq2Seq(nn.Module):

    def __init__(self, encoder: Seq2SeqEncoder, decoder: Seq2SeqDecoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, enc_x, dec_x):
        _, state = self.encoder(enc_x)
        prediction, state = self.decoder(dec_x, state)
        return prediction, state

# 3.Attention mechanism

Using the nonvolitional cue based on saliency (red cup, non-paper), attention is involuntarily directed to the coffee:

![jupyter](eye-coffee.svg)


Using the volitional cue (want to read a book) that is task-dependent, attention is directed to the book under volitional control.

![jupyter](eye-book.svg)

Inspired by the above biology experiment, we come to the attention mechanism.

1. nonvolitional cues $\rightarrow$ keys
2. volitional cue $\rightarrow$ query
3. sensory inputs $\rightarrow$ values

Attention mechanisms bias selection over values (sensory inputs) via attention pooling, which incorporates queries (volitional cues) and keys (nonvolitional cues):

![jupyter](qkv.svg)

# 4. attention pooling: nadaraya-watson regression

given a dataset $\left\{(x_{1},y_{1}),...,(x_{n},y_{n})\right\}$, where $x_{i} \in \mathbb{R}, y_{i} \in \mathbb{R}$, how to learn $f$ to predict the output $\hat{y}=f(x)$ for any new input $x$?

the "dumbest" estimator for this regression problem:

$$f(x)= \frac{1}{n}\sum_{i=1}^{n}y_{i}$$

![jupyter](nadaraya-waston-1.svg)

nadaraya-watson regression evaluate the connection between $x$ and $x_{i}$ by a positive function $k(x,x_{i})$, then uses weighted average:

$$f(x) = \sum_{i=1}^{n}\frac{k(x, x_{i})}{\sum_{j=1}^{n}k(x, x_{j})}y_{i}$$

to illustrate, just consider guassian kernel:

$$k(x, x_{i}) = \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2}(x - x_{i})^{2})$$

in this setting:

$$f(x) = \sum_{i=1}^{n}\frac{exp(-\frac{1}{2}(x - x_{i})^{2})}{\sum_{i=1}^{n}exp(-\frac{1}{2}(x - x_{i})^{2})}y_{i} = \sum_{i=1}^{n}softmax\left(-\frac{1}{2}(x - x_{i})^{2}\right)y_{i}$$

![jupyter](nadaraya-waston-2.svg)

parametric nadaraya-watson uses a learnable parameter $w$:

$$f(x) = \sum_{i=1}^{n}\frac{exp(-\frac{1}{2}((x - x_{i})w)^{2})}{\sum_{i=1}^{n}exp(-\frac{1}{2}((x - x_{i})w)^{2})}y_{i} = \sum_{i=1}^{n}softmax\left(-\frac{1}{2}((x - x_{i})w)^{2}\right)y_{i}$$

![jupyter](nadaraya-waston-3.svg)

# 5.attention scoring function

we can formalize nadaraya-watson regression as:

1. compute query-key weight: $a(x, x_{i}) = -\frac{1}{2}(x - x_{i})^{2}$
2. softmax these weight, derive a distribution: $w(x, x_{i}) = softmax(a(x, x_{i}))$
3. output weighted average: $f(x) = \sum_{i=1}^{n}w(x, x_{i})y_{i}$

this is exactly one type of attention mechanism:

![jupyter](attention-output.svg)

when queries and keys are vectors of different length, we can use the additive attention:

$$a(\mathbf{q}, \mathbf{k}) = w_{h}^{T}tanh(W_{q}\mathbf{q} + W_{k}\mathbf{k})$$

here $\mathbf{q} \in \mathbb{R}^{q}, \mathbf{k} \in \mathbb{R}^{k}$, learnable parameters $W_{q} \in \mathbb{R}^{h\times{q}}, W_{k} \in \mathbb{R}^{h\times{k}}, w_{h} \in \mathbb{R}^{h}$.

here hidden size $h$ is a hyperparameter.

it is equal to concatenate query and key, then fed into a two layer MLP with hidden size $h$ and output size 1.

when query and key are of the same dimension, we can use the scaled dot-product attention:

$$a(q, k) = \frac{q^{T}k}{\sqrt{d}}$$

here $q, k \in \mathbb{R}^{d}$, we divide $\sqrt{d}$ to preserve variance.

suppose $q, k$ elements are independent with zero mean and unit variance, then $q^{T}k$ has zero mean and $d$ variance.

scaled dot-product is more computationally efficient, furthermore, in practice, we often think in minibatches for efficiency.

suppose we compute attention for $n$ queries and $m$ key-value pairs, then now the scaled dot-product:

$$softmax\left(\frac{QK^{T}}{\sqrt{d}}\right)V \in \mathbb{R}^{n\times{v}}$$

here $Q \in \mathbb{R}^{n\times{d}}, K \in \mathbb{R}^{m\times{d}}, V\in \mathbb{R}^{m\times{v}}$.

# 6.bahdanau attention

review the vanilla RNN encoder-decoder:

![jupyter](seq2seq1.svg)

where the context variable $c$ is the same in all decoder steps.

while bahdanau attention uses $c_{t^{'}}$ to replace $c$:

$$c_{t^{'}} = \sum_{i=1}^{T}\alpha(s_{t^{'}-1}, h_{t})h_{t}$$

here $T$ is the input sequence length.

the decoder hidden state $s_{t^{'}-1}$ at time step $t^{'}-1$ is the query.

the encoder hidden state $h_{t}$ are both keys and values.

we uses additive attention here, i.e:

$$a(\mathbf{q}, \mathbf{k}) = w_{h}^{T}tanh(W_{q}\mathbf{q} + W_{k}\mathbf{k})$$

bahdanau attention's data-flow:

![jupyter](seq2seq2.svg)

# 7.multi-head attention

given the same set of queries, keys and values, we want our model to combine knowledge from different behaviors.

e.g shorter-range-dependencis vs long-range-dependencies.

in multi-head attention, queries, keys and values first transformed with $h$ linear projections.

each output then fed into a attention pooling, finally concatenate and followed by another fully-connected layer:

![jupyter](multi-head-attention.svg)

# 8. self-attention and positional encoding

if queries $=$ keys $=$ values, we call the attention self-attention.

suppose a sequence $x_{1}, ..., x_{n}$ where $x_{i} \in \mathbb{R}^{d}$, then self-attention output:

$$y_{i} = f(x_{i}, (x_{1}, x_{1}),...,(x_{n}, x_{n})) \in \mathbb{R}^{d}$$

comparing cnns, rnns and self-attention in mapping a sequences of $n$ tokens to another sequence of $n$ tokens:

![jupyter](cnn-rnn-self-attention.svg)

unlike rnns, self-attention ditches sequential operation in favor of parallel computation.

to use sequence order information, we can inject positional information by adding positional encoding.

positional encoding can be either learned or fixed, we introduce fixed positional encoding here.

suppose input $X \in \mathbb{R}^{n\times{d}}$ contains $d$ dimensional embedding for $n$ tokens of sequence.

the positional encoding outputs $X+P$, here $P \in \mathbb{R}^{n\times{d}}$, whose elements on $i^{th}$ row and $(2j)^{th}$ or $(2j+1)^{th}$ column is:

$$p_{i, 2j} = sin\left(\frac{i}{10000^{2j/d}}\right)$$

$$p_{i, 2j + 1} = cos\left(\frac{i}{10000^{2j/d}}\right)$$

rows correspond to positions within a sequence, columns represent different positional encoding dimension. 

![jupyter](self-attention1.svg)

in binary representations, a higher bit has a lower frequency than lower bits.

the positional encoding decreases frequecies along the encoding dimension, as demenstrated below:

![jupyter](self-attention2.svg)

besides capturing absolute positional information, the above encoding also allows a model to easily learn to attend by relative positions.

denoting $\omega_{j} = 10000^{2j/d}$, then $(p_{i, 2j}, p_{i, 2j+1})$ can be linearly projected to $(p_{i + \delta, 2j}, p_{i + \delta, 2j+1})$ for any fixed offset $\delta$:

$$
\begin{equation}
\begin{split}
&\begin{bmatrix}
 cos(\delta\omega_{j}) &sin(\delta\omega_{j}) \\
 -sin(\delta\omega_{j}) &cos(\delta\omega_{j})
\end{bmatrix}
\begin{bmatrix}
 p_{i, 2j}\\
p_{i, 2j+1}
\end{bmatrix}\\
=&
\begin{bmatrix}
cos(\delta\omega_{j})sin(i\omega_{j}) + sin(\delta\omega_{j})cos(i\omega_{j})\\
-sin(\delta\omega_{j})sin(i\omega_{j}) + cos(\delta\omega_{j})cos(i\omega_{j})
\end{bmatrix}\\
=&
\begin{bmatrix}
 sin((i+\delta)\omega_{j})\\
cos((i+\delta)\omega_{j})
\end{bmatrix}\\
=&
\begin{bmatrix}
 p_{i, 2j+\delta}\\
p_{i, 2j+1\delta}
\end{bmatrix}
\end{split}
\end{equation}
$$

indepent of $i$.

# 9.transformer

the transformer architecture:

for the attention of the decoder's second-sublayer, keys and values are from the previous encoding layer outputs.

![jupyter](transformer.svg)

transformer is an instance of the encoder-decoder architecture, multi-head attention and positional encoding is as above.

positionwise feed-forward network transforms the representation at all the sequence position using the same MLP.

may be the name "position-free" is more appropriate.

add & norm layer is a residual connection immediately followed by layer normalization.

layer-normalization: accross layer.

batch-normalization: accross batch.