## Multihead Attention

- one attention head c can capture one relationship

- in a sentence there are many relationships

- multiple head's whith their on $W^Q$, $W^K$ and $W^V$

- additional $W^O$ to combine heads





<img src="MultiHead.svg" width=40% style="margin-left:auto; margin-right:auto">

## Multihead Attention Math


- $W^O$ with $ d_{model} \times hd_v$

- $W^{O}_h$ with $d_{model} \times d_v$

- $c_j$ with $1 \times d_v$


$c_j$ of each head has dimensionality $d_v$, will be denoted with $c_{hj}$ 

$$z_j = \sum_{h=1}^H W^{O}_h c_{hj} = W^{O} \cdot [c_{1j} ... c_{Hj}]^T  $$

$z_j$ with $( d_{model} \times hd_v) \times (hd_v \times 1) = d_{model} $

$z$ with $ inputs \times d_{model} $


In [2]:
import random
import numpy as np
import math

# input vector (maybe take values from Stephan/Ziwei)
x = np.array([np.random.random_sample(8) for x in range(3)])

# we need to set the dimensions
d_model = len(x[0]) # always the length of the input vectors
d_q = d_model // 4 # theoretically freely choosable to linear transform the projection matrix
d_v = d_model // 2 # can be different for the values, but usually not
h_count = 3 # Header Count

W_Q = np.random.random_sample((h_count,d_q, d_model))
W_K = np.random.random_sample((h_count,d_q, d_model))
W_V = np.random.random_sample((h_count,d_v, d_model))


c_jh = np.zeros((x.shape[0], h_count, d_v))

for hi in range(h_count):
    k_stars = np.array([np.dot(W_K[hi], xi) for xi in x])
    q_stars = np.array([np.dot(W_Q[hi], xi).transpose() for xi in x])
    v_stars = np.array([np.dot(W_V[hi], xi) for xi in x])
    
    for j in range(x.shape[0]):
        qj_star = q_stars[j]
        all_gj = np.array([np.dot(qj_star, k_stars[i]) / np.sqrt(d_model) for i in range(x.shape[0])]) # 3x1
        sum_g = np.sum(np.array([math.exp(all_gj[i]) for i in range(x.shape[0])]))
        alpha_j = np.array([math.exp(all_gj[i]) / sum_g for i in range(x.shape[0])])
        c_jh[j][hi] = np.sum([np.dot(alpha_j[i], v_stars[i]) for i in range(x.shape[0])], axis=0)

W_O = np.random.random_sample((x.shape[0],d_model, h_count*d_v))

z = np.zeros((x.shape[0], d_model))
c = c_jh.reshape(x.shape[0], h_count * d_v)

for i in range(x.shape[0]):
    z[i] = np.dot(W_O[i], c[i])


## Pointwise Feedforward Network

- each position same transformation (with same weights)

- fully connected

- $d_{inner} > d_{model}$

- take up approx. $2/3$ of transformer parameters

- might serve as key/value pair (https://arxiv.org/pdf/2012.14913.pdf)

- $in = out =  d_{model}$


$$PFF(z_j) = W_2 F(W_1 z_j + b_1 ) + b2   $$

$$ F(x) = max(0,x) = Relu $$


- So attention might not be everything you need.


In [4]:
d_inner = d_model * 4

W_1 = np.random.random_sample((d_inner, d_model))
W_2 = np.random.random_sample((d_model, d_inner))
b_1 = np.random.random_sample((d_inner))
b_2 = np.random.random_sample((d_model))
relu = np.zeros((d_inner))
y = np.zeros((x.shape[0], d_model))
for i in range(x.shape[0]):
    hidden_layer = np.maximum(np.dot(W_1, z[i] ) + b_1, d_inner)
    y[i] = np.dot(W_2, hidden_layer) + b_2


## Residual Connection and Layer Optimization

- aim to improve the converge of optimzation algorithms



## Residual Connection

Substitue $y = f(x,p)$ by $y = x + g(x,q)$ where q and p are parameter vectors

This means that $g(x,q) = f(x,p) - x$

- $ g(x,q)$ can be easier to optimize if f is close to the identity function $id(x) = x$
- $ q(x,q)$ learns how much the input x needs to change

- if $initial weights = 0$ without residual, then $output\approx zerofunction$
- if $initial weights = 0$ with residual, then $output \approx identity(x)$

Reasoning:

- gradient of objective function E ( Error function) $E(y)$.
- Chain-Rule
$$ E'(y) = E'(y)\cdot y'$$


$$ \frac{\delta E}{\delta y} = \frac{\delta E}{\delta y} \frac{\delta y}{\delta x}$$

Without residual: y = Px

$$ \frac{\delta E}{\delta y} \frac{\delta y}{\delta x} = \frac{\delta E}{\delta y}P$$

With residual connection: y = x + Qx, I = Identy Matrix (derivate of x)

$$ \frac{\delta E}{\delta y} \frac{\delta y}{\delta x} = \frac{\delta E}{\delta y}(I+Q)$$ 

### One more justification - Vanishing Gradient Problem

- Deep (stacked) network with H layers

$$ y_h = Ix + F_h(x) = (I + F_h)x $$

- stack of layers 1,...H

$$ y = ( \prod_{h=1}^H (I + F_h) ) x $$


$$ \prod_{h=1}^H (I + F_h) = I + \sum_{h\leq H}F_h + \sum_{i<j\leq H}F_jF_i +... + \prod_{h=H,..,1}F_h $$

- non residual would correspond to last term:

$$  \prod_{h=H,..,1}F_h$$

- size of gradient tends to decrease with chaining layers -> vanishing

- residual connection contains term $\sum_hF_h $ ( sum of outputs of invididual layers)

- this prevents gradients from vanishing



In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,3))
result = layers.MultiHeadAttention(key_dim=2, num_heads=2, use_bias=False, kernel_initializer='zeros')(inputs, inputs)
#result = layers.Add()([inputs, result])
model = keras.models.Model(inputs=inputs, outputs=result)
test_input = tf.constant([[[1,2,3]]])
result = model(test_input) 
print(result)

## Normalization

### Reasons

- Input variables might have different scaling

- in a linear model scaling is accounted for by the pseudoinversion X'X

- might lead to bad numerical conditioning of the inverse

- in a stacked network, outputs could also hit problematic regions of nonlinear activation functions

- Weight Matrices are unbound -> could also grow limitless


### Batch Normalization

- a priori knowledge of mean $m$ and variance $v$ of variable $z$ from test-set

$$ \hat{y} = \frac{z-m}{\sqrt{v}}$$


- For fit measure E = MSE the gradients:

$$ \frac{\delta E}{\delta z} = \frac {\delta E}{\delta y}\frac{1}{\sqrt{v}} [(1-\frac {1}{H}) - (z-m)^2 \frac{1}{Hv}] $$


- bounderies for the norm of gradients and hessian matrix of second derivatives

- batch normalization seems to make the mapping smoother


## Layer Normalization

- in batch normalization gradient of one sample depends on all other samples as well

- layer normalization calculates mean and variance over feature dimension

- Layer Normalization in Transformer: 

$$ \hat{x} = \frac{x -m}{\sqrt{v^2}}$$

Applied after Residual Connection of MultiHeadAttention and PFF:

>$ z^*_j = LayerNorm(z_j + x_j)$ where $ z_j = MultiHeadAttt(x_j, x)$


>$ y^*_j = LayerNorm(y_j + z^*_j)$ where $y_j = PFF(z^*_j)$


<img src="BatchLayerNorm.png" width=40% style="margin-left:auto; margin-right:auto">

In [None]:
import random
beta = random.uniform(-1, 1)
gamma = random.uniform(-1,1)
epsilon = 0.00001

v_i = x[0]
mu = 1 / v_i.shape[0] * np.sum(v_i)
omega_squared = 1 / v_i.shape[0] * np.sum((v_i - mu)**2)
Layer_Norm = gamma * (v_i - mu) / (np.sqrt(omega_squared) + epsilon) + beta
print(Layer_Norm)

### How it is done

$$ LayerNorm(x) = \gamma \frac{x-m}{\sqrt{v^2} + \epsilon}+ \beta $$
> $ x \in \mathbb{R}^{d_{model}}$

> $ mean = m = \frac{1}{d_{model}} \sum_{i=1}^{d_{model}} x_i$

> $ v^2 = \frac{1}{d_{model}} \sum_{i=1}^{d_{model}} (x_i - m)^2$

> $ \epsilon$ and $\beta$ are two learnable parameters
