# Tensorflow Experiments 0x02
----
(C) Maxim Gansert, 2020, Mindscan Engineering

In [None]:
import sys
sys.path.insert(0, '../../src')

In [None]:
import math
import numpy as np

from sklearn.utils.extmath import softmax


I want to experiment with the attention mechanism. This is one of the things i do not understand right now. The following steps shall be achieved:

* reuse my learned embeddings **done**
* use a fixed vector **done**
* do the attention calculation **done**
* visualize the attention **done**


# Load embedding data

In [None]:
from de.mindscan.fluentgenesis.embedding.embedder import Embedder



In [None]:
embedder = Embedder()
embedder.load("../../data/16k-full-embeddings/syn0.txt")


In [None]:
# this will embed a sequence of int32 into a matrix of (len(input) x 512)

bpe = [461, 124, 648, 92, 94, 2128, 645, 640, 62, 864, 47, 3357, 41, 5946, 42, 60, 10160, 1712, 62, 10160, 47, 1465, 41, 35, 4151, 10423, 42, 60, 320, 1712, 47, 5438, 41, 2128, 645, 640, 42, 60, 126, 633, 41, 349, 102, 42, 124, 320, 346, 60, 126]
E = embedder.embed(bpe)

print(E.shape)
print(E)

In [None]:
K = E
Q = E
V = E


In [None]:
scores = np.dot(K, Q.T) / math.sqrt(512)
print(scores.shape)

In [None]:
# this softmax (sklearn) function work row wise, (line by line), we can see that because 
# the matrix is not symatric any more.

softscores = softmax(scores)


In [None]:
print(softscores)


## Visualize the Results

In [None]:
import matplotlib
import matplotlib.pyplot as plt

from matplotlib.colors import NoNorm

In [None]:
def plot_attention( attn ):
    fig, ax = plt.subplots(figsize=(8,8) , dpi=150)
    im = ax.imshow(attn, cmap=plt.get_cmap('gray'), norm=NoNorm(), interpolation='none')

    ax.set_xticks(np.arange(len(bpe)))
    ax.set_yticks(np.arange(len(bpe)))

    ax.set_xticklabels(bpe)
    ax.set_yticklabels(bpe)

    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    ax.set_title("Attention")
    fig.tight_layout()
    plt.show()
    
def plot_attention2( attn ):
    fig, ax = plt.subplots(figsize=(12,30) , dpi=150)
    im = ax.imshow(attn, cmap=plt.get_cmap('gray'))

    # ax.set_xticks(np.arange(len(bpe)))
    ax.set_yticks(np.arange(len(bpe)))

    # ax.set_xticklabels(bpe)
    ax.set_yticklabels(bpe)

    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    ax.set_title("Weighted Embeddings.")
    fig.tight_layout()
    plt.show()

In [None]:
plot_attention(softscores)

## Simple Attention-Mechanism (Pytorch) 

This code is for reference and is equivalent to the examples in "the annotated transformer" which implements the transformder network described in "Attention is all you need".

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Q=torch.tensor(E, device=device).float()
K=torch.tensor(E, device=device).float()
V=torch.tensor(E, device=device).float()

In [None]:
t_scores = torch.matmul(Q,K.transpose(-2,-1)) / math.sqrt(512)

print(t_scores)

In [None]:
p_attn = F.softmax(t_scores, dim = -1)

print(p_attn)

What we hope to see is that the pytorch implementation has similar results to the numpy implementation above, to be sure that the numpy implementation is doing the same calculations as the pytorch(reference) implementation.

Since I don't know pytorch very well, i want to have a consistent view in numpy so i can translate that later to a tensorflow implementation.

In [None]:
    
plot_attention(p_attn.cpu())

In [None]:
result=torch.matmul(p_attn, V)

plot_attention2(result.cpu())

In [None]:
result.cpu().size()

## Simple Attention-Mechanism (numpy)

In [None]:
def attention(query, key, value):
    d_k = 512
    scores = np.dot(query, key.T) / math.sqrt(d_k)
    p_attn = softmax(scores)
    
    return np.dot(p_attn, value), p_attn

In [None]:
def run_simple_attention(input):
    out, pattn =attention(input,input,input)
    plot_attention(pattn)
    plot_attention2(out)
run_simple_attention(E)

## Multi-Head Attention-Mechanism (numpy)

How does multihead attention work? Instead of having one computation of Attention for the whole embedding (of sequence length x embedding dimensions (e.g. 512)) we divide the embeddings into smaller ones, by "splitting" the embedding vectors into smaller ones. **(But unfortunately this is not how it is done)**

Let's assume we have 16 attention heads, we split a 49x512 embedding into 16 adjacent tiles of size 49x32. If we have 8 attention heads, we split the 49x512 embedding into 8 adjacent tiles of size 49x64. The embeddings used have different statistical properties for every dimension, resulting in different self attention matrices for each tile (they don't look the same. Thus the weighting of the values will be different. You can see the different sttention matrices below.

There is still the question open, on how to proceed with those attention matrices and how to combine the different results.
We simply can calculate different attentions. But what then?

  * use the only on the particular bloc, where this atention is derived from?
  * concat these attentions and do some magic with the "value"

In [None]:
## https://stackoverflow.com/questions/16856788/slice-2d-array-into-smaller-2d-arrays

def blockshaped(arr, nrows, ncols):
    """
    Return an array of shape (n, nrows, ncols) where
    n * nrows * ncols = arr.size

    If arr is a 2D array, the returned array should look like n subblocks with
    each subblock preserving the "physical" layout of arr.
    """
    h, w = arr.shape
    assert h % nrows == 0, "{} rows is not evenly divisble by {}".format(h, nrows)
    assert w % ncols == 0, "{} cols is not evenly divisble by {}".format(w, ncols)
    return (arr.reshape(h//nrows, nrows, -1, ncols)
               .swapaxes(1,2)
               .reshape(-1, nrows, ncols))

heads = 8
eSplitted = blockshaped(E, 49, 512//heads)
print(eSplitted.shape)

for i in range(0,heads):
    run_simple_attention(eSplitted[i])

## Conclusion

The self attention describes how much a word (line) is connected/related to the i-th word (column) in the sentence (contextual relationship). If we split the attention by splitting the embeddings, it creates attentions across multiple dimensions encoded in the embedding vector. But we also do not care about what each dimension in the vector encodes. But having multiple attentions can help to keep track of multiple ideas/concepts in the given input sentence.
**(Sorry but the conclusion is wrong here...)**

In [None]:
def sum_simple_attention(input):
    results = []
    for i in range(0,heads):
        out, pattn =attention(input[i],input[i],input[i])
        results.append(pattn)
    
    plot_attention( (results[0]+results[1]+results[2]+results[3]+results[4]+results[5]+results[6]+results[7])/8 )
    
sum_simple_attention(eSplitted)

## Multi-head-Attention -- Part 2
The real multi-head attention is implemented by using a learned weighting for V, K and Q. We have three weight matrices for each attention head. which reduces the dimensionality of K and Q to d_k, d_q = d_model // heads. For a model using 512d-embeddings and 8 heads we have d_k=64 and d_q=64 because of 64 = 512 // 8.

In the Transformer paper it seems that d_k, d_q and d_v are of different dimensions. In the tensorflow implementation these are equal. But i still have to investigate that further.