# Transformers Layers
Investigate and play with our transformer implementation

#### References
* https://www.tensorflow.org/text/tutorials/transformer
* https://nlp.seas.harvard.edu/2018/04/03/attention.html#full-model
* https://theaisummer.com/self-attention/
* https://arxiv.org/pdf/2009.06732.pdf
* https://github.com/jadore801120/attention-is-all-you-need-pytorch
* https://bgg.medium.com/seq2seq-pay-attention-to-self-attention-part-2-cf81bf32c73d

In [9]:
import sys
sys.path.append('../')
from model.sublayers.transformer import scaled_dot_product_attn, MultiHeadedAttention
import torch

### Scaled-Dot-Product
<img src="./imgs/Scaled_Dot-Product_Attention.png" width="320" height="240" align="right">
This operator enable the information retrieval system behaviour typical on transformers, where some query tensor will be used to search some concept encoded on the values, the key tensor is the index that allows queries to find the values.

$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V$$

It's valid to mention that the matrix multiplication between Q and K is material for a lot of other paper that try to make the memory and compute consumption non-quadratic to the sequence lenght.

In [8]:
keys = torch.rand(4, 3)
query = torch.rand(1, 3)
values = torch.rand(4, 2)
attn, attn_weights = scaled_dot_product_attn(query, keys, values)
print('Scaled dot product attention shape:', attn.shape)
print('Attention Weights shape:', attn_weights.shape)

Scaled dot product attention shape: torch.Size([1, 2])
Attention Weights shape: torch.Size([1, 4])


### Multi-Headed Attention
<img src="./imgs/MultiHeadAttn.png" width="320" height="240" align="right">
This is the module that allow the transformer to learn(which means has parameters) how each word(token) in a input prase (or any sequence) relate to each other at different positions. The multiple "heads" also gives the possibility to represent those relations in different subspaces. 

$$\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, ...,
\mathrm{head_h})W^O    \\
    \text{where}~\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)$$

In [23]:
mha = MultiHeadedAttention(num_heads=8, d_model=512)
print(mha)
# [batch_size, encoder_sequence, d_model]
x = torch.rand(1, 60, 512)
attn, attn_weights = mha(key=x, query=x, value=x)

# [batch_size, encoder_sequence, d_model]
print(f'Multiheaded attention output shape: {attn.shape}')

# [batch_size, num_heads, seq_len_q, seq_len_k]
print(f'Multiheaded attention weights: {attn_weights.shape}')

MultiHeadedAttention(
  (linear_q): Linear(in_features=512, out_features=512, bias=True)
  (linear_k): Linear(in_features=512, out_features=512, bias=True)
  (linear_v): Linear(in_features=512, out_features=512, bias=True)
  (linear_concat): Linear(in_features=512, out_features=512, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)
Multiheaded attention output shape: torch.Size([1, 60, 512])
Multiheaded attention weights: torch.Size([1, 60, 60])
