In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

import numpy as np

# Pytorch Transformer

> Based on the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762)

> Google's original implementation in TensorFlow [here](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py)

The Transformer network is an attention only sequence transduction model "dispensing with recurrence and convolutions entirely".

# Transformer Architecture 

The model is made up of $N$ identical encoder/decoder layers ($N=6$ in the paper).

![Transformer Model Architecture](./images/network_architecture.png)

Each encoder/decoder is made up of two types of Sub-Layers:

- A Multi-Head Attention Layer
- A Feed-Forward Network

**TODO**: Layer-normilization is applied around each sub-layer before being passed to the next. Don't forget to implement this in the full model later.

We're going to implement from more or less scratch! So we need to dig down into the sub-layers and build up.

## Sub-Layer: Multi-Head Attention

> An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

>-- Attention is All You Need, 3.2

These queries, keys and values are computed as matrices

The sub-layer looks like:

![Multi-head attention](./images/multi-head.png)

Where:

$$MultiHead(Q, K, V) = Concat(head_{1}, \ldots{}, head_{h}) W^{O}$$

and 

$$head_i = Attention(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V})$$


### Implementation: Scaled Attention Function

Recursing ever further into our model, we need the math behind a single Attention calculation:

![Attention Function](./images/attention_calc.png)

or, in notation:

$$Attention(Q,K,V) = softmax \left( \frac{ QK^{T} }{ \sqrt{ d_{k} } } \right) V$$

This is a version of multiplicative attention. It's scaled by the dimension of the key to prevent the dot product from getting out of hand (see this [blog post](http://ruder.io/deep-learning-nlp-best-practices/index.html#attention) for a nice overview of additive vs. multiplicative attention)


Some notes about the google implementation:
- The optional mask is implemented simply by setting masked values to $-\infty$.
- Later in the paper they add dropout to the result of the softmax. We'll just put it in now, as they found it improved the model

Let's do this:


In [3]:
class ScaledAttention(nn.Module):
    
    def __init__(self, dim_key, drop_percent=0.1):
        super(ScaledAttention, self).__init__()
        
        # The value to scale by will be constant throughout model
        self.scale_value = np.sqrt(dim_key)
        
        # Layers
        self.dropout = nn.Dropout(drop_percent)
        self.softmax = nn.Softmax()
        
    def forward(self, Q, K, V, mask=None):
        
        # Remember, the 1st dim of torch tensors is the batch size
        attention = torch.bmm(Q, K.transpose(1, 2))
        
        if mask:
            attention.masked_fill_(mask, -float('inf'))
            
        # Ugh... python... get you a pipe operator or something
        attention = self.droupout(self.softmax(attention))
        
        attention = torch.bmm(attention, V)
        
        return attention
        

### Implementation: Multi-Head Attention

Now we have what we need for the Multi-Head sub-layer itself. From the paper:

> Instead of performing a single attention function with $d_{model}$-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_q$, $d_k$ and $d_v$ dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$-dimensional output values. These are concatenated and once again projected, resulting in the final values ...

> Attention is All You Need, 3.2.2


Relevant formulas from above:
$$MultiHead(Q, K, V) = Concat(head_{1}, \ldots{}, head_{h}) W^{O}$$

$$head_i = Attention(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V})$$

W/r/t dimensions of stuff: 

$W_{i}^{Q} \in \mathbb{R}^{d_{model} \times d_k}$, $W_{i}^{K} \in \mathbb{R}^{d_{model} \times d_k}$, $W_{i}^{V} \in \mathbb{R}^{d_{model} \times d_v}$, and $W_{i}^{O} \in \mathbb{R}^{d_{hd_v} \times d_{model}}$

where in the paper: 
- $h = 8$
- $d_{model} = 512$
- $d_q = d_k = d_v = d_{model}/h = 64$

In [44]:
class MultiHeadAttention(nn.Module):
    
    def __init__(self, num_heads, dim_model, dim_key, dim_value, drop_percent=0.1):
        super(MultiHeadAttenion, self).__init__()
        
        self.num_heads = num_heads
        self.dim_key   = dim_key
        self.dim_value = dim_value
        
        # Projecting w/ learned weight matrix is just a linear transform w/out bias, yes?
        # TODO: Will these Linear units learn num_heads seperate weights 
        #       if an <num_heads> x <dim_model> Tensor is passed in? 
        #       is passed through? I don't actually know. 
        #
        #       Would need to do the projection "by hand" if not (i.e. with our own weight Variables)
        q_projection = nn.Linear(dim_model, dim_key, bias=False)
        k_projection = nn.Linear(dim_model, dim_key, bias=False)
        v_projection = nn.Linear(dim_model, dim_value, bias=False)
        
        o_projection = nn.Linear(num_heads*dim_value, dim_model, bias=False)
        
        # Layers
        self.scaled_attention = ScaledAttention(dim_key, drop_percent)
        
    def forward(self, q, k, v, mask=None):
        
        # If I understand PyTorch, we'll be getting in Tensors of shape
        # <Minibatch Size> x <Length Of Vec> x <Dimmension of Embedding Space>
        # where dim(embedding) would be dim(model)
        mini_batch_size, len_q, dim_model = q.size()
        _,               len_k, _         = k.size()
        _,               len_v, _         = v.size()
        

        # We want to duplicate the vectors num_heads times (each will be projected into dim_key/value),
        # then reshape into <num_heads> x <len_vec * mini_batch_size ?> x <dim_model>  
        Q = q.repeat(self.num_heads, 1, 1).view(self.num_heads, -1, self.dim_model)
        K = k.repeat(self.num_heads, 1, 1).view(self.num_heads, -1, self.dim_model)
        V = v.repeat(self.num_heads, 1, 1).view(self.num_heads, -1, self.dim_model)
        
        # Now pass through the linear projection and reshape to 
        # some larger batch size of <length of vector> x <dim_key/value> (which our attention function expects)
        Q = q_projection(Q).shape(-1, len_q, self.dim_key)
        K = k_projection(K).shape(-1, len_k, self.dim_key)
        V = v_projection(V).shape(-1, len_v, self.dim_value)
        
        # note, if we make our mask based on a single <vec> x <embedding>
        # we need a tensor of num_heads identical masks I think maybe?
        if mask:
            mask = mask.repeat(self.num_heads, 1, 1)
            
        # Pass our matrices to our Attention layer
        attention = scaled_attention(Q, K, V, mask)
        
        # Chunk so first dimension is the expected mini batch size using split()
        # Then concat along the last axis to get expected size to match with dimensions of W^O
        attention = torch.cat(seq = torch.split(attention, mini_batch_size, dim=0), 
                              dim = -1)
        
        # Project back to tensor of <mini_batch> x <dim_model> x <dim_model>
        attention = o_projection(attention)
        
        return attention