----
# Transformer-Paper Exploration - Attention is all you need
----

Inspired by The Annotated Transformer - https://nlp.seas.harvard.edu/2018/04/03/attention.html


The Transformer presented in "Attention is all you need".

This is an annotated version of the paper as a line by line implementation. This implementation is not working (yet). But it helps to work on all those primitives (one at a time) and architectural designs. 




My goal is to really understand, how to implement a transformer network architecture for myself. My goal is not to teach others, but genuinely understand and being able to implement a transformer architecture from scratch. I don't want to use other peoples code while not understanding it. I might switch later to a given and working codebase, but for the start, I want to understand how this model is used, trained and extended for my own purposes.

I have reasons to not use the available original pretrained transformer models, but to train my own. Because I don't target natural language processing with this implementation. Because of that I will try to implement a much smaller baseline, with only a few million parameters, and if it works out, I will decide how to move on. I might go for a bigger model, maybe i have to train it for money, or use my own hardware.

## Let's start with the overall architecture first

A encoder-decoder architecture is standard right now, we have an encoder on the left side (gray box) and a decoder on the right side (also gray box).

![arxiv_1706_03762_fig1](images/arxiv.1706.03762.fig1.png "Figure 1 of arxiv 1706.03762")

One inference-step (forward step) of this whole model takes an encode step and after that a decode step. We combine these operations:

    decode( encode( inputs ), previous_outputs_shifted_right )

We have to define two more operations, `encode` and `decode`. 

The `encode` operation is a combination of embedding the inputs (left red box) and then run the encoding step on it (left gray box). 

The `decode` operation is a combination of embedding the previously generated outputs(right red box) and then run the decoding step (right gray box) on it using the additional input of the encoding step (arrow(s) from the left gray box into the right gray box).

In [None]:
class EncoderDecoder(object):
    '''
    This is a simple implementation of an encoder-decoder architecture. It is not specific to the implementation of the 
    transformer architecture.
    '''
    
    def __init__(self, encoder, decoder, source_embedder, target_embedder, generator):
        self.encoder = encoder
        self.decoder = decoder
        self.source_embed = source_embedder
        self.target_embed = target_embedder
        self.generator = generator
        
    def forward(self, source, target, source_mask, target_mask):
        return self.decode(self.encode(source, source_mask), source_mask, target, target_mask)
    
    def encode(self, source, source_mask):
        return self.encoder(self.source_embed(source), src_mask)
    
    def decode(self, memory, source_mask, target, target_mask):
        return self.decoder(self.target_embed(target), memory, source_mask, target_mask)

After modeling the red and gray boxes, we need to consider the light blue `Linear` layer and green `Softmax` layer. We usually call this the projection layer. Because it does the transformation of the last layer of the output to a kind of "orthogonal" representation of the output which selects one output out of many. We make a projection of the output into n one-dimensional outputs. The number of outputs is the same number as words/tokens in the output dictionary.

Because the projection layer also generates the next word/token, it is also called the `Generator`.

In [None]:
class Generator(object):
    '''
    This is a standard projection layer, used for generating the next word in a standard linear and softmax layout.
    '''
    def __init_(self, output_embedding_dimensions, vocabsize ):
        # linear nodes, has input of output_embedding words and the outputsize of vocabsize
        self.projection = linear(embedding_dimensions, vocabsize)
        
    def forward(self, x ):
        return F.log_softmax(self.projection(x), dim=-1)

## Encoder and Decoder Stacks

### Encoder
Let's have a look at the encoder again and work through it.

![arxiv_1706_03762_fig1.zoomencoder](images/arxiv.1706.03762.fig1.zoomencoder.png "Figure 1 of arxiv 1706.03762 Encoder zoomed in")

It turned out, that the number six is quite a good choice for language understanding / Languae Modelling tasks.

In [None]:
def clone_layer(module, N):
    '''
    This module will clone a module and produce N identical layers
    '''
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

I wonder, why there is a normalization as the last element of the Encoder... but let's accept that for now. But i guess, there is more usage of the pattern more than once...

The final Layer Normalization is applied, because the following implementation. The following implementation decides to apply the normalization before the layer is applied. This might be a slightly different network, than that in the paper... because x is not normalized later, which might lead to loss in X over layers. Maybe this is why it is done this way...

In [None]:
class Encoder(object):
    '''
    The Encoder is a a stack of N identical layers
    '''
    
    def __init(self, singlelayer, N):
        self.layers = clone_layer(singlelayer, N)
        self.norm = LayerNorm( singlelayer.size )
        
    def forward( self, x, mask ):
        '''Pass the input and mask through each layer and do a final normalization'''
        for layer in self.layers:
            x = layer(x, mask)
            
        return self.norm(x)

In [None]:
class LayerNorm(object):
    
    def __init__(self, number_of_features, epsilon = 1e-6 ):
        self.a_2 = nn.Parameter(ones(number_of_features))
        self.b_2 = nn.parameter(zeros(number_of_features))
        self.eps = epsilon
        
    def forward( self, x):
        mean = x.mean(-1, keep_dim=True)
        std = x.std(-1, keep_dim=True)
        return self.a_2 * (x-mean) / (std+self.eps)  + self.b_2

Each layer contains two residual connections, with an add and normalization including a dropout (yellow boxes). The dropout is applied either on the blue box or the orange box. Dropout is required for training and is disabled while inferencing. Dropout makes a neural network more reliable and resilient, to always consider multiple inputs, and strong connections will not preferred/or occur, because the stzrong connections may be lost while training and the model should also perform well, and not only rely on the one strong connection only, when it can use other intputs too.

Okay lets explain this... This is the yelow box and the left connection of the yellow boxes, where `Feed Forward`(blue box) and `Multi-Head Attention`(orange box) is the given sublayer. This component is executing the whole thing. The normalization is taken over from the previous operation Layer.

X is the black line on the left around each sublayer compinent and the input below each sublayer component.

In [None]:
class SubLayerConnection(object):
    '''
    A residual connection followed by a layer normalization
    '''
    def __init__(self, size, dropout):
        self.norm = LayerNorm( size );
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, sublayer ):
        '''apply the layer on the normalized input of the previous layer or operation'''
        return x + self.dropout(sublayer(self.norm(x)))

Each layer consista of two sublayers. One is called the Feed-forward layer and the second sublayer is a multi-head-self attention mechanism.

We will both incorporate in one encoder layer. But at first we will not investigate further how they are constructed. The goal is to implement the overall structure first and later care about the details.

In [None]:
class EncoderLayer(object):
    '''
    The Encoder consists of an attention mechanism and a feed forward network
    '''
    
    def __init__(self, size, self_attention, feed_forward, dropout):
        self.self_attn = self_attention
        self.feed_forward = feed_forward
        self.sublayer = clone_layer(SubLayerConnection(size, dropout),2)
        self.size = size
        
    def forward(self, x, mask):
        # apply the self attention mechanism and the residual connection
        x = self.sublayer[0](x, lamba x: self.self_attn(x,x,x,mask))
        # apply the feed forward network and the residual connection
        x = self.sublayer[1](x, self.feed_forward)
        return x
        

## Decoder

![arxiv_1706_03762_fig1.zoomdecoder](images/arxiv.1706.03762.fig1.zoomdecoder.png "Figure 1 of arxiv 1706.03762 Decoder zoomed in")


The decoder consists also of six identical layers.



In [None]:
class Decoder(object):
    def __init__(self, layer, N):
        self.layers = clones( layer, N)
        self.norm = LayerNorm(layer.size)
    
    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

In [None]:
class DecoderLayer(object):
    '''
    One Decoding Layer consists of 
    '''
    
    def __init__(self, size, self_attention, source_attention, feed_forward, dropout):
        self.self_attn = self_attention
        self.source_attention = source_attention
        self.feed_forward = feed_forward
        self.sublayer = clones(SubLayerConnection(size, dropout))
        pass
    
    def forward(self, x, memory, src_mask, tgt_mask):
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x,x,x,tgt_mask))
        x = self.sublayer[1](x, lambda x: self.source_attention(x, m, m, src_mask))
        x = self.sublayer[2](x, self.feed_forward)
        return x

In [None]:
import matplotlib.pyplot as plt

def subsequent_mask(size):
    attn_shape = (1, size,size)
    subsequent_mask = np.triu(np.ones(attn_shape),k=1).astype('uint8')
    return subsequent_mask==0

plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])
