----
# Transformer-Paper Exploration - Attention is all you need
----

Inspired by The Annotated Transformer - https://nlp.seas.harvard.edu/2018/04/03/attention.html


The Transformer presented in "Attention is all you need".

This is an annotated version of the paper as a line by line implementation.




My goal is to really understand, how to implement a transformer network architecture for myself. My goal is not to teach others, but genuinely understand and being able to implement a transformer architecture from scratch. I don't want to use other peoples code while not understanding it. I might switch later to a given and working codebase, but for the start, I want to understand how this model is used, trained and extended for my own purposes.

I have reasons to not use the available original pretrained transformer models, but to train my own. Because I don't target natural language processing with this implementation. Because of that I will try to implement a much smaller baseline, with only a few million parameters, and if it works out, I will decide how to move on. I might go for a bigger model, maybe i have to train it for money, or use my own hardware.

## Let's start with the overall architecture first

A encoder-decoder architecture is standard right now, we have an encoder on the left side (gray box) and a decoder on the right side (also gray box).

![arxiv_1706_03762_fig1](images/arxiv.1706.03762.fig1.png "Figure 1 of arxiv 1706.03762")

One inference-step (forward step) of this whole model takes an encode step and after that a decode step. We combine these operations:

    decode( encode( inputs ), previous_outputs_shifted_right )

We have to define two more operations, `encode` and `decode`. 

The `encode` operation is a combination of embedding the inputs (left red box) and then run the encoding step on it (left gray box). 

The `decode` operation is a combination of embedding the previously generated outputs(right red box) and then run the decoding step (right gray box) on it using the additional input of the encoding step (arrow(s) from the left gray box into the right gray box).

In [2]:
class EncoderDecoder(object):
    '''
    This is a simple implementation of an encoder-decoder architecture. It is not specific to the implementation of the 
    transformer architecture.
    '''
    
    def __init__(self, encoder, decoder, source_embedder, target_embedder, generator):
        self.encoder = encoder
        self.decoder = decoder
        self.source_embed = source_embedder
        self.target_embed = target_embedder
        self.generator = generator
        
    def forward(self, source, target, source_mask, target_mask):
        return self.decode(self.encode(source, source_mask), source_mask, target, target_mask)
    
    def encode(self, source, source_mask):
        return self.encoder(self.source_embed(source), src_mask)
    
    def decode(self, memory, source_mask, target, target_mask):
        return self.decoder(self.target_embed(target), memory, source_mask, target_mask)