This is a short notebook to walk through some of the applications of Transformer neural nets for tokenized analytic data, and demonstrate the functionality of the repo through examples. First let's load in the necessary libraries and modules.

Each architecture does something a little different, roughly, the inputs and outputs look like the following:

1. Encoder-Only: [3,4,2,5,...,5,6,3,4,3,3] --> 8
2. Decoder-Only: [[5,6,3,4,3,3],[8,2,3,5,1,3],...]
3. Encoder-Decoder: [5,6,3,4,3,3] --> [3,4,2,5,1,3,2]

In words, this looks like:

1. Encoders take a sequence and maps it to a new vector in the embedding space that gets mapped to a single category
2. Decoders take a sequence and predict the next token, it is thus trained on a set of sequences
3. Encoder-Decoder conditions the next token predictino of the decoder layers with an encoder output vector. 

Obviously, these are all overlappping, and in many ways you can create the same behavior for encoders with decoders and vice-versa (just have the output of encoder map to the next token in the sequence, as opposed to some completely different semantic category). 

But for historical reasons, we'll keep all three of these architectures distinct as they have been used for different types of token prediction tasks.

Let's start with encoder only and "train" a neural network to identify the largest token in a sequence -- i.e. effectively implement a MAX function acting on list using a neural network.

In [172]:
import torch

Let's just run through each step of what makes a transformer. This will help us better understand what functions we need to write.

In [174]:
src = torch.randint(100,[6])
encoder_layer = torch.nn.TransformerEncoderLayer(d_model = 16, nhead = 4, dim_feedforward = 16, dropout = 0)
encoder = torch.nn.TransformerEncoder(encoder_layer, num_layers = 4)
encoder_embedding = torch.nn.Embedding(100,16)
print(src)

tensor([90, 10, 38, 49, 24, 14])


For now, we will let the encoder have dropout = 0 so that the model produces a consistent output for debugging.

In [175]:
x = encoder_embedding(src)
internal_rep = encoder(x)
print(internal_rep)

tensor([[ 0.6463,  1.5202,  0.2964, -0.4024,  1.6908,  0.3492, -1.0988,  1.0006,
         -0.1530,  0.5313,  0.3044, -2.0887,  0.1483, -0.9133, -0.3969, -1.4345],
        [-0.2057,  1.5150,  0.1638, -0.0061,  1.7000,  0.8537, -1.2499,  1.1544,
         -0.9514, -0.2319, -0.3340, -2.0115, -0.3033, -0.3549,  1.0673, -0.8053],
        [ 0.6144,  1.3180,  0.1553, -0.4885,  1.9265, -0.3660, -1.3322,  0.5868,
         -0.5834,  0.5751,  0.1226, -1.5521,  0.3205,  0.3377,  0.4092, -2.0442],
        [-0.6160,  1.8545, -0.6061, -2.0125,  0.3472,  0.6439, -1.5567,  0.1975,
          0.9158,  0.9141, -1.0344, -0.8321,  0.0472,  0.5315,  0.2046,  1.0014],
        [-1.6581, -0.3990,  1.3529,  0.4723, -1.9835,  0.1635, -0.4968, -0.7485,
          1.3000, -0.0625,  1.3928, -0.9147, -0.0121,  0.5573, -0.1715,  1.2079],
        [-1.0931,  1.9596,  0.3180,  0.4556,  0.2201,  0.1674,  0.2010,  0.7495,
          0.1940, -1.1141,  1.1850, -1.2857,  0.5750,  0.2279, -0.5288, -2.2314]],
       grad_fn=<Nativ

Notice what we have done here. We have taken a list of integers (tokens), mapped it to the embedding space (now a list of vectors), and transformed those vectors using our encoder layers (which are a combination of self-attention and feed-forward networks). Below are the shapes of the data at each step:

In [176]:
print("src size is:", src.shape)
print("embedded input is: ",encoder_embedding(src).shape)
print("encoded input is: ",internal_rep.shape)

src size is: torch.Size([6])
embedded input is:  torch.Size([6, 16])
encoded input is:  torch.Size([6, 16])


Now if we just had an encoder only transformer, we are essentially done. We can take these encoded vectors, and map them to a new token space. Let's try this out by first performing the contraction on the sequence dimension (dim=0), and then on the embedding space (dim=1). We will do this with two-linear projection layers

In [177]:
token_projection = torch.nn.Linear(6,1)
embedding_projection = torch.nn.Linear(16, 2)

Using this we can now build probabilities that classify our internal representation into two outcomes, to which we will assign probabilities. 

In [178]:
x = embedding_projection(internal_rep)
output = token_projection(x.transpose(0,1))
softmax = torch.nn.Softmax(dim=0)

print(softmax(output).tolist())

[[0.4085923731327057], [0.5914076566696167]]


Now we're cooking. Likewise, we could also pass this through a decoder stack, that maps the sequence to a desired output sequence. Let's give this a try by first creating instances of decoder_embedding, and decoder_layers. 

In [179]:
decoder_layer = torch.nn.TransformerDecoderLayer(d_model = 16, nhead = 4, dim_feedforward = 16, dropout = 0)
decoder = torch.nn.TransformerDecoder(decoder_layer, num_layers = 4)
decoder_embedding = torch.nn.Embedding(100,16)
final_projection = torch.nn.Linear(16, 9)
tgt = torch.randint(100,[6])

In [180]:
tgt_emb = decoder_embedding(tgt)
decoded_seq = decoder(tgt_emb, internal_rep)

softmax = torch.nn.Softmax(dim=1)

proj = final_projection(decoded_seq)
print(softmax(proj).argmax(dim=-1))

tensor([8, 6, 6, 7, 8, 0])


And that's it! We've just gone through the entire mapping of a transformer. Now what we want to build is something that can do translation --> take some high complexity integral, and translate it to a basis of master integral weighted by integer coefficients over a large prime field. Simple enough. In  practice it will look something like this:

model = EncoderDecoderModel(arg1,arg2,...)

"{3;5;2;3}" --> "{2341;6734;98432;325}"

We want this model to learn how to take the input sequence,

In [192]:
print(list("{3;5;2;3}"))

['{', '3', ';', '5', ';', '2', ';', '3', '}']


and translate it into a sequence of the form,

In [191]:
print(list("{2341;6734;98432;325}"))

['{', '2', '3', '4', '1', ';', '6', '7', '3', '4', ';', '9', '8', '4', '3', '2', ';', '3', '2', '5', '}']


To achieve this, we will make the base model an autoregressive encoder-decoder transformer, while the high level model that we send sequences to will give the output once the loop is terminated. Here's a rough mock-up:

base_model = EncoderDecoderModel(arg1,arg2,...)

where base_model is autoregressive and takes in a src (input of the full model), and a tgt_t with some non-trivial entries. We will train the base_model, and then the full model simply runs a routine to generate the output. 

In [None]:
class AI_IBP_model(torch.nn.Module):

    def __init__(self, base_model):
        super().__init___()
        self.initial_tgt = torch.tensor(["{","PAD","PAD","PAD","PAD","PAD",...])
        self.output_max_length = len(self.initial_tgt)
        self.base_model = base_model

    def forward(self, src):
        tgt = self.initial_tgt
        src = torch.tensor(src)
        for i in range(self.output_max_length-1):
            tgt[i+1] = self.base_model(src, tgt)[i+1].tolist()
        return tgt

To be clear, the base_model is what gets trained with back propagation, and the high-level model AI_IBP_model is what is actually used to map one integral sequence to the next.