This is a short notebook to walk through some of the applications of Transformer neural nets for tokenized analytic data, and demonstrate the functionality of the repo through examples. First let's load in the necessary libraries and modules.

Each architecture does something a little different, roughly, the inputs and outputs look like the following:

1. Encoder-Only: [3,4,2,5,...,5,6,3,4,3,3] --> 8
2. Decoder-Only: [[5,6,3,4,3,3],[8,2,3,5,1,3],...]
3. Encoder-Decoder: [5,6,3,4,3,3] --> [3,4,2,5,1,3,2]

In words, this looks like:

1. Encoders take a sequence and maps it to a new vector in the embedding space that gets mapped to a single category
2. Decoders take a sequence and predict the next token, it is thus trained on a set of sequences
3. Encoder-Decoder conditions the next token predictino of the decoder layers with an encoder output vector. 

Obviously, these are all overlappping, and in many ways you can create the same behavior for encoders with decoders and vice-versa (just have the output of encoder map to the next token in the sequence, as opposed to some completely different semantic category). 

But for historical reasons, we'll keep all three of these architectures distinct as they have been used for different types of token prediction tasks.

Let's start with encoder only and "train" a neural network to identify the largest token in a sequence -- i.e. effectively implement a MAX function acting on list using a neural network.

In [1]:
import torch

Let's just run through each step of what makes a transformer. This will help us better understand what functions we need to write.

In [2]:
src = torch.randint(100,[6])
encoder_layer = torch.nn.TransformerEncoderLayer(d_model = 16, nhead = 4, dim_feedforward = 16, dropout = 0)
encoder = torch.nn.TransformerEncoder(encoder_layer, num_layers = 4)
encoder_embedding = torch.nn.Embedding(100,16)
print(src)

tensor([24, 76, 55, 44, 94, 72])




For now, we will let the encoder have dropout = 0 so that the model produces a consistent output for debugging.

In [3]:
x = encoder_embedding(src)
internal_rep = encoder(x)
print(internal_rep)

tensor([[-1.2127,  1.6537,  0.0723,  1.3046,  0.2210,  1.5384,  0.5104, -1.4163,
         -0.4024, -1.4259,  0.4463, -0.9402, -1.1685, -0.3235,  0.5261,  0.6167],
        [-0.5606,  0.8476, -0.6768,  0.5698,  0.2271,  1.2122, -0.2183, -1.6810,
         -0.2648, -0.4991,  0.9850, -2.1261, -0.2526,  1.8321, -0.1312,  0.7368],
        [ 1.1186,  1.7833,  0.9453, -1.3795,  0.8694, -0.4449, -1.0556,  0.4666,
         -2.0631, -1.0185, -0.0567,  0.5114,  0.2645,  0.0031, -0.6264,  0.6822],
        [ 0.9288,  0.9889,  0.2136,  0.9784, -1.1983,  0.7845,  1.0701, -1.5109,
         -1.6719,  0.1763, -0.3626, -0.4841, -0.8155,  0.2127, -0.9847,  1.6748],
        [ 1.6971,  1.1108,  0.8574, -0.7106, -0.0139,  0.3790, -1.2149, -1.1007,
         -1.9908, -0.7293, -0.0161, -0.5163, -0.0650,  1.0827, -0.1328,  1.3633],
        [ 1.8538,  1.2696,  0.6135,  0.5131, -1.2424,  0.8327, -0.2386,  0.2915,
         -1.6153, -0.3127, -0.2418,  0.1944, -1.6435,  0.6072, -1.3548,  0.4733]],
       grad_fn=<Nativ

Notice what we have done here. We have taken a list of integers (tokens), mapped it to the embedding space (now a list of vectors), and transformed those vectors using our encoder layers (which are a combination of self-attention and feed-forward networks). Below are the shapes of the data at each step:

In [4]:
print("src size is:", src.shape)
print("embedded input is: ",encoder_embedding(src).shape)
print("encoded input is: ",internal_rep.shape)

src size is: torch.Size([6])
embedded input is:  torch.Size([6, 16])
encoded input is:  torch.Size([6, 16])


Now if we just had an encoder only transformer, we are essentially done. We can take these encoded vectors, and map them to a new token space. Let's try this out by first performing the contraction on the sequence dimension (dim=0), and then on the embedding space (dim=1). We will do this with two-linear projection layers

In [5]:
token_projection = torch.nn.Linear(6,1)
embedding_projection = torch.nn.Linear(16, 2)

Using this we can now build probabilities that classify our internal representation into two outcomes, to which we will assign probabilities. 

In [6]:
x = embedding_projection(internal_rep)
output = token_projection(x.transpose(0,1))
softmax = torch.nn.Softmax(dim=0)

print(softmax(output).tolist())

[[0.47317489981651306], [0.5268250703811646]]


Now we're cooking. Likewise, we could also pass this through a decoder stack, that maps the sequence to a desired output sequence. Let's give this a try by first creating instances of decoder_embedding, and decoder_layers. 

In [7]:
decoder_layer = torch.nn.TransformerDecoderLayer(d_model = 16, nhead = 4, dim_feedforward = 16, dropout = 0)
decoder = torch.nn.TransformerDecoder(decoder_layer, num_layers = 4)
decoder_embedding = torch.nn.Embedding(100,16)
final_projection = torch.nn.Linear(16, 9)
tgt = torch.randint(100,[6])

In [8]:
tgt_emb = decoder_embedding(tgt)
decoded_seq = decoder(tgt_emb, internal_rep)

softmax = torch.nn.Softmax(dim=1)

proj = final_projection(decoded_seq)
print(softmax(proj).argmax(dim=-1))

tensor([0, 3, 3, 8, 3, 3])


And that's it! We've just gone through the entire mapping of a transformer. Now what we want to build is something that can do translation --> take some high complexity integral, and translate it to a basis of master integral weighted by integer coefficients over a large prime field. Simple enough. In  practice it will look something like this:

model = EncoderDecoderModel(arg1,arg2,...)

"{3;5;2;3}" --> "{2341;6734;98432;325}"

We want this model to learn how to take the input sequence,

In [9]:
print(list("{3;5;2;3}"))

['{', '3', ';', '5', ';', '2', ';', '3', '}']


and translate it into a sequence of the form,

In [10]:
print(list("{2341;6734;98432;325}"))

['{', '2', '3', '4', '1', ';', '6', '7', '3', '4', ';', '9', '8', '4', '3', '2', ';', '3', '2', '5', '}']


To achieve this, we will make the base model an autoregressive encoder-decoder transformer, while the high level model that we send sequences to will give the output once the loop is terminated. Here's a rough mock-up:

base_model = EncoderDecoderModel(arg1,arg2,...)

where base_model is autoregressive and takes in a src (input of the full model), and a tgt_t with some non-trivial entries. We will train the base_model, and then the full model simply runs a routine to generate the output. 

In [11]:
class AI_IBP_model(torch.nn.Module):

    def __init__(self, base_model):
        super().__init___()
        self.initial_tgt = torch.tensor(["{","PAD","PAD","PAD","PAD","PAD",...])
        self.output_max_length = len(self.initial_tgt)
        self.base_model = base_model

    def forward(self, src):
        tgt = self.initial_tgt
        src = torch.tensor(src)
        for i in range(self.output_max_length-1):
            tgt[i+1] = self.base_model(src, tgt)[i+1].tolist()
        return tgt

To be clear, the base_model is what gets trained with back propagation, and the high-level model AI_IBP_model is what is actually used to map one integral sequence to the next.

Now let's think about the training run. We will train the base model, that does autoregressive decoding. That means, for every pair of input_seq and output_seq, we want to prepare a data set with len(output_seq) training examples. For each example, we want the model to predict the token output_seq[i], given the input data input_seq, output_seq[i-1]. 

Let's do this one step at a time to get a sense for what we're dealing with.

In [59]:
from encoder_decoder import EncoderDecoderModel

base = EncoderDecoderModel(10,20,64,4,2,64)

src = [1,6,3,4,5,2]
tgt = [1,3,7,13,3,15,12,13,2]
pad_tok = 0
bos_tok = 1
eos_tok = 2
seq_len = 10

Now we have made an instance of the EncoderDecoder model, and we're ready to cook. To standardize the training, we are going to normalize the input and output sequences, so that they all have seq_len = 10. 

To do this, we need to add padding tokens.

In [60]:
def pad_seq(seq, max_len):
    return seq +[pad_tok]*( max_len - len(seq))

print("input: ",pad_seq(src, seq_len))
print("output: ",pad_seq(tgt, seq_len))

input:  [1, 6, 3, 4, 5, 2, 0, 0, 0, 0]
output:  [1, 3, 7, 13, 3, 15, 12, 13, 2, 0]


In [64]:
masked_data = [tgt]
for i in range(len(tgt)-1):
    masked_data.append(tgt[:-i-1]+[0]*(i+1))
masked_data.reverse()
torch.tensor(masked_data)

tensor([[ 1,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  3,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  3,  7,  0,  0,  0,  0,  0,  0],
        [ 1,  3,  7, 13,  0,  0,  0,  0,  0],
        [ 1,  3,  7, 13,  3,  0,  0,  0,  0],
        [ 1,  3,  7, 13,  3, 15,  0,  0,  0],
        [ 1,  3,  7, 13,  3, 15, 12,  0,  0],
        [ 1,  3,  7, 13,  3, 15, 12, 13,  0],
        [ 1,  3,  7, 13,  3, 15, 12, 13,  2]])

Rather than doing all of this in series (which is what our for-loop is doing), PyTorch as native masking functions that are well suited for parallel computation. 

In [63]:
def causal_mask(tensor_input):
    seq_length = tensor_input.size(1)  # Assume tensor_input has shape (batch_size, seq_length)
    return torch.triu(torch.ones(seq_length,seq_length), diagonal=1) == 0
torch.tensor(tgt).unsqueeze(0)*causal_mask(torch.tensor(tgt).unsqueeze(0))

tensor([[ 1,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  3,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  3,  7,  0,  0,  0,  0,  0,  0],
        [ 1,  3,  7, 13,  0,  0,  0,  0,  0],
        [ 1,  3,  7, 13,  3,  0,  0,  0,  0],
        [ 1,  3,  7, 13,  3, 15,  0,  0,  0],
        [ 1,  3,  7, 13,  3, 15, 12,  0,  0],
        [ 1,  3,  7, 13,  3, 15, 12, 13,  0],
        [ 1,  3,  7, 13,  3, 15, 12, 13,  2]])

In [None]:
from torch import nn

class EncoderDecoderModel(nn.Module):
    def __init__(self, input_dim, output_dim, d_model, num_heads, num_layers, ff_dim, dropout = 0):
        super().__init__()

        # Encoder embedding and layer
        self.encoder_embedding = nn.Embedding(input_dim, d_model)
        self.encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, ff_dim, dropout)

        # Decoder embedding and layer
        self.decoder_embedding = nn.Embedding(output_dim, d_model)
        self.decoder_layer = nn.TransformerDecoderLayer(d_model, num_heads, ff_dim, dropout)

        # Encoder/Decoder stack
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers)
        self.decoder = nn.TransformerDecoder(self.decoder_layer, num_layers)

        self.fc_out = nn.Linear(d_model, output_dim)


    def forward(self, src, tgt):
        # Source encoded with padding mask
        src_emb = self.encoder_embedding(src)
        print(src_emb)
        src_encoded = self.encoder(src_emb)

        # Target decoded with causal & padding mask
        tgt_emb = self.decoder_embedding(tgt)
        tgt_padding_mask = padding_mask(tgt).to(src.device)
        tgt_causal_mask = causal_mask(tgt).to(src.device)
        tgt_decoded = self.decoder(tgt_emb, src_encoded)

        return self.fc_out(tgt_decoded)

In [91]:
basemodel = EncoderDecoderModel(10,20,64,4,2,64,0)
src,tgt



(tensor([1, 6, 3, 4, 5, 2, 0, 0, 0, 0]),
 tensor([ 1,  3,  7, 13,  3, 15, 12, 13,  2,  0]))

In [106]:
src_emb = basemodel.encoder_embedding(src)
src_encoded = basemodel.encoder(src_emb)

# Target decoded with causal & padding mask
tgt_emb = basemodel.decoder_embedding(tgt)
tgt_decoded = basemodel.decoder(tgt_emb, src_encoded)

basemodel.fc_out(tgt_decoded)

tensor([[ 0.6365, -0.1251,  0.0741,  0.6195, -0.3606,  0.2211,  0.1788, -0.6724,
          0.8437, -0.7596, -0.2102, -0.3413,  0.1840,  0.0579,  0.0699,  0.5463,
          0.8563, -0.5731,  0.4686, -0.1620],
        [ 0.1209, -0.0113, -0.4144,  0.6071,  1.0980,  0.3065, -0.2806, -0.1779,
         -0.2318, -0.9086,  0.5979, -0.2905,  0.4655,  0.4133,  0.4207, -0.6185,
          0.8042,  0.8250,  0.0447,  1.1392],
        [ 0.0860,  0.0327, -0.1629,  0.7807,  0.4243, -0.4513, -0.0532, -0.4585,
          1.3042, -1.1243, -0.8062,  0.2256, -0.2201,  0.9695, -0.9161,  0.5189,
          0.6786,  0.4888, -0.1673,  0.0120],
        [ 1.4958,  0.2586, -0.1913,  0.5204, -0.1927,  1.2984,  0.2823, -0.6646,
         -0.2679, -0.0415, -0.1828, -0.7941, -0.0805, -0.1785, -0.3469,  0.1435,
          1.1888,  0.2179,  0.8081, -0.1433],
        [ 0.1209, -0.0113, -0.4144,  0.6071,  1.0980,  0.3065, -0.2806, -0.1779,
         -0.2318, -0.9086,  0.5979, -0.2905,  0.4655,  0.4133,  0.4207, -0.6185,
      

Now we want to scale this up so that model takes in objects of size (batch_size, seq_length), and the embedding layer transforms them into objects of the form (batch_size,seq_length,d_model)

In [113]:
embedding = torch.nn.Embedding(20,4)
src = torch.randint(20,(4,5))

In [121]:
print(src)

tensor([[ 0, 18, 14,  1, 10],
        [ 2,  5,  9, 13, 12],
        [17,  3, 18,  2,  3],
        [15,  4,  0, 12,  7]])


In [123]:
embedding(src)

tensor([[[ 0.9217,  0.0278,  1.2198, -0.2151],
         [-0.2040,  0.2743,  0.4099,  1.1374],
         [-0.8172,  1.3363, -2.9211,  2.0755],
         [-0.2288,  0.1933, -0.5393, -0.0338],
         [-0.7728, -0.3497, -0.3871,  0.3064]],

        [[ 0.8770, -0.9357, -1.1362, -0.2797],
         [ 1.8444,  0.2990,  0.5008, -0.4369],
         [-1.4631,  0.2147, -1.2931,  0.1227],
         [-1.0897, -0.0143,  0.6449, -1.4886],
         [-0.0460, -0.0438, -0.7277, -1.0486]],

        [[ 0.5102, -0.7681,  0.8339,  1.2929],
         [ 0.3597, -1.5874, -1.8975,  0.7586],
         [-0.2040,  0.2743,  0.4099,  1.1374],
         [ 0.8770, -0.9357, -1.1362, -0.2797],
         [ 0.3597, -1.5874, -1.8975,  0.7586]],

        [[-0.4619,  0.6276, -0.4414,  0.9545],
         [ 0.6068, -0.5590, -0.1337,  1.2721],
         [ 0.9217,  0.0278,  1.2198, -0.2151],
         [-0.0460, -0.0438, -0.7277, -1.0486],
         [-0.8824,  0.4871,  1.0233,  0.7304]]], grad_fn=<EmbeddingBackward0>)

In [None]:
encoder_layer = torch.nn.TransformerEncoderLayer(4,2,4, dropout=0)

tensor([[ 0.4745, -0.2995,  1.2671, -1.4421],
        [-1.3554, -0.2683,  0.1903,  1.4334],
        [-0.8411,  0.6342, -1.0942,  1.3011],
        [-0.7749,  1.3655, -1.1200,  0.5294],
        [-1.3628, -0.1172,  0.0211,  1.4589]],
       grad_fn=<NativeLayerNormBackward0>)

In [133]:
encoder_layer(embedding(src))

tensor([[[ 0.2125, -0.5008,  1.4974, -1.2091],
         [-1.3716, -0.2727,  0.2337,  1.4105],
         [-0.8343,  0.6351, -1.1005,  1.2998],
         [ 1.4154, -0.0690, -1.4099,  0.0635],
         [-0.9779, -0.7814,  0.2143,  1.5450]],

        [[ 1.5279, -0.8658, -0.9213,  0.2592],
         [ 1.3318, -0.2139,  0.3230, -1.4409],
         [-1.4141,  0.8465, -0.4655,  1.0330],
         [-0.5280,  0.1536,  1.5340, -1.1596],
         [ 0.9437,  0.5621,  0.1580, -1.6639]],

        [[-0.3311, -1.4595,  0.6151,  1.1755],
         [ 0.5754, -1.0402, -0.8809,  1.3457],
         [-1.3335, -0.2427,  0.1096,  1.4666],
         [ 1.5448, -1.2277, -0.3260,  0.0089],
         [ 0.7964, -1.1951, -0.7645,  1.1632]],

        [[-1.4114,  0.6657, -0.4293,  1.1750],
         [ 0.1549, -1.2710, -0.3740,  1.4901],
         [ 0.5193, -0.2930,  1.2319, -1.4583],
         [ 1.0820,  0.2249,  0.3275, -1.6344],
         [-1.5275, -0.0852,  1.2300,  0.3827]]],
       grad_fn=<NativeLayerNormBackward0>)

In [134]:
linear_layer_model = torch.nn.Linear(4,2)
linear_layer_seq = torch.nn.Linear(5,1)

In [135]:
x = encoder_layer(embedding(src))
x = linear_layer_model(x)
print(x)

tensor([[[-0.2718, -0.1872],
         [ 0.5062, -0.2132],
         [ 0.5326, -0.2159],
         [ 0.0754, -0.8034],
         [ 0.4913, -0.4594]],

        [[ 0.0624, -1.0178],
         [-0.3442, -0.4599],
         [ 0.4932,  0.0727],
         [-0.1853,  0.1573],
         [-0.3324, -0.1510]],

        [[ 0.3183, -0.7044],
         [ 0.3721, -0.9750],
         [ 0.5228, -0.2336],
         [-0.0351, -1.0272],
         [ 0.3038, -1.0332]],

        [[ 0.5175,  0.0096],
         [ 0.4052, -0.9060],
         [-0.3320, -0.1985],
         [-0.3550, -0.2599],
         [ 0.2450,  0.1022]]], grad_fn=<ViewBackward0>)
