This is a short notebook to walk through some of the applications of Transformer neural nets for tokenized analytic data, and demonstrate the functionality of the repo through examples. First let's load in the necessary libraries and modules.

In [61]:
import encoder_decoder, encoder_only, decoder_only

ImportError: cannot import name 'Dataset' from 'torch' (/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/torch/__init__.py)

Each architecture does something a little different, roughly, the inputs and outputs look like the following:

1. Encoder-Only: [3,4,2,5,...,5,6,3,4,3,3] --> 8
2. Decoder-Only: [[5,6,3,4,3,3],[8,2,3,5,1,3],...]
3. Encoder-Decoder: [5,6,3,4,3,3] --> [3,4,2,5,1,3,2]

In words, this looks like:

1. Encoders take a sequence and maps it to a new vector in the embedding space that gets mapped to a single category
2. Decoders take a sequence and predict the next token, it is thus trained on a set of sequences
3. Encoder-Decoder conditions the next token predictino of the decoder layers with an encoder output vector. 

Obviously, these are all overlappping, and in many ways you can create the same behavior for encoders with decoders and vice-versa (just have the output of encoder map to the next token in the sequence, as opposed to some completely different semantic category). 

But for historical reasons, we'll keep all three of these architectures distinct as they have been used for different types of token prediction tasks.

Let's start with encoder only and "train" a neural network to identify the largest token in a sequence -- i.e. effectively implement a MAX function acting on list using a neural network.

In [None]:
import torch
from torch.utils.data import DataLoader

In [None]:
# Define Dataset Class
class SequenceDataset(torch.utils.data.Dataset):
    def __init__(self, sequences, targets):
        self.sequences = sequences
        self.targets = targets
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, idx):
        return torch.tensor(self.sequences[idx], dtype=torch.long), torch.tensor(self.targets[idx], dtype=torch.long)

In [None]:
seq = []
tgt = []
import random

for i in range(20):
    start = []
    for j in range(7):
        start.append(random.randint(1,10))
    seq.append(start)
    tgt.append(random.randint(1,20))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
data = SequenceDataset(seq,tgt)



In [None]:
from torch import nn
encoder_layer = nn.TransformerEncoderLayer(d_model=96,nhead=8)
encoder_only = nn.TransformerEncoder(encoder_layer,num_layers=4)
src = torch.rand(4,4,96)
out = encoder_only(src)
out[0,1]

tensor([-9.7131e-01,  6.4653e-01,  1.1058e+00, -1.5218e+00, -2.9165e-01,
         1.7518e-01, -6.2922e-02, -1.3742e+00,  5.7713e-01,  9.7474e-02,
        -5.5063e-01, -4.2551e-01, -2.4405e+00,  3.3425e-01, -9.2217e-01,
         2.9956e-02, -1.7427e+00, -2.9713e-04,  2.0554e+00, -3.3534e-01,
         4.4978e-01, -1.4612e+00, -9.6993e-01, -1.7550e+00, -4.9043e-01,
         2.1186e-01, -2.5202e+00, -3.3610e-01, -6.8310e-01, -3.2452e-02,
         2.7511e+00, -9.1493e-01,  1.3785e-01,  1.0419e+00, -1.4911e+00,
         3.2904e-01,  4.1584e-02,  1.4884e-01,  2.4676e-01,  1.1247e+00,
         1.3813e+00, -1.2214e-01,  1.9641e-01,  4.9167e-01,  2.5802e-01,
         3.2950e-01,  1.3350e+00,  1.0029e+00,  9.8470e-01, -8.4471e-01,
         1.3989e+00,  7.9151e-01,  2.5862e-01, -1.2320e+00,  8.8090e-02,
        -4.4815e-01,  7.1815e-02,  5.7713e-01,  1.3305e+00,  1.1457e+00,
        -2.8976e-01, -1.2213e-01, -1.1351e+00,  9.9267e-01, -1.4445e+00,
         1.2487e+00, -1.0512e+00,  9.6878e-01, -1.2

In [None]:
loader=DataLoader(data,batch_size=5, shuffle=True)

for epoch in range(5):
    print(epoch)
    for seq,tgt in loader:
        seq,tgt = seq.to(device), tgt.to(device)
        print(seq,seq.transpose(1,0),seq.transpose(1,0))

0
tensor([[ 1, 10,  6,  7,  3,  6,  7],
        [ 8,  6,  4,  2, 10,  9,  6],
        [ 1,  8,  4,  7,  8,  8,  7],
        [ 8,  6,  1,  9,  3,  6,  9],
        [ 5,  8,  1,  3,  7,  1,  8]]) tensor([[ 1,  8,  1,  8,  5],
        [10,  6,  8,  6,  8],
        [ 6,  4,  4,  1,  1],
        [ 7,  2,  7,  9,  3],
        [ 3, 10,  8,  3,  7],
        [ 6,  9,  8,  6,  1],
        [ 7,  6,  7,  9,  8]]) tensor([[ 1,  8,  1,  8,  5],
        [10,  6,  8,  6,  8],
        [ 6,  4,  4,  1,  1],
        [ 7,  2,  7,  9,  3],
        [ 3, 10,  8,  3,  7],
        [ 6,  9,  8,  6,  1],
        [ 7,  6,  7,  9,  8]])
tensor([[ 2,  3,  3, 10,  7,  9,  3],
        [ 7,  3,  6,  5,  1,  3,  2],
        [ 9,  2,  9,  6, 10,  5,  8],
        [ 7, 10, 10,  5,  3,  4,  1],
        [ 3,  2,  8, 10,  3, 10,  9]]) tensor([[ 2,  7,  9,  7,  3],
        [ 3,  3,  2, 10,  2],
        [ 3,  6,  9, 10,  8],
        [10,  5,  6,  5, 10],
        [ 7,  1, 10,  3,  3],
        [ 9,  3,  5,  4, 10],
        [ 3,  

In [None]:
embedding = nn.Embedding(10,9)
softmax=torch.nn.Softmax(dim=1)
logits = softmax(embedding(torch.tensor([1,2,3,4,9,0,5,6,7,8,8])))

Let's just run through each step of what makes a transformer. This will help us better understand what functions we need to write.

In [66]:
src = torch.randint(100,[6])
encoder_layer = torch.nn.TransformerEncoderLayer(d_model = 16, nhead = 4, dim_feedforward = 16, dropout = 0)
encoder = torch.nn.TransformerEncoder(encoder_layer, num_layers = 4)
encoder_embedding = torch.nn.Embedding(100,16)
print(src)

tensor([97, 19, 28, 15, 63, 22])


For now, we will let the encoder have dropout = 0 so that the model produces a consistent output for debugging.

In [78]:
x = encoder_embedding(src)
internal_rep = encoder(x)
print(internal_rep)

tensor([[ 1.3283,  0.0077,  0.3913,  0.9870,  0.6921,  0.3539, -1.3776,  1.3551,
         -0.2632,  0.2011, -1.9349, -0.9768,  1.0233, -0.5457,  0.3402, -1.5817],
        [ 1.2455, -0.0666,  1.1090,  0.5991,  0.1802,  1.1465, -1.9347,  1.0548,
         -0.5469,  1.0107, -1.6405, -0.8075,  0.2264, -0.3305,  0.1026, -1.3482],
        [ 1.1028, -0.3962,  1.3035,  1.0214, -0.2523,  1.2202, -1.2183,  1.0586,
         -0.7378,  0.2708, -2.2377, -0.7819,  0.7078, -0.5491,  0.2720, -0.7837],
        [ 0.9776, -0.3862,  0.9163,  0.4753,  0.5099,  1.3121, -1.6841,  1.4081,
         -0.6459,  0.6655, -1.6363, -0.9853,  0.6782, -0.6420,  0.2881, -1.2512],
        [ 1.6680, -0.3225,  0.3550,  0.7286, -0.0035,  0.3544, -1.1634,  1.3554,
         -0.4003,  0.1139, -2.0916, -0.8644,  1.4365, -1.0879,  0.5066, -0.5847],
        [ 2.0897,  0.1409, -0.0174, -0.3433,  0.2823,  0.4397, -1.3754,  1.0591,
         -0.1134, -0.1459, -1.7945, -0.9701,  1.6416, -0.8249,  0.5295, -0.5980]],
       grad_fn=<Nativ

Notice what we have done here. We have taken a list of integers (tokens), mapped it to the embedding space (now a list of vectors), and transformed those vectors using our encoder layers (which are a combination of self-attention and feed-forward networks). Below are the shapes of the data at each step:

In [99]:
print("src size is:", src.shape)
print("embedded input is: ",encoder_embedding(src).shape)
print("encoded input is: ",internal_rep.shape)

src size is: torch.Size([6])
embedded input is:  torch.Size([6, 16])
encoded input is:  torch.Size([6, 16])


Now if we just had an encoder only transformer, we are essentially done. We can take these encoded vectors, and map them to a new token space. Let's try this out by first performing the contraction on the sequence dimension (dim=0), and then on the embedding space (dim=1). We will do this with two-linear projection layers

In [103]:
token_projection = torch.nn.Linear(6,1)
embedding_projection = torch.nn.Linear(16, 2)

Using this we can now build probabilities that classify our internal representation into two outcomes, to which we will assign probabilities. 

In [115]:
x = embedding_projection(internal_rep)
output = token_projection(x.transpose(0,1))
softmax = torch.nn.Softmax(dim=0)

print(softmax(output).tolist())

[[0.6231492161750793], [0.37685078382492065]]
