# Tutorial Notebook
This is a little jupyter notebook to explore the functionality of each module in the code box. First let's start with the model building code

## Building
This is the most important module -- here we actually build constructor classes for the model, and delineate between the model which is trained, and the model which does inference. 

The module is essentially split into 3 section:

1 & 2) Dataloading & 

3 & 4) Tech Stack: Embedding and Encoder/Decoder Stacks

5 & 6) Models: KernelModel (for training) & Projection Model (for inference)

This section of the notebook just explicates the important functions that we use in the implementation for each. Let's begin.

In [1]:
from building import *

### Dataloading and Tokenization

This is pretty self-explanatory: there's a tokenizer class (CustomTokenizer), and a Dataset class (FIRE6Dataset). 

In [2]:
# ---------------------
# Dataloader & Tokenizer
# ---------------------
tokenizer = CustomTokenizer()
data = FIRE6Dataset("../data/train", tokenizer, max_len=1000)
batchsize = 5
dataloader = DataLoader(data,batch_size=batchsize,shuffle=True)

The tokenizer methods, tokenize and detokenize, will be useful for our inference model. But really all you need to know is that data is an object that takes the strings in the .dat files, and turns them into a bunch of padded tensors of tokens.

### Tech Stack

You can in principle use the PyTorch classes for this section, but I find it useful to tease out the actualy building blocks of the model. This way it's very clear what's going on inside the KernelModel, and you can play around with individual blocks. 

This is essentially the flow of the KernelModel:

In [3]:
def kernel_model_forward(self,src, tgt):
    # embed the input sequence in model space
    x0 = self.embedding(src)
    x1 = self.positional_encoding(x0)
    memory = self.encoder_stack(x1)

    # embed a padded sequence upto the n-th token in output space
    y0 = self.embedding(tgt)
    tgt_input = self.positional_encoding(y0)

    # throw this into a decoder that hides all the tokens after the n-th token
    tgt_output = self.decoder_stack(tgt_input, memory, self.causal_mask)
    logits = self.fc_out(tgt_output)

    # return the logits -- NB: these are not probabilities!
    return logits

All the blocks of functions are defined in sections 3 & 4 of the code module

### Models

This is the meat and potatos. Here we can define a model that is a blank canvas using our KernelModel class. After we define the hyperparameters and use them to instantiate the model, we can do 1 of 2 things:

1) train it and mold the paramters to our likin with data

2) load a model stat dictionary from a .pth file that is consistent with this instance of the KernelModel

Let's see this in practice:

In [4]:
# ---------------------
# Hyperparamters
# ---------------------

vocab_size = len(CustomTokenizer().vocab)
d_model = 16
d_ffn = 16
nhead = 4
encoder_layers = 4
decoder_layers = 4
dropout = 0.1
PAD_TOKEN_ID = 0

# ---------------------
# Model
# ---------------------

model = KernelModel(vocab_size,d_model,d_ffn,nhead,encoder_layers,decoder_layers,dropout)
model.load_state_dict(torch.load("../models/test_model.pth"))



<All keys matched successfully>

Now we can run some dummy data through this model. The model takes as input a tensor with tokens in the vocabulary of our tokenizer.

In [5]:
src = torch.randint(15, (10,10))
tgt = torch.randint(15, (10,10))

In [7]:
model(src,tgt)

tensor([[[-4.5918,  1.0286,  1.3364,  ..., -3.6634,  0.0891, -4.7492],
         [-4.0936,  1.2809,  1.5021,  ..., -3.4740, -0.7971, -4.5059],
         [-4.6641,  1.6571,  1.5042,  ..., -4.4009, -0.0971, -4.6453],
         ...,
         [-4.6669,  1.1684,  1.4334,  ..., -3.6470,  0.0199, -4.8630],
         [-4.5707,  0.1531,  0.5614,  ..., -0.8732,  2.6705, -4.5953],
         [-4.0723,  1.5911,  1.5782,  ..., -3.4980, -1.0484, -4.5811]],

        [[-4.4722,  1.1234,  1.4237,  ..., -3.5539, -0.1702, -4.7558],
         [-4.0286,  1.1482,  1.4078,  ..., -3.0399, -0.8441, -4.5372],
         [-4.5493,  4.2288,  2.0093,  ..., -4.6730, -1.3699, -4.4064],
         ...,
         [-4.3903,  0.8531,  1.3447,  ..., -3.6983,  0.2342, -4.6064],
         [-4.7878,  0.7533,  1.0412,  ..., -2.4487,  1.0180, -4.9894],
         [-4.0476,  1.3867,  1.4238,  ..., -3.5978, -0.8200, -4.4834]],

        [[-3.9507,  4.2087,  2.0719,  ..., -3.9265, -2.4222, -4.0103],
         [-4.3250,  0.1591,  0.2106,  ...,  0

Notice that if your run these dummy inputs through the model multiple times, the numbers change. This is because we have given this instance of the model dropout = 0.1 -- which randomly chooses 10% of the cofficients to turn to zero every time it runs. If we want to remove this and have a static model, we simple need to instantiate a new model without dropout.

In [8]:
model_static = KernelModel(vocab_size,d_model,d_ffn,nhead,encoder_layers,decoder_layers,dropout=0)
model_static.load_state_dict(torch.load("../models/test_model.pth"))

<All keys matched successfully>

In [9]:
model_static(src,tgt)

tensor([[[-4.5825,  1.0400,  1.3531,  ..., -3.3819,  0.0482, -4.8566],
         [-4.3006,  1.3465,  1.4643,  ..., -3.7228, -0.6543, -4.5880],
         [-4.5805,  1.1599,  1.4061,  ..., -3.7923,  0.2424, -4.7464],
         ...,
         [-4.5218,  1.2905,  1.4187,  ..., -3.7169, -0.2926, -4.7285],
         [-4.8254,  0.3129,  0.3601,  ..., -0.7590,  2.9198, -4.5413],
         [-4.1157,  1.3464,  1.4823,  ..., -3.5534, -0.8285, -4.5053]],

        [[-4.5825,  1.0400,  1.3531,  ..., -3.3819,  0.0482, -4.8566],
         [-4.3671,  1.2937,  1.4176,  ..., -3.5659, -0.5293, -4.6208],
         [-4.0769,  4.3929,  2.0561,  ..., -4.2985, -2.1331, -4.1152],
         ...,
         [-4.5171,  1.2208,  1.4519,  ..., -3.7090, -0.2367, -4.7048],
         [-4.7472,  1.1454,  1.3040,  ..., -2.7055,  0.5830, -4.9792],
         [-4.1230,  1.4266,  1.5192,  ..., -3.6172, -0.8433, -4.5173]],

        [[-3.9264,  4.2062,  2.0580,  ..., -3.9244, -2.3467, -4.0382],
         [-4.6810,  0.1557,  0.3756,  ..., -0

Now the outputs are fixed - that means that has the tokens are run through the model, all the cofficients are being utilized. 

## Training

Now let's try to train a new model. The key here is that we need to teach the model how to predict the next token given the input sequence, and all tokens upto the next one. To do this we've implemented a training function.

In [10]:
from training import *

To see this in action, let's load in some hypers for a new model, model_train.

In [14]:
import torch.optim as optim

# ---------------------
# Build a trainable model with the same hyperparameters above.
# ---------------------

model_train = KernelModel(vocab_size,d_model,d_ffn,nhead,encoder_layers,decoder_layers,dropout)

# ---------------------
# Traing Functions & Device
# ---------------------
epochs = 3
criterion = nn.CrossEntropyLoss(ignore_index=PAD_TOKEN_ID)
optimizer = optim.Adam(model_train.parameters(), lr=1e-3)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_train.to(device)

print("parameters loaded :-)")

# ---------------------
# Batch the dataloader to desired specs
# ---------------------

batchsize = 5
dataloader = DataLoader(data,batch_size=batchsize,shuffle=True)

parameters loaded :-)


The training function takes in a model (which could have bee preloaded with parameters), along with training specific objects (device, criterion, optimizer, dataloader, epochs), and then once the training has been completed, it saves the model to "example_model.pth" in the models directory.

In [15]:
train(model_train, device, criterion, optimizer, dataloader, epochs, "example_model")

Starting training...


Epoch 1/3: 100%|██████████| 1000/1000 [00:38<00:00, 25.71it/s, loss=2.08]


Epoch 1/3, Avg. Loss: 1.3886


Epoch 2/3: 100%|██████████| 1000/1000 [00:40<00:00, 24.97it/s, loss=1.82]


Epoch 2/3, Avg. Loss: 1.2766


Epoch 3/3: 100%|██████████| 1000/1000 [00:43<00:00, 22.97it/s, loss=1.6] 


Epoch 3/3, Avg. Loss: 1.2755
Training completed in 122.51 seconds
Model saved to ../models/example_model.pth


Now that the model as been trained, we can start testing it.

### Testing

The obvious first step is to check how it does with building reductions from the input integral. For this we have built a wrapper model class, ProjectModel, which unlike the KernelModel, only takes an input string, and then does auto-regressive encoding to build an output string. (EDIT: Still need to git push the token and sequence testing functions)

Let's see this in practice.

In [20]:
from testing import *

fullmodel = ProjectionModel(model_train,tokenizer,max_length = 100)

This is a small example model (with d_model=16), and we're running it on the tenniscourt integrals, so obviously performance is going to be quite bad. But it's worth showing how it ProjectionModel, which is our honest sequence-to-sequence model, performs once the KernelModel has been trained. 

Take the following integral as an example:

In [25]:
fullmodel('{2,1,1,1,0,-1,1,0,1,0,-1,0,0,0,0}')

'{010,0006010220100106,13078100,100000000,0,,7827701270030001293306,176021200007,72207162660276020002'

Again, since we used our model_train KernelModel with dropout = 0.1, the outputs generated above will be randomly constructed. We can resolve this by instantiating a static model without dropout, and then using that to define a new inference model based on the training

In [26]:
model_static =  KernelModel(vocab_size,d_model,d_ffn,nhead,encoder_layers,decoder_layers,dropout=0)
model_static.load_state_dict(torch.load("../models/example_model.pth"))

fullmodel_static = ProjectionModel(model_static,tokenizer,max_length = 50)

In [36]:
for i in range(10):
    print(f"Prediction: {fullmodel_static(data.input_strings[i])}")
    print(f"True: {data.output_strings[i]}")

Prediction: {0111110111100001000,1,17,0,1600000000007,,,,77771
True: {4988942163698054660,0,11073643427184746115,12786563879566983934,11297505512155254377,0,16071282845234746120,0,3877329585571272712,0,0,0,0,0,2351088370595720278,0,0,0,0,0,0,0,9485612509064294181,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Prediction: {0111110111100001000,1,,,0,1,177000000007,,,,77771
True: {10860739550675742401,1575801467500097271,15432346270711039692,0,0,16908624229613386083,0,0,0,6908755195386501367,0,12230655570764797522,9632817245834476874,15588438248097191445,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Prediction: {0111110111000001000,1,,,0,07600000000007,,,,77771
True: {0,0,15292746396285261898,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Prediction: {0111110111100001000,1,,,07,1600000000007,,,,77771
True: {1428586150493445959,0,3674960714066933736,0,0,0,0,0,14658749303094349744,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
Prediction: {0111