## Intro

Goal: Build LLM model from scratch - and train it to learn basic integer based math with addition, subtraction and multiplication.
I want to check how well LLM - like this one just trained from scratch - can learn math. This is not a symbolic system. It uses tokens. Can tokens be used for math calculations? In further steps:
1) Maybe embedding representing tokens are in fact kind of symbolic representation? This is worth checking - how these numbers and operations embedding behave and what are their relations.
2) If vanilla LLMs with token embeddings will prove not very good at math can we somehow expand them to somehow get/arrive at/generate symbolic math operation within such model? I mean here creating LLM models with internal represenations with a flavour of Wolfram Alpha/Mathematica.

I mean here the fact that LLMs are - and will be more and more - used for mathy problems. For example recently Microsoft started offering "Copilot for Finance" based on ChatGPT. I imaging that it is based on token and wonder how reliable the math side of it can be. And maybe: how could such models can be modified - both in terms of their architectures and the way they represent stuff like math - to be better at math. For now elementary like integer addition, subtraction and multiplication operations.

Let's do it :)


## Plan
1. ~~Generating and processing a small initial dataset to be able perfrom initial model training~~
2. Format the dataset to right format, save as Dataset and push to HF Hub.
3. Create a custom tokenizer for our dataset
4. ~~Design and implement model architecture~~
5. Design training goal (e.g. add some masking like in original transformers paper or no masking training)
6. Training a model on GPU
7. Gathering and processing a very large dataset (multiple operation beyond addition etc., complex structure of operations etc.)
9. Reiterate other steps and refine the model


## Questions

* Should I build Transformer encoder model or Transformer decoder model?

* How to generate math data?


* I should think whether math should be just strings of words or lists of tokens like in NER/POS?  
* I think I want to build encoder based transformer model to account for deeper bidirectional relations catching
* Or maybe keep it decoder model for better results generation?
* I think encoder model is better here as I want model to be able to analyze all the relations at once and bidirectionally - not like generative model.


* If I will implement the original "Attention is all you need" encoder part of the model (more or less) how many parameters will it have? Will it have billions? How many paramters has the paper model?



## Notes to myself
* Build Transformer-like model for generating math.
* We will see how to work with huge amounts of data we want.
* We will need to generate huge amounts of data.
* I will explore the pretraining step and how to train a transformer model from scratch
* We will have to cover following steps:
    * Gathering and processing a very large dataset
    * Creating a custom tokenizer for our dataset
    * Training a model on GPU
* To train such large mode we will probably use distributed trainig using some capabilities of PyTorch Accelerate library.
* I will probably use notebook to prototype and experiment, but the final preprocessing and training code should probably be placed in script run potentially with multiple GPU's - but we'll see.
* For example such script check out the Transformers repository


## 1. Generating and processing a small initial dataset

* For now I can train initial model on some very small and structurally simple experimental dataset.
* I think I will initially train model in addition from for numbers 0 to 99.
* Model will also have to learn to understand what these numbers mean.
* Starting with such a simple example I can ignore for now engineering the train dataset from something and focus on coding the model and training pipelines and evaluation.

Let's generate all four element combinations of four arrays containing numbers from 0 to 99.

That's 100M initial addition data records to teach our model how to add numbers from 0 to 99.

In [None]:
import numpy as np

nums = np.stack(np.meshgrid(np.arange(100), np.arange(100), np.arange(100), np.arange(100)), -1).reshape(-1, 4)

In [None]:
nums

array([[ 0,  0,  0,  0],
       [ 0,  0,  0,  1],
       [ 0,  0,  0,  2],
       ...,
       [99, 99, 99, 97],
       [99, 99, 99, 98],
       [99, 99, 99, 99]])

We will now create DataFrame with these pairs and sum of each row. We will save this as csv file for our temporary model training dataset.

In [None]:
import pandas as pd

sums = pd.DataFrame({"a": nums[:, 0], "b": nums[:, 1], "c": nums[:, 2], "d": nums[:, 3], "sum": nums[:, 0]+nums[:, 1]+nums[:, 2]+nums[:, 3]})

In [None]:
sums

Unnamed: 0,a,b,c,d,sum
0,0,0,0,0,0
1,0,0,0,1,1
2,0,0,0,2,2
3,0,0,0,3,3
4,0,0,0,4,4
...,...,...,...,...,...
99999995,99,99,99,95,392
99999996,99,99,99,96,393
99999997,99,99,99,97,394
99999998,99,99,99,98,395


In [None]:
sums.to_csv("./data/sums_4v_0_99.csv.gz", index=False)

That small initial dataset with sums of 4 numbers ranging from 0 to 99 has 1.5GB (400MB when compressed) of data only for this simple operation. We saved this dataset to csv file for later use in other places.

Let's treat this as our base training dataset to teach our simplest model how to add numbers from 0 to 99.

## 2. Preprocess data and save to hub


In [None]:
# We will implement it next - decided to first implement model architecture so we can fit the shape of dataset to the final architecture design

## 3. Create a custom tokenizer for our dataset

In [None]:
# We will implement it next - decided to first implement model architecture so we can fit the tokenizer's nuts and bolts to the final architecture design

## 4. Design and implement model architecture

Just because I want to build something simple - as much as transformers are concerned of course - I decided to use in this experiment the original transformers architecture presented in the "Attention Is All You Need" paper. To be more specific - the encoder part of it.

We need only encoder because of two reasons. First, we do not solve a seq2seq problem like the original paper did. Second, I think that we should care more about building model that creates rich numerical representation of input data relations (e.g. tokens, mathematical relations, etc.) and that is what encoder mostly does best. Decoder is more kind of one directional with applied masking and puts more emphasis on the sequence generation - which is obvious thing when you think it is the second "seq" in the "seq2seq".

(Sidenote: I now think it could be possible to pose this problem also as seq2seq: the mathematical expression being input sequence (e.g. "2 + 2") and the operation mathematical result being the output sequence (e.g. "4"). This would model the basic math operations as kind of translation problem - which in fact it kind of is. Just an idea for next implementation - we stick to encoder-only architecture for now.)

### 4.1 Multi-Head Attention

In [1]:
import torch

We first create the configuration for the architecture params. I mostly use values for the "Attention Is All You Need" paper.

In [45]:
from transformers import PretrainedConfig
class Config(PretrainedConfig):
  def __init__(self):
    super().__init__()
    self.vocab_size = 100
    self.embed_dim = 512
    self.num_attention_heads = 8

In [46]:
config = Config()

Just for prototyping I create some mockup tokenized sequence data.

In [27]:
from transformers import BatchEncoding
tokenized_inputs = BatchEncoding({"input_ids": torch.tensor([[1, 2, 3, 4]])})

In [28]:
tokenized_inputs.input_ids

tensor([[1, 2, 3, 4]])

In [30]:
tokenized_inputs.input_ids.shape # [batch_size, tokenized_sequence_length]

torch.Size([1, 4])

Also for prototyping we initialize embedding layer and use it to get embeddings for our example sequence.

In [33]:
embeddings = torch.nn.Embedding(config.vocab_size, config.embed_dim)
embeddings # [vocab_size, embed_dim]

Embedding(100, 512)

In [35]:
inputs_embeds = embeddings(tokenized_inputs.input_ids)

In [36]:
inputs_embeds.size() # [batch_size, tokenized_sequence_length, embed_dim]

torch.Size([1, 4, 512])

In [37]:
inputs_embeds

tensor([[[ 0.5223, -0.1004,  0.4019,  ..., -0.7209,  0.1582,  0.4808],
         [-1.2000,  1.3288, -1.4894,  ..., -0.5783,  0.6096,  2.3303],
         [ 0.4402,  1.4194, -1.3263,  ..., -0.1755,  0.1818, -0.0594],
         [ 0.0694,  0.7195,  0.3353,  ...,  1.0410,  0.0963,  1.1397]]],
       grad_fn=<EmbeddingBackward0>)

We now implement the key function used for estimating basically all the self-attention mechanism calculation as defined in the "Attention Is All You Need" paper.

So we start with query `Q`, key `K` and value `V` as function input. Their value are basically hidden state representation weighted by set of weights trained by network separately for each of those three. We will implement it in a moment late and here we just take these value as the function arguments.

We then set the `d_k` param value which is basically the hidden state representation dimensionality. We will use it for scaling the dot product as specified in original transformer paper. In the mockup data we use here it is 512.

`QK_T` is a dot product between the input `Q` and `K` vectors measuring the similarity between each of them. We normalize this score by dividing it by the size of the vectors to avoid huge values for high-dimensional representions.

`attention_weights` is the softmax of the `QK_T` similarity score to basically normalize the calculated weights to take values from 0 to 1 for each element and sum up to 1 for all of them.

Finally, we calculate the `attention` value itself which is simply a dot product between calculated earlier attention weights for all sequence elements and the original values vector `V`.


In [52]:
import numpy as np

def attention_Q_K_V(Q, K, V):
    d_k = Q.size(-1)
    QK_T = torch.bmm(Q, K.transpose(1, 2)) / np.sqrt(d_k)
    attention_weights = torch.nn.functional.softmax(QK_T, dim=-1)
    attention = torch.bmm(attention_weights, V)
    return attention

We initialize the attention head with each of the query `Q`, key `K` and value `V` values as linear dense layer with dimentionality of size of embeddings and head dimensionality.

Then on the forward step we input each of these three with the same current hidden state representation. Because these three `Q`, `K` and `V` layers are trained separately and they have their own sets of trained weights, the values of the query, key and value will be different from each other even though generated from the same hidden state.

We then feed these values into the attention value calculation function and return attention value on each `Attention` forward step.

In [59]:
class Attention(torch.nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.Q = torch.nn.Linear(embed_dim, head_dim)
        self.K = torch.nn.Linear(embed_dim, head_dim)
        self.V = torch.nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attention = attention_Q_K_V(self.Q(hidden_state), self.K(hidden_state), self.V(hidden_state))
        return attention

In [None]:
# NEXT: Implement MultiHeadAttention class

In [None]:
# TODO: Implement feed forward layer

In [None]:
# TODO: Implement residual layers connections

In [None]:
# TODO: Implement positional encodings

In [None]:
# TODO: Put it all together and add classification (?) head