## Intro - remove "we" and an "I"

Goal: Build LLM model from scratch - not pre-trained - and train it to learn basic integer based math with addition, subtraction and multiplication. 
I want to check how well LLM - like this one just trained from scratch - can learn math. This is not a symbolic system. It uses tokens. Can tokens be used for math calculations? In further steps: 
1) Maybe embedding representing tokens are in fact kind of symbolic representation? This is worth checking - how these numbers and operations embedding behave and what are their relations.
2) If vanilla LLMs with token embeddings will prove not very good at math can we somehow expand them to somehow get/arrive at/generate symbolic math operation within such model? I mean here creating LLM models with internal represenations with a flavour of Wolfram Alpha/Mathematica.

I mean here the fact that LLMs are - and will be more and more - used for mathy problems. For example recently Microsoft started offering "Copilot for Finance" based on ChatGPT. I imaging that it is based on token and wonder how reliable the math side of it can be. And maybe: how could such models can be modified - both in terms of their architectures and the way they represent stuff like math - to be better at math. For now elementary like integer addition, subtraction and multiplication operations.

Let's do it :)


## Plan
1. ~~Generating and processing a small initial dataset to be able perfrom initial model training~~
2. Format the dataset to right format, save as Dataset and push to HF Hub.
3. Create a custom tokenizer for our dataset
4. ~~Design and implement model architecture~~
5. Design training goal (e.g. add some masking like in original transformers paper or no masking training)
6. Training a model on GPU
7. Gathering and processing a very large dataset (multiple operation beyond addition etc., complex structure of operations etc.)
9. Reiterate other steps and refine the model


## Questions

* Should I build Transformer encoder model or Transformer decoder model?

* How to generate math data?


* I should think whether math should be just strings of words or lists of tokens like in NER/POS?  
* I think I want to build encoder based transformer model to account for deeper bidirectional relations catching
* Or maybe keep it decoder model for better results generation?
* I think encoder model is better here as I want model to be able to analyze all the relations at once and bidirectionally - not like generative model.


* If I will implement the original "Attention is all you need" encoder part of the model (more or less) how many parameter will it have? Will it have billions? How many paramters has the paper model?



## Notes to myself
* Build Transformer-like model for generating math.
* We will see how to work with huge amounts of data we want.
* We will need to generate huge amounts of data.
* I will explore the pretraining step and how to train a transformer model from scratch
* We will have to cover following steps:
    * Gathering and processing a very large dataset
    * Creating a custom tokenizer for our dataset
    * Training a model on GPU
* To train such large mode we will probably use distributed trainig using some capabilities of PyTorch Accelerate library.
* I will probably use notebook to prototype and experiment, but the final preprocessing and training code should probably be placed in script run potentially with multiple GPU's - but we'll see.
* For example such script check out the Transformers repository


## 1. Generating and processing a small initial dataset

* For now I can train initial model on some very small and structurally simple experimental dataset.
* I think I will initially train model in addition from for numbers 0 to 99.
* Model will also have to learn to understand what these numbers mean.
* Starting with such a simple example I can ignore for now engineering the train dataset from something and focus on coding the model and training pipelines and evaluation.

Let's generate all four element combinations of four arrays containing numbers from 0 to 99. 

That's 100M initial addition data records to teach our model how to add numbers from 0 to 99.

In [7]:
import numpy as np

nums = np.stack(np.meshgrid(np.arange(100), np.arange(100), np.arange(100), np.arange(100)), -1).reshape(-1, 4)

In [8]:
nums

array([[ 0,  0,  0,  0],
       [ 0,  0,  0,  1],
       [ 0,  0,  0,  2],
       ...,
       [99, 99, 99, 97],
       [99, 99, 99, 98],
       [99, 99, 99, 99]])

We will now create DataFrame with these pairs and sum of each row. We will save this as csv file for our temporary model training dataset.

In [9]:
import pandas as pd

sums = pd.DataFrame({"a": nums[:, 0], "b": nums[:, 1], "c": nums[:, 2], "d": nums[:, 3], "sum": nums[:, 0]+nums[:, 1]+nums[:, 2]+nums[:, 3]})

In [10]:
sums

Unnamed: 0,a,b,c,d,sum
0,0,0,0,0,0
1,0,0,0,1,1
2,0,0,0,2,2
3,0,0,0,3,3
4,0,0,0,4,4
...,...,...,...,...,...
99999995,99,99,99,95,392
99999996,99,99,99,96,393
99999997,99,99,99,97,394
99999998,99,99,99,98,395


In [12]:
sums.to_csv("./data/sums_4v_0_99.csv.gz", index=False)

That small initial dataset with sums of 4 numbers ranging from 0 to 99 has 1.5GB (400MB when compressed) of data only for this simple operation. We saved this dataset to csv file for later use in other places.

Let's treat this as our base training dataset to teach our simplest model how to add numbers from 0 to 99. 

## 2. Preprocess data and save to hub


In [None]:
# We will implement it next - decided to first implement model architecture so we can fit the shape of dataset to the final architecture design

## 3. Create a custom tokenizer for our dataset

In [1]:
# We will implement it next - decided to first implement model architecture so we can fit the tokenizer's nuts and bolts to the final architecture design

## 4. Design and implement model architecture

Just because I want to build something simple - as much as transformers are concerned of course - I decided to use in this experiment the original transformers architecture presented in the "Attention Is All You Need" paper. To be more specific - the encoder part of it. 

We need only encoder because of two reasons. First, we do not solve a seq2seq problem like the original paper did. Second, I think that we should care more about building model that creates rich numerical representation of input data relations (e.g. tokens, mathematical relations, etc.) and that is what encoder mostly does best. Decoder is more kind of one directional with applied masking and puts more emphasis on the sequence generation - which is obvious thing when you think it is the second "seq" in the "seq2seq".

(Sidenote: I now think it could be possible to pose this problem also as seq2seq: the mathematical expression being input sequence (e.g. "2 + 2") and the operation mathematical result being the output sequence (e.g. "4"). This would model the basic math operations as kind of translation problem - which in fact it kind of is. Just an idea for next implementation - we stick to encoder-only architecture for now.)

### 4.1 Self-attention

In [None]:
# We will use BERT as initial example
from transformers import AutoTokenizer, AutoModel

CHECKPOINT = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
model = AutoModel.from_pretrained(CHECKPOINT)

In [1]:
text = "this is example text"

In [2]:
# We tokenize the text
inputs = tokenier(text, return_tensors="pt", add_special_tokens=False)

In [None]:
inputs

In [3]:
# We now build dense embeddings.
# We will use model parameter for it.
from torch import nn
from transformers import AutoConfig

# TODO: Remove using BERT and config it ourselves.
# We load bert-base-uncased config.json - to get the config params.
# Here each input ID is mapped to one of 30522 embedding vectors stored in nn.Embedding
# Each embedding has 768 dimensions
config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

NameError: name 'model_ckpt' is not defined

In [None]:
# We created a lookup table to trainslate each input_ids into corresponding embedding
# We do it by feeding the specific input id to the embedding lookup.
input_embeds = token_emb(inputs.input_ids)
input_embeds.size() # [batch_size, seq_len, hidden_dim] -> [1, 5, 768]

In [4]:
# TODO:
# We should now implement and add the positional embeddings component.
# We skip it for now for simplicity and come back to it.

In [None]:
# We now focus on implementing the query, key and values representations.
import torch
from math import sqrt

# TODO:
# We use here a vast simplification where Q, K and V are equal.
# In fact they are mapped by dedicated trained set of weights.
# We will also implement these later.

query = key = value = input_embeds

In [None]:
# We now calculate the attention scores dot products which are the embeddings similarity measure in the transformers architecture
# Division by the embedding dimensionality length is a normalization step preventing problems with exploding gradients 
#  as the value without normalization in such high dimensionality can get huge.
dim_k = key.size(-1)
scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)