## Intro

Goal: Build LLM model from scratch - not pre-trained - and train it to learn basic integer based math with addition, subtraction and multiplication. 
I want to check how well LLM - like this one just trained from scratch - can learn math. This is not a symbolic system. It uses tokens. Can tokens be used for math calculations? In further steps: 
1) Maybe embedding representing tokens are in fact kind of symbolic representation? This is worth checking - how these numbers and operations embedding behave and what are their relations.
2) If vanilla LLMs with token embeddings will prove not very good at math can we somehow expand them to somehow get/arrive at/generate symbolic math operation within such model? I mean here creating LLM models with internal represenations with a flavour of Wolfram Alpha/Mathematica.

I mean here the fact that LLM are - and will be more and more - used for mathy problems. For example recently Microsoft started offering "Copilot for Finance" based on ChatGPT. I imaging that it is based on token and wonder how reliable the math side of it can be. And maybe: how could such models can be modified - both in terms of their architectures and the way they represent stuff like math - to be better at math. For now elementary like integer addition, subtraction and multiplication operations.

Let's do it :)


* Build Transformer-like model for generating math.
* We will see how to work with huge amounts of data we want.
* We will need to generate huge amounts of data.
* I will explore the pretraining step and how to train a transformer model from scratch
* We will have to cover following steps:
    * Gathering and processing a very large dataset
    * Creating a custom tokenizer for our dataset
    * Training a model on GPU
* To train such large mode we will probably use distributed trainig using some capabilities of PyTorch Accelerate library.
* I will probably use notebook to prototype and experiment, but the final preprocessing and training code should probably be placed in script run potentially with multiple GPU's - but we'll see.
* For example such script check out the Transformers repository



## General plan

* Generating and processing a small initial dataset to be able perfrom initial model training
* Format the dataset to right format, save as Dataset and push to HF Hub.
* Create a custom tokenizer for our dataset
* Design and implement the trained model architecture
* Training a model on GPU

* Gathering and processing a very large dataset
    * multiple operation beyond addition etc.
    * complex structure of operations etc.
* Reiterate other steps and refine the model


## Problems to solve (messy notebook):

* Should I build Transformer encoder model or Transformer decoder model?

* How to generate math data?


* I should think whether math should be just strings of words or lists of tokens like in NER/POS?  
* I think I want to build encoder based transformer model to account for deeper bidirectional relations catching
* Or maybe keep it decoder model for better results generation?
* I think encoder model is better here as I want model to be able to analyze all the relations at once and bidirectionally - not like generative model.


* If I will implement the original "Attention is all you need" encoder part of the model (more or less) how many parameter will it have? Will it have billions? How many paramters has the paper model?



## 1. Building up large training dataset

### 1.1 Xsw

* For now I can train on some very small experimental - 100 samples dataset.
* I think I will initially train model in addition from for numbers 1 to 100.
* Model will also have to learn to understand what these numbers mean.
* Starting with such a simple example I can ignore for now engineering the train dataset from something and focus on coding the model and training pipelines and evaluation.

Let's generate all four element combinations of four arrays containing numbers from 0 to 99. 

That's 100M initial addition data records to teach our model how to add numbers from 0 to 99.

In [7]:
import numpy as np

nums = np.stack(np.meshgrid(np.arange(100), np.arange(100), np.arange(100), np.arange(100)), -1).reshape(-1, 4)

In [8]:
nums

array([[ 0,  0,  0,  0],
       [ 0,  0,  0,  1],
       [ 0,  0,  0,  2],
       ...,
       [99, 99, 99, 97],
       [99, 99, 99, 98],
       [99, 99, 99, 99]])

We will now create DataFrame with these pairs and sum of each row. We will save this as csv file for our temporary model training dataset.

In [9]:
import pandas as pd

sums = pd.DataFrame({"a": nums[:, 0], "b": nums[:, 1], "c": nums[:, 2], "d": nums[:, 3], "sum": nums[:, 0]+nums[:, 1]+nums[:, 2]+nums[:, 3]})

In [10]:
sums

Unnamed: 0,a,b,c,d,sum
0,0,0,0,0,0
1,0,0,0,1,1
2,0,0,0,2,2
3,0,0,0,3,3
4,0,0,0,4,4
...,...,...,...,...,...
99999995,99,99,99,95,392
99999996,99,99,99,96,393
99999997,99,99,99,97,394
99999998,99,99,99,98,395


In [12]:
sums.to_csv("./data/sums_4v_0_99.csv.gz", index=False)

That small initial dataset with sums of 4 numbers ranging from 0 to 99 has 1.5GB (... when compressed) of data only for this simple operation. We saved this dataset to csv file for later use in other places.

Let's treat this as our base training dataset to teach our simplest model how to add numbers from 0 to 99. 

NEXT: Build Hugging Face Dataset out of it. headline

In [None]:
tbd

NEXT: Prepare Tokenizer Headline

In [None]:
tbd

NEXT: Prepare Transformer Encoder basic architecture