This repo contains code to train a GPT from scratch. The dataset is taken from the RedPajama 1 trillion data. Only samples from this are taken and used for the training purposes. The implementation of the transformer is similar to the LitGPT.
The trained model has a parameter count of about 160M. The final training loss was found to be 3.2154.
The Hugging Face implementation can be found here.
The training details can be found in the attached notebooks. The initial training was stopped when the loss was around 4.
Using the checkpoint, the second part of the training was resumed and stopped when it went below 3.5.