This models pushes performance of small models as far as possible. We aim to create a Model that is both large and small at the same time.
Large to take advantage:
- large scale pretraining data corpuses
- large amounts of GPU memory available at inference time
Small to take advantage of:
- fast inference paths
To accomplish this we take inspiration of a few recent models(Mixtral and Deepseek-MoE), mainly:
- MoE (Mixture of Experts work to increase model size without utilizing all params for inference) 0
- Grouped Query Attention (downscaled KV keys to increase attention effeceincy)1
- Expert Specialization(More effecient experts)2
- Per layer Configuration of Sliding Window Attention and Grouped Query Attention Sizes, We use lot's of early layers with smaller windows and attention head counts for speed, and a few layers of denser global attention
We aim for 440M active parameters and 5B trainable. Ideally runs at GPT-2-medium level speeds for inference. Targetting 40GB cards for serving inference and 24GB cards for quantized.
Currently
- Model Architecture Done
- Tuning the model architeture hyper parameters for inference speed
- Train Simple Variants on effecient web and synthetic data
- Training Model on 1T+ tokens