Skip to content

nbardy/tiny_moe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinyMoE

This models pushes performance of small models as far as possible. We aim to create a Model that is both large and small at the same time.

Large to take advantage:

  • large scale pretraining data corpuses
  • large amounts of GPU memory available at inference time

Small to take advantage of:

  • fast inference paths

To accomplish this we take inspiration of a few recent models(Mixtral and Deepseek-MoE), mainly:

  • MoE (Mixture of Experts work to increase model size without utilizing all params for inference) 0
  • Grouped Query Attention (downscaled KV keys to increase attention effeceincy)1
  • Expert Specialization(More effecient experts)2
  • Per layer Configuration of Sliding Window Attention and Grouped Query Attention Sizes, We use lot's of early layers with smaller windows and attention head counts for speed, and a few layers of denser global attention

We aim for 440M active parameters and 5B trainable. Ideally runs at GPT-2-medium level speeds for inference. Targetting 40GB cards for serving inference and 24GB cards for quantized.

Currently

  • Model Architecture Done
  • Tuning the model architeture hyper parameters for inference speed
  • Train Simple Variants on effecient web and synthetic data
  • Training Model on 1T+ tokens

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages