TinyMoE

This models pushes performance of small models as far as possible. We aim to create a Model that is both large and small at the same time.

Large to take advantage:

large scale pretraining data corpuses
large amounts of GPU memory available at inference time

Small to take advantage of:

fast inference paths

To accomplish this we take inspiration of a few recent models(Mixtral and Deepseek-MoE), mainly:

MoE (Mixture of Experts work to increase model size without utilizing all params for inference) 0
Grouped Query Attention (downscaled KV keys to increase attention effeceincy)1
Expert Specialization(More effecient experts)2
Per layer Configuration of Sliding Window Attention and Grouped Query Attention Sizes, We use lot's of early layers with smaller windows and attention head counts for speed, and a few layers of denser global attention

We aim for 440M active parameters and 5B trainable. Ideally runs at GPT-2-medium level speeds for inference. Targetting 40GB cards for serving inference and 24GB cards for quantized.

Currently

Model Architecture Done
Tuning the model architeture hyper parameters for inference speed
Train Simple Variants on effecient web and synthetic data
Training Model on 1T+ tokens

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
.gitignore		.gitignore
Readme.md		Readme.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

Readme.md

Readme.md

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

TinyMoE

About

Releases

Packages

Languages

nbardy/tiny_moe

Folders and files

Latest commit

History

Repository files navigation

TinyMoE

About

Resources

Stars

Watchers

Forks

Languages