JetMoe

Overview

JetMoe-8B is an 8B Mixture-of-Experts (MoE) language model developed by Yikang Shen and MyShell. JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the ModuleFormer. Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.

This model was contributed by Yikang Shen.

JetMoeConfig

[[autodoc]] JetMoeConfig

JetMoeModel

[[autodoc]] JetMoeModel - forward

JetMoeForCausalLM

[[autodoc]] JetMoeForCausalLM - forward

JetMoeForSequenceClassification

[[autodoc]] JetMoeForSequenceClassification - forward

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jetmoe.md

jetmoe.md

JetMoe

Overview

JetMoeConfig

JetMoeModel

JetMoeForCausalLM

JetMoeForSequenceClassification

Files

jetmoe.md

Latest commit

History

jetmoe.md

File metadata and controls

JetMoe

Overview

JetMoeConfig

JetMoeModel

JetMoeForCausalLM

JetMoeForSequenceClassification