Skip to content

Latest commit

History

History
49 lines (31 loc) 路 1.97 KB

jetmoe.md

File metadata and controls

49 lines (31 loc) 路 1.97 KB

JetMoe

Overview

JetMoe-8B is an 8B Mixture-of-Experts (MoE) language model developed by Yikang Shen and MyShell. JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the ModuleFormer. Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. Given the input tokens, it activates a subset of its experts to process them. This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy.

This model was contributed by Yikang Shen.

JetMoeConfig

[[autodoc]] JetMoeConfig

JetMoeModel

[[autodoc]] JetMoeModel - forward

JetMoeForCausalLM

[[autodoc]] JetMoeForCausalLM - forward

JetMoeForSequenceClassification

[[autodoc]] JetMoeForSequenceClassification - forward