Add EXAONE 4.0 model support for Inference V2#7853
Add EXAONE 4.0 model support for Inference V2#7853tohtana merged 5 commits intodeepspeedai:masterfrom
Conversation
bd52e9d to
400d05a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 400d05a36a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| """ | ||
| tokens = hidden_states.shape[0] | ||
| local_n_heads = self.n_heads // max(self.tp_size, 1) | ||
| local_n_heads_kv = self.n_heads_kv // max(self.tp_size, 1) |
There was a problem hiding this comment.
As EXAONE4 has uneven Q/KV heads (GQA), I think this can produce incorrect results. Shouldn't we use these?
self.n_heads_q_localinstead ofself.n_heads // self.tp_sizeself.n_heads_kv_localinstead ofself.n_heads_kv // self.tp_size
|
@tohtana Thanks for the review! I've updated the code to use I'll validate the model with coherent text generation and share the results. |
Signed-off-by: Bias92 <pewpewplay315@gmail.com>
Signed-off-by: Bias92 <pewpewplay315@gmail.com>
Signed-off-by: Bias92 <pewpewplay315@gmail.com>
Use n_heads_q_local and n_heads_kv_local for GQA compatibility Signed-off-by: Bias92 <pewpewplay315@gmail.com>
fced31a to
8ece3c1
Compare
## Summary Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed Inference V2. Closes deepspeedai#7453 ## Changes - New model implementation: `deepspeed/inference/v2/model_implementations/exaone4/` - `container.py`: Transformer and non-transformer parameter containers - `model.py`: Inference model with post-norm architecture and QK-Norm support - `policy.py`: Inference V2 policy - Register EXAONE 4.0 in `engine_factory.py` and `__init__.py` ## Key architectural differences from Mistral/Llama - **Post-norm**: RMSNorm is applied after attention/MLP outputs (not before), followed by residual addition - **QK-Norm**: Per-head RMSNorm applied to Q and K projections after the QKV linear layer - **Hybrid attention**: 32B model uses 3:1 sliding window/full attention ratio (via `layer_types` config) ## Supported models - [EXAONE-4.0-1.2B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B) (all full attention) - [EXAONE-4.0-32B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B) (hybrid sliding/full attention) Requires `transformers >= 4.54.0`. ## Related - Supersedes deepspeedai#7456 (draft, inactive for 6 months) --------- Signed-off-by: Bias92 <pewpewplay315@gmail.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>

Summary
Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed Inference V2.
Closes #7453
Changes
deepspeed/inference/v2/model_implementations/exaone4/container.py: Transformer and non-transformer parameter containersmodel.py: Inference model with post-norm architecture and QK-Norm supportpolicy.py: Inference V2 policyengine_factory.pyand__init__.pyKey architectural differences from Mistral/Llama
layer_typesconfig)Supported models
Requires
transformers >= 4.54.0.Related