Add EXAONE 4.0 model support for Inference V2 by Bias92 · Pull Request #7853 · deepspeedai/DeepSpeed

Bias92 · 2026-02-14T16:34:58Z

Summary

Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed Inference V2.

Closes #7453

Changes

New model implementation: deepspeed/inference/v2/model_implementations/exaone4/
- container.py: Transformer and non-transformer parameter containers
- model.py: Inference model with post-norm architecture and QK-Norm support
- policy.py: Inference V2 policy
Register EXAONE 4.0 in engine_factory.py and __init__.py

Key architectural differences from Mistral/Llama

Post-norm: RMSNorm is applied after attention/MLP outputs (not before), followed by residual addition
QK-Norm: Per-head RMSNorm applied to Q and K projections after the QKV linear layer
Hybrid attention: 32B model uses 3:1 sliding window/full attention ratio (via layer_types config)

Supported models

EXAONE-4.0-1.2B (all full attention)
EXAONE-4.0-32B (hybrid sliding/full attention)

Requires transformers >= 4.54.0.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("LGAI-EXAONE/EXAONE-4.0-1.2B", dtype=torch.float16, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("LGAI-EXAONE/EXAONE-4.0-1.2B", trust_remote_code=True)

prompt = '[|user|]\nExplain what DeepSpeed is in 2 sentences.\n[|assistant|]\n'
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The model produces coherent, contextually relevant text. Full end-to-end validation with DeepSpeed Inference V2 would require multi-GPU environment — happy to assist with further testing if needed.

tohtana

@Bias92 Thank you for the update! This looks good to me.

## Summary Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed Inference V2. Closes deepspeedai#7453 ## Changes - New model implementation: `deepspeed/inference/v2/model_implementations/exaone4/` - `container.py`: Transformer and non-transformer parameter containers - `model.py`: Inference model with post-norm architecture and QK-Norm support - `policy.py`: Inference V2 policy - Register EXAONE 4.0 in `engine_factory.py` and `__init__.py` ## Key architectural differences from Mistral/Llama - **Post-norm**: RMSNorm is applied after attention/MLP outputs (not before), followed by residual addition - **QK-Norm**: Per-head RMSNorm applied to Q and K projections after the QKV linear layer - **Hybrid attention**: 32B model uses 3:1 sliding window/full attention ratio (via `layer_types` config) ## Supported models - [EXAONE-4.0-1.2B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B) (all full attention) - [EXAONE-4.0-32B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B) (hybrid sliding/full attention) Requires `transformers >= 4.54.0`. ## Related - Supersedes deepspeedai#7456 (draft, inactive for 6 months) --------- Signed-off-by: Bias92 <pewpewplay315@gmail.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>

Bias92 requested review from hwchen2017 and tohtana as code owners February 14, 2026 16:34

Bias92 force-pushed the add-exaone4-inference-v2 branch from bd52e9d to 400d05a Compare February 14, 2026 16:35

chatgpt-codex-connector bot reviewed Feb 14, 2026

View reviewed changes

deepspeed/inference/v2/model_implementations/exaone4/model.py Outdated Show resolved Hide resolved

tohtana reviewed Feb 15, 2026

View reviewed changes

PKUWZP self-requested a review February 16, 2026 18:08

Bias92 and others added 4 commits February 17, 2026 21:11

Add EXAONE 4.0 model support for Inference V2

2c31269

Signed-off-by: Bias92 <pewpewplay315@gmail.com>

Fix QK-norm to use local head counts for TP compatibility

81f96e2

Signed-off-by: Bias92 <pewpewplay315@gmail.com>

Apply yapf formatting

66a7344

Signed-off-by: Bias92 <pewpewplay315@gmail.com>

Use n_heads_q_local and n_heads_kv_local for GQA compatibility

8ece3c1

Use n_heads_q_local and n_heads_kv_local for GQA compatibility Signed-off-by: Bias92 <pewpewplay315@gmail.com>

Bias92 force-pushed the add-exaone4-inference-v2 branch from fced31a to 8ece3c1 Compare February 17, 2026 12:12

tohtana approved these changes Feb 17, 2026

View reviewed changes

tohtana enabled auto-merge (squash) February 17, 2026 17:19

Merge branch 'master' into add-exaone4-inference-v2

14dc60b

tohtana merged commit f3a9819 into deepspeedai:master Feb 17, 2026
1 check passed

+                      """
+                      tokens = hidden_states.shape[0]
+                      local_n_heads = self.n_heads // max(self.tp_size, 1)
+                      local_n_heads_kv = self.n_heads_kv // max(self.tp_size, 1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EXAONE 4.0 model support for Inference V2#7853

Add EXAONE 4.0 model support for Inference V2#7853
tohtana merged 5 commits intodeepspeedai:masterfrom
Bias92:add-exaone4-inference-v2

Bias92 commented Feb 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

tohtana left a comment

Uh oh!

tohtana Feb 15, 2026

Uh oh!

Bias92 commented Feb 16, 2026

Uh oh!

Bias92 commented Feb 17, 2026

Uh oh!

tohtana left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bias92 commented Feb 14, 2026

Summary

Changes

Key architectural differences from Mistral/Llama

Supported models

Related

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

tohtana Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Bias92 commented Feb 16, 2026

Uh oh!

Bias92 commented Feb 17, 2026

Validation: coherent text generation with EXAONE-4.0-1.2B

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants