[1/N] Initial vllm-ext evaluation support (MXFP4 MOE) #935

yiliu30 · 2025-10-23T02:35:50Z

Usage

source apply_ext.sh 
VLLM_ENABLE_AR_EXT=1 vllm serve ...

Initial support for out-of-tree AutoRound integration with vLLM, verified on Qwen3-15B-A2B-Base.

Added AutoRoundExtensionConfig
Monkey patching support
Added AutoRoundMoEMethod and AutoRoundMoEMethodMXFp4Impl
UT

There are three paths for MXFP4

VLLM_ENABLE_STATIC_MOE=1: static MOE + unpack the weight on the fly, very slow, for correctness check
VLLM_ENABLE_STATIC_MOE=1 + VLLM_MXFP4_PRE_UNPACK_WEIGHTS=1, static MOE + unpack the weight to FP8 before inference, still slow but acceptable
VLLM_AR_MXFP4_MODULAR_MOE=1 + VLLM_MXFP4_PRE_UNPACK_WEIGHTS=1, unpack the weight to BF16 + FP32 group gemm, fast but requires the same mem as BF16 models, WIP, support it in the next PR

Signed-off-by: yiliu30 <yi4.liu@intel.com>

auto_round_extension/vllm_ext/tests/conftest.py

Copilot

Pull Request Overview

This PR introduces initial support for out-of-tree AutoRound integration with vLLM, specifically for MXFP4 MOE quantization. The implementation adds extension configuration, MOE quantization methods, and utilities for FP4/MXFP4 quantization and dequantization operations.

Key changes:

Added AutoRoundExtensionConfig to extend AutoRound's quantization support with MXFP4 MOE capabilities
Implemented AutoRoundMoEMethod and AutoRoundMoEMethodMXFp4Impl for handling MOE layers with MXFP4 quantization
Created utility modules for MXFP4 quantization/dequantization, FP4 conversions, and environment variable extensions

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
auto_round/experimental/vllm_ext/utils.py	MXFP4 scale derivation and quantization utilities for E8M0 exponent handling
auto_round/experimental/vllm_ext/tests/test_mxfp4_moe.py	Basic test validating AutoRound MXFP4 model inference
auto_round/experimental/vllm_ext/tests/conftest.py	Test fixtures and runners copied from vLLM
auto_round/experimental/vllm_ext/sitecustomize.py	Bootstrap script for enabling AutoRound extension via environment variable
auto_round/experimental/vllm_ext/quant_method_moe.py	MOE quantization method dispatcher for AutoRound
auto_round/experimental/vllm_ext/mxfp4_qdq_utils.py	MXFP4 quantization/dequantization implementation
auto_round/experimental/vllm_ext/moe_impl_mxfp4.py	MXFP4 MOE layer implementation with weight processing
auto_round/experimental/vllm_ext/fp4_utils.py	FP4 E2M1 format packing/unpacking utilities
auto_round/experimental/vllm_ext/envs_ext.py	Extension environment variables for MXFP4 configuration
auto_round/experimental/vllm_ext/auto_round_ext.py	AutoRound extension configuration class
auto_round/experimental/vllm_ext/init.py	Extension application entry point

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

auto_round_extension/vllm_ext/auto_round_ext.py

auto_round_extension/vllm_ext/tests/test_mxfp4_moe.py

auto_round_extension/vllm_ext/mxfp4_qdq_utils.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

auto_round/experimental/vllm_ext/fp4_utils.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

auto_round_extension/vllm_ext/tests/conftest.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

…llm-ext

auto_round_extension/vllm_ext/__init__.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

auto_round_extension/vllm_ext/utils.py

auto_round_extension/vllm_ext/quant_method_moe.py

auto_round_extension/vllm_ext/moe_impl_mxfp4.py

auto_round_extension/vllm_ext/fp4_utils.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

wenhuach21 · 2025-10-30T01:55:24Z

I’m not very familiar with this part. @mengniwang95 @n1ck-guo, could you please review it again and approve the PR if everything looks good?

auto_round_extension/vllm_ext/moe_impl_mxfp4.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

* Fix rtn tuning_device issue (#893) Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> * fix vlm gguf ut (#895) Signed-off-by: n1ck-guo <heng.guo@intel.com> * update alg_ext.abi3.so with python compatible version (#894) * move ste from quant to round for nvfp4 (#889) Signed-off-by: He, Xin3 <xin3.he@intel.com> * Add GPT-OSS quant support (#887) * better help printing information (#883) * better help printing information Signed-off-by: n1ck-guo <heng.guo@intel.com> * speedup quant and evaluation, fix recompile issue (#897) * rewrite the implementation for ease-of-maintain Signed-off-by: He, Xin3 <xin3.he@intel.com> * fix bug Signed-off-by: He, Xin3 <xin3.he@intel.com> * fix quant performance Signed-off-by: He, Xin3 <xin3.he@intel.com> * Update auto_round/compressors/base.py --------- Signed-off-by: He, Xin3 <xin3.he@intel.com> * fix nvfp act quantization bug (#891) * fix nvfp act quantization bug Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * add cuda ut for moe nvfp quantize Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * add cpu UT, refine cuda UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ut typo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix cpu ut Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * enhance experts amax match, refine UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * support automatic mixed bits assignment (#851) * try to fix gguf issue (#886) * remove numba from requirments (#905) Signed-off-by: yiliu30 <yi4.liu@intel.com> * Extend mxfp loading dtypes (#907) * block dataset logger info (#908) Signed-off-by: n1ck-guo <heng.guo@intel.com> * fix torch compile issue in AutoScheme (#909) * Revert "Extend mxfp loading dtypes (#907)" (#915) This reverts commit 0c2619c. * support disable_opt_rtn in auto-scheme (#913) * fix llama 4 ut (#896) * fix ut of llama 4 Signed-off-by: n1ck-guo <heng.guo@intel.com> * add numba for cpu lib (#919) Signed-off-by: yiliu30 <yi4.liu@intel.com> * Loosen the packing restrictions for mxfp&nvfp (#911) * Loosen the packing restrictions for mxfp&nvfp, enable Qwen1.5-MoE-A2.7B quantize Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine mxfp&nvfp layer checker Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix pylint Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Extend mxfp loading dtypes (#916) Signed-off-by: root <root@clx5673.ra.intel.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: root <root@clx5673.ra.intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix act config exporting for mixed schemes (#903) * fp8 exporting bugfix Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix act related config saving Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add ut for act_config check Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine extra_config saving, add UTs Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix ut typo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix ut typo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fixtypo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CI Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix scan issue Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix scan issue Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * rm global variable Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rerun ut Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine ut Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * optimize rtn for int woq (#924) * fix bug of gguf and support for LiquidAI/LFM2-1.2B (#927) Signed-off-by: n1ck-guo <heng.guo@intel.com> * remove numpy<2.0 limitation (#921) * enable regex quantization config saving for mixed bits (#825) * enable dynamic quantization config saving Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixtypo Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rebase code, refine config saving Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine ut Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * fix UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable hf loading for regex, add UTs Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine export, enhance gptq UT Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix Flux tuning issue (#936) Signed-off-by: Mengni Wang <mengni.wang@intel.com> * gguf support for inclusionAI/Ling-flash-2.0 (#940) * remove low_cpu_mem (#934) * Add compatibility test (#918) * Add commit hash to version (#941) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> * gguf weight type align with original, output.weight, token_embed (#900) * support attention mask in user's dataset (#930) * Add diffusion README (#923) * update readme (#949) * refactor utils file (#943) * refact utils Signed-off-by: n1ck-guo <heng.guo@intel.com> * update readme for sglang support (#953) * update readme for sglang support Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * refine doc Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> * Update README.md --------- Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com> * update gguf and support for CompressedLinear (#950) * Reduce AutoSchem VRAM usage by up to 10X (#944) * add self attribution and fix avg_bits error (#956) * add self attribution and fix avg_bits error --------- Signed-off-by: He, Xin3 <xin3.he@intel.com> Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com> * add logo (#960) * refine AutoScheme readme/code (#958) * update readme (#962) * fix critic disable_opt_rtn regression (#963) * [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) (#935) Signed-off-by: yiliu30 <yi4.liu@intel.com> * fix bug of imatrix contains 0 (#955) * fix rtn bug (#966) * enhance flux doc (#967) * clean code (#968) * support for model scope (#957) * support for model scope Signed-off-by: n1ck-guo <heng.guo@intel.com> * merge main branch to alg_ext (#970) * fix cuda CI backend issue, fixtypo (#974) * disable compile packing by default (#975) Signed-off-by: yiliu30 <yi4.liu@intel.com> * enhance auto device map and support XPU (#961) * enhance auto device map and support XPU --------- Signed-off-by: He, Xin3 <xin3.he@intel.com> * refine readme (#978) * cli support for positional arguments model (#979) Signed-off-by: n1ck-guo <heng.guo@intel.com> * update bits (#986) Signed-off-by: He, Xin3 <xin3.he@intel.com> * fix guff scheme and device_map bug (#969) * add support for Magistral-Small (#980) * support model_dtype and fix bug of scheme contains quotes, mllm eval (#985) --------- Signed-off-by: Kaihui-intel <kaihui.tang@intel.com> Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: He, Xin3 <xin3.he@intel.com> Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: root <root@clx5673.ra.intel.com> Signed-off-by: Mengni Wang <mengni.wang@intel.com> Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Co-authored-by: Tang Kaihui <kaihui.tang@intel.com> Co-authored-by: Heng Guo <heng.guo@intel.com> Co-authored-by: Xin He <xin3.he@intel.com> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Weiwei <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: root <root@clx5673.ra.intel.com> Co-authored-by: Wang, Mengni <mengni.wang@intel.com> Co-authored-by: Sun, Xuehao <xuehao.sun@intel.com>

init moe support

5ee2c2d

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 changed the title ~~init moe support~~ [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) Oct 23, 2025

add test

c278f9d

Signed-off-by: yiliu30 <yi4.liu@intel.com>

github-advanced-security bot found potential problems Oct 23, 2025

View reviewed changes

auto_round_extension/vllm_ext/tests/conftest.py Fixed Show fixed Hide fixed

yiliu30 requested a review from Copilot October 23, 2025 02:42

Copilot AI reviewed Oct 23, 2025

View reviewed changes

auto_round_extension/vllm_ext/auto_round_ext.py Show resolved Hide resolved

auto_round_extension/vllm_ext/tests/test_mxfp4_moe.py Outdated Show resolved Hide resolved

auto_round_extension/vllm_ext/mxfp4_qdq_utils.py Outdated Show resolved Hide resolved

yiliu30 added 4 commits October 22, 2025 22:45

fix import

418e6a0

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean envs

184783f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add script for apply ext

b9da06f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean docs

187f38d

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 requested review from XuehaoSun, mengniwang95 and wenhuach21 October 23, 2025 02:57

yiliu30 marked this pull request as ready for review October 23, 2025 03:05

yiliu30 mentioned this pull request Oct 23, 2025

MXFP4/MXFP8 evaluation #937

Open

5 tasks

mengniwang95 reviewed Oct 23, 2025

View reviewed changes

auto_round/experimental/vllm_ext/fp4_utils.py Outdated Show resolved Hide resolved

yiliu30 added 2 commits October 23, 2025 02:35

fix license

4031724

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

5fe01ef

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 requested review from mengniwang95 and n1ck-guo October 23, 2025 08:11

yiliu30 added 2 commits October 23, 2025 06:35

fix import and sitecustomize

73f1e9b

Signed-off-by: yiliu30 <yi4.liu@intel.com>

move to ext

8495854

Signed-off-by: yiliu30 <yi4.liu@intel.com>

github-advanced-security bot found potential problems Oct 24, 2025

View reviewed changes

auto_round_extension/vllm_ext/tests/conftest.py Dismissed Show dismissed Hide dismissed

yiliu30 added 6 commits October 24, 2025 04:04

update mxfp4

c473934

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

9f65bd1

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix model name

8038a5f

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into vllm-ext

e0872b6

fix

c82bce1

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'vllm-ext' of https://github.com/intel/auto-round into v…

19e18c7

…llm-ext

n1ck-guo reviewed Oct 27, 2025

View reviewed changes

auto_round_extension/vllm_ext/__init__.py Outdated Show resolved Hide resolved

yiliu30 added 2 commits October 27, 2025 02:22

use absolute path

adf7ebf

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into vllm-ext

59f5cd2

n1ck-guo reviewed Oct 27, 2025

View reviewed changes

auto_round_extension/vllm_ext/utils.py Show resolved Hide resolved

wenhuach21 reviewed Oct 27, 2025

View reviewed changes

auto_round_extension/vllm_ext/quant_method_moe.py Show resolved Hide resolved

mengniwang95 reviewed Oct 27, 2025

View reviewed changes

auto_round_extension/vllm_ext/moe_impl_mxfp4.py Outdated Show resolved Hide resolved

mengniwang95 reviewed Oct 27, 2025

View reviewed changes

auto_round_extension/vllm_ext/fp4_utils.py Outdated Show resolved Hide resolved

yiliu30 added 2 commits October 29, 2025 20:45

Merge branch 'main' into vllm-ext

8f27041

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

ad8537c

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 requested review from mengniwang95, n1ck-guo and wenhuach21 October 30, 2025 00:55

mengniwang95 reviewed Oct 30, 2025

View reviewed changes

auto_round_extension/vllm_ext/moe_impl_mxfp4.py Show resolved Hide resolved

mark round method as todo

77844f6

Signed-off-by: yiliu30 <yi4.liu@intel.com>

mengniwang95 approved these changes Oct 30, 2025

View reviewed changes

yiliu30 merged commit e8bc353 into main Oct 30, 2025
21 checks passed

yiliu30 deleted the vllm-ext branch October 30, 2025 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1/N] Initial vllm-ext evaluation support (MXFP4 MOE) #935

[1/N] Initial vllm-ext evaluation support (MXFP4 MOE) #935

Uh oh!

yiliu30 commented Oct 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[1/N] Initial vllm-ext evaluation support (MXFP4 MOE) #935

[1/N] Initial vllm-ext evaluation support (MXFP4 MOE) #935

Uh oh!

Conversation

yiliu30 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yiliu30 commented Oct 23, 2025 •

edited

Loading