Skip to content

Disable replace FP8Expert#1379

Merged
chensuyue merged 13 commits intomainfrom
fp8-experts
Feb 3, 2026
Merged

Disable replace FP8Expert#1379
chensuyue merged 13 commits intomainfrom
fp8-experts

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Feb 2, 2026

Resolve the FP8 part of #1248

  • patch fp8 experts replacement
  • add fp8 linear

Description

Please briefly describe your main changes, the motivation.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Copilot AI review requested due to automatic review settings February 2, 2026 01:31
@yiliu30 yiliu30 changed the title Disable replace FP8Experts Disable replace FP8Expert Feb 2, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR patches the FP8 experts replacement functionality in the transformers library by disabling the automatic conversion of expert modules during FP8 quantization, while preserving standard linear layer conversion.

Changes:

  • Adds a version check utility to determine if transformers >= 5.0.0 is installed
  • Introduces a custom FP8 linear replacement function that explicitly disables expert module conversion
  • Automatically applies the patch at import time for compatible transformers versions

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
auto_round/utils/common.py Adds utility function to check transformers version against 5.0.0
auto_round/modeling/fp8_quant.py Implements patched FP8 linear replacement without expert conversion and applies it automatically
auto_round/modeling/init.py Imports fp8_quant module to ensure patch is applied at package initialization

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@wenhuach21
Copy link
Contributor

does this pr work for A100 and B200? For 200, transformers will keep the FP8 layer, while for A100, it will dequantize the model to BF16.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Contributor Author

yiliu30 commented Feb 3, 2026

does this pr work for A100 and B200? For 200, transformers will keep the FP8 layer, while for A100, it will dequantize the model to BF16.

I have verified it on A100 and B200; it works on both nodes.

Generated Output:
Explain the theory of relativity in simple terms. The theory of relativity, developed by Albert Einstein, is a fundamental concept in physics that explains how

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30 yiliu30 requested a review from wenhuach21 February 3, 2026 06:09
@wenhuach21
Copy link
Contributor

does this pr work for A100 and B200? For 200, transformers will keep the FP8 layer, while for A100, it will dequantize the model to BF16.

I have verified it on A100 and B200; it works on both nodes.

Generated Output:
Explain the theory of relativity in simple terms. The theory of relativity, developed by Albert Einstein, is a fundamental concept in physics that explains how

Thanks, nice work!

Copy link
Contributor

@wenhuach21 wenhuach21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern is transformers may change its behavior, shall we add try catch for the core code or some others ways to avoid the potential issue

@yiliu30
Copy link
Contributor Author

yiliu30 commented Feb 3, 2026

Another concern is transformers may change its behavior, shall we add try catch for the core code or some others ways to avoid the potential issue

I agree. Currently, only the FineGrainedFP8HfQuantizer is imported when initializing AutoRound; the other imports are inside a try–catch block.

@chensuyue chensuyue merged commit 137da55 into main Feb 3, 2026
28 of 29 checks passed
@chensuyue chensuyue deleted the fp8-experts branch February 3, 2026 08:08
lvliang-intel pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants