-
Notifications
You must be signed in to change notification settings - Fork 61
Add static FP8 attention support #1045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds static FP8 attention support to the auto-round library. The implementation enables quantization of attention mechanisms using FP8 format, building on existing KV cache quantization infrastructure.
Key Changes
- Introduces
QuantizedAttentionImplmodule to handle FP8 quantized attention operations - Refactors shared quantization utilities into a new
experimental/utils.pymodule - Updates test suite to validate FP8 attention quantization and correct scale tensor shapes
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_cpu/test_export.py | Adds test for static FP8 attention, updates model path and corrects expected scale tensor shapes |
| auto_round/experimental/utils.py | New utility module consolidating shared FP8 quantization and module manipulation functions |
| auto_round/experimental/qmodules/fp8_static.py | Preserves original dtype in forward pass for FP8 linear operations |
| auto_round/experimental/kv_cache.py | Refactors utilities to new module, initializes scale parameters, removes debug logging |
| auto_round/experimental/attention.py | New module implementing hooked attention mechanism for FP8 quantization |
| auto_round/compressors/base.py | Adds static_attention_dtype parameter and integrates attention quantization context |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
auto_round/experimental/utils.py
Outdated
| from auto_round.utils import logger | ||
|
|
||
|
|
||
| def fp8_per_tensor_qdq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name is a bit strange.I suggest changing it to a more easily understood name.
| enable_deterministic_algorithms = kwargs.pop("enable_deterministic_algorithms", False) | ||
| self.momentum = kwargs.pop("momentum", 0.0) | ||
| static_kv_dtype = kwargs.pop("static_kv_dtype", None) | ||
| static_attention_dtype = kwargs.pop("static_attention_dtype", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does __main__.py also need add this parameter?
|
LGTM |
resolve #938
LLMC format export is not supported for now, as vLLM loading is still a work in progress.
But it can be used on HPU.