selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2477
Closed
hanbitmyths wants to merge 5 commits into
Closed
selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2477hanbitmyths wants to merge 5 commits into
hanbitmyths wants to merge 5 commits into
Conversation
…TI_GPU dispatch - Normalize per-layer quant config overrides so Q/K/V projections in the same attention block share precision, required by ModelBuilder for GQA fusion. - Add AUTO setting for kld_memory_mode that picks among FULL, MULTI_GPU, LOW_MEMORY, OFFLOAD based on available GPU memory and model size. - Add MULTI_GPU mode that uses Accelerate's dispatch_model with _no_split_modules honored, plus a coalescing pass that pins every model.layers.N.* entry to a single device and falls back to LOW_MEMORY if a decoder layer still spans devices. - Tests: 24 unit tests covering QKV grouping, AUTO selection thresholds, and the MULTI_GPU device-map coalescing path.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR hardens
SelectiveMixedPrecision(SMP) for real-world LLMs targeting ONNX Runtime GenAI:QKV-aware quant config overrides (
olive/passes/pytorch/quant_utils.py): Normalize the per-layer override dict so that the Q, K, and V projections in the same attention block always share precision. ModelBuilder's GQA fusion requires this; without it, partial overrides silently break export on Qwen-style models.AUTO
kld_memory_mode(olive/passes/pytorch/selective_mixed_precision.py): A newautosetting selects amongfull,multi_gpu,low_memory, andoffloadbased on visible GPU memory and estimated model footprint, and logs the decision (e.g.KLD memory mode auto-selected: multi_gpu (gpus=3, full=145.14GB, multi_budget=215.86GB, ...)).New
multi_gpumode: Usesaccelerate.dispatch_model+infer_auto_device_mapwith_no_split_moduleshonored. Afterinfer_auto_device_map, everymodel.layers.N.*entry is coalesced to the first device assigned for that layer, and a defensive check falls back tolow_memoryif a decoder layer still spans devices. A diagnostic info log reports the per-device layer counts.Validation (A100 VM)
new_missing_qkv_partners=[]), same 657 MB output, ~301 vs 309 tok/s.MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)
14B is essentially lossless; the small-model deltas are inherent to int4 SMP on sub-2B parameters, not regressions introduced here.
Checklist before requesting a review
test_selective_mixed_precision.py)lintrunner -aRelease note:
SelectiveMixedPrecisionnow supports anautosetting forkld_memory_modeand a newmulti_gpumode that shards the KLD-scored forward across visible GPUs via Accelerate. Quant config overrides are normalized so Q/K/V projections in the same attention block share precision, ensuring compatibility with ModelBuilder GQA fusion.