You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The compiler module has its own internal quantization pipeline (calibrate → qdq stages) that duplicates the quant module's functionality with bugs and config mismatches. This needs to be unified.
Key Findings
1. Compiler's internal quantization is broken
CalibrateStage collects calibration ranges via ORT's MinMaxCalibrater, but QDQStage ignores them entirely — it creates a PrecomputedCalibrationReader with all-ones dummy data and passes that to quantize_static(). The calibration work is wasted. The quant module's get_qdq_config() + quantize() does not have this bug.
2. Duplicate configs with different defaults
QDQConfig (compiler) and WinMLQuantizationConfig (quant) overlap on 4 fields:
activation_type, per_channel, symmetric: same semantics, different classes.
samples: quant defaults to 10, compiler's CalibrationConfig defaults to 100.
3. compile.quantize=True is misleading
WinMLCompileConfig.to_dict() always emits quantize: True when qdq_config is not None. But DetectStage silently overrides this when Q/DQ nodes already exist (the normal build pipeline path). The flag appears to control behavior but is actually a no-op in production.
Proposed Design
Step 1: Unify quantization config
Make WinMLQuantizationConfig the single source of truth. Remove QDQConfig + CalibrationConfig from compiler. Port missing compiler-only fields (distribution, seed, load_path) to WinMLQuantizationConfig.
Step 2: Replace compiler's calibrate+qdq with quant module call
Replace CalibrateStage + QDQStage with a single QuantizeStage that calls quantize_onnx() from the quant module. Fixes the broken calibration pipeline.
Step 3: Thread quant_config through CompileContext
Add quant_config: WinMLQuantizationConfig | None to CompileContext. Update context.quantize to check self.quant_config is not None.
Step 4: Fix serialization round-trip
WinMLCompileConfig.to_dict() / from_dict() should serialize/deserialize via WinMLQuantizationConfig.to_dict() instead of the flat field extraction that caused the int8/uint8 mismatch.
Step 5: Add build config validation
When both config.quant and config.compile.quant_config are set, raise ValueError — the build pipeline runs quant before compile, so they can't both be active.
Step 6: Propagate QuantizeResult into build manifest
Thread nodes_quantized, calibration_time_seconds, qdq_insertion_time_seconds from QuantizeResult into the build manifest for observability.
Additional: wmk build for pre-exported ONNX models
The build command currently requires a HuggingFace model ID and runs Export → Optimize → Analyze → Quantize → Compile. We also need a path where:
Summary
The compiler module has its own internal quantization pipeline (calibrate → qdq stages) that duplicates the quant module's functionality with bugs and config mismatches. This needs to be unified.
Key Findings
1. Compiler's internal quantization is broken
CalibrateStagecollects calibration ranges via ORT'sMinMaxCalibrater, butQDQStageignores them entirely — it creates aPrecomputedCalibrationReaderwith all-ones dummy data and passes that toquantize_static(). The calibration work is wasted. The quant module'sget_qdq_config()+quantize()does not have this bug.2. Duplicate configs with different defaults
QDQConfig(compiler) andWinMLQuantizationConfig(quant) overlap on 4 fields:weight_type: compiler defaulted toint8, quant defaults touint8— caused QNN EP to reject LayerNorm nodes (30 CPU fallback partitions instead of 1 EPContext). Fixed in winml sys --list-device --format compact: --format flag silently ignored #232 but root cause is the duplication.activation_type,per_channel,symmetric: same semantics, different classes.samples: quant defaults to 10, compiler'sCalibrationConfigdefaults to 100.3.
compile.quantize=Trueis misleadingWinMLCompileConfig.to_dict()always emitsquantize: Truewhenqdq_configis not None. ButDetectStagesilently overrides this when Q/DQ nodes already exist (the normal build pipeline path). The flag appears to control behavior but is actually a no-op in production.Proposed Design
Step 1: Unify quantization config
Make
WinMLQuantizationConfigthe single source of truth. RemoveQDQConfig+CalibrationConfigfrom compiler. Port missing compiler-only fields (distribution,seed,load_path) toWinMLQuantizationConfig.Step 2: Replace compiler's calibrate+qdq with quant module call
Replace
CalibrateStage+QDQStagewith a singleQuantizeStagethat callsquantize_onnx()from the quant module. Fixes the broken calibration pipeline.Step 3: Thread
quant_configthroughCompileContextAdd
quant_config: WinMLQuantizationConfig | NonetoCompileContext. Updatecontext.quantizeto checkself.quant_config is not None.Step 4: Fix serialization round-trip
WinMLCompileConfig.to_dict()/from_dict()should serialize/deserialize viaWinMLQuantizationConfig.to_dict()instead of the flat field extraction that caused the int8/uint8 mismatch.Step 5: Add build config validation
When both
config.quantandconfig.compile.quant_configare set, raiseValueError— the build pipeline runs quant before compile, so they can't both be active.Step 6: Propagate
QuantizeResultinto build manifestThread
nodes_quantized,calibration_time_seconds,qdq_insertion_time_secondsfromQuantizeResultinto the build manifest for observability.Additional:
wmk buildfor pre-exported ONNX modelsThe build command currently requires a HuggingFace model ID and runs Export → Optimize → Analyze → Quantize → Compile. We also need a path where:
# Build from existing ONNX (skip export/optimize/analyze) wmk build --onnx model.onnx --device qnn --precision uint8 -o output/This would:
--onnxflag)--device(qnn, dml, cpu) and--precision(uint8, int8, fp16, fp32)--device+--precisionto the appropriateWinMLQuantizationConfig+WinMLCompileConfigThis enables the "I already have an ONNX model, just quantize and compile it for my target device" workflow.
Acceptance Criteria
QDQConfigandCalibrationConfigremoved from compilerWinMLQuantizationConfigis the single quant config used by both modulesquantize_onnx()internally (no more broken calibrate+qdq)wmk build --onnx model.onnx --device qnn --precision uint8workswmk perffor all existing models🤖 Generated with Claude Code