Summary
winml quantize fails with an ORT NOT_IMPLEMENTED error when the input .onnx is a compiled QNN EPContext model (e.g. output of winml build). ORT's calibration step tries to create an InferenceSession over the EPContext model using the CPU provider, which cannot execute EPContext nodes.
Repro
# model.onnx is the output of `winml config` + `winml build` for ResNet-50
winml quantize -m C:\Users\...\resnet-50\model.onnx \
--weight-type int8 --activation-type uint16 \
-o C:\Users\...\resnet-50\quantize\model_w8a16.onnx
Error
[2026-04-03T11:54:41] ERROR: Quantization failed
Traceback (most recent call last):
File "winml/modelkit/quant/quantizer.py", line 166, in quantize_onnx
quantize(...)
...
File "onnxruntime/quantization/calibrate.py", line 235, in create_inference_session
self.infer_session = onnxruntime.InferenceSession(...)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED :
Could not find an implementation for EPContext(1) node with name
'QNNExecutionProvider_QNN_16222565441986465013_1_0'
Root Cause
ORT's static quantization calibrator creates an InferenceSession on the input model using default (CPU) providers. A QNN EPContext model contains EPContext nodes that can only be executed by the QNN execution provider — the CPU provider raises NOT_IMPLEMENTED.
Quantization of a pre-compiled EPContext model is not a meaningful operation anyway, as the weights are embedded in an opaque binary blob.
Expected behavior
winml quantize should detect that the input is an EPContext model and fail early with a clear message:
"Input model is a compiled QNN EPContext artifact. Run winml quantize on the original ONNX model before compilation."
Notes
- Reporter: AdinaTru
- Input model: output of
winml config + winml build for ResNet-50 on Qualcomm device
- Same root cause as the
winml optimize EPContext failure (see related issue)
Summary
winml quantizefails with an ORTNOT_IMPLEMENTEDerror when the input.onnxis a compiled QNN EPContext model (e.g. output ofwinml build). ORT's calibration step tries to create anInferenceSessionover the EPContext model using the CPU provider, which cannot executeEPContextnodes.Repro
Error
Root Cause
ORT's static quantization calibrator creates an
InferenceSessionon the input model using default (CPU) providers. A QNN EPContext model containsEPContextnodes that can only be executed by the QNN execution provider — the CPU provider raisesNOT_IMPLEMENTED.Quantization of a pre-compiled EPContext model is not a meaningful operation anyway, as the weights are embedded in an opaque binary blob.
Expected behavior
winml quantizeshould detect that the input is an EPContext model and fail early with a clear message:Notes
winml config+winml buildfor ResNet-50 on Qualcomm devicewinml optimizeEPContext failure (see related issue)