-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QDQ implementation #7033
QDQ implementation #7033
Conversation
Out of curiosity, I've seen some mentions of QDQ in the quantizer tool (and now this graph transform) and have been wondering what exactly it is. Is there an explanation on what this "QDQ format" is and what advantages it gives over standard conversion of e.g. Conv -> QLinearConv Edit: From reading the code it seems like with QDQ a node like |
d40da41
to
83555c1
Compare
83555c1
to
d86ff63
Compare
OnnxRuntime (ORT) CPU ExecutionProvider (EP) quantizes models with standard ONNX quantization operators (QuantizeLinear, QLinearConv, DequantizeLinear, QLinearMatMul, ConvInteger, MatmulInteger), and tens of non-standard quantization operators, like QLinearAdd, QLinearMul, QLinearSigmoid, QLinearLeakyRelu, QLinearAveragePool, DynamicQuantizeLSTM, etc., to run efficiently. This approach works well for CPU EP. However, as more and more accelerators start supporting ONNX quantization and have plan to integrate their quantization ability to ORT, issues come out. Apparently, the quantized model running well on CPU EP does not work on other EPs (like TRT, OpenVino) because of non-standard operators. Pushing those non-standard operators to ONNX standard sounds like a solution. However, it will not work. Putting the long time it takes to make those operators standard aside, it is almost impossible to determine which quantized operators should be standard. Different accelerators have different capacity and thus have preference on the support of quantized operators. If adding too many, it becomes burden to accelerator to support all the operators and eventually each accelerator only support part of them. If adding too few, the need of customization issue is not solved. QuantizeLinear + DeQuantizeLinear (QDQ) can solve the flexibility issue. Backend engine can choose to quantize the model as what they want, and just skip to quantize certain operators that they do not support. |
@yufenglee Ah ok that makes sense. We ran into the same issue adding our own accelerator backend and had to implement all those non-standard operators. Just to clarify though, the EPs will still have to fuse the |
Yes, that's it. |
graph_.AddInitializedTensor(zp_tensor_proto); | ||
} | ||
|
||
input_defs.push_back(&graph_.GetOrCreateNodeArg(zp_tensor_proto.name(), nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just add these at the start and not bother checking here? Graph::Resolve will throw away anything that isn't used later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datatype of zero_point of DequantizeLinear is determined by its 1st input. The check is mainly for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - I could have been clearer. I meant add both the int8 and uint8 zero points at the start as Graph::Resolve with throw anything unused away. Optional to change.
In reply to: 600846628 [](ancestors = 600846628)
/azp run Windows GPU TensorRT CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
demo?code? |
Description: Describe your changes.
Add basic support of QDQ in ORT. It includes support qlinear version of Conv, MatMul, Reshape, Add, Mul, MaxPool.
Motivation and Context
OnnxRuntime (ORT) CPU ExecutionProvider (EP) quantizes models with standard ONNX quantization operators (QuantizeLinear, QLinearConv, DequantizeLinear, QLinearMatMul, ConvInteger, MatmulInteger), and tens of non-standard quantization operators, like QLinearAdd, QLinearMul, QLinearSigmoid, QLinearLeakyRelu, QLinearAveragePool, DynamicQuantizeLSTM, etc., to run efficiently.
This approach works well for CPU EP. However, as more and more accelerators start supporting ONNX quantization and have plan to integrate their quantization ability to ORT, issues come out. Apparently, the quantized model running well on CPU EP does not work on other EPs (like TRT, OpenVino) because of non-standard operators. Pushing those non-standard operators to ONNX standard sounds like a solution. However, it will not work. Putting the long time it takes to make those operators standard aside, it is almost impossible to determine which quantized operators should be standard. Different accelerators have different capacity and thus have preference on the support of quantized operators. If adding too many, it becomes burden to accelerator to support all the operators and eventually each accelerator only support part of them. If adding too few, the need of customization issue is not solved.
QuantizeLinear + DeQuantizeLinear (QDQ) can solve the flexibility issue. Backend engine can choose to quantize the model as what they want, and just skip to quantize certain operators that they do not support.