QDQ implementation #7033

yufenglee · 2021-03-16T21:15:42Z

Description: Describe your changes.
Add basic support of QDQ in ORT. It includes support qlinear version of Conv, MatMul, Reshape, Add, Mul, MaxPool.

Motivation and Context
OnnxRuntime (ORT) CPU ExecutionProvider (EP) quantizes models with standard ONNX quantization operators (QuantizeLinear, QLinearConv, DequantizeLinear, QLinearMatMul, ConvInteger, MatmulInteger), and tens of non-standard quantization operators, like QLinearAdd, QLinearMul, QLinearSigmoid, QLinearLeakyRelu, QLinearAveragePool, DynamicQuantizeLSTM, etc., to run efficiently.

This approach works well for CPU EP. However, as more and more accelerators start supporting ONNX quantization and have plan to integrate their quantization ability to ORT, issues come out. Apparently, the quantized model running well on CPU EP does not work on other EPs (like TRT, OpenVino) because of non-standard operators. Pushing those non-standard operators to ONNX standard sounds like a solution. However, it will not work. Putting the long time it takes to make those operators standard aside, it is almost impossible to determine which quantized operators should be standard. Different accelerators have different capacity and thus have preference on the support of quantized operators. If adding too many, it becomes burden to accelerator to support all the operators and eventually each accelerator only support part of them. If adding too few, the need of customization issue is not solved.

QuantizeLinear + DeQuantizeLinear (QDQ) can solve the flexibility issue. Backend engine can choose to quantize the model as what they want, and just skip to quantize certain operators that they do not support.

pranav-prakash · 2021-03-17T00:14:16Z

Out of curiosity, I've seen some mentions of QDQ in the quantizer tool (and now this graph transform) and have been wondering what exactly it is. Is there an explanation on what this "QDQ format" is and what advantages it gives over standard conversion of e.g. Conv -> QLinearConv

Edit: From reading the code it seems like with QDQ a node like Conv is converted to Dequant -> Conv -> Quant. But I fail to see what benefit this gives; doesn't doing the convolution in fp32 negate the whole point of having a quantized model? Are EPs expected to internally fuse that node sequence into an int8 conv? Or is this meant to be used for training?

yufenglee · 2021-03-18T22:08:54Z

Out of curiosity, I've seen some mentions of QDQ in the quantizer tool (and now this graph transform) and have been wondering what exactly it is. Is there an explanation on what this "QDQ format" is and what advantages it gives over standard conversion of e.g. Conv -> QLinearConv

Edit: From reading the code it seems like with QDQ a node like Conv is converted to Dequant -> Conv -> Quant. But I fail to see what benefit this gives; doesn't doing the convolution in fp32 negate the whole point of having a quantized model? Are EPs expected to internally fuse that node sequence into an int8 conv? Or is this meant to be used for training?

OnnxRuntime (ORT) CPU ExecutionProvider (EP) quantizes models with standard ONNX quantization operators (QuantizeLinear, QLinearConv, DequantizeLinear, QLinearMatMul, ConvInteger, MatmulInteger), and tens of non-standard quantization operators, like QLinearAdd, QLinearMul, QLinearSigmoid, QLinearLeakyRelu, QLinearAveragePool, DynamicQuantizeLSTM, etc., to run efficiently.

This approach works well for CPU EP. However, as more and more accelerators start supporting ONNX quantization and have plan to integrate their quantization ability to ORT, issues come out. Apparently, the quantized model running well on CPU EP does not work on other EPs (like TRT, OpenVino) because of non-standard operators. Pushing those non-standard operators to ONNX standard sounds like a solution. However, it will not work. Putting the long time it takes to make those operators standard aside, it is almost impossible to determine which quantized operators should be standard. Different accelerators have different capacity and thus have preference on the support of quantized operators. If adding too many, it becomes burden to accelerator to support all the operators and eventually each accelerator only support part of them. If adding too few, the need of customization issue is not solved.

QuantizeLinear + DeQuantizeLinear (QDQ) can solve the flexibility issue. Backend engine can choose to quantize the model as what they want, and just skip to quantize certain operators that they do not support.

pranav-prakash · 2021-03-18T22:27:56Z

@yufenglee Ah ok that makes sense. We ran into the same issue adding our own accelerator backend and had to implement all those non-standard operators. Just to clarify though, the EPs will still have to fuse the Dequant -> Conv -> Quant sequence into their internal quantized convolution node right? But the EPs can do so selectively and unfused ops will still run, so the benefit is that a single quantized model can work on all EPs regardless of which ones they choose to fuse?

yufenglee · 2021-03-18T22:29:55Z

@yufenglee Ah ok that makes sense. We ran into the same issue adding our own accelerator backend and had to implement all those non-standard operators. Just to clarify though, the EPs will still have to fuse the Dequant -> Conv -> Quant sequence into their internal quantized convolution node right? But the EPs can do so selectively and unfused ops will still run, so the benefit is that a single quantized model can work on all EPs regardless of which ones they choose to fuse?

Yes, that's it.

onnxruntime/core/optimizer/qdq_transformer/registry.h

onnxruntime/core/graph/graph_utils.cc

onnxruntime/core/optimizer/qdq_transformer/qdq_op_transformer.cc

skottmckay · 2021-03-24T11:09:44Z

onnxruntime/core/optimizer/qdq_transformer/qdq_op_transformer.cc

+      graph_.AddInitializedTensor(zp_tensor_proto);
+    }
+
+    input_defs.push_back(&graph_.GetOrCreateNodeArg(zp_tensor_proto.name(), nullptr));


Could we just add these at the start and not bother checking here? Graph::Resolve will throw away anything that isn't used later on.

The datatype of zero_point of DequantizeLinear is determined by its 1st input. The check is mainly for that.

Sorry - I could have been clearer. I meant add both the int8 and uint8 zero points at the start as Graph::Resolve with throw anything unused away. Optional to change.

In reply to: 600846628 [](ancestors = 600846628)

snnn · 2021-03-24T22:05:46Z

/azp run Windows GPU TensorRT CI Pipeline

azure-pipelines · 2021-03-24T22:05:56Z

Azure Pipelines successfully started running 1 pipeline(s).

skottmckay

aimen123 · 2022-08-27T09:31:29Z

demo？code？

yufenglee requested a review from a team as a code owner March 16, 2021 21:15

yufenglee force-pushed the yufeng/qdq_basic branch 2 times, most recently from d40da41 to 83555c1 Compare March 18, 2021 21:32

Add QDQ basic implementation

d86ff63

yufenglee force-pushed the yufeng/qdq_basic branch from 83555c1 to d86ff63 Compare March 18, 2021 22:05

yufenglee changed the title ~~[WIP] QDQ implementation~~ QDQ implementation Mar 18, 2021

yufenglee added 5 commits March 18, 2021 18:04

fix build break on linux

46aae71

zp of Q/DQ can be optional

9c0f49c

fix build break

02ee316

Merge branch 'master' into yufeng/qdq_basic

91f19f3

fix build break

a62beb9