Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QDQ implementation #7033

Merged
merged 7 commits into from
Mar 25, 2021
Merged

QDQ implementation #7033

merged 7 commits into from
Mar 25, 2021

Conversation

yufenglee
Copy link
Member

@yufenglee yufenglee commented Mar 16, 2021

Description: Describe your changes.
Add basic support of QDQ in ORT. It includes support qlinear version of Conv, MatMul, Reshape, Add, Mul, MaxPool.

Motivation and Context
OnnxRuntime (ORT) CPU ExecutionProvider (EP) quantizes models with standard ONNX quantization operators (QuantizeLinear, QLinearConv, DequantizeLinear, QLinearMatMul, ConvInteger, MatmulInteger), and tens of non-standard quantization operators, like QLinearAdd, QLinearMul, QLinearSigmoid, QLinearLeakyRelu, QLinearAveragePool, DynamicQuantizeLSTM, etc., to run efficiently.

This approach works well for CPU EP. However, as more and more accelerators start supporting ONNX quantization and have plan to integrate their quantization ability to ORT, issues come out. Apparently, the quantized model running well on CPU EP does not work on other EPs (like TRT, OpenVino) because of non-standard operators. Pushing those non-standard operators to ONNX standard sounds like a solution. However, it will not work. Putting the long time it takes to make those operators standard aside, it is almost impossible to determine which quantized operators should be standard. Different accelerators have different capacity and thus have preference on the support of quantized operators. If adding too many, it becomes burden to accelerator to support all the operators and eventually each accelerator only support part of them. If adding too few, the need of customization issue is not solved.

QuantizeLinear + DeQuantizeLinear (QDQ) can solve the flexibility issue. Backend engine can choose to quantize the model as what they want, and just skip to quantize certain operators that they do not support.

@yufenglee yufenglee requested a review from a team as a code owner March 16, 2021 21:15
@pranav-prakash
Copy link
Contributor

pranav-prakash commented Mar 17, 2021

Out of curiosity, I've seen some mentions of QDQ in the quantizer tool (and now this graph transform) and have been wondering what exactly it is. Is there an explanation on what this "QDQ format" is and what advantages it gives over standard conversion of e.g. Conv -> QLinearConv

Edit: From reading the code it seems like with QDQ a node like Conv is converted to Dequant -> Conv -> Quant. But I fail to see what benefit this gives; doesn't doing the convolution in fp32 negate the whole point of having a quantized model? Are EPs expected to internally fuse that node sequence into an int8 conv? Or is this meant to be used for training?

@yufenglee yufenglee force-pushed the yufeng/qdq_basic branch 2 times, most recently from d40da41 to 83555c1 Compare March 18, 2021 21:32
@yufenglee
Copy link
Member Author

yufenglee commented Mar 18, 2021

Out of curiosity, I've seen some mentions of QDQ in the quantizer tool (and now this graph transform) and have been wondering what exactly it is. Is there an explanation on what this "QDQ format" is and what advantages it gives over standard conversion of e.g. Conv -> QLinearConv

Edit: From reading the code it seems like with QDQ a node like Conv is converted to Dequant -> Conv -> Quant. But I fail to see what benefit this gives; doesn't doing the convolution in fp32 negate the whole point of having a quantized model? Are EPs expected to internally fuse that node sequence into an int8 conv? Or is this meant to be used for training?

OnnxRuntime (ORT) CPU ExecutionProvider (EP) quantizes models with standard ONNX quantization operators (QuantizeLinear, QLinearConv, DequantizeLinear, QLinearMatMul, ConvInteger, MatmulInteger), and tens of non-standard quantization operators, like QLinearAdd, QLinearMul, QLinearSigmoid, QLinearLeakyRelu, QLinearAveragePool, DynamicQuantizeLSTM, etc., to run efficiently.

This approach works well for CPU EP. However, as more and more accelerators start supporting ONNX quantization and have plan to integrate their quantization ability to ORT, issues come out. Apparently, the quantized model running well on CPU EP does not work on other EPs (like TRT, OpenVino) because of non-standard operators. Pushing those non-standard operators to ONNX standard sounds like a solution. However, it will not work. Putting the long time it takes to make those operators standard aside, it is almost impossible to determine which quantized operators should be standard. Different accelerators have different capacity and thus have preference on the support of quantized operators. If adding too many, it becomes burden to accelerator to support all the operators and eventually each accelerator only support part of them. If adding too few, the need of customization issue is not solved.

QuantizeLinear + DeQuantizeLinear (QDQ) can solve the flexibility issue. Backend engine can choose to quantize the model as what they want, and just skip to quantize certain operators that they do not support.

@yufenglee yufenglee changed the title [WIP] QDQ implementation QDQ implementation Mar 18, 2021
@pranav-prakash
Copy link
Contributor

@yufenglee Ah ok that makes sense. We ran into the same issue adding our own accelerator backend and had to implement all those non-standard operators. Just to clarify though, the EPs will still have to fuse the Dequant -> Conv -> Quant sequence into their internal quantized convolution node right? But the EPs can do so selectively and unfused ops will still run, so the benefit is that a single quantized model can work on all EPs regardless of which ones they choose to fuse?

@yufenglee
Copy link
Member Author

@yufenglee Ah ok that makes sense. We ran into the same issue adding our own accelerator backend and had to implement all those non-standard operators. Just to clarify though, the EPs will still have to fuse the Dequant -> Conv -> Quant sequence into their internal quantized convolution node right? But the EPs can do so selectively and unfused ops will still run, so the benefit is that a single quantized model can work on all EPs regardless of which ones they choose to fuse?

Yes, that's it.

graph_.AddInitializedTensor(zp_tensor_proto);
}

input_defs.push_back(&graph_.GetOrCreateNodeArg(zp_tensor_proto.name(), nullptr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just add these at the start and not bother checking here? Graph::Resolve will throw away anything that isn't used later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datatype of zero_point of DequantizeLinear is determined by its 1st input. The check is mainly for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I could have been clearer. I meant add both the int8 and uint8 zero points at the start as Graph::Resolve with throw anything unused away. Optional to change.


In reply to: 600846628 [](ancestors = 600846628)

@snnn
Copy link
Member

snnn commented Mar 24, 2021

/azp run Windows GPU TensorRT CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@skottmckay skottmckay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@aimen123
Copy link

demo?code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants