Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamicQuantizeLinear opset 20 and float 8 #5472

Closed
wants to merge 33 commits into from
Closed

Conversation

xadupre
Copy link
Contributor

@xadupre xadupre commented Aug 4, 2023

Description

DynamicQuantizeLinear only supports uint 8. This PR adds support for int8 and float 8.

Motivation and Context

The operator is used to dynamically quantize an input.

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
@xadupre xadupre added the operator Issues related to ONNX operators label Aug 4, 2023
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
docs/Changelog.md Outdated Show resolved Hide resolved
docs/Operators.md Outdated Show resolved Hide resolved
@yufenglee
Copy link
Contributor

Description

DynamicQuantizeLinear only supports uint 8. This PR adds support for int8 and float 8.

Motivation and Context

The operator is used to dynamically quantize an input.

We also need to add fp8 support for MatMulInteger to support dynamic quantization for fp8.

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
@xadupre xadupre marked this pull request as ready for review August 8, 2023 09:36
@xadupre xadupre requested review from a team as code owners August 8, 2023 09:36
@xadupre
Copy link
Contributor Author

xadupre commented Aug 8, 2023

We also need to add fp8 support for MatMulInteger to support dynamic quantization for fp8.

The function defined by CUDA cublasLtMatMul allows more than one option for the output type with the same input types. Since there is no scale for the output, the output type could be float32, float16 or bfloat16. I started to modify QLinearMatMul in PR #5473 which can be seen as a more generic version of MatMulInteger. There is also the transposition out of the equation and cublasLtMatMul only supports A.T @ B with float 8 (and column major order). Zero point is not used to float 8 types. The name MatMulInteger also includes Integer in it. Is it possible to modify the quantization tools to use QLinearMatMul instead?

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
@justinchuby
Copy link
Contributor

Nit: "convertion" -> "conversion"

Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
@justinchuby
Copy link
Contributor

Is this ready for reviews?

Copy link
Member

@jcwchen jcwchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also see more interests to add 16 bit support in QuantizeLinear/DequantizeLinear: #3971 (comment). If we do bump DynamicQuantizeLinear in this opset version, it might be a good time to add 16 bit support for QuantizeLinear/DequantizeLinear at the same time as well (if it makes sense, it can be done in another PR).

@xadupre
Copy link
Contributor Author

xadupre commented Aug 19, 2023

The only thing which wiuld require a larger consensus is the method i used to estimate the scale for float 8. Models are usually trained with float 8 and the scale estimation is part of the training. It is different from what i came up with.

@justinchuby
Copy link
Contributor

Cc @gramalingam

@justinchuby justinchuby added this to the 1.15 milestone Aug 30, 2023
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
* rounding to nearest ties to even.

Data quantization formula is:
```
y = saturate (round (x / y_scale) + y_zero_point)
```

* for saturation, it saturates to [0, 255] if it's uint8, or [-127, 127] if it's int8. Right now only uint8 is supported.
* rounding to nearest ties to even.
y_zero_point must be 0 for any float 8 type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is a problem. If we are forced to use 0 as the zeropoint, the above computation of scale will not guarantee that all values can be reasonably represented as a float8. We may need to change the computation of scale as well, using something like "max ( max(x)/qmax, min(x)/qmin )" with some adjustments for rounding etc.

But, better still, if this is being used in practice, what are they doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need some adjustments to ensure signs are handled correctly as well.

@liqunfu liqunfu modified the milestones: 1.15, 1.16 Sep 22, 2023
@justinchuby justinchuby modified the milestones: 1.16, 1.17 Feb 8, 2024
@xadupre xadupre closed this Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operator Issues related to ONNX operators
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants