-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DynamicQuantizeLinear opset 20 and float 8 #5472
Conversation
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
We also need to add fp8 support for MatMulInteger to support dynamic quantization for fp8. |
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
The function defined by CUDA cublasLtMatMul allows more than one option for the output type with the same input types. Since there is no scale for the output, the output type could be float32, float16 or bfloat16. I started to modify QLinearMatMul in PR #5473 which can be seen as a more generic version of MatMulInteger. There is also the transposition out of the equation and cublasLtMatMul only supports |
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Nit: "convertion" -> "conversion" |
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Is this ready for reviews? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also see more interests to add 16 bit support in QuantizeLinear/DequantizeLinear: #3971 (comment). If we do bump DynamicQuantizeLinear in this opset version, it might be a good time to add 16 bit support for QuantizeLinear/DequantizeLinear at the same time as well (if it makes sense, it can be done in another PR).
The only thing which wiuld require a larger consensus is the method i used to estimate the scale for float 8. Models are usually trained with float 8 and the scale estimation is part of the training. It is different from what i came up with. |
Cc @gramalingam |
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
* rounding to nearest ties to even. | ||
|
||
Data quantization formula is: | ||
``` | ||
y = saturate (round (x / y_scale) + y_zero_point) | ||
``` | ||
|
||
* for saturation, it saturates to [0, 255] if it's uint8, or [-127, 127] if it's int8. Right now only uint8 is supported. | ||
* rounding to nearest ties to even. | ||
y_zero_point must be 0 for any float 8 type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is a problem. If we are forced to use 0 as the zeropoint, the above computation of scale will not guarantee that all values can be reasonably represented as a float8. We may need to change the computation of scale as well, using something like "max ( max(x)/qmax, min(x)/qmin )" with some adjustments for rounding etc.
But, better still, if this is being used in practice, what are they doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May need some adjustments to ensure signs are handled correctly as well.
Description
DynamicQuantizeLinear only supports uint 8. This PR adds support for int8 and float 8.
Motivation and Context
The operator is used to dynamically quantize an input.