-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend supported types by QLinearMatMul (float16, float 8 types) #5473
Conversation
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: xadupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, we should try to turn this into a function "Dequantize => MatMul => Quantize". But fine if that is done separately. (There may be some questions about precision of intermediate values etc. there.)
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Tks @xadupre Does it mean int4 is not yet supported ? |
Int4 is not defined yet in onnx. That would be the first step before adding it to the list of supported types. Maybe it would be worth discussing it during one of the sig meeting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(As per offline discussion, Xavier suggests holding off this PR changes until better float 8 quantization support is available. Adding this comment to avoid accidental merge.)
Moved to 1.16 |
That s unfortunate. What about reducing the scope and at least add uint8/int8 (no float8)? |
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #5473 +/- ##
==========================================
- Coverage 56.06% 56.04% -0.02%
==========================================
Files 501 501
Lines 29366 29409 +43
Branches 4404 4413 +9
==========================================
+ Hits 16463 16482 +19
- Misses 12091 12115 +24
Partials 812 812 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Xavier Dupre <xadupre@microsoft.com>
…hts (#18043) ### Description Whenever a node QuantizeLinear or DequantizeLinear, the type of the weights before being quantize must be known to create the scale with the expected type. Another option would be to add many operator CastLike but that would push the burden to onnxruntime optimizer. The PR tries to avoid changing the signature. To do so, it modified the scale computation to use a numpy array to store the result and not a python float. The numpy array must be of the same type than the weights to quantize. The PR adds many `assert` to check the type of the scale is not a python type or a float64. This was added to make sure all the code follows the same logic. These lines were kept for the first review. DequantizeLinear, QuantizeLinear cannot be tested with onnx==1.15. PR onnx/onnx#5709 is missing to fix shape inference. PR onnx/onnx#5473) is missing to support QLinearMatMul with float 16. That explains why some tests are disabled with float 16. ### Motivation and Context The current quantization tool assumes every weight is float 32. For large models such as LLAMA, it is usually float 16. The quantization needs to quantize such weights.
…hts (#18043) ### Description Whenever a node QuantizeLinear or DequantizeLinear, the type of the weights before being quantize must be known to create the scale with the expected type. Another option would be to add many operator CastLike but that would push the burden to onnxruntime optimizer. The PR tries to avoid changing the signature. To do so, it modified the scale computation to use a numpy array to store the result and not a python float. The numpy array must be of the same type than the weights to quantize. The PR adds many `assert` to check the type of the scale is not a python type or a float64. This was added to make sure all the code follows the same logic. These lines were kept for the first review. DequantizeLinear, QuantizeLinear cannot be tested with onnx==1.15. PR onnx/onnx#5709 is missing to fix shape inference. PR onnx/onnx#5473) is missing to support QLinearMatMul with float 16. That explains why some tests are disabled with float 16. ### Motivation and Context The current quantization tool assumes every weight is float 32. For large models such as LLAMA, it is usually float 16. The quantization needs to quantize such weights.
Description
QLinearMatMul is used on quantized types. This PR extends the list of supported quantized types to float 8 types and the list of supported inputs types to float16, bfloat16.