[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose #20362

adrianlizarraga · 2024-04-18T01:46:32Z

Description

4-bit QuantizeLinear(21). Blocked quantization still missing (i.e., do not support the new block_size attribute)
4-bit DequantizeLinear(21). Blocked dequantization still missing (i.e., do not support the new block_size attribute)
4-bit Transpose(21).
Update quantization tool with int4 types.
Disable QDQ fusions for 4-bit types. See: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
MLAS 4-bit quantization kernels for intel, neon, powerpc.

Notes

To calculate a tensor's storage size, we normally get the number of elements from the shape (i.e., tensor_shape.Size()) and multiply by the size of a single element. This does not directly work for sub-byte elements like int4 as each element in a Tensor<Int4x2> stores two packed int4 elements in a byte. The Tensor:: CalculateTensorStorageSize should be called to perform the correct calculation for any tensor element type.

Motivation and Context

ONNX 1.16 added the int4 and uint4 types. This initial PR adds the int4 type to ORT and adds int4 implementations for the Quant, Dequant, and Transpose ops on CPU EP. We still need to add int4 support for many ops and execution providers. See the ONNX 1.16 release notes: https://github.com/onnx/onnx/releases.

…r input

…ectors

include/onnxruntime/core/framework/int4.h

include/onnxruntime/core/framework/tensor.h

onnxruntime/core/framework/tensor.cc

onnxruntime/python/tools/quantization/quant_utils.py

onnxruntime/core/mlas/inc/mlas.h

onnxruntime/core/providers/cpu/quantization/quantize_linear.cc

adrianlizarraga · 2024-05-29T00:22:36Z

onnxruntime/core/providers/cpu/quantization/quantize_linear.cc

+      }                                                                                                       \
+    }                                                                                                         \
+    assert(output_index == static_cast<size_t>(N * broadcast_dim * block_size));                              \
+  }


Although this variable is called "block_size", it is not the same thing as the new block_size attribute. #Resolved

adrianlizarraga · 2024-05-29T16:02:33Z

onnxruntime/core/providers/cpu/quantization/quantize_linear.cc

+      for (size_t bd = 0; bd < static_cast<size_t>(broadcast_dim); bd++) {                                    \
+        size_t bd_i = bd >> 1;  /*bd / 2*/                                                                    \
+        size_t bd_j = bd & 0x1; /*bd % 2*/                                                                    \
+        INT4_TYPE::UnpackedType zp = zero_point ? zero_point[bd_i].GetElem(bd_j) : 0;                         \


The scale and zero-point inputs do have the same shape. Please refer to onnx spec: https://onnx.ai/onnx/operators/onnx__QuantizeLinear.html

Both zero-point and scale have the same shape in this code as well. The zero-point input is stored as a packed int4, so we have to get the correct 4-bit element.

Can you please clarify what you think needs to be updated? #Resolved

…ed in onnx model (onnx bug)

yufenglee

fajin-corp

edgchen1 · 2024-05-31T00:09:03Z

include/onnxruntime/core/framework/int4.h

+
+  static bool Pack(gsl::span<Int4x2Base<Signed>> dst, gsl::span<const UnpackedType> src) {
+    if (src.empty() || (CalcNumInt4Pairs(src.size()) != dst.size())) {
+      return false;


does a return value of false mean it failed? regarding return value, can the handling of an empty src be made consistent with Unpack()?

edgchen1 · 2024-05-31T00:42:59Z

onnxruntime/core/framework/tensor.cc

+/// </summary>
+/// <param name="elt_type">Data type of the tensor elements.</param>
+/// <param name="shape_size">The number of elements indicated by the shape (i.e., shape.Size()).</param>
+/// <returns>Number of Tensor elements. Returns -1 if shape_size is negative.</returns>


nit: it is returning shape_size for a negative shape_size. but I guess -1 is the only expected value for a negative shape_size.

edgchen1 · 2024-05-31T01:05:42Z

onnxruntime/core/framework/tensorprotoutils.cc

+                                                                                                                     \
+    gsl::span<const INT4_TYPE> src_span = gsl::make_span(reinterpret_cast<const INT4_TYPE*>(unpacked_tensor.data()), \
+                                                         num_packed_pairs);                                          \
+    gsl::span<INT4_TYPE> dst_span = gsl::make_span(p_data, expected_num_elements);                                   \


gsl::make_span(p_data, expected_num_elements)

should the span length be num_packed_pairs?

is there much benefit to using spans here if they're just provided to memcpy?

edgchen1 · 2024-05-31T01:47:48Z

onnxruntime/core/mlas/lib/quantize.cpp

+    using UnpackedType = typename Int4Traits<Signed>::UnpackedType;
+
+    for (size_t n = 0; n < N; n++) {
+        float FloatValue = std::nearbyintf(Input[n] / Scale) + static_cast<float>(ZeroPoint);


will std::nearbyintf round to nearest even here? assuming we want that mode as it's specified for ONNX QuantizeLinear
https://github.com/onnx/onnx/blob/093a8d335a66ea136eb1f16b3a1ce6237ee353ab/docs/Operators.md?plain=1#L20288

edgchen1 · 2024-05-31T01:51:12Z

onnxruntime/core/mlas/inc/mlas.h

+MLASCALL
+MlasQuantizeLinearU4(
+    const float* Input,
+    uint8_t* Output,


nit: Output is to be interpreted as bytes vs 8-bit unsigned integers, right? if so, would std::byte would be clearer?

…led (#20889) ### Description The recent [PR for int4 support](#20362) breaks builds with the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS option enabled. This PR adds utility functions for debug printing of int4 tensor statistics and data. ### Motivation and Context

ChipKerchner · 2024-06-06T19:24:00Z

onnxruntime/core/mlas/lib/power/QuantizePower.cpp

+        auto ShortVector1 = vec_pack(IntegerVector2, IntegerVector3);
+
+        auto CharVector = vec_pack(ShortVector0, ShortVector1);
+        vec_xst(CharVector, 0, static_cast<int8_t *>(&TmpOutput[0]));


This line has broken the build for some compiler versions. Vector commands need C-style casting.

vec_xst(CharVector, 0, (int8_t *)(&TmpOutput[0]));

Let me know if you are fixing this or if you want me to create a PR

Hi @ChipKerchner, apologies for the inconvenience. Here's the PR: #20957

### Description Uses C-style casting for Power vector instructions in `MlasQuantizeLinearInt4Kernel`. ### Motivation and Context Vector commands (e.g., vec_xst) need C-style casting to support various compiler versions. ONNX Runtime CI pipelines do not build with all compiler versions. The recent INT4 PR broke the powerpc build for certain compiler versions because it uses C++-style `static_cast<>`. See: #20362 (comment) Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>

adrianlizarraga added 28 commits April 17, 2024 16:36

Update include/framework/ with int4

40da679

Update onnxruntime_c_api.h with int4 type

e3e8a6b

Update cpu_contrib_kernels.cc with int4 Q/DQ

5e01e0f

Update framework/data_types.cc with int4 types

44e0e02

Update onnxruntime map type info with int4 types

ce03eb2

Update Tensor methods to calc int4 tensor data sizze

e159007

Update function to map tensor_proto int4 to onnxruntime enum

46c3d0d

Update tensorprotoutils to handle int4 protobufs

583dae1

Add functions to map Int4x2 to an onnxruntime tensor type enum

0009a47

Update com.microsoft.DequantizeLinear schema to support int4 types fo…

d11a3d4

…r input

Add option to disable int4 type in Conv and MatMul qdq node group sel…

208c403

…ectors

Add DequantizeLinear with int4 support (missing block quant)

f91ae69

update transpose helper to support int4

7323793

Update provider bridge with int4 apis

8c79905

Update quantizer tool with int4

fc695ca

Remove duplicate enum

eeacb78

Remove MatMulSelector constructor arg

6f9da04

Remove unnecessary explicit template instantiation

7e8c458

Add static_cast

3b7ed5f

Add temporary CPU EP Int4 test (qdq conv)

c7086a5

Update operator docs

34dfa17

Update testing version of tensorprotoutils with int4 helpers

e7bec9c

Run lintrunner

cd8912e

Fix api to create int4 ort value

4cf3a75

Wrap long lines in tensorprotoutils

ca785c2

Add operator unit tests for Dequant int4/uint4

f87e785

Remove comments

d028f2f

Add QuantizeLinear int4 impl

24cc617

adrianlizarraga changed the title ~~[NOT READY] DequantizeLinear int4 support~~ [NOT READY] Q/DQ int4 support for CPU EP Apr 18, 2024

Update operator docs

f35b09e

yufenglee reviewed May 13, 2024

View reviewed changes

include/onnxruntime/core/framework/int4.h Outdated Show resolved Hide resolved

yufenglee reviewed May 13, 2024

View reviewed changes

include/onnxruntime/core/framework/int4.h Show resolved Hide resolved

yufenglee reviewed May 13, 2024

View reviewed changes

include/onnxruntime/core/framework/tensor.h Show resolved Hide resolved

yufenglee reviewed May 14, 2024

View reviewed changes

onnxruntime/core/framework/tensor.cc Outdated Show resolved Hide resolved

jambayk reviewed May 17, 2024

View reviewed changes

onnxruntime/python/tools/quantization/quant_utils.py Show resolved Hide resolved

yufenglee reviewed May 28, 2024

View reviewed changes

onnxruntime/core/mlas/inc/mlas.h Show resolved Hide resolved

yufenglee reviewed May 28, 2024

View reviewed changes

onnxruntime/core/providers/cpu/quantization/quantize_linear.cc Show resolved Hide resolved

adrianlizarraga commented May 29, 2024

View reviewed changes

adrianlizarraga added 8 commits May 29, 2024 16:54

Merge latest main branch

8eec173

Review comments

6c06bfb

Save one instruction in MlasSetInt4Element()

09c11c6

Use workaround to ensure quant tool stores negative INT4 weights pack…

12d7d0e

…ed in onnx model (onnx bug)

Add int4 qdq quantization tool test

43c7bf1

Check opset when using int4 types with quant tool

d4a05b7

Check opset version when creating qdq config for QNN

27301e6

Merge branch 'main' into adrianl/dq-transpose-int4

8eea1c1

yufenglee approved these changes May 31, 2024

View reviewed changes

fajin-corp approved these changes May 31, 2024

View reviewed changes

jywu-msft approved these changes May 31, 2024

View reviewed changes

jywu-msft merged commit b02d5e6 into main May 31, 2024
96 checks passed

jywu-msft deleted the adrianl/dq-transpose-int4 branch May 31, 2024 01:56

edgchen1 reviewed May 31, 2024

View reviewed changes

adrianlizarraga mentioned this pull request May 31, 2024

Fix compiler error when onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS is enabled #20889

Merged

jambayk mentioned this pull request Jun 3, 2024

Add ModelOpt support microsoft/onnxruntime-genai#513

Closed

ChipKerchner reviewed Jun 6, 2024

View reviewed changes

adrianlizarraga mentioned this pull request Jun 6, 2024

[MLAS] Use C-style casting for power vector instructions #20957

Merged

edgchen1 mentioned this pull request Jun 11, 2024

Update the functions in tensorprotoutils.h to use std::filesystem::path instead #20920

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose #20362

[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose #20362

adrianlizarraga commented Apr 18, 2024 •

edited

Loading

adrianlizarraga May 29, 2024 •

edited by fajin-corp

Loading

adrianlizarraga May 29, 2024 •

edited by fajin-corp

Loading

yufenglee left a comment

fajin-corp left a comment

edgchen1 May 31, 2024

edgchen1 May 31, 2024

edgchen1 May 31, 2024

edgchen1 May 31, 2024

edgchen1 May 31, 2024

ChipKerchner Jun 6, 2024

adrianlizarraga Jun 6, 2024

[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose #20362

[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose #20362

Conversation

adrianlizarraga commented Apr 18, 2024 • edited Loading

Description

Notes

Motivation and Context

adrianlizarraga May 29, 2024 • edited by fajin-corp Loading

Choose a reason for hiding this comment

adrianlizarraga May 29, 2024 • edited by fajin-corp Loading

Choose a reason for hiding this comment

yufenglee left a comment

Choose a reason for hiding this comment

fajin-corp left a comment

Choose a reason for hiding this comment

edgchen1 May 31, 2024

Choose a reason for hiding this comment

edgchen1 May 31, 2024

Choose a reason for hiding this comment

edgchen1 May 31, 2024

Choose a reason for hiding this comment

edgchen1 May 31, 2024

Choose a reason for hiding this comment

edgchen1 May 31, 2024

Choose a reason for hiding this comment

ChipKerchner Jun 6, 2024

Choose a reason for hiding this comment

adrianlizarraga Jun 6, 2024

Choose a reason for hiding this comment

adrianlizarraga commented Apr 18, 2024 •

edited

Loading

adrianlizarraga May 29, 2024 •

edited by fajin-corp

Loading

adrianlizarraga May 29, 2024 •

edited by fajin-corp

Loading