Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU EP] Int4 support for QuantizeLinear, DequantizeLinear, and Transpose #20362

Merged
merged 79 commits into from
May 31, 2024

Conversation

adrianlizarraga
Copy link
Contributor

@adrianlizarraga adrianlizarraga commented Apr 18, 2024

Description

Notes

To calculate a tensor's storage size, we normally get the number of elements from the shape (i.e., tensor_shape.Size()) and multiply by the size of a single element. This does not directly work for sub-byte elements like int4 as each element in a Tensor<Int4x2> stores two packed int4 elements in a byte. The Tensor:: CalculateTensorStorageSize should be called to perform the correct calculation for any tensor element type.

Motivation and Context

ONNX 1.16 added the int4 and uint4 types. This initial PR adds the int4 type to ORT and adds int4 implementations for the Quant, Dequant, and Transpose ops on CPU EP. We still need to add int4 support for many ops and execution providers. See the ONNX 1.16 release notes: https://github.com/onnx/onnx/releases.

@adrianlizarraga adrianlizarraga changed the title [NOT READY] DequantizeLinear int4 support [NOT READY] Q/DQ int4 support for CPU EP Apr 18, 2024
} \
} \
assert(output_index == static_cast<size_t>(N * broadcast_dim * block_size)); \
}
Copy link
Contributor Author

@adrianlizarraga adrianlizarraga May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this variable is called "block_size", it is not the same thing as the new block_size attribute. #Resolved

for (size_t bd = 0; bd < static_cast<size_t>(broadcast_dim); bd++) { \
size_t bd_i = bd >> 1; /*bd / 2*/ \
size_t bd_j = bd & 0x1; /*bd % 2*/ \
INT4_TYPE::UnpackedType zp = zero_point ? zero_point[bd_i].GetElem(bd_j) : 0; \
Copy link
Contributor Author

@adrianlizarraga adrianlizarraga May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scale and zero-point inputs do have the same shape. Please refer to onnx spec: https://onnx.ai/onnx/operators/onnx__QuantizeLinear.html

Both zero-point and scale have the same shape in this code as well. The zero-point input is stored as a packed int4, so we have to get the correct 4-bit element.

Can you please clarify what you think needs to be updated? #Resolved

Copy link
Member

@yufenglee yufenglee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@fajin-corp fajin-corp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jywu-msft jywu-msft merged commit b02d5e6 into main May 31, 2024
96 checks passed
@jywu-msft jywu-msft deleted the adrianl/dq-transpose-int4 branch May 31, 2024 01:56

static bool Pack(gsl::span<Int4x2Base<Signed>> dst, gsl::span<const UnpackedType> src) {
if (src.empty() || (CalcNumInt4Pairs(src.size()) != dst.size())) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does a return value of false mean it failed? regarding return value, can the handling of an empty src be made consistent with Unpack()?

/// </summary>
/// <param name="elt_type">Data type of the tensor elements.</param>
/// <param name="shape_size">The number of elements indicated by the shape (i.e., shape.Size()).</param>
/// <returns>Number of Tensor elements. Returns -1 if shape_size is negative.</returns>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it is returning shape_size for a negative shape_size. but I guess -1 is the only expected value for a negative shape_size.

\
gsl::span<const INT4_TYPE> src_span = gsl::make_span(reinterpret_cast<const INT4_TYPE*>(unpacked_tensor.data()), \
num_packed_pairs); \
gsl::span<INT4_TYPE> dst_span = gsl::make_span(p_data, expected_num_elements); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gsl::make_span(p_data, expected_num_elements)

should the span length be num_packed_pairs?

is there much benefit to using spans here if they're just provided to memcpy?

using UnpackedType = typename Int4Traits<Signed>::UnpackedType;

for (size_t n = 0; n < N; n++) {
float FloatValue = std::nearbyintf(Input[n] / Scale) + static_cast<float>(ZeroPoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will std::nearbyintf round to nearest even here? assuming we want that mode as it's specified for ONNX QuantizeLinear
https://github.com/onnx/onnx/blob/093a8d335a66ea136eb1f16b3a1ce6237ee353ab/docs/Operators.md?plain=1#L20288

MLASCALL
MlasQuantizeLinearU4(
const float* Input,
uint8_t* Output,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Output is to be interpreted as bytes vs 8-bit unsigned integers, right? if so, would std::byte would be clearer?

adrianlizarraga added a commit that referenced this pull request Jun 1, 2024
…led (#20889)

### Description
The recent [PR for int4
support](#20362) breaks
builds with the onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS option enabled.

This PR adds utility functions for debug printing of int4 tensor
statistics and data.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
auto ShortVector1 = vec_pack(IntegerVector2, IntegerVector3);

auto CharVector = vec_pack(ShortVector0, ShortVector1);
vec_xst(CharVector, 0, static_cast<int8_t *>(&TmpOutput[0]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has broken the build for some compiler versions. Vector commands need C-style casting.

        vec_xst(CharVector, 0, (int8_t *)(&TmpOutput[0]));

Let me know if you are fixing this or if you want me to create a PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ChipKerchner, apologies for the inconvenience. Here's the PR: #20957

adrianlizarraga added a commit that referenced this pull request Jun 6, 2024
### Description
Uses C-style casting for Power vector instructions in
`MlasQuantizeLinearInt4Kernel`.



### Motivation and Context
Vector commands (e.g., vec_xst) need C-style casting to support various
compiler versions.
ONNX Runtime CI pipelines do not build with all compiler versions. The
recent INT4 PR broke the powerpc build for certain compiler versions
because it uses C++-style `static_cast<>`.

See:
#20362 (comment)

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants