[Feature request] Add support for Int4 data-type #5776

galagam · 2023-11-29T09:17:35Z

System information

onnx v1.16.0
main top-of-tree

What is the problem that this feature solves?

LLMs are dominating the DL research and development today. Recent networks require 10s of GBs, and the main inference bottleneck is memory access (capacity and bandwidth). Recent papers are showing promising results using sub-byte data types, specifically weight-only quantization using int4.
The motivation behind weight-only quantization is improving performance (larger batches improve MMA utilization; smaller memory BW benefits memory-bound generation) as well as enabling large models that would otherwise not be able to execute on a single GPU.
By quantizing data to 4 bits, we can both reduce the model size and accelerate memory-bound inference use cases significantly.

Alternatives considered

N/A

Describe the feature

Every two int4 elements (i_0, i_1) will packed into a single uint8, as follows:
buffer = i_0 << 4 | i_1 & 0x0F

Tensors with an odd number of elements are not supported.

Add data-type TensorProto.INT4
Add helper functions to pack and unpack int4
Add support in QuantizeLinear, DequantizeLinear and some shape ops

Will this influence the current api (Y/N)?

Yes
Additional data type available
Adding int4 to a subset of the operators, including QuantizeLinear, DequantizeLinear (Optionally: Shape, Size, Transpose, Reshape, Constant).

Feature Area

data-types, operators

Are you willing to contribute it (Y/N)

Yes

Notes

Relevant papers using int4 quantization for LLMs:
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

galagam · 2023-12-04T15:13:33Z

Edit:
Found a way around this. Int4 will be packed when creating a tensor and unpacked when converting to a numpy array.
ReferenceEvaluator will use unpacked int4 (each int4 element stored as an np.int8 - sign extended).

Original message:
When initially suggesting this change, I've stated that the innermost dimension has to be even. This restriction comes from the way int4 is stored - int4x2 packed into a single uint8.

While implementing, I'm trying to remove this restriction, allowing int4 tensors of any shape, but I've encountered a few issues, and I'm not sure what is the best way to resolve this.
Ideally, when packing data to int4, we'll store a flat buffer (add a single element if the total number of elements is odd), and when unpacking we'll (drop the element) and reshape according to the original shape.
I can do just that when using TensorProto, because it allows to set the dimensions independently from the size of the data buffer. However, ONNX's ReferenceEvaluator is using numpy.ndarray as input and output, so I'm looking for a solution that can be extended to Numpy ndarrays.
In numpy.ndarray, the shape and size attributes are not writeable. That means that we need to preserve the shape information somehow within the ndarray.

For an even innermost dimension, that's simple:

packing: packed_dims = [unpacked_dims [:-1]] + [unpacked_dims [-1]/2].
unpacking: unpacked_dims = [packed_dims [:-1]] + [packed_dims [-1]*2].

But for an odd dimension this computation loses data: unpacked_dims [-1]/2.

I can think of a few solutions, but neither is ideal:

First option, keep the restriction on the tensor shape (innermost dimension has to be even). Weights typically have even dimensions, so that's unlikely to pose a real-world usage problem, but it adds a restriction to the standard that's rooted in an implementation detail.

Second option, store every int4 within an int8 element (unpacked). That is quite straightforward and allows any dimension to be used, but means that we will not benefit from the compression of model weights in ONNX.

Third option, add padding to make the innermost dimension even, and pass this information through the custom data-type's description. When packing we can set the data-type according to the parity of the innermost dimension, and when unpacking we can check the data-type. It's functional, but it's hacky.
In onnx/reference/custom_element_types.py

import numpy as np

bfloat16 = np.dtype((np.uint16, {"bfloat16": (np.uint16, 0)}))
float8e4m3fn = np.dtype((np.uint8, {"e4m3fn": (np.uint8, 0)}))
float8e4m3fnuz = np.dtype((np.uint8, {"e4m3fnuz": (np.uint8, 0)}))
float8e5m2 = np.dtype((np.uint8, {"e5m2": (np.uint8, 0)}))
float8e5m2fnuz = np.dtype((np.uint8, {"e5m2fnuz": (np.uint8, 0)}))
+ int4 = np.dtype((np.uint8, {"int4": (np.uint8, 0)}))
+ int4_padded = np.dtype((np.uint8, {"int4_padded": (np.uint8, 0)}))

Another thing to consider is how this mechanism will be extended when adding future sub-byte data-types with different widths (i.e. for int3, int6).

Creative ideas would be appreciated!

gramalingam · 2023-12-07T00:48:18Z

Found a way around this. Int4 will be packed when creating a tensor and unpacked when converting to a numpy array.
ReferenceEvaluator will use unpacked int4 (each int4 element stored as an np.int8 - sign extended).

If I understand correctly, this is essentially option 2 from your list (focusing specifically on the question of the numpy-representation of an int4 tensor, not the TensorProto representation etc.)?

galagam · 2023-12-07T09:30:53Z

Found a way around this. Int4 will be packed when creating a tensor and unpacked when converting to a numpy array.
ReferenceEvaluator will use unpacked int4 (each int4 element stored as an np.int8 - sign extended).

If I understand correctly, this is essentially option 2 from your list (focusing specifically on the question of the numpy-representation of an int4 tensor, not the TensorProto representation etc.)?

Option 2 referred to saving the data uncompressed - in the TensorProto as well. The obvious downside here is that the benefit of compression is lost.
However, I see now that the TensorProto representation is decoupled from the numpy representation only which is used for operator reference testing. Therefore, keeping the int4 values uncompressed in numpy seems like an ideal solution.

gramalingam · 2023-12-08T19:09:53Z

As you stated earlier, I think we can break this into two separate questions:

The representation in a TensorProto
The numpy representation

For numpy, keeping the int4 values uncompressed is reasonable from a pure specification perspective (that is, eg., in the reference implementation). However, I think sooner or later we might need utility methods that deal with the numpy representation and are used in production tools, where a compressed numpy representation would be beneficial too. In that situation, I think your option 3 might be a good choice. This is just for the sake of discussion (whether we add this now or later is a different question).

### Description - Add INT4 and UINT4 quantized data types - Support for packing and unpacking int4x2->byte - Implementation of Operators: Cast, CastLike, DequantizeLinear, QuantizeLinear - Type support for non-compute operators Constant, ConstantOfShape, Identity, Reshape, Shape, Size, If, Loop, Scan, Flatten, Pad, Squeeze, Unsqueeze, Transpose. ### Motivation and Context See details in issue #5776 --------- Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com> Signed-off-by: galagam <ghubaraagam@nvidia.com>

galagam · 2024-04-08T06:24:56Z

Implemented in #5811. Closing.

galagam added the enhancement Request for new feature or operator label Nov 29, 2023

justinchuby mentioned this issue Dec 4, 2023

Generic representation of datatypes #5793

Open

galagam mentioned this issue Dec 18, 2023

Add INT4, UINT4 types #5811

Merged

justinchuby assigned galagam Dec 18, 2023

justinchuby added the spec label Dec 18, 2023

galagam closed this as completed Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Add support for Int4 data-type #5776

[Feature request] Add support for Int4 data-type #5776

galagam commented Nov 29, 2023 •

edited

Loading

galagam commented Dec 4, 2023 •

edited

Loading

gramalingam commented Dec 7, 2023

galagam commented Dec 7, 2023

gramalingam commented Dec 8, 2023

galagam commented Apr 8, 2024

[Feature request] Add support for Int4 data-type #5776

[Feature request] Add support for Int4 data-type #5776

Comments

galagam commented Nov 29, 2023 • edited Loading

System information

What is the problem that this feature solves?

Alternatives considered

Describe the feature

Will this influence the current api (Y/N)?

Feature Area

Are you willing to contribute it (Y/N)

Notes

galagam commented Dec 4, 2023 • edited Loading

gramalingam commented Dec 7, 2023

galagam commented Dec 7, 2023

gramalingam commented Dec 8, 2023

galagam commented Apr 8, 2024

galagam commented Nov 29, 2023 •

edited

Loading

galagam commented Dec 4, 2023 •

edited

Loading