Describe the feature request
Summary
Add native support in the MatMulNBits operator for 2-bit quantized weights paired with float16/bfloat16 Zero Points.
Motivation
2-bit quantization (e.g., from QAD/Quark-quantized models) uses non-uniform quantization levels such as [-1, -1/3, 1/3, 1]. These levels can be expressed in the MatMulNBits dequantization formula (index - zero_point) * scale only when zero_point = 1.5 — a fractional value that cannot be represented as an integer zero-point.
Currently, onnxruntime-genai works around this by packing 2-bit weights into MatMulNBits nodes and passing a float16 zero-point tensor of value 1.5 per group. However, MatMulNBits kernels does not implement support for:
bits = 2
- float16/bfloat16 zero-points (as opposed to the packed uint8 integer zero-points used for 4-bit)
This creates a fragile dependency on undocumented runtime behavior.
Requested Changes
-
Kernel support for bits = 2 — ensure the MatMulNBits CUDA, CPU, and other EP kernels correctly handle 2-bit packed weight tensors (4 values per byte, LSB-first).
-
Float16/float32 zero-point input — formally specify and implement support for float-typed zero-points in MatMulNBits. For 2-bit non-uniform quantization the zero-point is fractional (e.g., 1.5) and must be stored as a float.
Describe scenario use case
Example Dequantization for 2-bit QAD
dequant_value = (uint2_index - 1.5f16) * (original_scale * (2/3))
| uint2 index |
dequant value (relative to scale s) |
| 0 |
-s |
| 1 |
-s/3 |
| 2 |
+s/3 |
| 3 |
+s |
Describe the feature request
Summary
Add native support in the
MatMulNBitsoperator for 2-bit quantized weights paired with float16/bfloat16 Zero Points.Motivation
2-bit quantization (e.g., from QAD/Quark-quantized models) uses non-uniform quantization levels such as
[-1, -1/3, 1/3, 1]. These levels can be expressed in theMatMulNBitsdequantization formula(index - zero_point) * scaleonly whenzero_point = 1.5— a fractional value that cannot be represented as an integer zero-point.Currently, onnxruntime-genai works around this by packing 2-bit weights into
MatMulNBitsnodes and passing a float16 zero-point tensor of value1.5per group. However,MatMulNBitskernels does not implement support for:bits = 2This creates a fragile dependency on undocumented runtime behavior.
Requested Changes
Kernel support for
bits = 2— ensure theMatMulNBitsCUDA, CPU, and other EP kernels correctly handle 2-bit packed weight tensors (4 values per byte, LSB-first).Float16/float32 zero-point input — formally specify and implement support for float-typed zero-points in
MatMulNBits. For 2-bit non-uniform quantization the zero-point is fractional (e.g.,1.5) and must be stored as a float.Describe scenario use case
Example Dequantization for 2-bit QAD
dequant_value = (uint2_index - 1.5f16) * (original_scale * (2/3))