Figure out the story for hybrid quantization #1575

burmako · 2023-06-04T18:08:53Z

At the moment, dot_general (as well as convolution as proposed in #1477) don't support hybrid quantization, e.g. float lhs and quantized rhs. However, this is an important practical use case. How do we represent it?

sdasgup3 · 2023-06-07T00:18:33Z

Thanks @burmako for bringing the topic. Let me add a bit context around it to further the discussion.

A few definitions which might be handy in the presentation:

Quantization Techniques

Weight-only quantization: Only quantize the weights. Sometimes, weight-only quantization is simulated, meaning a dequant op exists after quantized const, running float kernel at inference.
Dynamic Range (also known as DRQ) : Convert weights to reduced precision integer ahead of time, while quantizing activations based on data range (min/max) observed at runtime.

Let us have a look at convolution op which can be used to implement one of the the above techniques:

%result = stablehlo.convolution(%arg0, %arg1)
    dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
    window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}   
    {batch_group_count = 1 : i64, feature_group_count = 1 : i64} :     
    (tensor<?x5x5x3xf32>, tensor<3x3x3x38x!quant.uniform<i8:f32, 34.0:16>)
    -> tensor<?x3x3x38xf32>

We note that we also call this a hybrid op where operands are of different types (activation %arg0non- quantized, f32 vs weight %arg1 quantized, say qi8, where qi8 represents a quantized tensor type with 8-bit storage_type). There can be following two interpretations for such an hybrid op based on the fact that the op will have the semantics to make %arg0/%arg1 of the same type, either (qi8, qi8) or (f32, f32).

Op unifying the operand types to `f32`, also known as weight-only

The op, as part of its semantics, dequantizes the weight and does floating-point convolution producing a floating-point result. We note that, in general, the above op can be emulated by explicitly dequantizing the weight and then doing a convolution between floating point types.

Op unifying the operand types to `qi8`, also known as DRQ:

The op, as part of its semantics, calculates quantization parameters and quantize input activation, performs convolution(qi8, qi8) and dequantize the resulting accumulation.

Next let us talk about some of the pros and cons of expressing such an hybrid op in StableHLO

Pros

Less prone to constant folding: For example, the problem with explicitly emulating the dequantization step for weight-only op is that the dequantization step can be constant-folded producing an all floating-point convolution. This might be a problem for downstream consumers who are expecting to pattern match "dequant+floating-point conv" to re-create the hybrid op supported in downstream compilers.
Less pattern matching overhead for downstream consumers of StableHLO.

Cons

Ambiguous: With the two different interpretations for hybrid op, convolution(f32, qi8) seems ambiguous. The specification of each op supporting hybrid scheme needs to specify both the variants.
Implementation defined: The op behavior will be dependent on implementation: Some implementer can choose dynamic range quantization and others can use it for weight only.

Current state with expressing hybrid ops in StableHLO

StableHLO , in its current form does not support hybrid ops, but we are excited to bring community feedback to know more about use-cases for such ops and their associated trade-offs.

A few additional notes

The hybrid op, with DRQ semantics, is currently supported only in TFLite CPU runtime ref.
Other than TFLite CPU, the hybrid op can be executed on other hardware implementations
- Some implementation, with efficient floating-point support, can execute such ops in the decomposed form (first de-quantized to floating-point before having a floating-point computation). If the implementation consumes StableHLO, then the pros and cons discussed above would play an important role.
- Following the feedback, it seems that in the long run, there could be some ideas to support dynamic range quantization explicitly at the graph level. For the short term, we can use stablehlo.custom operation to support DRQ.

Please let me know your comments and feedback.

### Summary The PR proposes the specification for quantized element wise operations (total 47). ### Details Overall, we propose treating quantized elements as floating-point elements, and therefore ops on quantized tensors as ops on floating-point tensors, along the lines of dequantize -> float computation -> quantize. This principle works for most elementwise ops, although there are some exceptions discussed below. Furthermore, the proposal is to only support per-tensor quantization for elementwise ops. We haven't yet come across use cases for per-axis quantization for these ops, so let's start small. The story for per-axis quantization will be worked out in #1574. Finally, we propose to not support hybrid quantization (i.e. situations when some inputs/outputs are quantized and some are not) for now. The story for this will be worked out in #1575. - **Ops that support tensors of floating-point types (33)**: These ops support quantized tensors, with semantics following dequantize -> float computation -> quantize. We are using the `dequantize_op_quantize` function to express it: - Binary (10): `add, atan2, compare, divide, maximum, minimum, multiply, power, remainder, subtract`. - Unary (20): `abs, cbrt, ceil, cosine, exponential, exponential_minus_one, floor, is_finite, log, logistic, log_plus_one, negate, reduce_precision, round_nearest_afz, round_nearest_even, rsqrt, sign, sine, sqrt, tanh`. - Ternary (2): `clamp, select`. - Other (1): `map`. - **Ops that don't support tensors of floating-point types (9)**: These ops (`and, count_leading_zero, not, or, popcnt, shift_left, shift_right_arithmetic, shift_right_logical, xor`) don't support quantized tensors. If there is a need to perform computations on the underlying integer representation of these tensors, they can be bitcast_convert'ed to integers. - **Ops that involve complex types (3)**: These ops (`complex`, `imag`, `real`) don't support quantized tensors because quantization doesn't compose with complex types at the moment. - **Conversion ops (2)**: - `convert`: A convert from a quantized type to any type can be realized using `stablehlo.uniform_dequantize` followed by `stabhle.convert` to convert the dequantized floating-point type to type of choice. Similarly, a convert from any type to quantized type can be realized using `stablehlo.convert` to floating-point type followed by `stablehlo.uniform_quantize`. It's not necessarily great that we have 3 ops to represent something that could theoretically be represented by 1 op, and we're planning to explore a potential simplification in #1576. - `bitcast_convert`: Works with low-level representations, so it treats quantized elements as integer elements.

sdasgup3 · 2023-11-20T23:17:18Z

#1792 proposes semantic changes in StableHLO to support weight only quantization for convolution and dot_general ops.

Remaining tasks:

Figure out if there are other ops (than dot_general and convolution) needs hybrid op support?
Figure out the the story for dynamic range quantization?

…y quantization (#1792) This RFC proposes to add hybrid quantized convolution and dot_general for weight-only quantization. Please let me know your feedback on this. The RFC partially address the issue #1575 w.r.t supporting weight only quantization support in StableHLO. The remaining tasks for the above the are highlighted [here](#1575 (comment)).

sdasgup3 · 2024-04-08T15:59:33Z

With #1792 merged let us close this issue. We will open a separate ones for the remaining one #1575 (comment) once we have more information around them.

burmako added the Spec label Jun 4, 2023

burmako assigned sdasgup3 Jun 4, 2023

burmako mentioned this issue Jun 4, 2023

Specification for quantized ConvolutionOp #1477

Merged

sdasgup3 mentioned this issue Jun 6, 2023

Specification for quantized element-wise operations #1566

Merged

sdasgup3 mentioned this issue Nov 20, 2023

[RFC] Add hybrid quantized convolution and dot_general for weight-only quantization #1792

Merged

sdasgup3 closed this as completed Apr 8, 2024

GleasonK added the Quantization label Apr 8, 2024

GleasonK unassigned sdasgup3 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out the story for hybrid quantization #1575

Figure out the story for hybrid quantization #1575

burmako commented Jun 4, 2023

sdasgup3 commented Jun 7, 2023 •

edited

Loading

sdasgup3 commented Nov 20, 2023

sdasgup3 commented Apr 8, 2024

Figure out the story for hybrid quantization #1575

Figure out the story for hybrid quantization #1575

Comments

burmako commented Jun 4, 2023

sdasgup3 commented Jun 7, 2023 • edited Loading

Quantization Techniques

Op unifying the operand types to f32, also known as weight-only

Op unifying the operand types to qi8, also known as DRQ:

Pros

Cons

Current state with expressing hybrid ops in StableHLO

A few additional notes

sdasgup3 commented Nov 20, 2023

sdasgup3 commented Apr 8, 2024

sdasgup3 commented Jun 7, 2023 •

edited

Loading

Op unifying the operand types to `f32`, also known as weight-only

Op unifying the operand types to `qi8`, also known as DRQ: