Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out the story for hybrid quantization #1575

Closed
burmako opened this issue Jun 4, 2023 · 3 comments
Closed

Figure out the story for hybrid quantization #1575

burmako opened this issue Jun 4, 2023 · 3 comments

Comments

@burmako
Copy link
Contributor

burmako commented Jun 4, 2023

At the moment, dot_general (as well as convolution as proposed in #1477) don't support hybrid quantization, e.g. float lhs and quantized rhs. However, this is an important practical use case. How do we represent it?

@sdasgup3
Copy link
Member

sdasgup3 commented Jun 7, 2023

Thanks @burmako for bringing the topic. Let me add a bit context around it to further the discussion.

A few definitions which might be handy in the presentation:

Quantization Techniques

  • Weight-only quantization: Only quantize the weights. Sometimes, weight-only quantization is simulated, meaning a dequant op exists after quantized const, running float kernel at inference.
  • Dynamic Range (also known as DRQ) : Convert weights to reduced precision integer ahead of time, while quantizing activations based on data range (min/max) observed at runtime.

Let us have a look at convolution op which can be used to implement one of the the above techniques:

%result = stablehlo.convolution(%arg0, %arg1)
    dim_numbers = [b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f],
    window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}   
    {batch_group_count = 1 : i64, feature_group_count = 1 : i64} :     
    (tensor<?x5x5x3xf32>, tensor<3x3x3x38x!quant.uniform<i8:f32, 34.0:16>)
    -> tensor<?x3x3x38xf32>

We note that we also call this a hybrid op where operands are of different types (activation %arg0non- quantized, f32 vs weight %arg1 quantized, say qi8, where qi8 represents a quantized tensor type with 8-bit storage_type). There can be following two interpretations for such an hybrid op based on the fact that the op will have the semantics to make %arg0/%arg1 of the same type, either (qi8, qi8) or (f32, f32).

Op unifying the operand types to f32, also known as weight-only

The op, as part of its semantics, dequantizes the weight and does floating-point convolution producing a floating-point result. We note that, in general, the above op can be emulated by explicitly dequantizing the weight and then doing a convolution between floating point types.

Op unifying the operand types to qi8, also known as DRQ:

The op, as part of its semantics, calculates quantization parameters and quantize input activation, performs convolution(qi8, qi8) and dequantize the resulting accumulation.

Next let us talk about some of the pros and cons of expressing such an hybrid op in StableHLO

Pros

  • Less prone to constant folding: For example, the problem with explicitly emulating the dequantization step for weight-only op is that the dequantization step can be constant-folded producing an all floating-point convolution. This might be a problem for downstream consumers who are expecting to pattern match "dequant+floating-point conv" to re-create the hybrid op supported in downstream compilers.
  • Less pattern matching overhead for downstream consumers of StableHLO.

Cons

  • Ambiguous: With the two different interpretations for hybrid op, convolution(f32, qi8) seems ambiguous. The specification of each op supporting hybrid scheme needs to specify both the variants.
  • Implementation defined: The op behavior will be dependent on implementation: Some implementer can choose dynamic range quantization and others can use it for weight only.

Current state with expressing hybrid ops in StableHLO

StableHLO , in its current form does not support hybrid ops, but we are excited to bring community feedback to know more about use-cases for such ops and their associated trade-offs.

A few additional notes

  • The hybrid op, with DRQ semantics, is currently supported only in TFLite CPU runtime ref.
  • Other than TFLite CPU, the hybrid op can be executed on other hardware implementations
    • Some implementation, with efficient floating-point support, can execute such ops in the decomposed form (first de-quantized to floating-point before having a floating-point computation). If the implementation consumes StableHLO, then the pros and cons discussed above would play an important role.
    • Following the feedback, it seems that in the long run, there could be some ideas to support dynamic range quantization explicitly at the graph level. For the short term, we can use stablehlo.custom operation to support DRQ.

Please let me know your comments and feedback.

sdasgup3 added a commit that referenced this issue Jun 28, 2023
### Summary

The PR proposes the specification for quantized element wise operations
(total 47).

### Details

Overall, we propose treating quantized elements as floating-point 
elements, and therefore ops on quantized tensors as ops on
floating-point
tensors, along the lines of dequantize -> float computation -> quantize.
This
principle works for most elementwise ops, although there are some
exceptions
discussed below.

Furthermore, the proposal is to only support per-tensor quantization for
elementwise ops. We haven't yet come across use cases for per-axis
quantization
for these ops, so let's start small. The story for per-axis quantization
will
be worked out in #1574.

Finally, we propose to not support hybrid quantization (i.e. situations
when
some inputs/outputs are quantized and some are not) for now. The story
for this
will be worked out in #1575.

- **Ops that support tensors of floating-point types (33)**: These ops
support
quantized tensors, with semantics following dequantize -> float
computation ->
quantize. We are using the `dequantize_op_quantize` function to express
it:
- Binary (10): `add, atan2, compare, divide, maximum, minimum, multiply,
    power, remainder, subtract`.
- Unary (20): `abs, cbrt, ceil, cosine, exponential,
exponential_minus_one,
floor, is_finite, log, logistic, log_plus_one, negate, reduce_precision,
round_nearest_afz, round_nearest_even, rsqrt, sign, sine, sqrt, tanh`.
  - Ternary (2): `clamp, select`.
  - Other (1): `map`.
  
- **Ops that don't support tensors of floating-point types (9)**: These
ops
(`and, count_leading_zero, not, or, popcnt, shift_left,
shift_right_arithmetic,
shift_right_logical, xor`) don't support quantized tensors. If there is
a need
to perform computations on the underlying integer representation of
these
tensors, they can be bitcast_convert'ed to integers.

- **Ops that involve complex types (3)**: These ops (`complex`, `imag`,
`real`) don't support quantized
tensors because quantization doesn't compose with complex types at the
moment.

- **Conversion ops (2)**: 
- `convert`: A convert from a quantized type to any type can be realized
using
`stablehlo.uniform_dequantize` followed by `stabhle.convert` to convert
the
dequantized floating-point type to type of choice. Similarly, a convert
from
any type to quantized type can be realized using `stablehlo.convert` to
floating-point type followed by `stablehlo.uniform_quantize`. It's not
necessarily great that we have 3 ops to represent something that could
    theoretically be represented by 1 op, and we're planning to explore 
    a potential simplification in #1576.
- `bitcast_convert`: Works with low-level representations, so it treats
    quantized elements as integer elements.
@sdasgup3
Copy link
Member

#1792 proposes semantic changes in StableHLO to support weight only quantization for convolution and dot_general ops.

Remaining tasks:

  1. Figure out if there are other ops (than dot_general and convolution) needs hybrid op support?
  2. Figure out the the story for dynamic range quantization?

GleasonK pushed a commit that referenced this issue Apr 2, 2024
…y quantization (#1792)

This RFC proposes to add hybrid quantized convolution and dot_general
for weight-only quantization.
Please let me know your feedback on this.

The RFC partially address the issue
#1575 w.r.t supporting weight
only quantization support in StableHLO. The remaining tasks for the
above the are highlighted
[here](#1575 (comment)).
@sdasgup3
Copy link
Member

sdasgup3 commented Apr 8, 2024

With #1792 merged let us close this issue. We will open a separate ones for the remaining one #1575 (comment) once we have more information around them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: Done
Development

No branches or pull requests

3 participants