Skip to content

[QDQ Quant] Support mixed-precision integer quantization via overrides#19925

Merged
adrianlizarraga merged 11 commits into
mainfrom
adrianl/qdq-quant-mixed-prec-overrides
Mar 23, 2024
Merged

[QDQ Quant] Support mixed-precision integer quantization via overrides#19925
adrianlizarraga merged 11 commits into
mainfrom
adrianl/qdq-quant-mixed-prec-overrides

Conversation

@adrianlizarraga
Copy link
Copy Markdown
Contributor

@adrianlizarraga adrianlizarraga commented Mar 15, 2024

Description

Adds support for specifying mixed precision QDQ models via tensor quantization overrides.

Motivation and Context

This PR implements an approach for supported "mixed precision" models. The following figure demonstrates an example mixed precision model as defined in this PR.

image

A mixed precision QDQ model consists of regions with different activation/weight quantization data types. The boundary between regions converts between activation quantization data types (e.g., uint8 to uint16) using a DQ to Q sequence.

The ability to specify regions with different quantization data types enables exploring the tradeoffs between accuracy and latency. A higher integer precision may improve accuracy at the expense of latency, so selectively promoting certain regions to a higher precision can aid in achieving a desirable balance in key metrics.

Current support

By default, the ORT quantizer supports specifying default activation and weight quantization data types for the entire model. A recent PR added support for specifying basic quantization overrides at the tensor level via the extra_options["TensorQuantOverrides"] configuration:

TensorQuantOverrides = dictionary :
    Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
    list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
    per-channel quantization, the list contains a dictionary for each channel in the tensor.
    Each dictionary contains optional overrides with the following keys and values.
           'quant_type' = QuantType : The tensor's quantization data type.
           'scale' =  Float         : The scale value to use. Must also specify `zero_point` if set.
           'zero_point' = Int       : The zero-point value to use. Must also specify `scale` is set.
           'symmetric' = Bool       : If the tensor should use symmetric quantization. Invalid if also
                                      set `scale` or `zero_point`.
           'reduce_range' = Bool    : If the quantization range should be reduced. Invalid if also
                                      set `scale` or `zero_point`.
           'rmax' = Float           : Override the maximum real tensor value in calibration data.
                                      Invalid if also set `scale` or `zero_point`.
           'rmin' = Float           : Override the minimum real tensor value in calibration data.
                                      Invalid if also set `scale` or `zero_point`.

The tensor-level overrides are currently used to override the quantization type for weights/initializers or to set specific scale/zero-point values for a tensor (e.g., QNN requires Sigmoid to use a specific scale/zero-point at its output).

However, these overrides are not typically used to override activation quantization types due in large part to operator data type constraints. Consider, for example, that all inputs and outputs to an Add operator must be of the same data type. Consequently, using tensor-level overrides to promote the Add’s output to 16-bits would force the inputs to also be overridden to 16-bit. In turn, this would have a cascading effect on potentially the entire graph. The solution implemented by this PR is to allow the specification of tensor boundaries where the activation quantization data type changes.

The approach

The following figure shows a model with a region that has been promoted to 16-bit from the default 8-bit activation type.

image

Note the following observations:

  • Op2’s output is consumed by Op4, Op7, and Op8. Op4 consumes the converted u16 type, while Op7 and Op8 consume the original u8 type.
  • Op3’s output is converted from u8 to u16. Op5 consumes the converted u16 type.
  • Op4’s output is just u16 (not converted).
  • Op5’s output is converted from u16 to u8. Op6 consumes the u8 type.

The approach implemented by this PR uses the tensor-level quantization overrides to specify a tensor’s quantization type at both the producer and consumer ends. The following shows the overrides necessary to create this mixed precision QDQ model.

overrides = {
  “Op2_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op4”}}}],
  “Op3_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op5”}}}],
  “Op4_out”: [{“quant_type”: QUInt16}],
  “Op5_out”: [{“quant_type”: QUInt16, “convert”: {“quant_type”: QUInt8, “recv_nodes”: {“Op6”}}}]
}

Comment thread onnxruntime/python/tools/quantization/qdq_quantizer.py Fixed
@adrianlizarraga adrianlizarraga marked this pull request as ready for review March 15, 2024 12:41
Comment thread onnxruntime/python/tools/quantization/registry.py
Comment thread onnxruntime/test/python/quantization/test_qdq.py Outdated
Comment thread onnxruntime/test/python/quantization/test_qdq.py Outdated
Comment thread onnxruntime/python/tools/quantization/base_quantizer.py
Comment thread onnxruntime/python/tools/quantization/onnx_quantizer.py
Comment thread onnxruntime/python/tools/quantization/operators/softmax.py
Comment thread onnxruntime/python/tools/quantization/qdq_quantizer.py
Comment thread onnxruntime/python/tools/quantization/qdq_quantizer.py Outdated
Comment thread onnxruntime/python/tools/quantization/tensor_quant_overrides.py
@adrianlizarraga
Copy link
Copy Markdown
Contributor Author

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, Linux QNN CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 9 pipeline(s).

@adrianlizarraga
Copy link
Copy Markdown
Contributor Author

/azp run Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed, Linux MIGraphX CI Pipeline, Big Models

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 9 pipeline(s).

@adrianlizarraga
Copy link
Copy Markdown
Contributor Author

/azp run ONNX Runtime React Native CI Pipeline, orttraining-amd-gpu-ci-pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

jywu-msft
jywu-msft previously approved these changes Mar 20, 2024
Copy link
Copy Markdown
Member

@jywu-msft jywu-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! great work. i'm signing off, but it's good to also have an approval from either @xadupre or @yufenglee

Comment thread onnxruntime/python/tools/quantization/onnx_quantizer.py Outdated
Comment thread onnxruntime/python/tools/quantization/qdq_quantizer.py
@adrianlizarraga adrianlizarraga merged commit cdc5d72 into main Mar 23, 2024
@adrianlizarraga adrianlizarraga deleted the adrianl/qdq-quant-mixed-prec-overrides branch March 23, 2024 18:05
adrianlizarraga added a commit that referenced this pull request Mar 25, 2024
…0028)

### Description
- Adds a utility to the QNN quantization scripts that "fixes" an initial
set of tensor quantization overrides for mixed-precision QDQ models.
Follow-up to #19925
- Moves existing overrides for QNN compatibility (matmul, layernorm,
sigmoid, tanh) to separate functions. PR adds missing unit tests for
these.
- Adds `weight_symmetric=None` parameter to the `get_qnn_qdq_config()`
function to enable user specification (instead of always using default
behavior).
- If weight_symmetric is set to `None`, it will be set to
`weight_symmetric = weight_type in (QUInt8, QUInt16)`.
  - Otherwise, the user's value is used.

#### Example
Float model:

```
    input_0 --> Op1 --> Op3 --> Op5 --> Op6 --> output_0
                                 ^
                                 |
    input_1 --> Op2 -+-> Op4 ----+
                     |
                     +-> Op7 --> output_1
                     |
                     +-> Op8 --> output_2
```

If we'd like to quantize this model to uint8 precision, but would like
to make sure tensor "Op4_out" is quantized to 16-bit, then we would
specify the following initial tensor quantization overrides:
```python
# Op4_out could be an inaccurate tensor that should be upgraded to 16bit
initial_overrides = {"Op4_out": [{"quant_type": QuantType.QUInt16}]}
```

These initial overrides may not create a valid model because Op4 and Op5
may require both the input and output to be the same type (e.g.,
uint16). This helper fixes the overrides so that input/output data types
are valid:

```python
qnn_config = get_qnn_qdq_config(
    float_model_path,
    data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QUInt8,
    init_overrides=initial_overrides,  # These initial overrides will be "fixed"
)
```

The above snippet generates the following "fixed" overrides (get via
`qnn_config.extra_options["TensorQuantOverrides"]`):
```python
    {
      "Op2_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op4"}}}],
      "Op3_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op5"}}}],
      "Op4_out": [{"quant_type": QUInt16}],
      "Op5_out": [{"quant_type": QUInt16, "convert": {"quant_type": QUInt8, "recv_nodes": {"Op6"}}}]
    }
```

How to interpret the fixed overrides:
- Op2's output is consumed by Op4, Op7, and Op8. Op4 consumes the
converted u16 type, but Op7 and Op8 consume the original u8 type.
- Op3's output is converted from u8 to u16. Op5 consumes the converted
u16 type.
- Op4's output is just u16 (not converted). All consumers of Op4_out get
the u16 type.
- Op5's output is converted from u16 to u8. Op6 consumes the u8 type.

### Motivation and Context
Generating mixed-precision quantization overrides is currently a manual
process. This PR adds an utility that helps generate valid overrides.
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
microsoft#19925)

### Description
Adds support for specifying mixed precision QDQ models via tensor
quantization overrides.



### Motivation and Context
This PR implements an approach for supported "mixed precision" models.
The following figure demonstrates an example mixed precision model as
defined in this PR.


![image](https://github.com/microsoft/onnxruntime/assets/19691973/40ae3bf9-b21a-4ba5-a1cd-41c1e08c21e7)

A mixed precision QDQ model consists of regions with different
activation/weight quantization data types. The boundary between regions
converts between activation quantization data types (e.g., uint8 to
uint16) using a DQ to Q sequence.

The ability to specify regions with different quantization data types
enables exploring the tradeoffs between accuracy and latency. A higher
integer precision may improve accuracy at the expense of latency, so
selectively promoting certain regions to a higher precision can aid in
achieving a desirable balance in key metrics.

#### Current support
By default, the ORT quantizer supports specifying default activation and
weight quantization data types for the entire model. A recent PR added
support for specifying basic quantization overrides at the tensor level
via the `extra_options["TensorQuantOverrides"]` configuration:

```
TensorQuantOverrides = dictionary :
    Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
    list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
    per-channel quantization, the list contains a dictionary for each channel in the tensor.
    Each dictionary contains optional overrides with the following keys and values.
           'quant_type' = QuantType : The tensor's quantization data type.
           'scale' =  Float         : The scale value to use. Must also specify `zero_point` if set.
           'zero_point' = Int       : The zero-point value to use. Must also specify `scale` is set.
           'symmetric' = Bool       : If the tensor should use symmetric quantization. Invalid if also
                                      set `scale` or `zero_point`.
           'reduce_range' = Bool    : If the quantization range should be reduced. Invalid if also
                                      set `scale` or `zero_point`.
           'rmax' = Float           : Override the maximum real tensor value in calibration data.
                                      Invalid if also set `scale` or `zero_point`.
           'rmin' = Float           : Override the minimum real tensor value in calibration data.
                                      Invalid if also set `scale` or `zero_point`.
```
The tensor-level overrides are currently used to override the
quantization type for weights/initializers or to set specific
scale/zero-point values for a tensor (e.g., QNN requires Sigmoid to use
a specific scale/zero-point at its output).

However, these overrides are not typically used to override activation
quantization types due in large part to operator data type constraints.
Consider, for example, that all inputs and outputs to an Add operator
must be of the same data type. Consequently, using tensor-level
overrides to promote the Add’s output to 16-bits would force the inputs
to also be overridden to 16-bit. In turn, this would have a cascading
effect on potentially the entire graph. The solution implemented by this
PR is to allow the specification of tensor boundaries where the
activation quantization data type changes.

#### The approach
The following figure shows a model with a region that has been promoted
to 16-bit from the default 8-bit activation type.


![image](https://github.com/microsoft/onnxruntime/assets/19691973/5998c301-ae20-4ac9-8a43-37f335cfcf8b)

Note the following observations:
- Op2’s output is consumed by Op4, Op7, and Op8. Op4 consumes the
converted u16 type, while Op7 and Op8 consume the original u8 type.
- Op3’s output is converted from u8 to u16. Op5 consumes the converted
u16 type.
 - Op4’s output is just u16 (not converted).
 - Op5’s output is converted from u16 to u8. Op6 consumes the u8 type.

The approach implemented by this PR uses the tensor-level quantization
overrides to specify a tensor’s quantization type at both the producer
and consumer ends. **The following shows the overrides necessary to
create this mixed precision QDQ model.**

```python3
overrides = {
  “Op2_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op4”}}}],
  “Op3_out”: [{“quant_type”: QUInt8, “convert”: {“quant_type”: QUInt16, “recv_nodes”: {“Op5”}}}],
  “Op4_out”: [{“quant_type”: QUInt16}],
  “Op5_out”: [{“quant_type”: QUInt16, “convert”: {“quant_type”: QUInt8, “recv_nodes”: {“Op6”}}}]
}
```
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
…crosoft#20028)

### Description
- Adds a utility to the QNN quantization scripts that "fixes" an initial
set of tensor quantization overrides for mixed-precision QDQ models.
Follow-up to microsoft#19925
- Moves existing overrides for QNN compatibility (matmul, layernorm,
sigmoid, tanh) to separate functions. PR adds missing unit tests for
these.
- Adds `weight_symmetric=None` parameter to the `get_qnn_qdq_config()`
function to enable user specification (instead of always using default
behavior).
- If weight_symmetric is set to `None`, it will be set to
`weight_symmetric = weight_type in (QUInt8, QUInt16)`.
  - Otherwise, the user's value is used.

#### Example
Float model:

```
    input_0 --> Op1 --> Op3 --> Op5 --> Op6 --> output_0
                                 ^
                                 |
    input_1 --> Op2 -+-> Op4 ----+
                     |
                     +-> Op7 --> output_1
                     |
                     +-> Op8 --> output_2
```

If we'd like to quantize this model to uint8 precision, but would like
to make sure tensor "Op4_out" is quantized to 16-bit, then we would
specify the following initial tensor quantization overrides:
```python
# Op4_out could be an inaccurate tensor that should be upgraded to 16bit
initial_overrides = {"Op4_out": [{"quant_type": QuantType.QUInt16}]}
```

These initial overrides may not create a valid model because Op4 and Op5
may require both the input and output to be the same type (e.g.,
uint16). This helper fixes the overrides so that input/output data types
are valid:

```python
qnn_config = get_qnn_qdq_config(
    float_model_path,
    data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QUInt8,
    init_overrides=initial_overrides,  # These initial overrides will be "fixed"
)
```

The above snippet generates the following "fixed" overrides (get via
`qnn_config.extra_options["TensorQuantOverrides"]`):
```python
    {
      "Op2_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op4"}}}],
      "Op3_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op5"}}}],
      "Op4_out": [{"quant_type": QUInt16}],
      "Op5_out": [{"quant_type": QUInt16, "convert": {"quant_type": QUInt8, "recv_nodes": {"Op6"}}}]
    }
```

How to interpret the fixed overrides:
- Op2's output is consumed by Op4, Op7, and Op8. Op4 consumes the
converted u16 type, but Op7 and Op8 consume the original u8 type.
- Op3's output is converted from u8 to u16. Op5 consumes the converted
u16 type.
- Op4's output is just u16 (not converted). All consumers of Op4_out get
the u16 type.
- Op5's output is converted from u16 to u8. Op6 consumes the u8 type.

### Motivation and Context
Generating mixed-precision quantization overrides is currently a manual
process. This PR adds an utility that helps generate valid overrides.
rohan11235813 pushed a commit to quadric-io/onnxruntime that referenced this pull request Aug 19, 2025
…0028)

### Description
- Adds a utility to the QNN quantization scripts that "fixes" an initial
set of tensor quantization overrides for mixed-precision QDQ models.
Follow-up to microsoft/onnxruntime#19925
- Moves existing overrides for QNN compatibility (matmul, layernorm,
sigmoid, tanh) to separate functions. PR adds missing unit tests for
these.
- Adds `weight_symmetric=None` parameter to the `get_qnn_qdq_config()`
function to enable user specification (instead of always using default
behavior).
- If weight_symmetric is set to `None`, it will be set to
`weight_symmetric = weight_type in (QUInt8, QUInt16)`.
  - Otherwise, the user's value is used.

#### Example
Float model:

```
    input_0 --> Op1 --> Op3 --> Op5 --> Op6 --> output_0
                                 ^
                                 |
    input_1 --> Op2 -+-> Op4 ----+
                     |
                     +-> Op7 --> output_1
                     |
                     +-> Op8 --> output_2
```

If we'd like to quantize this model to uint8 precision, but would like
to make sure tensor "Op4_out" is quantized to 16-bit, then we would
specify the following initial tensor quantization overrides:
```python
# Op4_out could be an inaccurate tensor that should be upgraded to 16bit
initial_overrides = {"Op4_out": [{"quant_type": QuantType.QUInt16}]}
```

These initial overrides may not create a valid model because Op4 and Op5
may require both the input and output to be the same type (e.g.,
uint16). This helper fixes the overrides so that input/output data types
are valid:

```python
qnn_config = get_qnn_qdq_config(
    float_model_path,
    data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QUInt8,
    init_overrides=initial_overrides,  # These initial overrides will be "fixed"
)
```

The above snippet generates the following "fixed" overrides (get via
`qnn_config.extra_options["TensorQuantOverrides"]`):
```python
    {
      "Op2_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op4"}}}],
      "Op3_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op5"}}}],
      "Op4_out": [{"quant_type": QUInt16}],
      "Op5_out": [{"quant_type": QUInt16, "convert": {"quant_type": QUInt8, "recv_nodes": {"Op6"}}}]
    }
```

How to interpret the fixed overrides:
- Op2's output is consumed by Op4, Op7, and Op8. Op4 consumes the
converted u16 type, but Op7 and Op8 consume the original u8 type.
- Op3's output is converted from u8 to u16. Op5 consumes the converted
u16 type.
- Op4's output is just u16 (not converted). All consumers of Op4_out get
the u16 type.
- Op5's output is converted from u16 to u8. Op6 consumes the u8 type.

### Motivation and Context
Generating mixed-precision quantization overrides is currently a manual
process. This PR adds an utility that helps generate valid overrides.
rohan11235813 pushed a commit to quadric-io/onnxruntime that referenced this pull request Sep 15, 2025
…0028)

### Description
- Adds a utility to the QNN quantization scripts that "fixes" an initial
set of tensor quantization overrides for mixed-precision QDQ models.
Follow-up to microsoft/onnxruntime#19925
- Moves existing overrides for QNN compatibility (matmul, layernorm,
sigmoid, tanh) to separate functions. PR adds missing unit tests for
these.
- Adds `weight_symmetric=None` parameter to the `get_qnn_qdq_config()`
function to enable user specification (instead of always using default
behavior).
- If weight_symmetric is set to `None`, it will be set to
`weight_symmetric = weight_type in (QUInt8, QUInt16)`.
  - Otherwise, the user's value is used.

#### Example
Float model:

```
    input_0 --> Op1 --> Op3 --> Op5 --> Op6 --> output_0
                                 ^
                                 |
    input_1 --> Op2 -+-> Op4 ----+
                     |
                     +-> Op7 --> output_1
                     |
                     +-> Op8 --> output_2
```

If we'd like to quantize this model to uint8 precision, but would like
to make sure tensor "Op4_out" is quantized to 16-bit, then we would
specify the following initial tensor quantization overrides:
```python
# Op4_out could be an inaccurate tensor that should be upgraded to 16bit
initial_overrides = {"Op4_out": [{"quant_type": QuantType.QUInt16}]}
```

These initial overrides may not create a valid model because Op4 and Op5
may require both the input and output to be the same type (e.g.,
uint16). This helper fixes the overrides so that input/output data types
are valid:

```python
qnn_config = get_qnn_qdq_config(
    float_model_path,
    data_reader,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QUInt8,
    init_overrides=initial_overrides,  # These initial overrides will be "fixed"
)
```

The above snippet generates the following "fixed" overrides (get via
`qnn_config.extra_options["TensorQuantOverrides"]`):
```python
    {
      "Op2_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op4"}}}],
      "Op3_out": [{"quant_type": QUInt8, "convert": {"quant_type": QUInt16, "recv_nodes": {"Op5"}}}],
      "Op4_out": [{"quant_type": QUInt16}],
      "Op5_out": [{"quant_type": QUInt16, "convert": {"quant_type": QUInt8, "recv_nodes": {"Op6"}}}]
    }
```

How to interpret the fixed overrides:
- Op2's output is consumed by Op4, Op7, and Op8. Op4 consumes the
converted u16 type, but Op7 and Op8 consume the original u8 type.
- Op3's output is converted from u8 to u16. Op5 consumes the converted
u16 type.
- Op4's output is just u16 (not converted). All consumers of Op4_out get
the u16 type.
- Op5's output is converted from u16 to u8. Op6 consumes the u8 type.

### Motivation and Context
Generating mixed-precision quantization overrides is currently a manual
process. This PR adds an utility that helps generate valid overrides.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants