Skip to content

Latest commit

 

History

History
246 lines (167 loc) · 10.9 KB

quantization.md

File metadata and controls

246 lines (167 loc) · 10.9 KB

Quantization

  1. Quantization Introduction
  2. Quantization Fundamentals
  3. Accuracy Aware Tuning
  4. Get Started
    4.1 Post Training Quantization
    4.2 Specify Quantization Rules
    4.3 Specify Quantization Backend and Device
  5. Examples

Quantization Introduction

Quantization is a very popular deep learning model optimization technique invented for improving the speed of inference. It minimizes the number of bits required by converting a set of real-valued numbers into the lower bit data representation, such as int8 and int4, mainly on inference phase with minimal to no loss in accuracy. This way reduces the memory requirement, cache miss rate, and computational cost of using neural networks and finally achieve the goal of higher inference performance. On Intel 3rd Gen Intel® Xeon® Scalable Processors, user could expect up to 4x theoretical performance speedup. We expect further performance improvement with Intel® Advanced Matrix Extensions on 4th Gen Intel® Xeon® Scalable Processors.

Quantization Fundamentals

Affine quantization and Scale quantization are two common range mapping techniques used in tensor conversion between different data types.

The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)$$.

Affine Quantization

This is so-called asymmetric quantization, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].

here:

If INT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 127$ and $ZeroPoint = -128 - X_{f_{min}} / Scale$.

or

If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPoint = - X_{f_{min}} / Scale$.

Scale Quantization

This is so-called Symmetric quantization, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range.

The math equation is like:

here:

If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$.

or

If UINT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 255$ and $ZeroPoint = 128$.

NOTE

Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 data bits) to represent int8 range, may be needed on some early Xeon platforms, it's because those platforms may have overflow issues due to fp16 intermediate calculation result when executing int8 dot product operation. After AVX512_VNNI instruction is introduced, this issue gets solved by supporting fp32 intermediate data.

Quantization Support Matrix

Framework Backend Library Symmetric Quantization Asymmetric Quantization
ONNX Runtime MLAS Weight (int8) Activation (uint8)

Quantization Scheme

  • Symmetric Quantization
    • int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
  • Asymmetric Quantization
    • uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale)

Reference

Quantization Approaches

Quantization has three different approaches:

  1. post training dynamic quantization
  2. post training static quantization

The first two approaches belong to optimization on inference. The last belongs to optimization during training. Currently. ONNX Runtime doesn't support the last one.

Post Training Dynamic Quantization

The weights of the neural network get quantized into int8 format from float32 format offline. The activations of the neural network is quantized as well with the min/max range collected during inference runtime.

This approach is widely used in dynamic length neural networks, like NLP model.

Post Training Static Quantization

Compared with post training dynamic quantization, the min/max range in weights and activations are collected offline on a so-called calibration dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The calibration process runs on the original fp32 model and dumps out all the tensor distributions for Scale and ZeroPoint calculations. Usually preparing 100 samples are enough for calibration.

This approach is major quantization approach people should try because it could provide the better performance comparing with post training dynamic quantization.

With or Without Accuracy Aware Tuning

Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods.

This tuning algorithm creates a tuning space based on user-defined configurations, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met.

Neural compressor also support to quantize all quantizable ops without accuracy tuning, user can decide whether to tune the model accuracy or not. Please refer to "Get Start" below.

Working Flow

Currently accuracy aware tuning only supports post training quantization.

User could refer to below chart to understand the whole tuning flow.

accuracy aware tuning working flow

Get Started

The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide model, calibration dataloader, and evaluation function. Those parameters would be used to quantize and tune the model.

model is the framework model location or the framework model object.

calibration dataloader is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset.

If a user needs to tune the model accuracy, the user should provide evaluation function.

evaluation function is a function used to evaluate model accuracy. It is a optional. This function should be same with how user makes evaluation on fp32 model, just taking model as input and returning a scalar value represented the evaluation accuracy.

User could execute:

Post Training Quantization

  1. Without Accuracy Aware Tuning

This means user could leverage ONNX Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It's user responsibility to ensure the accuracy of the quantized model meets expectation. ONNX Neural Compressor supports Post Training Static Quantization and Post Training Dynamic Quantization.

from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import quantize
from onnx_neural_compressor.quantization import calibrate


class DataReader(calibrate.CalibrationDataReader):
    def get_next(self): ...

    def rewind(self): ...


calibration_data_reader = DataReader()  # only needed by StaticQuantConfig
qconfig = config.StaticQuantConfig(calibration_data_reader)  # or qconfig = DynamicQuantConfig()
quantize(model, q_model_path, qconfig)
  1. With Accuracy Aware Tuning

This means user could leverage the advance feature of ONNX Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide eval_fn.

from onnx_neural_compressor.quantization import calibrate
from onnx_neural_compressor.quantization import tuning
    CalibrationDataReader,
    GPTQConfig,
    RTNConfig,
    autotune,
    get_woq_tuning_config,
)


class DataReader(calibrate.CalibrationDataReader):
    def get_next(self): ...

    def rewind(self): ...


data_reader = DataReader()

# TuningConfig can accept:
# 1) a set of candidate configs like tuning.TuningConfig(config_set=[config.RTNConfig(weight_bits=4), config.GPTQConfig(weight_bits=4)])
# 2) one config with a set of candidate parameters like tuning.TuningConfig(config_set=[config.GPTQConfig(weight_group_size=[32, 64])])
# 3) our pre-defined config set like tuning.TuningConfig(config_set=config.get_woq_tuning_config())
custom_tune_config = tuning.TuningConfig(config_set=[config.RTNConfig(weight_bits=4), config.GPTQConfig(weight_bits=4)])
best_model = tuning.autotune(
    model_input=model,
    tune_config=custom_tune_config,
    eval_fn=eval_fn,
    calibration_data_reader=data_reader,
)

Specify Quantization Rules

ONNX Neural Compressor support specify quantization rules by operator name. Users can use set_local API of configs to achieve the above purpose by below code:

fp32_config = config.GPTQConfig(weight_dtype="fp32")
quant_config = config.GPTQConfig(
    weight_bits=4,
    weight_dtype="int",
    weight_sym=False,
    weight_group_size=32,
)
quant_config.set_local("/h.4/mlp/fc_out/MatMul", fp32_config)

Specify Quantization Backend and Device

Neural-Compressor will quantized models with user-specified backend or detecting the hardware and software status automatically to decide which backend should be used. The automatically selected priority is: GPU/NPU > CPU.

Backend Backend Library Support Device(cpu as default)
CPUExecutionProvider MLAS cpu
TensorrtExecutionProvider TensorRT gpu
CUDAExecutionProvider CUDA gpu
DnnlExecutionProvider OneDNN cpu
DmlExecutionProvider* OneDNN npu


Note

DmlExecutionProvider support works as experimental, please expect exceptions.

Known limitation: the batch size of onnx models has to be fixed to 1 for DmlExecutionProvider, no multi-batch and dynamic batch support yet.

Examples

User could refer to examples on how to quantize a new model.