Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions docs/source/features/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,3 +198,87 @@ Olive consolidates the NVIDIA TensorRT Model Optimizer-Windows quantization into
```

Please refer to [Phi3.5 example](https://github.com/microsoft/olive-recipes/tree/main/microsoft-Phi-3.5-mini-instruct/NvTensorRtRtx) for usability and setup details.


## Quantize with AI Model Efficiency Toolkit
Olive supports quantizing models with Qualcomm's [AI Model Efficiency Toolkit](https://github.com/quic/aimet) (AIMET).

AIMET is a software toolkit for quantizing trained ML models to optimize deployment on edge devices such as mobile phones or laptops. AIMET employs post-training and fine-tuning techniques to minimize accuracy loss during quantization.

Olive consolidates AIMET quantization into a single pass called AimetQuantization which supports LPBQ, SeqMSE, and AdaRound. Multiple techniques can be applied in a single pass by listing them in the techniques array. If no techniques are specified, AIMET applies basic static quantization to the model using the provided data.

| Technique | Description |
|--------------------------------|-----------------------------------------------------------------------------|
| **LPBQ** | An alternative to blockwise quantization which allows backends to leverage existing per-channel quantization kernels while significantly improving encoding granularity. |
| **SeqMSE** | Optimizes the weight encodings of each layer of a model to minimize the difference between the layer's original and quantized outputs. |
| **AdaRound** | Tunes the rounding direction for quantized model weights to minimize the local quantization error at each layer output. |

### Example Configuration

```json
{
"type": "AimetQuantization",
"data_config": "calib_data_config"
}
```

#### LPBQ

Configurations:

- `block_size`: Number of input channels to group in each block (default: `64`).
- `op_types`: List of operator types for which to enable LPBQ (default: `["Gemm", "MatMul", "Conv"]`).
- `nodes_to_exclude`: List of node names to exclude from LPBQ weight quantization (default: `None`)


```json
{
"type": "AimetQuantization",
"data_config": "calib_data_config",
"techniques": [
{"name": "lpbq", "block_size": 64}
]
}
```

#### SeqMSE

Configurations:


- `data_config`: Data config to use for SeqMSE optimization. Defaults to calibration set if not specified.
- `num_candidates`: Number of encoding candidates to sweep for each weight (default: `20`).


```json
{
"type": "AimetQuantization",
"data_config": "calib_data_config",
"precision": "int4",
"techniques": [
{"name": "seqmse", "num_candidates": 20}
]
}
```

#### AdaRound

Configurations:

- `num_iterations`: Number of optimization steps to take for each layer (default: `10000`). Recommended value is
10K for weight bitwidths >= 8-bits, 15K for weight bitwidths < 8 bits.
- `nodes_to_exclude`: List of node names to exclude from AdaRound optimization (default: `None`).


```json
{
"type": "AimetQuantization",
"data_config": "calib_data_config",
"techniques": [
{"name": "adaround", "num_iterations": 10000, "nodes_to_exclude": ["/lm_head/MatMul"]}
]
}
```

Please refer to [AimetQuantization](aimet_quantization) for more details about the pass and its config parameters.

7 changes: 7 additions & 0 deletions docs/source/reference/pass.rst
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,13 @@ ModelBuilder
------------
.. autoconfigclass:: olive.passes.ModelBuilder

.. _aimet_quantization:

AimetQuantization
-----------------

.. autoconfigclass:: olive.passes.AimetQuantization

Pytorch
=================================

Expand Down