Release Intel® Neural Compressor v2.3 Release · intel/neural-compressor

Highlights

Integrate Intel Neural Compressor into MSFT ONNX Runtime (#16288) and Olive (#411, #412, #469).
Supported low precision (INT4, NF4, FP4) and Weight-Only Quantization algorithms including RTN, AWQ, GPTQ and TEQ on ONNX Runtime and PyTorch for LLMs optimization.
Supported sparseGPT pruner (88adfc).
Supported quantization for ONNX Runtime DML EP and DNNL EP, and verified inference on Intel NPU (e.g., Meteor Lake) and Intel CPU (e.g., Sapphire Rapids).

Features

[Quantization] Support ONNX Runtime quantization and inference for DNNL EP (79be8b)
[Quantization] [Experimental] Support ONNX Runtime quantization and inference for DirectML EP (750bb9)
[Quantization] Support low precision and Weight-Only Quantization (WOQ) algorithms, including RTN (501440, 19ab16, 859315), AWQ (2562f2, 641d42),
GPTQ (b5ac3c, 6ba783) and TEQ (d2f995, 9ff7f0) for PyTorch
[Quantization] Support NF4 and FP4 data type for PyTorch Weight-Only Quantization (3d11b5)
[Quantization] Support low precision and Weight-Only Quantization algorithms, including RTN, AWQ and GPTQ for ONNX Runtime (da4c92)
[Quantization] Support layer-wise quantization (d9d1fc) and enable with SmoothQuant (ec9ae9)
[Pruning] Add sparseGPT pruner and refactor pruning class (88adfc)
[Pruning] Add Hyper-parameter Optimization algorithm for pruning (6613cf)
[Model Export] Support PT2ONNX dynamic quantization export (165532)

Improvement

[Common] Clean up dataloader usage in examples (1044d8,
a2931e, 447cc7)
[Common] Enhance ONNX Runtime backend check (4ce9de)
[Strategy] Add block-wise distributed fallback in basic strategy (ea309f)
[Strategy] Enhance strategy exit policy (d19b42)
[Quantization] Add WeightOnlyLinear for Weight-Only approach to allow low memory inference (00bbf8)
[Quantization] Support more ONNX Runtime direct INT8 ops (b9ce61)
[Quantization] Support TensorFlow per-channel MatMul quantization (cf5589)
[Quantization] Implement a new method to perform alpha auto-tuning in SmoothQuant (084eda)
[Quantization] Enhance ONNX SmoothQuant tuning structure (f0d51c)
[Quantization] Enhance PyTorch SmoothQuant tuning structure (81da40)
[Quantization] Update PyTorch examples dataloader to support transformers 4.31.x (59371f)
[Quantization] Enhance ONNX Runtime backend setting for GPU EP support (295535)
[Pruning] Refactor pruning (92d14d)
[Mixed Precision] Update the list of supported layers for Keras mix-precision (692c8b)
[Mixed Precision] Introduce quant_level into mixed precision (0dc6a9)

Productivity

[Ecosystem] MSFT Olive integrate SmoothQuant and 3 LLM examples (#411, #412, #469)
[Ecosystem] MSFT ONNX Runtime integrate SmoothQuant static quantization (#16288)
[Neural Insights] Support PyTorch FX inspect tensor and integrate with Neural Insights (775def, 74a785)
[Neural Insights] Add step-by-step diagnosis cases (99c3b0)
[Neural Solution] Resource management and user-facing API enhancement (fbba10)
[Auto CI] Integrate auto CI code scan bug fix tools (f77a2c, 06cc38)

Bug Fixes

Fix bugs in PyTorch SmoothQuant (0349b9, 8f3645)
Fix pytorch dataloader batch size issue (6a98d0)
Fix bugs for ONNX Runtime CUDA EP (a1b566, d1f315)
Fix bug in ONNX Runtime adapter where _rename_node function fails with model size > 2 GB (1f6b1a)
Fix ONNX Runtime diagnosis bug (f10e26)
Update Neural Solution example and fix grpc port issue (528868)
Fix the objective initialization issue (9d7546)
Fix reshape issue for bayesian strategy (77cb83)
Fix CVEs (d86922, 2bbfcd, fc71fa)

Examples

Add Weight-Only LLM examples for PyTorch (4b24be, 66f7c1, aa457a)
Add Weight-Only LLM examples for ONNX Runtime (10c133)
Enable 3 ONNX Runtime examples, CodeBert (5e584e), LayoutLMv2 FUNSD (5f0b17), Table Transformer (eb8a95)
Add ONNX Runtime LLM SmoothQuant example Llama-7B (7fbcf5)
Enable 2 TensorFlow examples, ViT (94df99), GraphSage (29ec82)
Add easy get started notebooks (d7b608, 6ee846)
Add multi-cards magnitude pruning use case (909618)
Unify ONNX Runtime prepare model scripts (5ecb13)

Validated Configurations

Provide feedback