Skip to content

Releases: intel/neural-compressor

Intel® Neural Compressor v2.5.1 Release

03 Apr 14:03
Compare
Choose a tag to compare
  • Improvement
  • Bug Fixes
  • Validated Configurations

Improvement

  • Improve WOQ AutoRound export (409231, 7ee721)
  • Adapt ITREX v1.4 release for example evaluate (9d7a05)
  • Update more supported LLM recipes (ce9b16)

Bug Fixes

  • Fix WOQ RTN supported layer checking condition (079177)
  • Fix in-place processing error in quant_weight function (92533a)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.15
  • ITEX 2.14.0
  • PyTorch/IPEX 2.2
  • ONNX Runtime 1.17

Intel® Neural Compressor v2.5 Release

26 Mar 10:21
24419c9
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • External Contributes
  • Validated Configurations

Highlights

  • Integrated Weight-Only Quantization algorithm AutoRound and verified on Gaudi2, Intel CPU, NV GPU
  • Applied SmoothQuant & Weight-Only Quantization algorithms with 15+ popular LLMs for INT8 & INT4 quantization and published the recipes

Features

  • [Quantization] Integrate Weight-Only Quantization algorithm AutoRound (5c7f33, dfd083, 9a7ddd, cf1de7)
  • [Quantization] Quantize weight with in-place mode in Weight-Only Quantization (deb1ed)
  • [Pruning] Enable SNIP on multiple cards using DeepSpeed ZeRO-3 (49ab28)
  • [Pruning] Support new pruning approach Wanda and DSNOT for PyTorch LLM (7a3671)

Improvement

  • [Quantization] SmoothQuant code structure refactor (a8d81c)
  • [Quantization] Optimize the workflow of parsing Keras model (b816d7)
  • [Quantization] Support static_groups options in GPTQ API (1c426a)
  • [Quantization] Update TEQ train dataloader (d1e994)
  • [Quantization] WeightOnlyLinear keeps self.weight after recover (2835bd)
  • [Quantization] Add version condition for IPEX prepare init (d96e14)
  • [Quantization] Enhance the ORT node name checking (f1597a)
  • [Pruning] Stop the tuning process early when enabling smooth quant (844a03)

Productivity

  • ORT LLM examples support latest optimum version (26b260)
  • Add coding style docs and recommended VS Code setting (c1f23c)
  • Adapt transformers 4.37 loading (6133f4)
  • Upgrade pre-commit checker for black/blacken-docs/ruff (7763ed)
  • Support CI summary in PR comments (d4bcdd))
  • Notebook example update to install latest INC & TF, add metric in fit (4239d3)

Bug Fixes

  • Fix QA IPEX example fp32 input issue (c4de19)
  • Update Conditions of Getting min-max during TF MatMul Requantize (d07175)
  • Fix TF saved_model issues (d8e60b)
  • Fix comparison of module_type and MulLinear (ba3aba)
  • Fix ORT calibration issue (cd6d24)
  • Fix ORT example bart export failure (b0dc0d)
  • Fix TF example accuracy diff during benchmark and quantization (5943ea)
  • Fix bugs for GPTQ exporting with static_groups (b4e37b)
  • Fix ORT quant issue caused by tensors having same name (0a20f3)
  • Fix Neural Solution SQL/CMD injection (14b7b0)
  • Fix the best qmodel recovery issue (f2d9b7)
  • Fix logger issue (83bc77)
  • Store token in protected file (c6f9cc)
  • Define the default SSL context (b08725)
  • Fix IPEX stats bug (5af383)
  • Fix ORT calibration for Dml EP (c58aea)
  • Fix wrong socket number retrieval for non-english system (5b2a88)
  • Fix trust remote for llm examples (2f2c9a)

External Contributes

  • Intel Mac support (21cfeb)
  • Add PTQ example for PyTorch CV Segment Anything Model (bd5e69)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
  • Python 3.8, 3.9, 3.10, 3.11
  • TensorFlow 2.13, 2.14, 2.15
  • ITEX 2.13.0, 2.14.0
  • PyTorch/IPEX 2.0, 2.1, 2.2
  • ONNX Runtime 1.15, 1.16, 1.17

Intel® Neural Compressor v2.4.1 Release

29 Dec 13:12
b8c7f1a
Compare
Choose a tag to compare
  • Improvement
  • Bug Fixes
  • Examples
  • Validated Configurations

Improvement

  • Narrow down the tuning space of SmoothQuant auto-tune (9600e1)
  • Support ONNXRT Weight-Only Quantization with different dtypes (5119fc)
  • Add progress bar for ONNXRT Weight-Only Quantization and SmoothQuant (4d26e3)

Bug Fixes

  • Fix SmoothQuant alpha-space generation (33ece9)
  • Fix inputs error for SmoothQuant example_inputs (39f63a)
  • Fix LLMs accuracy regression with IPEX 2.1.100 (3cb6d3)
  • Fix quantizable add ops detection on IPEX backend (4c004d)
  • Fix range step bug in ORTSmoothQuant (40275c)
  • Fix unit test bugs and update CI versions (6c78df, 835805)
  • Fix notebook issues (08221e)

Examples

  • Add verified LLMs list and recipes for SmoothQuant and Weight-Only Quantization (f19cc9)
  • Add code-generaion evaluation for Weight-Only Quantization GPTQ (763440)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.14
  • ITEX 2.14.0.1
  • PyTorch/IPEX 2.1.0
  • ONNX Runtime 1.16.3

Intel® Neural Compressor v2.4 Release

17 Dec 03:26
111b3ce
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • Examples
  • Validated Configurations

Highlights

  • Supported layer-wise quantization for PyTorch RTN/GPTQ Weight-Only Quantization and ONNX Runtime W8A8 quantization.
  • Supported Weight-Only Quantization tuning for ONNX Runtime backend.
  • Supported GGML double quant on RTN/GPTQ Weight-Only Quantization with FW extension API
  • Supported SmoothQuant of Big Saved Model for TensorFlow Backend.

Features

  • [Quantization] Support GGML double quant in Weight-Only Quantization for RTN and GPTQ (05c15a)
  • [Quantization] Support Weight-Only Quantization tuning for ONNX Runtime backend (6d4ea5, 934ba0, 4fcfdf)
  • [Quantization] Support SmoothQuant block-wise alpha-tuning (ee6bc2)
  • [Quantization] Support SmoothQuant of Big Saved Model for TensorFlow Backend (3b2925, 4f2c35)
  • [Quantization] Support PyTorch layer-wise quantization for GPTQ (ee5450)
  • [Quantization] support PyTorch layer-wise quantization for RTN (ebd1e2)
  • [Quantization] Support ONNX Runtime layer-wise W8A8 quantization (6142e4, 5d33a5)
  • [Common] [Experimental] FW extension API implement (76b8b3, 8447d7, 258236)
  • [Quantization] [Experimental] FW extension API for PT backend support Weight-Only Quantization (915018, dc9328)
  • [Quantization] [Experimental] FW extension API for TF backend support Keras Quantization (2627d3)
  • [Quantization] IPEX 2.1 XPU (CPU+GPU) support (af0b50, cf847c)

Improvement

  • [Quantization] Add use_optimum_format for export_compressed_model in Weight-Only Quantization (5179da, 0a0644)
  • [Quantization] Enhance ONNX Runtime quantization with DirectML EP (db0fef, d13183, 098401, 6cad50)
  • [Quantization] Support restore ipex model from json (c3214c)
  • [Quantization] ONNX Runtime add attr to MatMulNBits (7057e3)
  • [Quantization] Increase SmoothQuant auto alpha running speed (173c18)
  • [Quantization] Add SmoothQuant alpha search space as a config argument (f9663d)
  • [Quantization] Add SmoothQuant weight_clipping as a default_on option (1f4aec)
  • [Quantization] Support SmoothQuant with MinMaxObserver (45b496)
  • [Quantization] Support Weight-Only Quantization with fp16 for PyTorch backend (d5cb56)
  • [Quantization] Support trace with dictionary type example_inputs (afe315)
  • [Quantization] Support falcon Weight-Only Quantization (595d3a)
  • [Common] Add deprecation decorator in experimental fold (aeb3ed)
  • [Common] Remove 1.x API dependency (ee617a)
  • [Mixed Precision] Support PyTorch eager mode BF16 MixedPrecision (3bfb76)

Productivity

  • Support quantization and benchmark on macOS (16d6a0)
  • Support ONNX Runtime 1.16.0 (d81732, 299af9, 753783)
  • Support TensorFlow new API for gnr-base (8160c7)

Bug Fixes

  • Fix GraphModule object has no attribute bias (7f53d1)
  • Fix ONNX model export issue (af0aea, eaa57f)
  • Add clip for ONNX Runtime SmoothQuant (cbb69b)
  • Fix SmoothQuant minmax observer init (b1db1c)
  • Fix SmoothQuant issue in get/set_module (dffcfe)
  • Align sparsity with block-wise masks in progressive pruning (fcdc29)

Examples

  • Support peft model with SmoothQuant (5e21b7)
  • Enable two ONNX Runtime examples table-transformer-detection (550cee), BEiT (7265df)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04 & Win10 & MacOS Ventura 13.5
  • Python 3.8, 3.9, 3.10, 3.11
  • TensorFlow 2.13, 2.14, 2.15
  • ITEX 1.2.0, 2.13.0.0, 2.14.0.1
  • PyTorch/IPEX 1.13.0+cpu, 2.0.1+cpu, 2.1.0
  • ONNX Runtime 1.14.1, 1.15.1, 1.16.3
  • MXNet 1.9.1

Intel® Neural Compressor v2.3.2 Release

23 Nov 15:30
Compare
Choose a tag to compare
  • Features
  • Bug Fixes

Features

  • Reduce memory consumption in ONNXRT adaptor (f64833)
  • Support MatMulFpQ4 for onnxruntime 1.16 (1beb43)
  • Support MatMulNBits for onnxruntime 1.17 (67a31b)

Bug Fixes

  • Update ITREX version in ONNXRT WOQ example and fix bugs in hf models (0ca51a)
  • Update ONNXRT WOQ example into llama-2-7b (7f2063)
  • Fix ONNXRT WOQ failed with None model_path (cbd0a4)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.13
  • ITEX 2.13
  • PyTorch/IPEX 2.0.1+cpu
  • ONNX Runtime 1.15.1
  • MXNet 1.9.1

Intel® Neural Compressor v2.3.1 Release

28 Sep 09:48
Compare
Choose a tag to compare
  • Bug Fixes
  • Productivity

Bug Fixes

  • Fix PyTorch SmoothQuant for auto alpha (e9c14a, 35def7)
  • Fix PyTorch SmoothQuant calibration memory overhead (49e950)
  • Fix PyTorch SmoothQuant issue in get/set_module (Issue #1265)(6de9ce)
  • Support falcon Weight-Only Quantization (bf7b5c)
  • Remove Conv2d in Weight-Only Quantization adaptor white list (1a6526)
  • Fix TensorFlow ssd_resnet50_v1 Example for TF New API (c63fc5)

Productivity

  • Adapt Example for TensorFlow 2.14 AutoTrackable API Change (424cf3)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.10
  • TensorFlow 2.13, 2.14
  • ITEX 2.13
  • PyTorch/IPEX 2.0.1+cpu
  • ONNX Runtime 1.15.1
  • MXNet 1.9.1

Intel® Neural Compressor v2.3 Release

15 Sep 07:56
3e1b9d4
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • Examples
  • Validated Configurations

Highlights

  • Integrate Intel Neural Compressor into MSFT ONNX Runtime (#16288) and Olive (#411, #412, #469).
  • Supported low precision (INT4, NF4, FP4) and Weight-Only Quantization algorithms including RTN, AWQ, GPTQ and TEQ on ONNX Runtime and PyTorch for LLMs optimization.
  • Supported sparseGPT pruner (88adfc).
  • Supported quantization for ONNX Runtime DML EP and DNNL EP, and verified inference on Intel NPU (e.g., Meteor Lake) and Intel CPU (e.g., Sapphire Rapids).

Features

  • [Quantization] Support ONNX Runtime quantization and inference for DNNL EP (79be8b)
  • [Quantization] [Experimental] Support ONNX Runtime quantization and inference for DirectML EP (750bb9)
  • [Quantization] Support low precision and Weight-Only Quantization (WOQ) algorithms, including RTN (501440, 19ab16, 859315), AWQ (2562f2, 641d42),
    GPTQ (b5ac3c, 6ba783) and TEQ (d2f995, 9ff7f0) for PyTorch
  • [Quantization] Support NF4 and FP4 data type for PyTorch Weight-Only Quantization (3d11b5)
  • [Quantization] Support low precision and Weight-Only Quantization algorithms, including RTN, AWQ and GPTQ for ONNX Runtime (da4c92)
  • [Quantization] Support layer-wise quantization (d9d1fc) and enable with SmoothQuant (ec9ae9)
  • [Pruning] Add sparseGPT pruner and refactor pruning class (88adfc)
  • [Pruning] Add Hyper-parameter Optimization algorithm for pruning (6613cf)
  • [Model Export] Support PT2ONNX dynamic quantization export (165532)

Improvement

  • [Common] Clean up dataloader usage in examples (1044d8,
    a2931e, 447cc7)
  • [Common] Enhance ONNX Runtime backend check (4ce9de)
  • [Strategy] Add block-wise distributed fallback in basic strategy (ea309f)
  • [Strategy] Enhance strategy exit policy (d19b42)
  • [Quantization] Add WeightOnlyLinear for Weight-Only approach to allow low memory inference (00bbf8)
  • [Quantization] Support more ONNX Runtime direct INT8 ops (b9ce61)
  • [Quantization] Support TensorFlow per-channel MatMul quantization (cf5589)
  • [Quantization] Implement a new method to perform alpha auto-tuning in SmoothQuant (084eda)
  • [Quantization] Enhance ONNX SmoothQuant tuning structure (f0d51c)
  • [Quantization] Enhance PyTorch SmoothQuant tuning structure (81da40)
  • [Quantization] Update PyTorch examples dataloader to support transformers 4.31.x (59371f)
  • [Quantization] Enhance ONNX Runtime backend setting for GPU EP support (295535)
  • [Pruning] Refactor pruning (92d14d)
  • [Mixed Precision] Update the list of supported layers for Keras mix-precision (692c8b)
  • [Mixed Precision] Introduce quant_level into mixed precision (0dc6a9)

Productivity

  • [Ecosystem] MSFT Olive integrate SmoothQuant and 3 LLM examples (#411, #412, #469)
  • [Ecosystem] MSFT ONNX Runtime integrate SmoothQuant static quantization (#16288)
  • [Neural Insights] Support PyTorch FX inspect tensor and integrate with Neural Insights (775def, 74a785)
  • [Neural Insights] Add step-by-step diagnosis cases (99c3b0)
  • [Neural Solution] Resource management and user-facing API enhancement (fbba10)
  • [Auto CI] Integrate auto CI code scan bug fix tools (f77a2c, 06cc38)

Bug Fixes

  • Fix bugs in PyTorch SmoothQuant (0349b9, 8f3645)
  • Fix pytorch dataloader batch size issue (6a98d0)
  • Fix bugs for ONNX Runtime CUDA EP (a1b566, d1f315)
  • Fix bug in ONNX Runtime adapter where _rename_node function fails with model size > 2 GB (1f6b1a)
  • Fix ONNX Runtime diagnosis bug (f10e26)
  • Update Neural Solution example and fix grpc port issue (528868)
  • Fix the objective initialization issue (9d7546)
  • Fix reshape issue for bayesian strategy (77cb83)
  • Fix CVEs (d86922, 2bbfcd, fc71fa)

Examples

Read more

Intel® Neural Compressor v2.2 Release

21 Jun 13:30
da99d28
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Productivity
  • Bug Fixes
  • Examples
  • External Contributes

Highlights

  • Expanded SmoothQuant support on mainstream frameworks including PyTorch/IPEX, TensorFlow/ITEX, ONNX Runtime, and validated popular large language models (LLMs) such as GPT-J, LLaMA, OPT, BLOOM, Dolly, MPT, LaMini-LM and RedPajama-INCITE.
  • Innovated two productivity components Neural Solution for distributed quantization and Neural Insights for quantization accuracy debugging.
  • Successfully integrated Intel Neural Compressor into MSFT Olive (#157) and DeepSpeed (#3300).

Features

  • [Quantization] Support TensorFlow SmoothQuant (1f4127)
  • [Quantization] Support ITEX SmoothQuant (1f4127)
  • [Quantization] Support PyTorch FX SmoothQuant (6a39f6, 603811)
  • [Quantization] Support ONNX Runtime SmoothQuant (3df647, 1e1d70)
  • [Quantization] Support dictionary inputs for IPEX quantization (4ba233)
  • [Quantization] Enable calibration algorithm Entropy/KL & Percentile for ONNX Runtime (dae494)
  • [MixedPrecision] Support mixed precision op name/type dict option (a9c2cb)
  • [Strategy] Support block wise tuning (9c26ed)
  • [Strategy] Enable mse_v2 for ONNX Runtime (62122d)
  • [Pruning] Support retrain free sparse (d29aa0)
  • [Pruning] Support TensorFlow pruning with 2.x API (072c13)

Improvement

  • [Quantization] Enhance Keras functional model quantization with Keras model in, quantized Keras model out (699751)
  • [Quantization] Enhance MatMul and Gather quantization for ONNX Runtime (1f9c4f)
  • [Quantization] Add new recipe for ONNX Runtime NLP models (10d82c)
  • [MixedPrecision] Add more FP16 OPs support for ONNX Runtime (15d551)
  • [MixedPrecision] Add more BF16 OPs support for TensorFlow (369b9d)
  • [Pruning] Enhance multihead-attention slim (f3de50)
  • [Pruning] Enable progressive pruning in N:M pattern (483e80)
  • [Model Export] Refine PT2ONNX export (877adb)
  • Remove redundant classes for quantization, benchmark and mixed precision (c51096)

Productivity

  • [Neural Solution] Support multi-node distribute tuning model-level parallelism (ee049c)
  • [Neural Insights] Support quantization and benchmark diagnosis with GUI (5dc9ea, 3bde2e, 898344)
  • [Neural Coder] Migrate Neural Coder support into 2.x API (113ca1, e74a8a)
  • [Ecosystem] MSFT Olive integration (#157)
  • [Ecosystem] MSFT DeepSpeed integration (#3300)
  • Support ITEX 1.2 (5519e2)
  • Support Python 3.11 (6fa053)
  • Enhance documentations for mixed precision, diagnosis, dataloader, metric, etc.

Bug Fixes

Examples

External Contributes

  • Add a mathematical check for SmoothQuant transform (5c04ac)
  • Fix mismatch absorb layers due to tracing and named modules for SmoothQuant (bccc89)
  • Fix trace issue when input is dictionary for SmoothQuant (6a3c64)
  • Allow dictionary model inputs for ONNX export (17b642)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.7, 3.8, 3.9, 3.10, 3.11
  • TensorFlow 2.10.0, 2.11.0, 2.12.0
  • ITEX 1.1.0, 1.2.0
  • PyTorch/IPEX 1.12.1+cpu, 1.13.0+cpu, 2.0.1+cpu
  • ONNX Runtime 1.13.1, 1.14.1, 1.15.0
  • MXNet 1.9.1

Intel® Neural Compressor v2.1.1 Release

11 May 12:21
Compare
Choose a tag to compare
  • Bug Fixes
  • Examples

Bug Fixes

  • Fix calibration max value issue for SmoothQuant (commit b28bfd)
  • Fix exception for untraceable model during SmoothQuant (commit b28bfd)
  • Fix depthwise conv issue for SmoothQuant (commit 0e5942)
  • Fix Keras model mix precision convert issue (commit 997c57)

Examples

  • Add gpt-j alpha-tuning example (commit 3b7d28)
  • Migrate notebook example update to INC2.0 API (commit 54d2f5)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.8
  • TensorFlow 2.11.0
  • ITEX 1.1.0
  • PyTorch/IPEX 1.13.0+cpu
  • ONNX Runtime 1.13.1
  • MXNet 1.9.1

Intel® Neural Compressor v2.1 Release

29 Mar 10:17
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Improvement
  • Bug Fixes
  • Examples
  • Documentations

Highlights

  • Support and enhance SmoothQuant on popular large language models (LLMs) (e.g., BLOOM-176B, OPT-30B, GPT-J-6B, etc.)
  • Support native Keras model quantization (Keras model as input, and quantized Keras model as output)
  • Provide auto-tuning strategy to improve quantization productivity
  • Support model conversion from TensorFlow INT8 to ONNX INT8 model
  • Polish documentations to help the user be easier to get started

Features

  • [Quantization] Support SmoothQuant and verify with LLMs (commit cbb5cf) (commit 08e255) (commit 12c101)
  • [Quantization] Support Keras functional model quantization with Keras model in, quantized Keras model out (commit efd737)
  • [Strategy] Add auto quantization level as the default tuning process (commit cdfb99)
  • [Strategy] Integrate quantization recipes into tuning strategy (commit 44d176)
  • [Strategy] Extend the strategy capability for adding the new data type (commit d0059c)
  • [Strategy] Enable tuning strategy level multi-node distribute quantization (commit e1fe50)
  • [AMP] Support ONNX Runtime with FP16 (commit 108c24)
  • [Productivity] Export TensorFlow models into ONNX QDQ mode on both fp32 and int8 precision (commit 33a235)
  • [Productivity] Support PT/IPEX v2.0 (commit dbf138)
  • [Productivity] Support ONNX Runtime v1.14.1 (commit 146759)
  • [Productivity] GitHub IO docs support history versions

Improvement

  • Remove the dependency on experimental API (commit 6e10ef)
  • Enhance GUI diagnosis function on model graph and tensor histogram showing style (commit 9f0891)
  • Optimize memory usage for PyTorch adaptor (commit c295a7), ONNX adaptor (commit 8cbf2e), TensorFlow adaptor (commit ad0f1e), and tuning strategy (commit c49300) to support LLM
  • Refine ONNX Runtime QDQ quantization graph (commit c64a5b)
  • Enable ONNX model quantization with NVidia GPU TRT EP (commit ba42d0)
  • Improve code line coverage to 85%

Bug Fixes

  • Fix mix precision config setting (commit 4b71a8)
  • Fix multi-instance benchmark on Windows (commit 1f89aa)
  • Fix domain detection for large ONNX model (commit 70a566)

Examples

  • Migrate examples with INC v2.0 API
  • Enable LLMs (e.g., GPT-NeoX, T5 Large, BLOOM-176B, OPT-30B, GPT-J-6B, etc.)
  • Enable examples for Keras in Keras out (commit efd737)
  • Enable multi-node training examples on CPU (e.g., RN50 distillation, QAT, pruning examples)
  • Add 15+ Huggingface (HF) examples with ONNX Runtime backend and update quantized models into HF (commit a4228d)
  • Add 2 examples for PT2ONNX model export (commit 26db4a)

Documentations

  • Polish documentations with simplified GitHub main page, easy to read IO Docs structure, hands on API migrate user guide, more detailed new API instruction, refreshed API docs template, etc.

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.7, 3.8, 3.9, 3.10
  • TensorFlow 2.10.1, 2.11.0, 2.12.0
  • ITEX 1.0.0, 1.1.0
  • PyTorch/IPEX 1.12.1+cpu, 1.13.0+cpu, 2.0.0+cpu
  • ONNX Runtime 1.12.1, 1.13.1, 1.14.1
  • MXNet 1.9.1