Skip to content

ONNX Runtime v1.16.0

Compare
Choose a tag to compare
@er3x3 er3x3 released this 20 Sep 21:27
e7a0495

General

  • Support for serialization of models >=2GB

APIs

  • New session option to disable default CPU EP fallback session.disable_cpu_ep_fallback
  • Java
    • Support for fp16 and bf16 tensors as inputs and outputs, along with utilities to convert between these and fp32 data. On JDK 20 and newer the fp16 conversion methods use the JDK's Float.float16ToFloat and Float.floatToFloat16 methods which can be hardware accelerated and vectorized on some platforms.
    • Support for external initializers so that large models that can be instantiated without filesystem access
  • C#
    • Expose OrtValue API as the new preferred API to run inference in C#. This reduces garbage and exposes direct native memory access via Slice like interfaces.
    • Make Float16 and BFloat16 full featured fp16 interfaces that support conversion and expose floating properties (e.g. IsNaN, IsInfinity, etc)
  • C++
    • Make Float16_t and BFloat16_t full featured fp16 interfaces that support conversion and expose floating properties (e.g. IsNaN, IsInfinity, etc)

Performance

  • Improve LLM quantization accuracy with smoothquant
  • Support 4-bit quantization on CPU
  • Optimize BeamScore to improve BeamSearch performance
  • Add FlashAttention v2 support for Attention, MultiHeadAttention and PackedMultiHeadAttention ops

Execution Providers

  • CUDA EP
    • Initial fp8 support (QDQ, Cast, MatMul)
    • Relax CUDA Graph constraints to allow more models to utilize
    • Allow CUDA allocator to be registered with ONNX Runtime externally
    • Fixed a build issue with CUDA 12.2 (#16713)
  • TensorRT EP
    • CUDA Graph support
    • Support user provided cuda compute stream
    • Misc bug fixes and improvements
  • OpenVINO EP
    • Support OpenVINO 2023.1
  • QNN EP
    • Enable context binary cache to reduce initialization time
    • Support QNN 2.12
    • Support for resize with asymmetric transformation mode on HTP backend
    • Ops support: Equal, Less, LessOrEqual, Greater, GreaterOrEqual, LayerNorm, Asin, Sign, DepthToSpace, SpaceToDepth
    • Support 1D Conv/ConvTranspose
    • Misc bug fixes and improvements

Mobile

  • Initial support for Azure EP
  • Dynamic shape support for CoreML
  • Improve React Native performance with JSI
  • Mobile support for CLIPImageProcessor pre-processing and CLIP scenario
  • Swift Package Manager support for ONNX Runtime inference and ONNX Runtime extensions via onnxruntime-swift-package-manager

Web

  • webgpu ops coverage improvements (SAM, T5, Whisper)
  • webnn ops coverage improvements (SAM, Stable Diffusion)
  • Stability/usability improvements for webgpu

Large model training

  • ORTModule + OpenAI Triton Integration now available. See details here
  • Label Sparsity compute optimization support complete and enabled by default starting release 1.16
  • New experimental embedding sparsity related optimizations available (disabled by default).
    • Improves training performance of Roberta in Transformers by 20-30%
  • Other compute optimizations like Gather/Slice/Reshape upstream support enabled.
  • Optimizations for LLaMAv2 (~10% acceleration) and OpenAI Whisper
  • Improvements to logging and metrics (initialization overhead, memory usage, statistics convergence tool, etc) system improvements.
  • PythonOp enhancement: bool and tuple[bool] constants, materialize grads, empty inputs, save in context, customized shape inference, use full qualified name for export.
  • SCELossInternal/SCELossGradInternal CUDA kernels can handle elements more than std::numeric_limits<int32_t>::max.
  • Improvements to LayerNorm fusion
  • Model cache for exported onnx model is introduced to avoid repeatedly exporting a model that is not changed across.

On-Device Training

  • iOS support available starting this release
  • Minimal build now available for On-Device Training. Basic binary size ~1.5 MB
  • ORT-Extensions custom op support enabled through onnxblock for on-device training scenarios

ORT Extensions

This ORT release is accompanied by updates to onnxruntime-extensions. Features include:

  • New Python API gen_processing_models to export ONNX data processing model from Huggingface Tokenizers such as LLaMA , CLIP, XLM-Roberta, Falcon, BERT, etc.
  • New TrieTokenizer operator for RWKV-like LLM models, and other tokenizer operator enhancements.
  • New operators for Azure EP compatibility: AzureAudioToText, AzureTextToText, AzureTritonInvoker for Python and NuGet packages.
  • Processing operators have been migrated to the new Lite Custom Op API

Known Issues

  • ORT CPU Python package requires execution provider to be explicitly provided. See #17631. Fix is in progress to be patched.

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members:
fs-eire, edgchen1, snnn, pengwa, mszhanyi, PeixuanZuo, tianleiwu, adrianlizarraga, baijumeswani, cloudhan, satyajandhyala, yuslepukhin, RandyShuai, RandySheriffH, skottmckay, Honry, dependabot[bot], HectorSVC, jchen351, chilo-ms, YUNQIUGUO, justinchuby, PatriceVignola, guschmue, yf711, Craigacp, smk2007, RyanUnderhill, jslhcl, wschin, kunal-vaishnavi, mindest, xadupre, fdwr, hariharans29, AdamLouly, wejoncy, chenfucn, pranavsharma, yufenglee, zhijxu-MS, jeffdaily, natke, jeffbloo, liqunfu, wangyems, er3x3, nums11, yihonglyu, sumitsays, zhanghuanrong, askhade, wenbingl, jingyanwangms, ashari4, gramalingam, georgen117, sfatimar, BowenBao, hanbitmyths, stevenlix, jywu-msft