Skip to content

ONNX Runtime v1.13.1

Compare
Choose a tag to compare
@jchen351 jchen351 released this 24 Oct 21:09
· 2 commits to rel-1.13.1 since this release
b353e0b

Announcements

  • Security issues addressed by this release
    1. A protobuf security issue CVE-2022-1941 that impact users who load ONNX models from untrusted sources, for example, a deep learning inference service which allows users to upload their models then runs the inferences in a shared environment.
    2. An ONNX security vulnerability that allows reading of tensor_data outside the model directory, which allows attackers to read or write arbitrary files on an affected system that loads ONNX models from untrusted sources. (#12915)
  • Deprecations
    • CUDA 10.x support at source code level
    • Windows 8.x support in Nuget/C API prebuilt binaries. Support for Windows 7+ Desktop versions (including Windows servers) will be retained by building ONNX Runtime from source.
    • NUPHAR EP code is removed
  • Dependency versioning updates
    • C++ 17 compiler is now required to build ORT from source. On Linux, GCC version >=7.0 is required.
    • Minimal numpy version bumped to 1.21.6 (from 1.21.0) for ONNX Runtime Python packages
    • Official ONNX Runtime GPU packages now require CUDA version >=11.6 instead of 11.4.

General

  • Expose all arena configs in Python API in an extensible way
  • Fix ARM64 NuGet packaging
  • Fix EP allocator setup issue affecting TVM EP

Performance

  • Transformers CUDA improvements
    • Quantization on GPU for BERT - notebook, documentation on QAT, transformer optimization toolchain and quantized kernels.
    • Add fused attention CUDA kernels for BERT.
    • Fuse Add (bias) and Transpose of Q/K/V into one kernel for Attention and LongformerAttention.
    • Reduce GEMM computation in LongformerAttention with a new weight format.
  • General quantization (tool and kernel)
    • Quantization debugging tool - identify sensitive node/layer from accuracy drop discrepancies
    • New quantize API based on QuantConfig
    • New quantized operators: SoftMax, Split, Where

Execution Providers

  • CUDA EP
    • Official ONNX Runtime GPU packages are now built with CUDA version 11.6 instead of 11.4, but should still be backwards compatible with 11.4
  • TensorRT EP
    • Build option to link against pre-built onnx-tensorrt parser; this enables potential "no-code" TensorRT minor version upgrades and can be used to build against TensorRT 8.5 EA
    • Improved nested control flow support
    • Improve HashId generation used for uniquely identifying TRT engines. Addresses issues such as TRT Engine Cache Regeneration Issue
    • TensorRT uint8 support
  • OpenVINO EP
    • OpenVINO version upgraded to 2022.2.0
    • Support for INT8 QDQ models from NNCF
    • Support for Intel 13th Gen Core Process (Raptor Lake)
    • Preview support for Intel discrete graphics cards Intel Data Center GPU Flex Series and Intel Arc GPU
    • Increased test coverage for GPU Plugin
  • SNPE EP
  • DirectML EP
  • [new] CANN EP - Initial integration of CANN EP contributed by Huawei to support Ascend 310 (#11477)

Mobile

  • EP infrastructure
    • Implemented support for additional EPs that use static kernels
      • Required for EPs like XNNPACK to be supported in minimal build
      • Removes need for kernel hashes to reduce maintenance overhead for developers
      • NOTE: ORT format models will need to be regenerated as the format change is NOT backwards compatible. We're replacing hashes for the CPU EP kernels with operator constraint information for operators used by the model so that we can match any static kernels available at runtime.
  • XNNPack
    • Added more kernels including QDQ format model support
      • AveragePool, Softmax,
      • QLinearConv, QLinearAveragePool, QLinearSoftmax
    • Added support for XNNPACK using threadpool
      • See documentation for recommendations on how to configure the XNNPACK threadpool
  • ORT format model peak memory usage

Web

  • Support for 4GB memory in webassembly
  • Upgraded emscripten to 3.1.19
  • Build from source support for onnxruntime-extensions and sentencepiece
  • Initial support for XNNPACK for optimizations for Wasm

Training

  • Training packages updated to CUDA version 11.6 and removed CUDA 10.2 and 11.3
  • Performance improvements via op fusions like BiasSoftmax and Dropout fusion, Gather to Split fusion etc targeting SOTA models
  • Added Aten support for GroupNorm, InstanceNormalization, Upsample nearest
  • Bug fix for SimplifiedLayerNorm, seg fault for alltoall

Contributions

Contributors to ONNX Runtime include members across teams at Microsoft, along with our community members:
snnn, baijumeswani#2baijumeswani, edgchen1, iK1D, skottmckay, cloudhan, tianleiwu, fs-eire, mszhanyi, WilBrady, hariharans29, chenfucn, fdwr, yuslepukhin, wejoncy, PeixuanZuo, pengwa, yufenglee, jchen351, justinchuby, dependabot[bot], RandySheriffH, sumitsays, wschin, wangyems, YUNQIUGUO, ytaous, pranavsharma, vvchernov, natke, Craigacp, RandyShuai, smk2007, zhangyaobit, jcwchen, yihonglyu, georgen117, chilo-ms, ashbhandare, faxu, jstoecker, gramalingam, garymm, jeffbloo, xadupre, jywu-msft, askhade, RyanUnderhill, thiagocrepaldi, mindest, jingyanwangms, wenbingl, ashari4, sfatimar, MaajidKhan, souptc, HectorSVC, weixingzhang, zhanghuanrong