Skip to content

ONNX Runtime QNN Execution Provider v2.1.0

Choose a tag to compare

@qti-mbadnara qti-mbadnara released this 20 Apr 18:32
· 228 commits to main since this release
44a228b

ONNX Runtime Compatibility: >= 1.24.1 (compiled with v1.24.4)
QAIRT SDK Compatibility: 2.45.40

pip install onnxruntime
pip install onnxruntime-qnn==2.1.0

Highlights

QNN EP 2.1.0 adds Genie backend support for on-device LLM inference, 64-bit uDMA (user-DMA) for next-gen hardware, seven new operators, four graph-level fusions, and significant memory and startup performance improvements for large model deployments on ARM64 Windows.


New Features

Genie Backend Support

Added Genie backend integration for ONNX Runtime, enabling on-device LLM inference through the Qualcomm Genie runtime. Genie operates on pre-compiled context binaries and provides optimized execution for generative AI workloads.

64-bit Extended uDMA (user-DMA) Mode

Enabled 64-bit Extended uDMA (user-DMA) for v81+ hardware via the extended_udma provider option. Allows far-mapping of weights and spill/fill buffers, expanding addressable memory for large models on next-generation Qualcomm devices.

File-Mapped Weights on ARM64 Windows

Enabled Windows file mapping of weights and context binaries on ARM64. Multiple ORT sessions can share a single loaded context binary, eliminating per-session heap allocation. Significantly improves memory efficiency and initialization time for LLM-scale deployments. Automatically disabled when context_embed_mode=1.

HTP Weight Sharing on ARM64

Enabled HTP weight sharing on ARM64. A warning is logged on unsupported platforms.

Parallelized Graph Prepare

Introduced QnnJobThreadPool to parallelize graph finalization during ComposeGraph. Reduces ORT session creation time for models with many subgraphs.

Runtime Version Query

Added __version__ attribute to the QNN EP Python package:

import onnxruntime_qnn as qnn_ep
print(qnn_ep.__version__)

GetHardwareDeviceIncompatibilityDetails API

Implemented the GetHardwareDeviceIncompatibilityDetails API for the QNN ABI EP, enabling callers to retrieve structured details about hardware incompatibilities at runtime.

Offline x64 Compilation with MEMHANDLE IO Type

Extended offline compilation support to x64 platforms for graphs using MEMHANDLE IO type (previously ARM64-only).


New Operator Support (7 new ops)

Operator Backends Notes
Softplus CPU, HTP U8/U16 quantized on HTP; ranks > 4 on CPU only
RotaryEmbedding HTP QNN_OP_ROPE via new RopeOpBuilder
Tan CPU, HTP Standard trigonometric op
IsNaN CPU, HTP Boolean output
GroupNormalization HTP Group normalization
SimplifiedLayerNormalization HTP RMSNorm-style normalization
RoiAlign HTP Region of interest alignment

Improved Operators

  • SpaceToDepth — Added DCR (Depth-Column-Row) mode attribute.
  • Resize — Added cubic interpolation mode.

For a complete list of operators supported by QNN and ORT QNN EP, refer to the ORT QNN EP operator list. For detailed QNN operator definitions and backend-specific constraints, see the QNN Master Operator Definition.


Graph Optimizations & Fusions

Reshape-Transpose-Reshape → SpaceToDepth Fusion

Folds Reshape-Transpose-Reshape sequences (including surrounding pre/post transpose ops) into a single SpaceToDepth op. Resolves context binary failures caused by 6D transpose/reshape graphs and brings parity with QAIRT converter behavior.

LayerNorm Decomposed Pattern Fusion

Detects decomposed LayerNorm subgraphs in ONNX models and fuses them into a single QNN_OP_LAYER_NORM operator.

Gemm Decomposition for Dynamic Bias

Extended Gemm handling for cases where the bias tensor is an intermediate (NATIVE) tensor. Decomposes Gemm into FC + ElementWiseAdd, fixing compatibility with models where ORT's MatMulAddFusion produces Gemm ops with dynamic bias (e.g., CLIP-style text projection layers).

6D Reshape-Einsum-Reshape Pattern

Added handling for 6D Reshape-Einsum-Reshape patterns that previously caused compilation failures.


Bug Fixes

Various correctness and stability fixes across the EP.


Infrastructure & Packaging

  • Unified x64 + ARM64EC wheel — Single Python wheel supports both Windows x64 and Windows ARM64 targets

Platform Support

Package Windows ARM64 Windows x64
Python Wheel Inference AOT compilation + Inference
NuGet Inference
ZIP Inference

Known Issues

  • SpaceToDepth FP32 accuracy — FP32 validation tests for the SpaceToDepth operator have been disabled due to known accuracy issues. A fix is targeted for the next release.

Contributors

This release includes contributions from:

Arnav Deshpande, Ashwath Shankarnarayan, Badri Narayanan, Calvin Nguyen, Cheng-Hsin Weng, Chun-Chih Teng, Hua-Yu Chou, Hung-Jui Wang, Jeff Kilpatrick, Kuan-Yu Lin, Kyle Romero, Matthew Sinclair, Mike Hsu, Min Fong Hong, Samrat Dutta, Shubham Patel, Tirupathi Reddy T, Yathindra Kota, Yuduo Wu, Yu-Hung Chuang