Release ONNX Runtime QNN Execution Provider v2.1.0 · onnxruntime/onnxruntime-qnn

ONNX Runtime Compatibility: >= 1.24.1 (compiled with v1.24.4)
QAIRT SDK Compatibility: 2.45.40

pip install onnxruntime
pip install onnxruntime-qnn==2.1.0

Highlights

QNN EP 2.1.0 adds Genie backend support for on-device LLM inference, 64-bit uDMA (user-DMA) for next-gen hardware, seven new operators, four graph-level fusions, and significant memory and startup performance improvements for large model deployments on ARM64 Windows.

New Features

Genie Backend Support

Added Genie backend integration for ONNX Runtime, enabling on-device LLM inference through the Qualcomm Genie runtime. Genie operates on pre-compiled context binaries and provides optimized execution for generative AI workloads.

64-bit Extended uDMA (user-DMA) Mode

Enabled 64-bit Extended uDMA (user-DMA) for v81+ hardware via the extended_udma provider option. Allows far-mapping of weights and spill/fill buffers, expanding addressable memory for large models on next-generation Qualcomm devices.

File-Mapped Weights on ARM64 Windows

Enabled Windows file mapping of weights and context binaries on ARM64. Multiple ORT sessions can share a single loaded context binary, eliminating per-session heap allocation. Significantly improves memory efficiency and initialization time for LLM-scale deployments. Automatically disabled when context_embed_mode=1.

HTP Weight Sharing on ARM64

Enabled HTP weight sharing on ARM64. A warning is logged on unsupported platforms.

Parallelized Graph Prepare

Introduced QnnJobThreadPool to parallelize graph finalization during ComposeGraph. Reduces ORT session creation time for models with many subgraphs.

Runtime Version Query

Added __version__ attribute to the QNN EP Python package:

import onnxruntime_qnn as qnn_ep
print(qnn_ep.__version__)

GetHardwareDeviceIncompatibilityDetails API

Implemented the GetHardwareDeviceIncompatibilityDetails API for the QNN ABI EP, enabling callers to retrieve structured details about hardware incompatibilities at runtime.

Offline x64 Compilation with MEMHANDLE IO Type

Extended offline compilation support to x64 platforms for graphs using MEMHANDLE IO type (previously ARM64-only).

New Operator Support (7 new ops)

Operator	Backends	Notes
Softplus	CPU, HTP	U8/U16 quantized on HTP; ranks > 4 on CPU only
RotaryEmbedding	HTP	QNN_OP_ROPE via new RopeOpBuilder
Tan	CPU, HTP	Standard trigonometric op
IsNaN	CPU, HTP	Boolean output
GroupNormalization	HTP	Group normalization
SimplifiedLayerNormalization	HTP	RMSNorm-style normalization
RoiAlign	HTP	Region of interest alignment

Improved Operators

SpaceToDepth — Added DCR (Depth-Column-Row) mode attribute.
Resize — Added cubic interpolation mode.

For a complete list of operators supported by QNN and ORT QNN EP, refer to the ORT QNN EP operator list. For detailed QNN operator definitions and backend-specific constraints, see the QNN Master Operator Definition.

Graph Optimizations & Fusions

Reshape-Transpose-Reshape → SpaceToDepth Fusion

Folds Reshape-Transpose-Reshape sequences (including surrounding pre/post transpose ops) into a single SpaceToDepth op. Resolves context binary failures caused by 6D transpose/reshape graphs and brings parity with QAIRT converter behavior.

LayerNorm Decomposed Pattern Fusion

Detects decomposed LayerNorm subgraphs in ONNX models and fuses them into a single QNN_OP_LAYER_NORM operator.

Gemm Decomposition for Dynamic Bias

Extended Gemm handling for cases where the bias tensor is an intermediate (NATIVE) tensor. Decomposes Gemm into FC + ElementWiseAdd, fixing compatibility with models where ORT's MatMulAddFusion produces Gemm ops with dynamic bias (e.g., CLIP-style text projection layers).

6D Reshape-Einsum-Reshape Pattern

Added handling for 6D Reshape-Einsum-Reshape patterns that previously caused compilation failures.

Bug Fixes

Various correctness and stability fixes across the EP.

Infrastructure & Packaging

Unified x64 + ARM64EC wheel — Single Python wheel supports both Windows x64 and Windows ARM64 targets

Platform Support

Package	Windows ARM64	Windows x64
Python Wheel	Inference	AOT compilation + Inference
NuGet	Inference	—
ZIP	Inference	—

Known Issues

SpaceToDepth FP32 accuracy — FP32 validation tests for the SpaceToDepth operator have been disabled due to known accuracy issues. A fix is targeted for the next release.

Contributors

This release includes contributions from:

Arnav Deshpande, Ashwath Shankarnarayan, Badri Narayanan, Calvin Nguyen, Cheng-Hsin Weng, Chun-Chih Teng, Hua-Yu Chou, Hung-Jui Wang, Jeff Kilpatrick, Kuan-Yu Lin, Kyle Romero, Matthew Sinclair, Mike Hsu, Min Fong Hong, Samrat Dutta, Shubham Patel, Tirupathi Reddy T, Yathindra Kota, Yuduo Wu, Yu-Hung Chuang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ONNX Runtime QNN Execution Provider v2.1.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

New Features

Genie Backend Support

64-bit Extended uDMA (user-DMA) Mode

File-Mapped Weights on ARM64 Windows

HTP Weight Sharing on ARM64

Parallelized Graph Prepare

Runtime Version Query

GetHardwareDeviceIncompatibilityDetails API

Offline x64 Compilation with MEMHANDLE IO Type

New Operator Support (7 new ops)

Improved Operators

Graph Optimizations & Fusions

Reshape-Transpose-Reshape → SpaceToDepth Fusion

LayerNorm Decomposed Pattern Fusion

Gemm Decomposition for Dynamic Bias

6D Reshape-Einsum-Reshape Pattern

Bug Fixes

Infrastructure & Packaging

Platform Support

Known Issues

Contributors

Uh oh!