forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 55
Sync with Microsoft ONNX Runtime - 17/09/2025 #814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR adds block-wise quant kernel for QMoE CPU
…oft#25945) ### Description This PR unifies the present_sequence_length in flash attention and removes the dependency on total_sequence_length. This is preparation to support graph capture. microsoft#25868
### Description Attention on CPU is following ONNX specifications. This change replicates the changes introduced by onnx/onnx#7274.
### Description Attention BFloat16 Support for CUDA - extends kernel implementations to accept BF16 input/output tensors. ### Motivation and Context We already have BFloat16 support for GQA (Group Query Attention), but not for regular Attention which many models require for inference (e.g. visual encoder of Gemma 3) due to FP32-like stability at lower memory/compute cost. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Inference/Core DRI
This pull request introduces a significant refactoring of the Azure Pipelines CI/CD infrastructure. The primary goals of these changes are to: 1. Solve the problem that vcpkg/cmake version can get changed when the CI build machine image changes, which can make pipeline suddenly broken and interrupt our release process. 2. Reduce the `Zip-Nuget-Java-Nodejs Packaging Pipeline`'s running time by changing how macOS universal2 binaries are built. 3. Add the support for RC releases for Java packages. #### Key Changes: **1. Standardized Build Tool Setup (`setup-build-tools.yml`)** A new reusable template, `setup-build-tools.yml`, has been created to centralize the setup of essential build tools. * **Pinned Dependencies:** This new template allows us to pin specific versions of `cmake` and `vcpkg`, which was not previously possible. This ensures a more stable and predictable build environment across all pipelines. * **Reduced Redundancy:** By consolidating the setup logic into a single template, we have significantly reduced code duplication across numerous pipeline files, making them cleaner and easier to maintain. Currently this file is only used in macOS and Windows pipelines since most Linux pipelines use docker to manage their environment. **2. Reworked macOS Universal Binary Build Process** The methodology for building macOS `universal2` binaries has been fundamentally changed to improve reliability and flexibility. * **Python Packaging:** The Python packaging pipeline will no longer produce `universal2` wheels. Instead, it will generate separate wheels for `x86_64` and `arm64` architectures. * **NuGet C-API Packaging:** The NuGet C-API pipeline has been updated to first build the `x86_64` and `arm64` binaries independently. These single-architecture binaries are then combined to create the final universal package, rather than building a `universal2` package in a single pass. The change is made mainly because: - Building for both ARCHs in a single pass is too slow, which may take about 5 hours in the ADO machine pool. - A lot of MLAS features are disabled when ORT is built in such a way. **3. Java Packaging and Testing Overhaul** The Download_Java_Tools stage in "Zip-Nuget-Java-Nodejs Packaging Pipeline" is deleted because it is no longer used. Previously it was added to reduce the times of downloading the same java packages again and again , which was to reduce download errors. Now we have setup a private ADO feed for this. Besides, there are some major changes to the pipeline that: 1. MD5 and SHA1 checksum files are provided along with the java package files instead of SHA256. This is because Sonatype's official docs says MD5/SHA1 checkcums are required while the others are optional. See: https://central.sonatype.org/publish/requirements/#supply-javadoc-and-sources . Now the publishing would fail if we don't have the MD5/SHA1 checksum files. 2. The format of the checksum files is changed. Previously we used Linux's sha256sum command to generate such files, so each checksum file contains a hash value and a filename in the file content. However, it was not the expected format. Sonatype expects that the file only contains a hash value. This PR fixes the issue. 3. A few powershell scripts were rewritten in python to improve error check and robustness 4. Added the support for generating RC packages. Previously we had to manually modify the version numbers and manually do GPG sign. 5. Two new files `jar-packaging.yml` and `setup-maven.yml` were added. We will use maven to fetch dependent packages(instead of directly HTTP fetching) to improve supply chain security, because maven allows us using a private feed to do so. **4. Dockerfile Enhancements** The Dockerfiles used in the CI have been updated to use a `BASEIMAGE` argument. This makes them more flexible, allowing the base image to be specified at build time, which simplifies maintenance and updates. It will allow us to using different base image repos in different CI environments. In the future we will change the Github Actions to only fetch base images from public docker repos. Meanwhile, ADO packaging pipelines will continue to use private repos. **5. Improved Release Management** The run_packaging_pipelines.py script has been updated to provide more robust and explicit control over the package versioning for different build scenarios. This clarifies the process for generating nightly, release candidate (RC), and final release packages. The script now handles three distinct cases for package versioning: * Nightly Packages: For regular CI builds (e.g., on the main branch), the script triggers the packaging pipelines in "nightly" mode. This sets the IsReleaseBuild parameter to false and results in packages with a development version suffix (e.g., 1.2.3-dev-20250909-abcdef). * Release Candidate (RC) Builds: To create a pre-release or RC build, the script is run with the --build-mode release flag, along with the --pre-release-suffix-string (e.g., rc) and --pre-release-suffix-number (e.g., 1) arguments. This sets the IsReleaseBuild parameter to true and passes the suffix information to the pipelines, resulting in a semantically versioned pre-release package (e.g., 1.2.3-rc.1). * Final Release Builds: For a final release, the script is run with --build-mode release without any pre-release suffix arguments. This sets IsReleaseBuild to true, and the resulting package will have a clean, final version number (e.g., 1.2.3) based on the VERSION_NUMBER file. Please note: - Java packages still only support the second and third mode. - Python packages only support the first and the last mode.
### Description This PR implements optimized Arm NEON kernels for NCHWc (channels-last with channel blocking) convolution and pooling operations in MLAS, significantly improving performance on Arm64 platforms. ### Motivation and Context Fixes microsoft#24790 The new NCHWc kernels improve performance by 5-6x, depending on the configuration of threads, model, etc. For example, here is the performance gain witnessed during mobilenet inference: Focus on the "Number of inferences per second" (93 inf/s -> 498 inf/s) <details> <summary>System configuration</summary> ``` Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 1 Stepping: r0p1 BogoMIPS: 2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti Caches (sum of all): L1d: 4 MiB (64 instances) L1i: 4 MiB (64 instances) L2: 128 MiB (64 instances) L3: 36 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Not affected Srbds: Not affected Tsx async abort: Not affected ``` </details> <details> <summary>Perf with current upstream kernels</summary> ``` ./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx Setting intra_op_num_threads to 32 Session creation time cost: 0.0238608 s First inference time cost: 11 ms Total inference time cost: 10.7458 s Total inference requests: 1000 Average inference time cost: 10.7458 ms Total inference run time: 10.7465 s Number of inferences per second: 93.0534 Avg CPU usage: 50 % Peak working set size: 70410240 bytes Avg CPU usage:50 Peak working set size:70410240 Runs:1000 Min Latency: 0.0106707 s Max Latency: 0.0113617 s P50 Latency: 0.0107453 s P90 Latency: 0.0107695 s P95 Latency: 0.0107785 s P99 Latency: 0.0107965 s P999 Latency: 0.0113617 s ``` </details> <details> <summary>Perf with NCHWc kernels</summary> ``` ./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx Setting intra_op_num_threads to 32 Session creation time cost: 0.0358121 s First inference time cost: 2 ms Total inference time cost: 2.00561 s Total inference requests: 1000 Average inference time cost: 2.00561 ms Total inference run time: 2.00607 s Number of inferences per second: 498.488 Avg CPU usage: 50 % Peak working set size: 92467200 bytes Avg CPU usage:50 Peak working set size:92467200 Runs:1000 Min Latency: 0.00198387 s Max Latency: 0.00204784 s P50 Latency: 0.00200537 s P90 Latency: 0.0020155 s P95 Latency: 0.00201822 s P99 Latency: 0.0020251 s P999 Latency: 0.00204784 s ``` </details> Happy to run further performance tests as required.
…plugin EPs (microsoft#25689) ### Description <!-- Describe your changes. --> Move provider-specific unit tests that were formerly in `onnxruntime_test_all` to a new test program, `onnxruntime_provider_test`. Notably, this includes the op tests which test different provider implementations. Enable some tests in `onnxruntime_provider_test` (those using `ModelTester` or `OpTester`) to use a plugin EP. The plugin EP usage is specified at runtime, so it is referred to as "dynamic". The dynamic plugin EP configuration can be specified with environment variable `ORT_UNIT_TEST_MAIN_DYNAMIC_PLUGIN_EP_CONFIG_JSON`. This is an example value for `ORT_UNIT_TEST_MAIN_DYNAMIC_PLUGIN_EP_CONFIG_JSON`. The test infrastructure will register the plugin EP library at `path/to/example_plugin_ep.dll`. The op tests will use a plugin EP instance with the name `example_ep`. ```json { "ep_library_registration_name": "example_ep", "ep_library_path": "/path/to/example_plugin_ep.dll", "selected_ep_name": "example_ep" } ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable use of plugin EPs in some provider-specific unit tests.
### Description Disable profiler test on Windows CUDA build ### Motivation and Context Temporarily mitigate Windows CUDA build failures
### Description Fix the constraints of Resize operator in QNN-EP.
### Description The Expand op builder for QNN did not handle FP16 data. Enabling it in this change and adding Expand tests for the GPU backend.
…ft#26044) ### Description <!-- Describe your changes. --> `num_heads` is not necessarily required from users when input shape is 4D. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> To follow ONNX spec, https://github.com/onnx/onnx/blob/main/docs/Operators.md#RotaryEmbedding, the original constraints on attributes were wrong. NOTE: 3 rotary embedding tests are expected to be wrong until next release.
### Description Convert QNN x64 CI pipeline from ADO pipeline to Github Actions. ### Motivation and Context We shall move all PR pipelines to Github Actions.
This PR upgrades the com.diffplug.spotless Gradle plugin to version 7.2.1 in the java/build.gradle file. This brings in the latest features and bug fixes from the Spotless code formatter.
### Description <!-- Describe your changes. --> This fixes somewhat contrived edgecases that are present in our tests - input propagates to output - output is produced by an initializer. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Python API upcoming PR does not pass tests without it.
microsoft#26012) …in_memory (closes microsoft#25873) ### Description Adds a Python binding to load external initializer files from in‑memory buffers. Mirrors existing C/C++ API and Node binding to enable full in‑memory model loading. Adds explicit type validation to avoid pybind dumping large raw bytes on argument errors. ### Motivation and Context Problem: Models that use external_data for initializers can’t be fully loaded from bytes in Python because the weights are expected on disk. Goal: Allow providing the external file contents directly from Python memory (e.g., bytes, memoryview, numpy), eliminating filesystem dependency and supporting serverless/remote asset scenarios. Issue: Fixes microsoft#25873. ### Changes #### New Python API on SessionOptions: Name: add_external_initializers_from_files_in_memory Signature: names: list[str], buffers: list[bytes-like], lengths: list[int] File: onnxruntime/python/onnxruntime_pybind_state.cc #### Input validation to prevent noisy errors: Validates top-level types are lists and that list lengths match. Validates each name is str, each buffer supports the buffer protocol, and each length is an int. Raises clear RuntimeError messages instead of pybind11’s verbose dumps for mismatched types. #### Test added: onnxruntime/test/python/onnxruntime_test_python.py test_session_options_add_external_initializers_from_files_in_memory: supplies “Pads_not_on_disk.bin” content from a numpy array’s bytes and loads model_with_external_initializer_come_from_user.onnx without touching the filesystem. ### Usage Provide the external file name(s) as referenced by the model’s external_data “location”, plus their bytes and lengths: so.add_external_initializers_from_files_in_memory(["weights.bin"], [weights_bytes], [len(weights_bytes)]) sess = onnxruntime.InferenceSession(model_bytes, sess_options=so) --------- Signed-off-by: Jonah Bernard <jb2528@cornell.edu> Co-authored-by: Jonah Bernard <jb2528@cornell.edu>
ankitm3k
approved these changes
Sep 17, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.