Skip to content

Conversation

Jaswanth51
Copy link

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

apsonawane and others added 17 commits September 15, 2025 08:32
This PR adds block-wise quant kernel for QMoE CPU
…oft#25945)

### Description
This PR unifies the present_sequence_length in flash attention and
removes the dependency on total_sequence_length. This is preparation to
support graph capture. microsoft#25868
### Description
Attention on CPU is following ONNX specifications. This change
replicates the changes introduced by
onnx/onnx#7274.
### Description
Attention BFloat16 Support for CUDA - extends kernel implementations to
accept BF16 input/output tensors.

### Motivation and Context
We already have BFloat16 support for GQA (Group Query Attention), but
not for regular Attention which many models require for inference (e.g.
visual encoder of Gemma 3) due to FP32-like stability at lower
memory/compute cost.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This pull request introduces a significant refactoring of the Azure
Pipelines CI/CD infrastructure. The primary goals of these changes are
to:
1. Solve the problem that vcpkg/cmake version can get changed when the
CI build machine image changes, which can make pipeline suddenly broken
and interrupt our release process.
2. Reduce the `Zip-Nuget-Java-Nodejs Packaging Pipeline`'s running time
by changing how macOS universal2 binaries are built.
3. Add the support for RC releases for Java packages. 

#### Key Changes:

**1. Standardized Build Tool Setup (`setup-build-tools.yml`)**

A new reusable template, `setup-build-tools.yml`, has been created to
centralize the setup of essential build tools.

* **Pinned Dependencies:** This new template allows us to pin specific
versions of `cmake` and `vcpkg`, which was not previously possible. This
ensures a more stable and predictable build environment across all
pipelines.
* **Reduced Redundancy:** By consolidating the setup logic into a single
template, we have significantly reduced code duplication across numerous
pipeline files, making them cleaner and easier to maintain.

Currently this file is only used in macOS and Windows pipelines since
most Linux pipelines use docker to manage their environment.

**2. Reworked macOS Universal Binary Build Process**

The methodology for building macOS `universal2` binaries has been
fundamentally changed to improve reliability and flexibility.

* **Python Packaging:** The Python packaging pipeline will no longer
produce `universal2` wheels. Instead, it will generate separate wheels
for `x86_64` and `arm64` architectures.
* **NuGet C-API Packaging:** The NuGet C-API pipeline has been updated
to first build the `x86_64` and `arm64` binaries independently. These
single-architecture binaries are then combined to create the final
universal package, rather than building a `universal2` package in a
single pass.

The change is made mainly because:
- Building for both ARCHs in a single pass is too slow, which may take
about 5 hours in the ADO machine pool.
- A lot of MLAS features are disabled when ORT is built in such a way.

**3. Java Packaging and Testing Overhaul**

The Download_Java_Tools stage in "Zip-Nuget-Java-Nodejs Packaging
Pipeline" is deleted because it is no longer used. Previously it was
added to reduce the times of downloading the same java packages again
and again , which was to reduce download errors. Now we have setup a
private ADO feed for this.

Besides, there are some major changes to the pipeline that:

1. MD5 and SHA1 checksum files are provided along with the java package
files instead of SHA256. This is because Sonatype's official docs says
MD5/SHA1 checkcums are required while the others are optional. See:
https://central.sonatype.org/publish/requirements/#supply-javadoc-and-sources
. Now the publishing would fail if we don't have the MD5/SHA1 checksum
files.
2. The format of the checksum files is changed. Previously we used
Linux's sha256sum command to generate such files, so each checksum file
contains a hash value and a filename in the file content. However, it
was not the expected format. Sonatype expects that the file only
contains a hash value. This PR fixes the issue.
3. A few powershell scripts were rewritten in python to improve error
check and robustness
4. Added the support for generating RC packages. Previously we had to
manually modify the version numbers and manually do GPG sign.
5. Two new files `jar-packaging.yml` and `setup-maven.yml` were added.
We will use maven to fetch dependent packages(instead of directly HTTP
fetching) to improve supply chain security, because maven allows us
using a private feed to do so.


**4. Dockerfile Enhancements**

The Dockerfiles used in the CI have been updated to use a `BASEIMAGE`
argument. This makes them more flexible, allowing the base image to be
specified at build time, which simplifies maintenance and updates. It
will allow us to using different base image repos in different CI
environments. In the future we will change the Github Actions to only
fetch base images from public docker repos. Meanwhile, ADO packaging
pipelines will continue to use private repos.

**5. Improved Release Management**

The run_packaging_pipelines.py script has been updated to provide more
robust and explicit control over
the package versioning for different build scenarios. This clarifies the
process for generating nightly,
 release candidate (RC), and final release packages.

 The script now handles three distinct cases for package versioning:

* Nightly Packages: For regular CI builds (e.g., on the main branch),
the script triggers the packaging
pipelines in "nightly" mode. This sets the IsReleaseBuild parameter to
false and results in packages with
      a development version suffix (e.g., 1.2.3-dev-20250909-abcdef).

* Release Candidate (RC) Builds: To create a pre-release or RC build,
the script is run with the
--build-mode release flag, along with the --pre-release-suffix-string
(e.g., rc) and
--pre-release-suffix-number (e.g., 1) arguments. This sets the
IsReleaseBuild parameter to true and
passes the suffix information to the pipelines, resulting in a
semantically versioned pre-release package
      (e.g., 1.2.3-rc.1).

* Final Release Builds: For a final release, the script is run with
--build-mode release without any
pre-release suffix arguments. This sets IsReleaseBuild to true, and the
resulting package will have a
clean, final version number (e.g., 1.2.3) based on the VERSION_NUMBER
file.

Please note:
 - Java packages still only support the second and third mode.
 - Python packages only support the first and the last mode.
### Description
This PR implements optimized Arm NEON kernels for NCHWc (channels-last
with channel blocking) convolution and pooling operations in MLAS,
significantly improving performance on Arm64 platforms.

### Motivation and Context
Fixes microsoft#24790 

The new NCHWc kernels improve performance by 5-6x, depending on the
configuration of threads, model, etc.
For example, here is the performance gain witnessed during mobilenet
inference: Focus on the "Number of inferences per second" (93 inf/s ->
498 inf/s)

<details>
  <summary>System configuration</summary>

```
Architecture:             aarch64
  CPU op-mode(s):         64-bit
  Byte Order:             Little Endian
CPU(s):                   64
  On-line CPU(s) list:    0-63
Vendor ID:                ARM
  Model name:             Neoverse-V2
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   64
    Socket(s):            1
    Stepping:             r0p1
    BogoMIPS:             2000.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp 
                          sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
Caches (sum of all):      
  L1d:                    4 MiB (64 instances)
  L1i:                    4 MiB (64 instances)
  L2:                     128 MiB (64 instances)
  L3:                     36 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-63
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected
```
</details>
<details>
  <summary>Perf with current upstream kernels</summary>

```
./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx

Setting intra_op_num_threads to 32
Session creation time cost: 0.0238608 s
First inference time cost: 11 ms
Total inference time cost: 10.7458 s
Total inference requests: 1000
Average inference time cost: 10.7458 ms
Total inference run time: 10.7465 s
Number of inferences per second: 93.0534 
Avg CPU usage: 50 %
Peak working set size: 70410240 bytes
Avg CPU usage:50
Peak working set size:70410240
Runs:1000
Min Latency: 0.0106707 s
Max Latency: 0.0113617 s
P50 Latency: 0.0107453 s
P90 Latency: 0.0107695 s
P95 Latency: 0.0107785 s
P99 Latency: 0.0107965 s
P999 Latency: 0.0113617 s
```
</details>
<details>
  <summary>Perf with NCHWc kernels</summary>

```
./build/Linux/Release/onnxruntime_perf_test -x 32 -I -m times -r 1000 ~/scripts/mobilenet.onnx

Setting intra_op_num_threads to 32
Session creation time cost: 0.0358121 s
First inference time cost: 2 ms
Total inference time cost: 2.00561 s
Total inference requests: 1000
Average inference time cost: 2.00561 ms
Total inference run time: 2.00607 s
Number of inferences per second: 498.488 
Avg CPU usage: 50 %
Peak working set size: 92467200 bytes
Avg CPU usage:50
Peak working set size:92467200
Runs:1000
Min Latency: 0.00198387 s
Max Latency: 0.00204784 s
P50 Latency: 0.00200537 s
P90 Latency: 0.0020155 s
P95 Latency: 0.00201822 s
P99 Latency: 0.0020251 s
P999 Latency: 0.00204784 s
```
</details>

Happy to run further performance tests as required.
…plugin EPs (microsoft#25689)

### Description
<!-- Describe your changes. -->

Move provider-specific unit tests that were formerly in
`onnxruntime_test_all` to a new test program,
`onnxruntime_provider_test`. Notably, this includes the op tests which
test different provider implementations.

Enable some tests in `onnxruntime_provider_test` (those using
`ModelTester` or `OpTester`) to use a plugin EP. The plugin EP usage is
specified at runtime, so it is referred to as "dynamic". The dynamic
plugin EP configuration can be specified with environment variable
`ORT_UNIT_TEST_MAIN_DYNAMIC_PLUGIN_EP_CONFIG_JSON`.

This is an example value for
`ORT_UNIT_TEST_MAIN_DYNAMIC_PLUGIN_EP_CONFIG_JSON`. The test
infrastructure will register the plugin EP library at
`path/to/example_plugin_ep.dll`. The op tests will use a plugin EP
instance with the name `example_ep`.
```json
{
  "ep_library_registration_name": "example_ep",
  "ep_library_path": "/path/to/example_plugin_ep.dll",
  "selected_ep_name": "example_ep"
}
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enable use of plugin EPs in some provider-specific unit tests.
### Description
Disable profiler test on Windows CUDA build


### Motivation and Context
Temporarily mitigate Windows CUDA build failures
### Description
Fix the constraints of Resize operator in QNN-EP.
### Description
The Expand op builder for QNN did not handle FP16 data. Enabling it in 
this change and adding Expand tests for the GPU backend.
…ft#26044)

### Description
<!-- Describe your changes. -->

`num_heads` is not necessarily required from users when input shape is
4D.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

To follow ONNX spec,
https://github.com/onnx/onnx/blob/main/docs/Operators.md#RotaryEmbedding,
the original constraints on attributes were wrong.

NOTE: 3 rotary embedding tests are expected to be wrong until next
release.
### Description
Convert QNN x64 CI pipeline from ADO pipeline to Github Actions. 

### Motivation and Context
We shall move all PR pipelines to Github Actions.
This PR upgrades the com.diffplug.spotless Gradle plugin to version
7.2.1 in the java/build.gradle file. This brings in the latest features
and bug fixes from the Spotless code formatter.
### Description
<!-- Describe your changes. -->
This fixes somewhat contrived edgecases that are present in our tests
  - input propagates to output
  - output is produced by an initializer.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Python API upcoming PR does not pass tests without it.
microsoft#26012)

…in_memory (closes microsoft#25873)

### Description
Adds a Python binding to load external initializer files from in‑memory
buffers.
Mirrors existing C/C++ API and Node binding to enable full in‑memory
model loading.
Adds explicit type validation to avoid pybind dumping large raw bytes on
argument errors.



### Motivation and Context
Problem: Models that use external_data for initializers can’t be fully
loaded from bytes in Python because the weights are expected on disk.
Goal: Allow providing the external file contents directly from Python
memory (e.g., bytes, memoryview, numpy), eliminating filesystem
dependency and supporting serverless/remote asset scenarios.
Issue: Fixes microsoft#25873.

### Changes

#### New Python API on SessionOptions:
Name: add_external_initializers_from_files_in_memory
Signature: names: list[str], buffers: list[bytes-like], lengths:
list[int]
File: onnxruntime/python/onnxruntime_pybind_state.cc

#### Input validation to prevent noisy errors:
Validates top-level types are lists and that list lengths match.
Validates each name is str, each buffer supports the buffer protocol,
and each length is an int.
Raises clear RuntimeError messages instead of pybind11’s verbose dumps
for mismatched types.

#### Test added:
onnxruntime/test/python/onnxruntime_test_python.py
test_session_options_add_external_initializers_from_files_in_memory:
supplies “Pads_not_on_disk.bin” content from a numpy array’s bytes and
loads model_with_external_initializer_come_from_user.onnx without
touching the filesystem.


### Usage

Provide the external file name(s) as referenced by the model’s
external_data “location”, plus their bytes and lengths:
so.add_external_initializers_from_files_in_memory(["weights.bin"],
[weights_bytes], [len(weights_bytes)])
sess = onnxruntime.InferenceSession(model_bytes, sess_options=so)

---------

Signed-off-by: Jonah Bernard <jb2528@cornell.edu>
Co-authored-by: Jonah Bernard <jb2528@cornell.edu>
@Jaswanth51 Jaswanth51 requested a review from ankitm3k September 17, 2025 05:52
@ankitm3k ankitm3k merged commit f7a5656 into ovep-develop Sep 17, 2025
4 of 7 checks passed
@ankitm3k ankitm3k deleted the sync_msft_17092025 branch September 17, 2025 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.