forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 56
Sync with Microsoft ONNX Runtime - 09/12/2025 #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Description Resolved all security vulnerabilities in JavaScript packages under `/js` by running `npm audit fix`. All updates are non-breaking patch/minor version bumps. **Fixed vulnerabilities:** - `/js` root: 1 high severity - `glob` 10.4.5 → 10.5.0 (command injection - GHSA-5j98-mcp5-4vw2) - `/js/react_native`: 7 vulnerabilities (1 high, 3 moderate, 3 low) - `image-size` → 1.2.1 (high: DoS via infinite loop - GHSA-m5qc-5hw7-8vg7) - `@babel/helpers` 7.25.6 → 7.28.4 (moderate: RegExp complexity - GHSA-968p-4wvh-cqc8) - `@babel/runtime` 7.25.6 → 7.28.4 (moderate: RegExp complexity - GHSA-968p-4wvh-cqc8) - `js-yaml` → fixed (moderate: prototype pollution - GHSA-mh29-5h37-fv8m) - `brace-expansion` 2.0.1 → 2.0.2 (low: ReDoS - GHSA-v6h2-p8h4-qcjw) - `on-headers` → fixed (low: header manipulation - GHSA-76c9-3jph-rj3q) **Files modified:** - `js/package-lock.json` - `js/react_native/package-lock.json` **Result:** All JS packages (`/js`, `/js/common`, `/js/web`, `/js/node`, `/js/react_native`) now report 0 vulnerabilities. ### Motivation and Context Security maintenance to address dependency vulnerabilities identified by `npm audit`. No breaking changes or code modifications required. <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> > Please create a pull request that runs `npm audit fix` for the JavaScript/TypeScript portion of the repository under the `/js` directory of [microsoft/onnxruntime](https://github.com/microsoft/onnxruntime). > > Requirements: > > 1. **Scope** > - Work only within the `/js` folder and its subpackages (e.g., `js/web`, `js/node`, `js/common`, etc.). > - Do not modify files outside `/js`. > > 2. **Dependency updates** > - Run `npm audit fix` (and, if necessary to fully resolve high/critical issues while staying non-breaking, `npm audit fix --force` on specific subpackages) to address security vulnerabilities. > - Prefer minimal, non-breaking version bumps (patch and minor) that satisfy `npm audit` while keeping semver ranges sensible. > - If any **major** upgrades are required to clear vulnerabilities, handle them cautiously: > - Apply the upgrade only if tests still pass and typings/build setup remain compatible. > - If a major bump would require code changes or creates breaking behavior, **do not** apply it; instead, leave a TODO comment in the PR description summarizing which packages remain vulnerable and why. > > 3. **Validation** > - Run the existing JS-related checks that the repo supports from `/js`, such as: > - `npm test` or package-specific test scripts. > - Any documented lint/build/test commands for JS packages (e.g., `npm run build`, `npm run lint`) where applicable. > - Ensure the updated lockfiles (if present) are consistent, and the project installs cleanly with `npm ci` (or the repo's documented install command) in the `/js` area. > > 4. **Files to update** > - Update `package.json` and lockfiles under `/js` (e.g., `package-lock.json`, `npm-shrinkwrap.json`, or workspace-specific lock files) to reflect the audited dependency tree. > - Do not manually edit `node_modules`; rely on `npm` to manage dependencies and only commit manifest/lockfile changes. > > 5. **Repository conventions** > - Follow this repo's existing conventions for formatting, commit messages, and JS tooling. > - Keep the diff focused on the dependency and lockfile updates plus any absolutely necessary code tweaks to maintain compatibility. > > 6. **Pull request description** > - In the PR body, include: > - A short summary: that `npm audit fix` was run in `/js` to address dependency vulnerabilities. > - A bullet list of notable dependency changes (especially any major version bumps), with packages and old/new versions. > - A brief testing summary (commands run and their results). > - A note about any remaining vulnerabilities that could not be fixed without breaking changes (if applicable), including the affected packages and advisories if available. > > The goal is a clean, minimal PR that improves the security posture of the JS packages under `/js` in `microsoft/onnxruntime` without introducing breaking changes. </details> *This pull request was created as a result of the following prompt from Copilot chat.* > Please create a pull request that runs `npm audit fix` for the JavaScript/TypeScript portion of the repository under the `/js` directory of [microsoft/onnxruntime](https://github.com/microsoft/onnxruntime). > > Requirements: > > 1. **Scope** > - Work only within the `/js` folder and its subpackages (e.g., `js/web`, `js/node`, `js/common`, etc.). > - Do not modify files outside `/js`. > > 2. **Dependency updates** > - Run `npm audit fix` (and, if necessary to fully resolve high/critical issues while staying non-breaking, `npm audit fix --force` on specific subpackages) to address security vulnerabilities. > - Prefer minimal, non-breaking version bumps (patch and minor) that satisfy `npm audit` while keeping semver ranges sensible. > - If any **major** upgrades are required to clear vulnerabilities, handle them cautiously: > - Apply the upgrade only if tests still pass and typings/build setup remain compatible. > - If a major bump would require code changes or creates breaking behavior, **do not** apply it; instead, leave a TODO comment in the PR description summarizing which packages remain vulnerable and why. > > 3. **Validation** > - Run the existing JS-related checks that the repo supports from `/js`, such as: > - `npm test` or package-specific test scripts. > - Any documented lint/build/test commands for JS packages (e.g., `npm run build`, `npm run lint`) where applicable. > - Ensure the updated lockfiles (if present) are consistent, and the project installs cleanly with `npm ci` (or the repo's documented install command) in the `/js` area. > > 4. **Files to update** > - Update `package.json` and lockfiles under `/js` (e.g., `package-lock.json`, `npm-shrinkwrap.json`, or workspace-specific lock files) to reflect the audited dependency tree. > - Do not manually edit `node_modules`; rely on `npm` to manage dependencies and only commit manifest/lockfile changes. > > 5. **Repository conventions** > - Follow this repo's existing conventions for formatting, commit messages, and JS tooling. > - Keep the diff focused on the dependency and lockfile updates plus any absolutely necessary code tweaks to maintain compatibility. > > 6. **Pull request description** > - In the PR body, include: > - A short summary: that `npm audit fix` was run in `/js` to address dependency vulnerabilities. > - A bullet list of notable dependency changes (especially any major version bumps), with packages and old/new versions. > - A brief testing summary (commands run and their results). > - A note about any remaining vulnerabilities that could not be fixed without breaking changes (if applicable), including the affected packages and advisories if available. > > The goal is a clean, minimal PR that improves the security posture of the JS packages under `/js` in `microsoft/onnxruntime` without introducing breaking changes. <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/onnxruntime/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>
…se (microsoft#26626) ### Description <!-- Describe your changes. --> This PR optimizes `InstanceNormalization` by removing redundant transpose. Given the implementation of `InstanceNormalization` for `NCHW` is more effiencient, we don't need to add wrapper `Transpose` to make it run in `NHWC`, which helps use to elide redundant transpose and improve performance. Testing on Lunar Lake shows about `~60%` performance improvement in `InstanceNormalization` operations. #### `InstanceNormalization` OP benchmark The input tensor shape: `(1,32,1048576)` The scale tensor shape: `(32)` The B tensor shape: `(32)` | time cost (ms) | baseline | opt | diff | | ---------------- | -------- | ---- | ---- | | Lunar Lake | 82.6 | 34.2 | 58% | #### Model benchmark | time cost (ms) | baseline | opt | diff | | ---------------- | -------- | ---- | ---- | | sd-turbo-vae-decoder-fp16-demo | 2437.6 | 1835.9 | 25% | ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Please see above
### Description This PR refactors a few "context" classes to make it clearer and support new features. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/actions/checkout/releases">actions/checkout's releases</a>.</em></p> <blockquote> <h2>v6.0.0</h2> <h2>What's Changed</h2> <ul> <li>Update README to include Node.js 24 support details and requirements by <a href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2248">actions/checkout#2248</a></li> <li>Persist creds to a separate file by <a href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2286">actions/checkout#2286</a></li> <li>v6-beta by <a href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2298">actions/checkout#2298</a></li> <li>update readme/changelog for v6 by <a href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2311">actions/checkout#2311</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v5.0.0...v6.0.0">https://github.com/actions/checkout/compare/v5.0.0...v6.0.0</a></p> <h2>v6-beta</h2> <h2>What's Changed</h2> <p>Updated persist-credentials to store the credentials under <code>$RUNNER_TEMP</code> instead of directly in the local git config.</p> <p>This requires a minimum Actions Runner version of <a href="https://github.com/actions/runner/releases/tag/v2.329.0">v2.329.0</a> to access the persisted credentials for <a href="https://docs.github.com/en/actions/tutorials/use-containerized-services/create-a-docker-container-action">Docker container action</a> scenarios.</p> <h2>v5.0.1</h2> <h2>What's Changed</h2> <ul> <li>Port v6 cleanup to v5 by <a href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2301">actions/checkout#2301</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v5...v5.0.1">https://github.com/actions/checkout/compare/v5...v5.0.1</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/actions/checkout/blob/main/CHANGELOG.md">actions/checkout's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <h2>V6.0.0</h2> <ul> <li>Persist creds to a separate file by <a href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2286">actions/checkout#2286</a></li> <li>Update README to include Node.js 24 support details and requirements by <a href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2248">actions/checkout#2248</a></li> </ul> <h2>V5.0.1</h2> <ul> <li>Port v6 cleanup to v5 by <a href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2301">actions/checkout#2301</a></li> </ul> <h2>V5.0.0</h2> <ul> <li>Update actions checkout to use node 24 by <a href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2226">actions/checkout#2226</a></li> </ul> <h2>V4.3.1</h2> <ul> <li>Port v6 cleanup to v4 by <a href="https://github.com/ericsciple"><code>@ericsciple</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2305">actions/checkout#2305</a></li> </ul> <h2>V4.3.0</h2> <ul> <li>docs: update README.md by <a href="https://github.com/motss"><code>@motss</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1971">actions/checkout#1971</a></li> <li>Add internal repos for checking out multiple repositories by <a href="https://github.com/mouismail"><code>@mouismail</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1977">actions/checkout#1977</a></li> <li>Documentation update - add recommended permissions to Readme by <a href="https://github.com/benwells"><code>@benwells</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2043">actions/checkout#2043</a></li> <li>Adjust positioning of user email note and permissions heading by <a href="https://github.com/joshmgross"><code>@joshmgross</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2044">actions/checkout#2044</a></li> <li>Update README.md by <a href="https://github.com/nebuk89"><code>@nebuk89</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2194">actions/checkout#2194</a></li> <li>Update CODEOWNERS for actions by <a href="https://github.com/TingluoHuang"><code>@TingluoHuang</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2224">actions/checkout#2224</a></li> <li>Update package dependencies by <a href="https://github.com/salmanmkc"><code>@salmanmkc</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/2236">actions/checkout#2236</a></li> </ul> <h2>v4.2.2</h2> <ul> <li><code>url-helper.ts</code> now leverages well-known environment variables by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1941">actions/checkout#1941</a></li> <li>Expand unit test coverage for <code>isGhes</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1946">actions/checkout#1946</a></li> </ul> <h2>v4.2.1</h2> <ul> <li>Check out other refs/* by commit if provided, fall back to ref by <a href="https://github.com/orhantoy"><code>@orhantoy</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1924">actions/checkout#1924</a></li> </ul> <h2>v4.2.0</h2> <ul> <li>Add Ref and Commit outputs by <a href="https://github.com/lucacome"><code>@lucacome</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1180">actions/checkout#1180</a></li> <li>Dependency updates by <a href="https://github.com/dependabot"><code>@dependabot</code></a>- <a href="https://redirect.github.com/actions/checkout/pull/1777">actions/checkout#1777</a>, <a href="https://redirect.github.com/actions/checkout/pull/1872">actions/checkout#1872</a></li> </ul> <h2>v4.1.7</h2> <ul> <li>Bump the minor-npm-dependencies group across 1 directory with 4 updates by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1739">actions/checkout#1739</a></li> <li>Bump actions/checkout from 3 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1697">actions/checkout#1697</a></li> <li>Check out other refs/* by commit by <a href="https://github.com/orhantoy"><code>@orhantoy</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1774">actions/checkout#1774</a></li> <li>Pin actions/checkout's own workflows to a known, good, stable version. by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1776">actions/checkout#1776</a></li> </ul> <h2>v4.1.6</h2> <ul> <li>Check platform to set archive extension appropriately by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1732">actions/checkout#1732</a></li> </ul> <h2>v4.1.5</h2> <ul> <li>Update NPM dependencies by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1703">actions/checkout#1703</a></li> <li>Bump github/codeql-action from 2 to 3 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1694">actions/checkout#1694</a></li> <li>Bump actions/setup-node from 1 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1696">actions/checkout#1696</a></li> <li>Bump actions/upload-artifact from 2 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1695">actions/checkout#1695</a></li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/actions/checkout/commit/1af3b93b6815bc44a9784bd300feb67ff0d1eeb3"><code>1af3b93</code></a> update readme/changelog for v6 (<a href="https://redirect.github.com/actions/checkout/issues/2311">#2311</a>)</li> <li><a href="https://github.com/actions/checkout/commit/71cf2267d89c5cb81562390fa70a37fa40b1305e"><code>71cf226</code></a> v6-beta (<a href="https://redirect.github.com/actions/checkout/issues/2298">#2298</a>)</li> <li><a href="https://github.com/actions/checkout/commit/069c6959146423d11cd0184e6accf28f9d45f06e"><code>069c695</code></a> Persist creds to a separate file (<a href="https://redirect.github.com/actions/checkout/issues/2286">#2286</a>)</li> <li><a href="https://github.com/actions/checkout/commit/ff7abcd0c3c05ccf6adc123a8cd1fd4fb30fb493"><code>ff7abcd</code></a> Update README to include Node.js 24 support details and requirements (<a href="https://redirect.github.com/actions/checkout/issues/2248">#2248</a>)</li> <li>See full diff in <a href="https://github.com/actions/checkout/compare/v5...v6">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description <!-- Describe your changes. --> add LogEvaluationStart for ReplayGraph to match LogEvaluationStop ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> So by using ETW, could capture run time correctly Co-authored-by: hualxie <hualxie@microsoft.com>
### Description <!-- Describe your changes. --> add LogCompileModel to mark the session usage as Compile because that session will not be used for inference We could also use it to log compile model parameters if needed ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We are building a profiling tool for WinML and we want to differentiate Compile session and inference session. I think there are two ways to do it but I don't know which is better microsoft#26646 microsoft#26647 --------- Co-authored-by: hualxie <hualxie@microsoft.com>
Fix bug introduced by microsoft#26563 which used the wrong condition by accident and results incorrect result in graph capture mode.
…test.exe (microsoft#26396) ### Description <!-- Describe your changes. --> - The change allows users to better debug unit tests by adding the following environment variables: - `QNN_DUMP_ONNX`: Dump input onnx model - `QNN_DUMP_JSON`: Dump json qnn graph with provider_option `dump_json_qnn_graph` - `QNN_DUMP_DLC`: Dump dlc with provider_option `qnn_ir_backend_path` - `QNN_VERBOSE`: Use the log level `ORT_LOGGING_LEVEL_VERBOSE` - Developers can use the environment variables above to save the artifacts of QNN-EP testcases to a directory named with `<TestSuite>_<TestName>` ``` . ├── QnnCPUBackendTests_BatchNorm2D_fp32 # RunQnnModelTest │ ├── dumped_f32_model.onnx # float32 ONNX model │ ├── QNNExecutionProvider_QNN_XXXX_X_X.dlc │ └── QNNExecutionProvider_QNN_XXXX_X_X.json ├── QnnHTPBackendTests_BatchNorm_FP16 # TestFp16ModelAccuracy │ ├── dumped_f16_model.onnx # float16 ONNX model │ ├── dumped_f32_model.onnx # float32 ONNX model │ ├── QNNExecutionProvider_QNN_XXXX_X_X.dlc │ └── QNNExecutionProvider_QNN_XXXX_X_X.json └── QnnHTPBackendTests_BatchNorm2D_U8U8S32 # TestQDQModelAccuracy ├── dumped_f32_model.onnx # float32 ONNX model ├── dumped_qdq_model.onnx # QDQ ONNX model ├── QNNExecutionProvider_QNN_XXXX_X_X.dlc └── QNNExecutionProvider_QNN_XXXX_X_X.json # All artifact files are placed under the current working directory from which the test binary is invoked. ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - The Json qnn graph/dlc are helpful for backend to debug performance/accuracy issues - By comparing the onnx and Json qnn graph/dlc, we can locate the issue about graph manipulation.
…t#26667) ### Description More accurately compute Pow(2.0) on WebGPU EP. Reproduction script: ```py from onnx import helper, TensorProto import onnxruntime as ort import numpy as np # 1. Create the ONNX model # Define input and output input_info = helper.make_tensor_value_info('X', TensorProto.FLOAT, [1, 1]) output_info = helper.make_tensor_value_info('Y', TensorProto.FLOAT, [1, 1]) # Create a constant tensor for the exponent (2.0) exponent_tensor = helper.make_tensor('exponent', TensorProto.FLOAT, [], [2.0]) exponent_node = helper.make_node('Constant', [], ['exponent_out'], value=exponent_tensor) # Create the Pow node # Pow takes two inputs: Base (X) and Power (exponent_out) pow_node = helper.make_node( 'Pow', inputs=['X', 'exponent_out'], outputs=['Y'], name='PowNode' ) # Create the graph graph_def = helper.make_graph( [exponent_node, pow_node], 'test-model', [input_info], [output_info] ) # Create the model model_def = helper.make_model(graph_def, producer_name='onnx-example') opset = model_def.opset_import[0] opset.version = 13 # Ensure opset version supports the operations # 2. Convert model to string (bytes) model_str = model_def.SerializeToString() # 3. Prepare input data np.random.seed(0) input_data = np.array([[-2e3]], dtype=np.float32) # 4. Run on CPUExecutionProvider sess_cpu = ort.InferenceSession(model_str, providers=['CPUExecutionProvider']) res_cpu = sess_cpu.run(['Y'], {'X': input_data})[0] print("CPU Result:", res_cpu) # 5. Run on WebGpuExecutionProvider sess_webgpu = ort.InferenceSession(model_str, providers=['WebGpuExecutionProvider']) res_webgpu = sess_webgpu.run(['Y'], {'X': input_data})[0] print("WebGPU Result:", res_webgpu) # Compare results diff = np.abs(res_cpu - res_webgpu) max_diff = diff.max().item() assert max_diff < 1e-5, f"Results do not match within tolerance! Max diff: {max_diff}" print("Results match!") ``` currently produces ``` CPU Result: [[4.e+06]] WebGPU Result: [[3.999999e+06]] --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[1], [line 56](vscode-notebook-cell:?execution_count=1&line=56) 54 diff = np.abs(res_cpu - res_webgpu) 55 max_diff = diff.max().item() ---> [56](vscode-notebook-cell:?execution_count=1&line=56) assert max_diff < 1e-5, f"Results do not match within tolerance! Max diff: {max_diff}" 57 print("Results match!") AssertionError: Results do not match within tolerance! Max diff: 1.0 ``` but with this PR: ``` CPU Result: [[4.e+06]] WebGPU Result: [[4.e+06]] Results match! ``` ### Motivation and Context Leads to downstream issues/inaccuracies for certain models, especially those which have larger values to compute pow(x,2) for. cc @guschmue
### Description While profiling session creation time for large graphs (number of nodes, not size of tensors), we noticed that the creations and subsequent destructions of protobuf objects were the major hotspot. This PR avoids its creation. Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
…microsoft#26682) ### Description Use `std::string_view` directly as key in `find` method of `flat_hash_map`. This part of the absl documentation may provide further insights: https://abseil.io/docs/cpp/guides/container#heterogeneous-lookup ### Motivation and Context We noticed this when profiling the session creation of large models (in terms of the number of nodes). Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
In debug mode, `webgpu_context.cc:257 Run Uniform variable[5] (head_size) data type mismatch in program "SplitPackedQKVWithRotaryEmbeddingAndCopyKV", Expected: u32, Actual: i32`. No issue in release mode. Convert i32 to u32 to avoid this issue.
…crosoft#26659) ### Description Test model (happens with any 2D inputs): [2191__visual_projection_visual_projection.1_BatchNormalization.onnx.zip](https://github.com/user-attachments/files/23758390/2191__visual_projection_visual_projection.1_BatchNormalization.onnx.zip) Command: ``` python -c "import onnxruntime as ort; ort.InferenceSession('2191__visual_projection_visual_projection.1_BatchNormalization.onnx', providers=['WebGpuExecutionProvider'])" ``` Before (failure): ``` Op (BatchNormalization) [ShapeInferenceError] Tensor must have at least 3 dimensions to convert between channels first and channels last. ``` After (success): ``` (nothing, meaning success) ``` ### Motivation and Context This fixes BatchNormalization on WebGPU, matching CPU version. cc @guschmue
### Description <!-- Describe your changes. --> CudaMemPool test checks if it is supported in a given environment. We need to clear the error not to affect subsequent tests. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Potential test failure.
…soft#26546) ### Description <!-- Describe your changes. --> The original error message only shows: "Failed to setup QNN input tensors for graph: <graph_name>" This change adds more detailed error information by logging the failure reason from [SetupTensors](https://github.com/microsoft/onnxruntime/blob/ea55c160a36d658eae61a4c7aeda6cb55dd54dec/onnxruntime/core/providers/qnn/builder/qnn_model.cc#L386), making it easier to debug issues. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> User requires detailed error logging for the ORT online context binary generation.
…osoft#26662) ### Description This patch replaces `global_id` and `workgroup_id` with `logical_global_id` and `logical_workgroup_id` which are computed from `workgroup_idx` and the dispatch workgroup sizes set in `ProgramBase::SetDispatchGroupSize()`. ### Motivation and Context We shouldn't use `global_id` or `workgroup_id` directly because the dispatch workgroup sizes may be normalized in `ProgramManager::NormalizeDispatchGroupSize()`.
…ive_number) in float32 (microsoft#26670) ### Description The correct definition of the most negative number is `-3.40282346638528e+38`, according to IEEE 754, but it is being incorrectly registered inline as a truncated version `-3.402823e+38f`. ```py >>> import numpy as np >>> np.finfo(np.float32).min np.float32(-3.4028235e+38) >>> np.finfo(np.float32).min.item() -3.4028234663852886e+38 ``` For this reason, values less than this threshold were handled incorrectly. While this may seem like a small/irrelevant detail, it's essential in attention masking, where we do in fact use this value, leading to large numerical errors down the line. Reproduction: ```py from onnx import helper, TensorProto import onnxruntime as ort import numpy as np # 1. Create the ONNX model # Define input and output input_shape = [1, 2] input_info = helper.make_tensor_value_info('X', TensorProto.FLOAT, input_shape) output_info = helper.make_tensor_value_info('Y', TensorProto.FLOAT, input_shape) # Create the Softmax node # Softmax takes one input: X softmax_node = helper.make_node( 'Softmax', inputs=['X'], outputs=['Y'], name='SoftmaxNode', axis=-1 # Default axis is -1, usually applied to the last dimension ) # Create the graph graph_def = helper.make_graph( [softmax_node], 'test-model', [input_info], [output_info] ) # Create the model model_def = helper.make_model(graph_def, producer_name='onnx-example') opset = model_def.opset_import[0] opset.version = 13 # Ensure opset version supports the operations # 2. Convert model to string (bytes) model_str = model_def.SerializeToString() # 3. Prepare input data np.random.seed(0) input_data = np.array( [[-3.40282346638528e+38, -3.40282346638528e+38]] # [[-3.4028234663852886e+38, -3.4028234663852886e+38]] ).astype(np.float32) print(input_data.tolist()) # 4. Run on CPUExecutionProvider sess_cpu = ort.InferenceSession(model_str, providers=['CPUExecutionProvider']) res_cpu = sess_cpu.run(['Y'], {'X': input_data})[0] print("CPU Result:", res_cpu) # 5. Run on WebGpuExecutionProvider sess_webgpu = ort.InferenceSession(model_str, providers=['WebGpuExecutionProvider']) res_webgpu = sess_webgpu.run(['Y'], {'X': input_data})[0] print("WebGPU Result:", res_webgpu) # Compare results diff = np.abs(res_cpu - res_webgpu) max_diff = diff.max().item() print(diff) print(f"Max diff: {max_diff}") assert max_diff < 1e-5, f"Results do not match within tolerance! Max diff: {max_diff}" print("Results match!") ``` Before: ``` [[-3.4028234663852886e+38, -3.4028234663852886e+38]] CPU Result: [[0.5 0.5]] WebGPU Result: [[0. 0.]] [[0.5 0.5]] Max diff: 0.5 AssertionError: Results do not match within tolerance! Max diff: 0.5 ``` After: ``` [[-3.4028234663852886e+38, -3.4028234663852886e+38]] CPU Result: [[0.5 0.5]] WebGPU Result: [[0.5 0.5]] [[0. 0.]] Max diff: 0.0 Results match! ``` cc @guschmue
…Capability/IndexedSubGraph (microsoft#26444) ### Description For TRT EP's `GetCapability()`, in some case, the `GetSubGraph()` won't add graph's output to the `ComputeCapability/IndexedSubGraph` returning to ORT. The issue if from following code: ````c++ ... if (node->GetOutputEdgesCount() > node->OutputDefs().size()) { ... // execute here } else { ... if (graph_output_names.find(output->Name()) != graph_output_names.end()) { graph_outputs_to_add[output] = output_order; // missing this } } ```` Update TRT RTX EP as well. ### Motivation and Context microsoft#25373
…rosoft#26697) ### Description This is follow up of microsoft#25181 to remove ROCM EP related files to avoid confusion. Documents will be updated later. ### Motivation and Context microsoft#26692
### Description This PR optimizes the `Conv` operation by implementing two new compute shaders: `oihw_to_ohwi` and `im2col-matmul`. `oihw_to_ohwi`: Improves performance over the default Transpose shader by utilizing workgroup memory to ensure continuous memory read/write patterns. `im2col-matmul`: - Employs a workgroup size of 64. - Dynamically selects tile sizes (32x64 or 16x64) based on the source/weight shape. - Each invocation handles a dedicated weight element. - Uses subgroupShuffle to efficiently access the source tile, leveraging k_vec4 vectorization for better memory throughput. Testing on Lunar Lake demonstrated **up to an 87%** performance improvement in Conv_2D operations. ### Motivation and Context See above.
) ### Description Calling an operator's `TypeAndShapeInferenceFunction()` alone is sometimes insufficient for complete shape inference. For example, the `Shape` operator only infers the output’s rank (a 1-dimensional tensor) but not its actual dimension values. For instance, given an input of shape [1, 3, 64, 64], the Shape operator's `TypeAndShapeInferenceFunction()` produces an output shape tensor with 1-dimension as int64[4], representing the rank of the input tensor. Therefore, as you can imagine, the below graph's output shape can't be properly inferred (even though the input shape is known) because the shape data is lost at the `Shape `operator. <img width="563" height="488" alt="image" src="https://github.com/user-attachments/assets/bfa9fd8f-5291-4c6d-a679-3ce4a8c48669" /> To solve the issue, the `PartialDataPropagationFunction()`, defined in the ONNX operator schema, must also be executed to obtain the concrete output shape values, allowing accurate propagation of shape information throughout the graph. This PR adds the support of executing operator's `PartialDataPropagationFunction()` in ORT, and makes sure the shape values is properly propagated throughout the graph. ### Motivation and Context When using the Compile API to generate an EPContext model, all graph optimizations are disabled by default except for free dimension overrides. However, for certain models, such as a VAE decoder, the output shape may still fail to be properly inferred even when free dimension override values are provided beforehand. However, you won't hit this issue if enabling all the graph optimizations as some nodes, e.g. `Shape`, `Reshape `.. will be constant folded.
### Description
onnxruntime-github-Ubuntu2204-AMD-CPU has been running out of resources.
Based on the documentation for self hosted pools, they require a unique
identifier set to: JobId=<uniquejobprefix>-${{ github.run_id }}-${{
github.run_number }}-${{ github.run_attempt }}
This PR adds the above JobID to all instances of self-hosted pool.
### Motivation and Context
We are seeing long queue times on the self hosted pools. Reaching out to
the pool owners, they recommended adding JobId across all self hosted
instances.
…latform-specific checks (microsoft#26668) ### Description <!-- Describe your changes. --> Add `MlasIsDynamicQGemmAvailable()` helper function and use that in place of platform-specific checks. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Try to reduce platform-specific code.
Update SECURITY.md requested by MSRC since the team no longer receives security research submissions through email.
…t#26724) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: wp <webgraphics@intel.com>
### Description Adding SUPPORT.md file to clarify Microsoft support policy for the project.
…osoft#26706) @yuslepukhin am i missing somthing or is there some kind of bridge missing to safely convert the OrtSyncStream pointer ?
…icrosoft#26725) ### Description This patch adds the support of `float16x4` and `float32x4` in `ValidateVariableDataType()` as we may declare them as atomic variables when Split-K is used. ### Motivation and Context Previously we failed to discover this issue because `ValidateVariableDataType()` is only used in the Debug build.
…26679) ### Description Get data from `bias` and `output` with `GetByOffset()` instead of direct access to the array in `gemm_utils.cc` ### Motivation and Context Because when the input `bias` or output `output` is very large, both `bias` and `output` may be split into multiple buffers, we must query the data from `bias` and `output` with `GetByOffset()` or `GetByIndices()` which implement the logic to get the correct split buffer.
…soft#26721) This PR moves the conversion of initializers in-memory from Graph constructor to early in graph transform before the partitioning. This is done to avoid conversion when subgraphs are constructed. It also addresses bugs in TRT and NV TRT providers. Addresses issue: microsoft#26653 **Graph Initializer Conversion and Handling:** * Added a new method `Graph::ConvertInitializersIntoOrtValues()` to convert all graph TensorProto initializers into OrtValues and create in-memory external data references, separating this logic from graph construction and making it reusable. (`include/onnxruntime/core/graph/graph.h`, `onnxruntime/core/graph/graph.cc`) [[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1457-R1463) [[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cR3416-R3447) * Removed the previous lambda for converting large tensor initializers within the graph constructor, delegating this responsibility to the new method above for clearer separation of concerns. (`onnxruntime/core/graph/graph.cc`) [[1]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cL1234-L1255) [[2]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cL1275-L1276) [[3]](diffhunk://#diff-e231a92b40d89409cc8e82436be0a15bc87ef95c93b303b9feaeab6e50c8835cL1353-R1327) **Provider Interface Enhancements:** * Introduced move assignment operators for `GraphProto` and `TensorProto` in both the provider interface (`ProviderHost`) and wrapper structs, allowing for more efficient object transfers and assignment. (`onnxruntime/core/providers/shared_library/provider_interfaces.h`, `onnxruntime/core/providers/shared_library/provider_wrappedtypes.h`) [[1]](diffhunk://#diff-d62681d5e83139cfbc272f32afc4ff897dbfd84a709f02a932666e18240fa094L442-R457) [[2]](diffhunk://#diff-d62681d5e83139cfbc272f32afc4ff897dbfd84a709f02a932666e18240fa094L495-R511) [[3]](diffhunk://#diff-bf62a34e53927025e7a7bcf7f294532a366ec4ee069bbe541fcdc87e3b1eaa8fL178-R179) [[4]](diffhunk://#diff-bf62a34e53927025e7a7bcf7f294532a366ec4ee069bbe541fcdc87e3b1eaa8fL244-R248) * Added iterator interfaces (`TensorProto_ConstIterator`, `TensorProto_Iterator`) and corresponding methods to `TensorProtos` for clean iteration over initializer lists, improving code readability and maintainability. (`onnxruntime/core/providers/shared_library/provider_interfaces.h`, `onnxruntime/core/providers/shared_library/provider_wrappedtypes.h`) [[1]](diffhunk://#diff-d62681d5e83139cfbc272f32afc4ff897dbfd84a709f02a932666e18240fa094L73-R93) [[2]](diffhunk://#diff-d62681d5e83139cfbc272f32afc4ff897dbfd84a709f02a932666e18240fa094L524-R545) [[3]](diffhunk://#diff-bf62a34e53927025e7a7bcf7f294532a366ec4ee069bbe541fcdc87e3b1eaa8fL286-R295) **Execution Provider Logic Simplification:** * Refactored how initializers are processed in the NVExecutionProvider, using the new initializer conversion and iteration logic to simplify handling of external and in-memory data, and ensuring correct assignment and ownership of user-provided weights. (`onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.cc`) [[1]](diffhunk://#diff-b7114b8cae911bdd2c3523a09019f9a9b9f9d7cce4fdd50b282603c81a6137aaL1657-R1658) [[2]](diffhunk://#diff-b7114b8cae911bdd2c3523a09019f9a9b9f9d7cce4fdd50b282603c81a6137aaR1709-R1733) [[3]](diffhunk://#diff-b7114b8cae911bdd2c3523a09019f9a9b9f9d7cce4fdd50b282603c81a6137aaR2558-R2587) **Other Minor Improvements:** * Improved const-correctness and interface consistency for size and iterator methods in `TensorProtos`. (`onnxruntime/core/providers/shared_library/provider_interfaces.h`, `onnxruntime/core/providers/shared_library/provider_wrappedtypes.h`) [[1]](diffhunk://#diff-d62681d5e83139cfbc272f32afc4ff897dbfd84a709f02a932666e18240fa094L524-R545) [[2]](diffhunk://#diff-bf62a34e53927025e7a7bcf7f294532a366ec4ee069bbe541fcdc87e3b1eaa8fL286-R295)
## [VitisAI] Add External EP Loader ### Description This PR introduces a dynamic external execution provider loading mechanism for the VitisAI execution provider, enabling runtime loading of alternative execution providers through a plugin-style architecture. ### Key Changes #### 1. **New External EP Library Infrastructure** (`global_api.cc`) - Added `ExternalEpLibaray` class to dynamically load external execution provider libraries at runtime - Implemented complete library lifecycle management (loading, unloading, symbol resolution) - Added global registry (`g_external_ep_libaries`) with caching to avoid redundant library loading - Created `CreateExecutionProviderFromAnotherEp()` function to instantiate execution providers from external libraries **Implementation Details:** - **Simplified symbol resolution**: Only resolves the essential `GetProvider` symbol (required) - **Removed optional symbols**: No longer attempts to resolve `CreateEpFactories` or `RyzenAI_SetSessionOptions` - Lazy initialization pattern with `Ensure()` method - Safe cleanup with `Clear()` method and proper error handling - Platform-agnostic library loading using `LIBRARY_PREFIX` and `LIBRARY_EXTENSION` macros #### 2. **API Extension** (`global_api.h`) - Declared new public function: `CreateExecutionProviderFromAnotherEp()` - Added required includes: - `core/framework/execution_provider.h` for `IExecutionProvider` interface - `<memory>` for smart pointer support #### 3. **Factory Integration** (`vitisai_provider_factory.cc`) - Integrated external EP loading into the VitisAI provider factory workflow - Added provider option check for `external_ep_libray` key - **Logic Flow**: 1. Check if `external_ep_libray` option is specified 2. If yes, load and return the external execution provider 3. If no, create and return standard VitisAI execution provider Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
### Description This pull request updates the OpenVINO version used in the internal CI pipelines to OpenVINO 2025.3.0. Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
### Description
This PR adds an initial set of C APIs necessary to support kernel
registration for plugin EPs.
### Example use
The example plugin EP implementation now registers `MemcpyFromHost` and
`MemcpyToHost` operator kernels using the new APIs. New utilities in the
example implementation make the process of defining operator kernels
very similar to the existing process used by provider-bridge EPs.
First, the operator kernel class is defined:
```c++
// File: onnxruntime/test/autoep/library/kernels/memcpy.h
struct Memcpy : public OrtKernelImpl {
static OrtStatus* Create(const OrtKernelInfo* info, void* state, /*out*/ std::unique_ptr<Memcpy>& kernel);
Memcpy(const OrtKernelInfo* info, void* state);
static OrtStatus* ORT_API_CALL ComputeImpl(OrtKernelImpl* this_ptr, OrtKernelContext* kernel_ctx) noexcept;
static void ORT_API_CALL ReleaseImpl(OrtKernelImpl* this_ptr) noexcept;
OrtStatus* DoCompute(OrtKernelContext* kernel_ctx) noexcept;
private:
const OrtKernelInfo* info_;
void* state_; // Custom state passed from OrtEp
};
```
Then, a macro defines a function that can be called to register the
operator with the EP's kernel registry:
```c++
// File: onnxruntime/test/autoep/library/kernels/memcpy.cc
ONNX_OPERATOR_KERNEL_EX(
MemcpyFromHost,
kOnnxDomain,
1,
(Ort::KernelDefBuilder()
.SetInputMemType(0, OrtMemType::OrtMemTypeCPUInput)
.AddTypeConstraint("T", MLDataTypes::GetTensorType(ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT))),
Memcpy)
ONNX_OPERATOR_KERNEL_EX(
MemcpyToHost,
kOnnxDomain,
1,
(Ort::KernelDefBuilder()
.SetOutputMemType(0, OrtMemType::OrtMemTypeCPUOutput)
.AddTypeConstraint("T", MLDataTypes::GetTensorType(ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT))),
Memcpy)
```
Lastly, the functions defined by the above macro are entered into a
table:
```c++
// File: onnxruntime/test/autoep/library/ep_kernel_registration.cc
// Include kernel files:
#include "kernels/memcpy.h"
// Forward declarations of kernel classes used as template args for BuildKernelCreateInfo
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kOnnxDomain, 1, MemcpyFromHost);
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kOnnxDomain, 1, MemcpyToHost);
// Table of BuildKernelCreateInfo functions for each operator
static const BuildKernelCreateInfoFn build_kernel_create_info_funcs[] = {
BuildKernelCreateInfo<void>, // Dummy to avoid table becoming empty.
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kOnnxDomain, 1, MemcpyFromHost)>,
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kOnnxDomain, 1, MemcpyToHost)>,
};
```
The [example EP processes the entries in the above
table](https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi-kernel-based-eps/onnxruntime/test/autoep/library/ep_kernel_registration.cc)
to add information about the supported operator kernels to the EP's
kernel registry (`OrtKernelRegistry`).
Additionally, during the call to `OrtEp::GetCapability`, an EP can now
lookup registered kernel definitions via the new API
`EpGraphSupportInfo_LookUpKernel`. Note that an EP would not normally
lookup kernels for `Memcpy**Host`, which are inserted by ORT. Instead,
it would be used to look up other registered operator kernels like
`Conv`, for example.
```c++
static OrtStatus* ORT_API_CALL GetCapabilityImpl(OrtEp* this_ptr, const OrtGraph* graph,
OrtEpGraphSupportInfo* graph_support_info) noexcept {
// ...
for (const OrtNode* node : nodes) {
const OrtKernelDef* kernel_def = nullptr;
OrtStatus* status = this_ep->ep_api->EpGraphSupportInfo_LookUpKernel(graph_support_info, node, &kernel_def);
if (status != nullptr) {
return status;
}
if (kernel_def != nullptr) { // Take node if this EP has a registered kernel for it.
if (OrtStatus* st = this_ep->ep_api->EpGraphSupportInfo_AddSingleNode(graph_support_info, node);
st != nullptr) {
return st;
}
}
}
return nullptr;
}
```
### EP implementation details
An EP instance (i.e., `OrtEp`) that needs to register operator kernels
with ONNX Runtime must implement the following
`OrtEp::GetKernelRegistry()` function:
| Function Signature | Description |
|--------------------|-------------|
|**GetKernelRegistry**<br/><br/>**Returns**:`OrtStatus*`<br/><br/>**Parameters:**<br/><ul><li>`OrtEp*
this_ptr`: The OrtEp instance.</li><li>`const OrtKernelRegistry**
kernel_registry`: Output parameter set to the EP's kernel registry,
which must remain valid throughout the lifetime of the EP.</li></ul>|
Gets the execution provider's kernel registry, if
any.<br/><br/>**Remarks:** A kernel registry contains kernel creation
information for operator kernels supported by an EP.<br/><br/>**Note:**
Implementation of this function is optional. If set to NULL, ORT assumes
the EP compiles nodes. |
If defined by the EP, the `OrtEp::GetKernelRegistry()` function is
[called by ONNX
Runtime](https://github.com/microsoft/onnxruntime/blob/0f7145f3809103c123de2d281a6b310677e6d56c/onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc#L146-L147)
after creating an instance of the `OrtEp` in order to retrieve the EP's
kernel registry.
#### APIs used by EP to add entries to kernel registry
An EP's kernel registry (`OrtKernelRegistry`) contains **information**
necessary for the (later) creation of operator kernels supported by an
EP. Conceptually, a kernel registry contains an array of "kernel
creation information" elements, one per operator. Each such element
consists of:
- A kernel **definition** (`OrtKernelDef`), which specifies operator
type, supported versions, type constraints, I/O memory types, etc.
- A function of type `OrtKernelCreateFunc` that ORT calls to create an
instance of the kernel (`OrtKernelImpl`).
- Custom opaque state (provided by the `OrtEp`) that is passed to the
`OrtKernelCreateFunc`.
An EP uses the following `OrtEpApi::KernelRegistry_AddKernel()` function
to add an entry for one supported operator.
| Function Signature | Description |
|--------------------|-------------|
|**KernelRegistry_AddKernel**<br/><br/>**Returns**:`OrtStatus*`<br/><br/>**Parameters:**<br/><ul><li>`OrtKernelRegistry*
kernel_registry`: The OrtKernelRegistry instance.</li><li>`const
OrtKernelDef* kernel_def`: The kernel definition, which includes
operator type, version, EP name, type constraints,
etc.</li><li>`OrtKernelCreateFunc kernel_create_func`: Function that
creates an instance of the operator kernel as a OrtKernelImpl
instance.</li><li>`void* kernel_create_func_state`: Custom state passed
to the kernel creation function. Can be null.</li></ul>| Adds kernel
creation information for a supported operator kernel to the given kernel
registry.<br/><br/>**Remarks:** Refer to OrtEp::GetKernelRegistry, which
returns an EP's kernel registry to ORT. |
##### Building a kernel definition
An EP uses a kernel definition builder (`OrtKernelDefBuilder`) to create
a kernel definition (`OrtKernelDef`). The following table lists **some**
of the C APIs related to building a kernel definition. The above
`ONNX_OPERATOR_KERNEL_EX` macro [uses these
APIs](https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi-kernel-based-eps/onnxruntime/test/autoep/library/kernels/utils.h#L42).
| Function Signature | Description |
|--------------------|-------------|
|**KernelDefBuilder_SetOperatorType**<br/><br/>**Returns**:`OrtStatus*`<br/><br/>**Parameters:**<br/><ul><li>`OrtKernelDefBuilder*
kernel_def_builder`: The OrtKernelDefBuilder instance.</li><li>`const
char* op_type`: A null-terminated string representing the operator
type.</li></ul>| Sets the kernel's operator type. |
|**KernelDefBuilder_SetDomain**<br/><br/>**Returns**:`OrtStatus*`<br/><br/>**Parameters:**<br/><ul><li>`OrtKernelDefBuilder*
kernel_def_builder`: The OrtKernelDefBuilder instance.</li><li>`const
char* domain`: A null-terminated string representing the operator's
domain.</li></ul>| Sets the kernel's domain. |
| ... | ... |
|**KernelDefBuilder_Build**<br/><br/>**Returns**:`OrtStatus*`<br/><br/>**Parameters:**<br/><ul><li>`OrtKernelDefBuilder*
kernel_def_builder`: The OrtKernelDefBuilder
instance.</li><li>`OrtKernelDef** kernel_def_out`: The new OrtKernelDef
instance.</li></ul>| Creates a OrtKernelDef instance from the given
kernel definition builder. |
##### Defining a kernel implementation
An EP defines a kernel implementation by initializing an instance of
`OrtKernelImpl` (shown below) with function pointers for computation,
release, etc.
```c++
struct OrtKernelImpl {
uint32_t ort_version_supported; ///< Must be initialized to ORT_API_VERSION
/** \brief Computation function called to execute the kernel on an EP.
*
* \param[in] this_ptr The OrtKernelImpl instance.
* \param[in] context The OrtKernelContext instance that provides access to the inputs and outputs.
*
* \snippet{doc} snippets.dox OrtStatus Return Value
*
* \since Version 1.24.
*/
ORT_API2_STATUS(Compute, _In_ OrtKernelImpl* this_ptr, _In_ OrtKernelContext* context);
/** \brief Called by ORT to release the OrtKernelImpl instance and its resources.
*
* \param[in] this_ptr The OrtKernelImpl instance.
*
* \since Version 1.24.
*/
ORT_API_T(void, Release, _In_ OrtKernelImpl* this_ptr);
};
```
As shown previously, the example EP creates a `Memcpy` class that
inherits from `OrtKernelImpl` and [implements the above
functions](https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi-kernel-based-eps/onnxruntime/test/autoep/library/kernels/memcpy.cc).
##### Defining a kernel creation function
An EP must provide a function of type `OrtKernelCreateFunc` that ORT can
later call to create an instance of a kernel (`OrtKernelImpl`). The
signature of the `OrtKernelCreateFunc` is shown below.
```c++
/** \brief Type definition for a function that creates an OrtKernelImpl instance for an operator kernel.
*
* \param[in] ctx Unused/reserved for future use.
* \param[in] kernel_create_func_state Opaque state initially provided by the EP that registered the kernel.
* Refer to OrtEpApi::KernelRegistry_AddKernel(). May be null.
* \param[in] info The OrtKernelInfo instance that provides access to the kernel's input and output characteristics.
* \param[out] kernel_out Output parameter set to the new OrtKernelImpl instance.
*
* \snippet{doc} snippets.dox OrtStatus Return Value
*
* \since Version 1.24.
*/
typedef OrtStatus*(ORT_API_CALL* OrtKernelCreateFunc)(_In_ OrtKernelCreateContext* ctx, // unused/reserved as of 1.24
_In_ void* kernel_create_func_state,
_In_ const OrtKernelInfo* info,
_Outptr_result_maybenull_ OrtKernelImpl** kernel_out);
```
The example EP declares kernel creation functions via use of the
previously mentioned `ONNX_OPERATOR_KERNEL_EX`
[macro](https://github.com/microsoft/onnxruntime/blob/adrianl/ep-abi-kernel-based-eps/onnxruntime/test/autoep/library/kernels/utils.h#L56-L64).
If one were to expand the macro call, the kernel creation function for
`MemcpyFromHost` would look similar to the following snippet:
```c++
OrtStatus* ORT_API_CALL CreateMemcpyKernel(OrtKernelCreateContext* /*ctx*/, void* kernel_create_func_state,
const OrtKernelInfo* info, OrtKernelImpl** kernel_out) {
*kernel_out = nullptr;
std::unique_ptr<Memcpy> kernel;
RETURN_IF_ERROR(Memcpy::Create(info, kernel_create_func_state, kernel));
*kernel_out = kernel.release();
return nullptr;
}
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
…oft#26523) ### Description <!-- Describe your changes. --> Conv2D supports per-channel uint8 quantized weights since QNN SDK 2.36. This PR updates outdated comments related to signed quantization checks for Conv. The check itself was removed in microsoft#25986 , which has been merged. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> microsoft#25986 Co-authored-by: -qti <@qti.qualcomm.com>
### Description <!-- Describe your changes. --> - Moved `ReplaceUpsampleWithResize` in the quantization preprocessing pipeline to occur before `SymbolicShapeInference`, due to the current lack of shape inference support for the `Upsample` operator. Prevented unnecessary modifications to `model.opset_import`. Setting `model.opset_import` prior to invoking `onnx.version_converter` can interfere with successful opset conversion. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - This change ensures the quantization preprocessing functions correctly by addressing limitations in shape inference for `Upsample`. - It also avoids potential issues with opset conversion caused by premature modification of `model.opset_import`.
### Description Remove USE_ROCM and rocm ep related code. Public APIs (C/C++ APIs and corresponding C# native methods) are not touched. ### Motivation and Context Follow up on microsoft#25181 and microsoft#26697
…/nextjs-default (microsoft#26719) Bumps [next](https://github.com/vercel/next.js) from 15.4.7 to 15.4.8. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/vercel/next.js/releases">next's releases</a>.</em></p> <blockquote> <h2>v15.4.8</h2> <p>Please see <a href="https://nextjs.org/blog/CVE-2025-66478">CVE-2025-66478</a> for additional details about this release.</p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/vercel/next.js/commit/49668475daba15ef8cea1d8e469dc0f9a765b635"><code>4966847</code></a> v15.4.8</li> <li><a href="https://github.com/vercel/next.js/commit/bf8d31c89caf0fc18efe91fb2dc3463fc03795c0"><code>bf8d31c</code></a> update version script</li> <li><a href="https://github.com/vercel/next.js/commit/bed530f7294241b9f92aa2ee5abc50a92e97b7fe"><code>bed530f</code></a> Update React Version for Next.js 15.4.8 (<a href="https://redirect.github.com/vercel/next.js/issues/9">#9</a>)</li> <li><a href="https://github.com/vercel/next.js/commit/4309d936b36e5fbdfdc9ee743dd9161c26e7220f"><code>4309d93</code></a> update tag</li> <li><a href="https://github.com/vercel/next.js/commit/17e6873ee8320bd6bfa8f35c3d7769c0e08e1ebf"><code>17e6873</code></a> [backport]: <code>experimental.middlewareClientMaxBodySize</code> (<a href="https://redirect.github.com/vercel/next.js/issues/84722">#84722</a>)</li> <li><a href="https://github.com/vercel/next.js/commit/4da39f22c51e2653ad74a91f0a0d60de9916e2ec"><code>4da39f2</code></a> [backport] fix: unstable_cache should perform blocking revalidation during IS...</li> <li>See full diff in <a href="https://github.com/vercel/next.js/compare/v15.4.7...v15.4.8">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Fix issue microsoft#26692 by upgrading base image.
…stream (microsoft#26542) ### Description - Don't register set device function when we use existing stream - Fix bug nv_execution_provider.cc : set device only if user did not provide existing stream ### Motivation and Context In some use cases, we push a user generated CUDA context, create streams using this context, and then provide these streams to TRT-RTX. However, we noticed that after calling Run(), the custom context is replaced by another CUDA context created by ORT. This means that TRT-RTX is no longer using the original CUDA context. After investigating further, we found that the new context is being created in onnxruntime/core/framework/stream_execution_context.cc. The solution we propose is to not register set device function if we provide the stream. Also there is a bug in onnxruntime\core\providers\nv_tensorrt_rtx\nv_execution_provider.cc. We should set the device only if the user has not provided any stream. (coherent with the original comment)
### Description Be able to specify auxiliary streams to TensorRT RTX EP. ### Motivation and Context In some use cases, we want to have full control over all the streams used by TRT-RTX, even auxiliary ones.
…osoft#26671) ### Description Adds missing `#include "test/common/cuda_op_test_utils.h"` to `conv_fp16_test.cc`. The `ConvBF16Test_Conv2D_1` test calls `CudaHasBF16Support()` but the header defining it was not included. ### Motivation and Context Fixes Linux CPU packaging pipeline build failure: ``` /onnxruntime_src/onnxruntime/test/providers/cpu/nn/conv_fp16_test.cc:281:8: error: 'CudaHasBF16Support' was not declared in this scope 281 | if (!CudaHasBF16Support()) { | ^~~~~~~~~~~~~~~~~~ ``` Other test files using this function (`pool_fp16_op_test.cc`, `element_wise_ops_test.cc`, `skiplayernorm_op_test.cc`) already include this header. <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> > Linux CPU packaging pipeline failed with this error: > > > ``` > /onnxruntime_src/onnxruntime/test/providers/cpu/nn/conv_fp16_test.cc: In member function ‘virtual void onnxruntime::test::ConvBF16Test_Conv2D_1_Test::TestBody()’: > /onnxruntime_src/onnxruntime/test/providers/cpu/nn/conv_fp16_test.cc:281:8: error: ‘CudaHasBF16Support’ was not declared in this scope > 281 | if (!CudaHasBF16Support()) { > | ^~~~~~~~~~~~~~~~~~ > ``` > > it looks like `CudaHasBF16Support` is declared inside a macro section which is inconsistent. > > Please try to figure out the reason and perform a fix and make a PR for it </details> <!-- START COPILOT CODING AGENT TIPS --> --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>
…rosoft#26715) This pull request improves the WebGPU BERT attention implementation by enhancing FlashAttention support, generalizing tensor layout handling, and increasing batch size flexibility. The changes focus on supporting both BSNH and BNSH tensor layouts, enabling FlashAttention for multi-batch scenarios, and ensuring correct broadcasting and dispatch sizing for attention bias and batch dimensions. Key improvements include: **FlashAttention Support & Generalization:** * Added support for both BSNH and BNSH tensor layouts by introducing the `q_BNSH` parameter and updating shader code, program classes, and kernel logic to handle either layout correctly. This includes changes in the WGSL template and C++ logic for offset calculations and program instantiation. [[1]](diffhunk://#diff-de9fb56a92586a62185eae0a2e0153f12960bc73dab990e616185236e115885fR7) [[2]](diffhunk://#diff-de9fb56a92586a62185eae0a2e0153f12960bc73dab990e616185236e115885fL45-R97) [[3]](diffhunk://#diff-de9fb56a92586a62185eae0a2e0153f12960bc73dab990e616185236e115885fL86-R122) [[4]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8R445) [[5]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8R454) [[6]](diffhunk://#diff-27882bdbb4d2adc903ff91fbb7b09feb61c53d7a5d86d9336e294d631b7f59e9R76) [[7]](diffhunk://#diff-27882bdbb4d2adc903ff91fbb7b09feb61c53d7a5d86d9336e294d631b7f59e9R86) [[8]](diffhunk://#diff-27882bdbb4d2adc903ff91fbb7b09feb61c53d7a5d86d9336e294d631b7f59e9R110) * Updated the `CanApplyFlashAttention` and `ApplyFlashAttention` logic to allow multi-batch operation by removing the restriction to batch size 1 and ensuring present key/value tensors are always created for FlashAttention. [[1]](diffhunk://#diff-1ed746fa440247995dabd97ad1f318a548fc385cde70b9ea2d4a410219f91629R740-R752) [[2]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8L501-L506) [[3]](diffhunk://#diff-27882bdbb4d2adc903ff91fbb7b09feb61c53d7a5d86d9336e294d631b7f59e9L177-R185) **Batch & Bias Handling:** * Modified dispatch group size calculations and uniform variables throughout the FlashAttention pipeline to properly account for batch size, ensuring correct parallelization for multi-batch scenarios. [[1]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8R260-R273) [[2]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8L272-R285) [[3]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8L320-R333) [[4]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8L366-R379) [[5]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8L454-R490) [[6]](diffhunk://#diff-27882bdbb4d2adc903ff91fbb7b09feb61c53d7a5d86d9336e294d631b7f59e9R95-R100) [[7]](diffhunk://#diff-27882bdbb4d2adc903ff91fbb7b09feb61c53d7a5d86d9336e294d631b7f59e9L123-R131) * Added logic to extract and pass attention bias dimensions as uniforms for correct broadcasting in both the compute and shader code. [[1]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8R260-R273) [[2]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8L272-R285) [[3]](diffhunk://#diff-c21dffe27e10565d78827773edf856be89f28b4dfefe1a79d18e083c0b18b0e8L454-R490) [[4]](diffhunk://#diff-27882bdbb4d2adc903ff91fbb7b09feb61c53d7a5d86d9336e294d631b7f59e9R95-R100) [[5]](diffhunk://#diff-27882bdbb4d2adc903ff91fbb7b09feb61c53d7a5d86d9336e294d631b7f59e9L123-R131) **Other Enhancements:** * Improved handling of QKV format detection and generalized code to support more format variants in `CopyKVCache`. * Updated includes and dependencies to ensure all necessary headers for FlashAttention are present. These changes collectively make the WebGPU BERT attention implementation more robust, flexible, and performant across different tensor layouts and batch sizes. phi-4-mm-vision.onnx Before Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|AttentionProbs | 159.66 | 11.14 Attention\|VxAttentionScore | 122.56 | 8.55 Attention\|InPlaceSoftmax | 51.83 | 3.62 After Kernel | Time (ms) | Percentage (%) -- | -- | -- Attention\|FlashAttention | 60.23 | 5.38
ankitm3k
approved these changes
Dec 9, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.