Skip to content

Conversation

@jatinwadhwa921
Copy link

Backmerging with Msft commits

kevinch-nv and others added 8 commits July 4, 2025 08:20
…icrosoft#25263)

### Description
Adds `include_initializer_data` option to `GraphViewerToProto` to skip
writing initializer raw data and external data when serializing.

### Motivation and Context
For TensorRT EP, partitioned graphs must be serialized to proto in order
for getCapability() to run. For cases where the weights are not strictly
needed (i.e. weightless engines), serializing the graph without
initializer data reduces the overall memory required.

---------

Signed-off-by: Kevin Chen <kevinch@nvidia.com>
### Description
This pull request includes a wide range of feature updates,
optimizations, and bug fixes aimed at improving performance, memory
efficiency, dynamic shaped model support, ORT GenAI support for GenAI
models viz. LLMs / SLMs , and overall stability of the OpenVINO
Execution Provider (OVEP).

### Key Enhancements

- Dynamic Shaped Model Support:
Added support for inferencing dynamic shaped models using
`reshape_input` provider option
Enabled workload type handling for dynamic-shaped models

- Performance Optimizations:
Reduced peak memory usage by optimizing fallback logic and model proto
handling.
Improved CPU inference path efficiency.
Removed unintended model copies during compilation.

- ORT GenAI Feature Pass:
[ORT GenAI](https://github.com/microsoft/onnxruntime-genai) is now
supported using OpenVINO EP using `enable_causallm` provider option as
`True`

- EPContext OVIR Encapsulation Feature:
ORT now supports EpContext Models with OVIR (i.e. model.xml & model.bin)
stored into `ep_cache_context` attribute Compilation, Inference &
Pre-Compiled Cached Blob Support

- Quantization Enhancements:
Enabled QDQ stripping path using adaptive stripping.
Enabled QDQ Channel Wise Quantization for Intel NPU friendly
quantization using `MatMul4BitsQuantizer/ DefaultWeightOnlyQuantConfig`
using option `channel_wised_quantize` as `True`

```
from onnxruntime.quantization import matmul_nbits_quantizer

# Define quantization configuration and process
quant_config = matmul_nbits_quantizer.DefaultWeightOnlyQuantConfig(
    block_size=128, is_symmetric=True, quant_format=quant_utils.QuantFormat.QDQ, channel_wised_quantize=True)
```

- Operator & Backend Improvements:
Added support for the HardSwish operator 
Fixed logic for unsupported op modes and improved precision accuracy

- Bug Fixes:
Fixed metadata naming and file path validation
Addressed device selection issues and provider key verification
Resolved deprecated OV element types and LUID check issues

---------

Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: Sushanth Rajasankar <44513542+sushraja-msft@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Seungtaek Kim <seungtaek.kim.94@gmail.com>
Co-authored-by: co63oc <co63oc@users.noreply.github.com>
Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Alessio Soldano <services@soldano.it>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com>
Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: wp <webgraphics@intel.com>
Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: Mark Schofield <mschofie@microsoft.com>
Co-authored-by: jiangzhaoming <zhaoming.jiang@microsoft.com>
Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com>
Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Ranjit Ranjan <165394499+ranjitshs@users.noreply.github.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com>
Co-authored-by: Nikolay Proshunin <nikolay.proshunin@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Yaru Du <yaru.du@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…crosoft#25272)

### Description
<!-- Describe your changes. -->

Add OrtEpFactory::GetVersion and store EP version in EP metadata.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enforce plugin EP version specification and make it accessible from EP
metadata.

---------

Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
…djusting weight scale and requantizing (microsoft#25278)

### Overview

This PR introduces a critical fix for **QOperator INT8 symmetric
quantization** in ONNX Runtime. It addresses a situation where the
computed **bias scale** (`input_scale * weight_scale`) becomes too
small, leading to **int32 overflow** or **precision clipping** during
bias quantization.

### Problem

In symmetric quantization (i.e., zero_point = 0), the bias tensor is
quantized using a fixed-point scale:
**bias_scale = input_scale * weight_scale**


When this value is too small, the quantized int32 bias may exceed the
range of `int32`, causing saturation or significant quantization error.
This was observed to cause **>51% accuracy loss** in some models.

### Solution

This PR adds two new functions to mitigate this:

---

#### 🔧 `_adjust_weight_scale_for_int32_bias(...)`

Located in `onnx_quantizer.py`, this function:

- **Inspects the float bias range** to compute the smallest valid bias
scale (based on int32 dynamic range)
- **Compares** this threshold against `input_scale * weight_scale`
- If too small, **scales up the weight scale** accordingly, to prevent
overflow
- Supports both per-tensor and per-channel weight quantization cases

This logic is **only triggered when**:
- The weight's zero point is exactly zero (i.e. symmetric)
- The weight data type is `INT8` or `INT16`

---

#### 🔄 `_requantize_weight(...)`

After weight scale adjustment, this function:
- **Finds the original quantized weight** (`q_weight`), scale, and zero
point from the initializer list
- **Removes** the outdated quantized weight and scale
- **Re-quantizes** the original float weights using the new scale and
the same zero point
- **Re-inserts** them into the model to maintain consistency

---

### Summary of Benefits

- ✅ Prevents int32 overflow or saturation during symmetric bias
quantization
- ✅ Ensures weight and bias quantization remain consistent
- ✅ Reduced quantization error from >51.4% to ~3% in test models
- ✅ Fix is limited in scope to QOperator + symmetric INT8/INT16 flow
(safe for other modes)
- ✅ Improves robustness of static quantization for hardware that
performs integer-only inference

---

### Code Location

- `onnxruntime/quantization/onnx_quantizer.py`
  - `def _adjust_weight_scale_for_int32_bias(...)`
  - `def _requantize_weight(...)`
  - Integrated in `quantize_bias_static(...)`

---

Please let me know if you'd like additional test coverage or integration
points. Thanks!
This PR enables graph capture capabilities in the WebGPU provider, which
is similar with jsep one microsoft#18989.

All limitations are similar with JS/CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the WebGPU EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IOBinding is required. And all inputs/outputs are pre-allocated gpu
buffers.

When users use graph capture feature, we suppose they will do some
pre-process and post-process for the inference's inputs and outputs in
order to keep the whole pipeline on GPU to avoid some unnecessary cpu to
gpu or gpu to cpu copying. The usage will be like below:
```
// Initialize Dawn
{
  // 1. Create Dawn instance
  ...
  instance = wgpu::CreateInstance(&instanceDescriptor); 
  // 2.  Create the adapter
  ...
  instance.RequestAdapter
  // 3. Create device from adapter
  ...
  adapter.RequestDevice
}

// Create session options
webgpu_options_ = std::make_unique<Ort::SessionOptions>();
std::unordered_map<std::string, std::string> provider_options;
provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs()));
provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get()));
provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get()));
provider_options["deviceId"] = "1";
provider_options["enableGraphCapture"] = "1";
// add WebGPU provider
webgpu_options_->AppendExecutionProvider("WebGPU", provider_options);
...
// create webgpu session
webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_);
...
Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
Ort::Allocator allocator(*webgpu_session_, memory_info_gpu);
auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float));
auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float));
// Create IoBinding objects
Ort::IoBinding webgpu_binding(*webgpu_session_);
// Upload cpu data to input_buffer or copy gpu buffer to input_buffer
...
// Create an OrtValue tensor backed by data on gpu memory
Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_,
                input_dims_.data(), input_dims_.size());

Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_,
                output_dims_.data(), output_dims_.size());
webgpu_binding.BindInput("input", bound_x);
webgpu_binding.BindOutput("output", bound_y);

// Run inference
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing
...
// post process output_buffer's content
...
// Update input_buffer's content
...

// Run again
webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay()
...
// post process output_buffer's content
...
```
MatMulNBits is a decomposed op in WebNN EP. Previously, we share the
WebNN constant of zero_points if they have the same value and data type.
However, this brings a lot of complexity for developers to fuse it back
to MatMulNBits in the underlying WebNN implementation in Chromium.

In this PR, we will always create a new constant for zero_points.
…ft#25305)

### Description

1. rename `SessionState` to `GraphCaptureState`, since there is already
one SessionState type in ORT.
2. optimize implementation of `ComputeContext::BufferManager()`

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@jatinwadhwa921 jatinwadhwa921 requested a review from ankitm3k July 7, 2025 12:14
@ankitm3k ankitm3k merged commit 66eceb9 into ovep-develop Jul 7, 2025
6 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_7_7_25 branch July 7, 2025 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants