Backmerging with Msft commits #731

jatinwadhwa921 · 2025-07-07T12:14:15Z

Backmerging with Msft commits

…icrosoft#25263) ### Description Adds `include_initializer_data` option to `GraphViewerToProto` to skip writing initializer raw data and external data when serializing. ### Motivation and Context For TensorRT EP, partitioned graphs must be serialized to proto in order for getCapability() to run. For cases where the weights are not strictly needed (i.e. weightless engines), serializing the graph without initializer data reduces the overall memory required. --------- Signed-off-by: Kevin Chen <kevinch@nvidia.com>

### Description This pull request includes a wide range of feature updates, optimizations, and bug fixes aimed at improving performance, memory efficiency, dynamic shaped model support, ORT GenAI support for GenAI models viz. LLMs / SLMs , and overall stability of the OpenVINO Execution Provider (OVEP). ### Key Enhancements - Dynamic Shaped Model Support: Added support for inferencing dynamic shaped models using `reshape_input` provider option Enabled workload type handling for dynamic-shaped models - Performance Optimizations: Reduced peak memory usage by optimizing fallback logic and model proto handling. Improved CPU inference path efficiency. Removed unintended model copies during compilation. - ORT GenAI Feature Pass: [ORT GenAI](https://github.com/microsoft/onnxruntime-genai) is now supported using OpenVINO EP using `enable_causallm` provider option as `True` - EPContext OVIR Encapsulation Feature: ORT now supports EpContext Models with OVIR (i.e. model.xml & model.bin) stored into `ep_cache_context` attribute Compilation, Inference & Pre-Compiled Cached Blob Support - Quantization Enhancements: Enabled QDQ stripping path using adaptive stripping. Enabled QDQ Channel Wise Quantization for Intel NPU friendly quantization using `MatMul4BitsQuantizer/ DefaultWeightOnlyQuantConfig` using option `channel_wised_quantize` as `True` ``` from onnxruntime.quantization import matmul_nbits_quantizer # Define quantization configuration and process quant_config = matmul_nbits_quantizer.DefaultWeightOnlyQuantConfig( block_size=128, is_symmetric=True, quant_format=quant_utils.QuantFormat.QDQ, channel_wised_quantize=True) ``` - Operator & Backend Improvements: Added support for the HardSwish operator Fixed logic for unsupported op modes and improved precision accuracy - Bug Fixes: Fixed metadata naming and file path validation Addressed device selection issues and provider key verification Resolved deprecated OV element types and LUID check issues --------- Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: Sushanth Rajasankar <44513542+sushraja-msft@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Seungtaek Kim <seungtaek.kim.94@gmail.com> Co-authored-by: co63oc <co63oc@users.noreply.github.com> Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Alessio Soldano <services@soldano.it> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com> Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: wp <webgraphics@intel.com> Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: Mark Schofield <mschofie@microsoft.com> Co-authored-by: jiangzhaoming <zhaoming.jiang@microsoft.com> Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Ranjit Ranjan <165394499+ranjitshs@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com> Co-authored-by: Nikolay Proshunin <nikolay.proshunin@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Yaru Du <yaru.du@intel.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…crosoft#25272) ### Description  Add OrtEpFactory::GetVersion and store EP version in EP metadata. ### Motivation and Context  Enforce plugin EP version specification and make it accessible from EP metadata. --------- Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>

…djusting weight scale and requantizing (microsoft#25278) ### Overview This PR introduces a critical fix for **QOperator INT8 symmetric quantization** in ONNX Runtime. It addresses a situation where the computed **bias scale** (`input_scale * weight_scale`) becomes too small, leading to **int32 overflow** or **precision clipping** during bias quantization. ### Problem In symmetric quantization (i.e., zero_point = 0), the bias tensor is quantized using a fixed-point scale: **bias_scale = input_scale * weight_scale** When this value is too small, the quantized int32 bias may exceed the range of `int32`, causing saturation or significant quantization error. This was observed to cause **>51% accuracy loss** in some models. ### Solution This PR adds two new functions to mitigate this: --- #### 🔧 `_adjust_weight_scale_for_int32_bias(...)` Located in `onnx_quantizer.py`, this function: - **Inspects the float bias range** to compute the smallest valid bias scale (based on int32 dynamic range) - **Compares** this threshold against `input_scale * weight_scale` - If too small, **scales up the weight scale** accordingly, to prevent overflow - Supports both per-tensor and per-channel weight quantization cases This logic is **only triggered when**: - The weight's zero point is exactly zero (i.e. symmetric) - The weight data type is `INT8` or `INT16` --- #### 🔄 `_requantize_weight(...)` After weight scale adjustment, this function: - **Finds the original quantized weight** (`q_weight`), scale, and zero point from the initializer list - **Removes** the outdated quantized weight and scale - **Re-quantizes** the original float weights using the new scale and the same zero point - **Re-inserts** them into the model to maintain consistency --- ### Summary of Benefits - ✅ Prevents int32 overflow or saturation during symmetric bias quantization - ✅ Ensures weight and bias quantization remain consistent - ✅ Reduced quantization error from >51.4% to ~3% in test models - ✅ Fix is limited in scope to QOperator + symmetric INT8/INT16 flow (safe for other modes) - ✅ Improves robustness of static quantization for hardware that performs integer-only inference --- ### Code Location - `onnxruntime/quantization/onnx_quantizer.py` - `def _adjust_weight_scale_for_int32_bias(...)` - `def _requantize_weight(...)` - Integrated in `quantize_bias_static(...)` --- Please let me know if you'd like additional test coverage or integration points. Thanks!

This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one microsoft#18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers. When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: ``` // Initialize Dawn { // 1. Create Dawn instance ... instance = wgpu::CreateInstance(&instanceDescriptor); // 2. Create the adapter ... instance.RequestAdapter // 3. Create device from adapter ... adapter.RequestDevice } // Create session options webgpu_options_ = std::make_unique<Ort::SessionOptions>(); std::unordered_map<std::string, std::string> provider_options; provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs())); provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get())); provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get())); provider_options["deviceId"] = "1"; provider_options["enableGraphCapture"] = "1"; // add WebGPU provider webgpu_options_->AppendExecutionProvider("WebGPU", provider_options); ... // create webgpu session webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_); ... Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault); Ort::Allocator allocator(*webgpu_session_, memory_info_gpu); auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float)); auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float)); // Create IoBinding objects Ort::IoBinding webgpu_binding(*webgpu_session_); // Upload cpu data to input_buffer or copy gpu buffer to input_buffer ... // Create an OrtValue tensor backed by data on gpu memory Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_, input_dims_.data(), input_dims_.size()); Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_, output_dims_.data(), output_dims_.size()); webgpu_binding.BindInput("input", bound_x); webgpu_binding.BindOutput("output", bound_y); // Run inference webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing ... // post process output_buffer's content ... // Update input_buffer's content ... // Run again webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay() ... // post process output_buffer's content ... ```

MatMulNBits is a decomposed op in WebNN EP. Previously, we share the WebNN constant of zero_points if they have the same value and data type. However, this brings a lot of complexity for developers to fuse it back to MatMulNBits in the underlying WebNN implementation in Chromium. In this PR, we will always create a new constant for zero_points.

…ft#25305) ### Description 1. rename `SessionState` to `GraphCaptureState`, since there is already one SessionState type in ORT. 2. optimize implementation of `ComputeContext::BufferManager()` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

kevinch-nv and others added 8 commits July 4, 2025 08:20

Merge branch 'master' into sync_msft_7_7_25

efd14d9

jatinwadhwa921 requested a review from ankitm3k July 7, 2025 12:14

ankitm3k approved these changes Jul 7, 2025

View reviewed changes

ankitm3k merged commit 66eceb9 into ovep-develop Jul 7, 2025
6 of 8 checks passed

ankitm3k deleted the sync_msft_7_7_25 branch July 7, 2025 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backmerging with Msft commits #731

Backmerging with Msft commits #731

Uh oh!

jatinwadhwa921 commented Jul 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Backmerging with Msft commits #731

Backmerging with Msft commits #731

Uh oh!

Conversation

jatinwadhwa921 commented Jul 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants