forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 57
Backmerging with Msft commits #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…icrosoft#25263) ### Description Adds `include_initializer_data` option to `GraphViewerToProto` to skip writing initializer raw data and external data when serializing. ### Motivation and Context For TensorRT EP, partitioned graphs must be serialized to proto in order for getCapability() to run. For cases where the weights are not strictly needed (i.e. weightless engines), serializing the graph without initializer data reduces the overall memory required. --------- Signed-off-by: Kevin Chen <kevinch@nvidia.com>
### Description This pull request includes a wide range of feature updates, optimizations, and bug fixes aimed at improving performance, memory efficiency, dynamic shaped model support, ORT GenAI support for GenAI models viz. LLMs / SLMs , and overall stability of the OpenVINO Execution Provider (OVEP). ### Key Enhancements - Dynamic Shaped Model Support: Added support for inferencing dynamic shaped models using `reshape_input` provider option Enabled workload type handling for dynamic-shaped models - Performance Optimizations: Reduced peak memory usage by optimizing fallback logic and model proto handling. Improved CPU inference path efficiency. Removed unintended model copies during compilation. - ORT GenAI Feature Pass: [ORT GenAI](https://github.com/microsoft/onnxruntime-genai) is now supported using OpenVINO EP using `enable_causallm` provider option as `True` - EPContext OVIR Encapsulation Feature: ORT now supports EpContext Models with OVIR (i.e. model.xml & model.bin) stored into `ep_cache_context` attribute Compilation, Inference & Pre-Compiled Cached Blob Support - Quantization Enhancements: Enabled QDQ stripping path using adaptive stripping. Enabled QDQ Channel Wise Quantization for Intel NPU friendly quantization using `MatMul4BitsQuantizer/ DefaultWeightOnlyQuantConfig` using option `channel_wised_quantize` as `True` ``` from onnxruntime.quantization import matmul_nbits_quantizer # Define quantization configuration and process quant_config = matmul_nbits_quantizer.DefaultWeightOnlyQuantConfig( block_size=128, is_symmetric=True, quant_format=quant_utils.QuantFormat.QDQ, channel_wised_quantize=True) ``` - Operator & Backend Improvements: Added support for the HardSwish operator Fixed logic for unsupported op modes and improved precision accuracy - Bug Fixes: Fixed metadata naming and file path validation Addressed device selection issues and provider key verification Resolved deprecated OV element types and LUID check issues --------- Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: Sushanth Rajasankar <44513542+sushraja-msft@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Seungtaek Kim <seungtaek.kim.94@gmail.com> Co-authored-by: co63oc <co63oc@users.noreply.github.com> Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Jian Chen <cjian@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Alessio Soldano <services@soldano.it> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com> Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: wp <webgraphics@intel.com> Co-authored-by: Satya Kumar Jandhyala <satya.k.jandhyala@gmail.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: Mark Schofield <mschofie@microsoft.com> Co-authored-by: jiangzhaoming <zhaoming.jiang@microsoft.com> Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Ranjit Ranjan <165394499+ranjitshs@users.noreply.github.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com> Co-authored-by: Nikolay Proshunin <nikolay.proshunin@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Yaru Du <yaru.du@intel.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…crosoft#25272) ### Description <!-- Describe your changes. --> Add OrtEpFactory::GetVersion and store EP version in EP metadata. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enforce plugin EP version specification and make it accessible from EP metadata. --------- Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
…djusting weight scale and requantizing (microsoft#25278) ### Overview This PR introduces a critical fix for **QOperator INT8 symmetric quantization** in ONNX Runtime. It addresses a situation where the computed **bias scale** (`input_scale * weight_scale`) becomes too small, leading to **int32 overflow** or **precision clipping** during bias quantization. ### Problem In symmetric quantization (i.e., zero_point = 0), the bias tensor is quantized using a fixed-point scale: **bias_scale = input_scale * weight_scale** When this value is too small, the quantized int32 bias may exceed the range of `int32`, causing saturation or significant quantization error. This was observed to cause **>51% accuracy loss** in some models. ### Solution This PR adds two new functions to mitigate this: --- #### 🔧 `_adjust_weight_scale_for_int32_bias(...)` Located in `onnx_quantizer.py`, this function: - **Inspects the float bias range** to compute the smallest valid bias scale (based on int32 dynamic range) - **Compares** this threshold against `input_scale * weight_scale` - If too small, **scales up the weight scale** accordingly, to prevent overflow - Supports both per-tensor and per-channel weight quantization cases This logic is **only triggered when**: - The weight's zero point is exactly zero (i.e. symmetric) - The weight data type is `INT8` or `INT16` --- #### 🔄 `_requantize_weight(...)` After weight scale adjustment, this function: - **Finds the original quantized weight** (`q_weight`), scale, and zero point from the initializer list - **Removes** the outdated quantized weight and scale - **Re-quantizes** the original float weights using the new scale and the same zero point - **Re-inserts** them into the model to maintain consistency --- ### Summary of Benefits - ✅ Prevents int32 overflow or saturation during symmetric bias quantization - ✅ Ensures weight and bias quantization remain consistent - ✅ Reduced quantization error from >51.4% to ~3% in test models - ✅ Fix is limited in scope to QOperator + symmetric INT8/INT16 flow (safe for other modes) - ✅ Improves robustness of static quantization for hardware that performs integer-only inference --- ### Code Location - `onnxruntime/quantization/onnx_quantizer.py` - `def _adjust_weight_scale_for_int32_bias(...)` - `def _requantize_weight(...)` - Integrated in `quantize_bias_static(...)` --- Please let me know if you'd like additional test coverage or integration points. Thanks!
This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one microsoft#18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not supported. 2. Usage of graph capture is limited to models where-in all ops in the model can be partitioned to the WebGPU EP or CPU EP and no memory copy between them. 3. Shapes of inputs/outputs cannot change across inference calls. 4. IOBinding is required. And all inputs/outputs are pre-allocated gpu buffers. When users use graph capture feature, we suppose they will do some pre-process and post-process for the inference's inputs and outputs in order to keep the whole pipeline on GPU to avoid some unnecessary cpu to gpu or gpu to cpu copying. The usage will be like below: ``` // Initialize Dawn { // 1. Create Dawn instance ... instance = wgpu::CreateInstance(&instanceDescriptor); // 2. Create the adapter ... instance.RequestAdapter // 3. Create device from adapter ... adapter.RequestDevice } // Create session options webgpu_options_ = std::make_unique<Ort::SessionOptions>(); std::unordered_map<std::string, std::string> provider_options; provider_options["dawnProcTable"] = std::to_string(reinterpret_cast<size_t>(&dawn::native::GetProcs())); provider_options["webgpuInstance"] = std::to_string(reinterpret_cast<size_t>(instance_.Get())); provider_options["webgpuDevice"] = std::to_string(reinterpret_cast<size_t>(device_.Get())); provider_options["deviceId"] = "1"; provider_options["enableGraphCapture"] = "1"; // add WebGPU provider webgpu_options_->AppendExecutionProvider("WebGPU", provider_options); ... // create webgpu session webgpu_session_ = std::make_unique<Ort::Session>(*env_, model_path_.c_str(), *webgpu_options_); ... Ort::MemoryInfo memory_info_gpu("WebGPU_Buffer", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault); Ort::Allocator allocator(*webgpu_session_, memory_info_gpu); auto input_buffer = allocator.GetAllocation(input_tensor_size_ * sizeof(float)); auto output_buffer = allocator.GetAllocation(output_tensor_size_ * sizeof(float)); // Create IoBinding objects Ort::IoBinding webgpu_binding(*webgpu_session_); // Upload cpu data to input_buffer or copy gpu buffer to input_buffer ... // Create an OrtValue tensor backed by data on gpu memory Ort::Value bound_x = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(input_buffer.get()), input_tensor_size_, input_dims_.data(), input_dims_.size()); Ort::Value bound_y = Ort::Value::CreateTensor(memory_info_gpu, reinterpret_cast<float*>(output_buffer.get()), output_tensor_size_, output_dims_.data(), output_dims_.size()); webgpu_binding.BindInput("input", bound_x); webgpu_binding.BindOutput("output", bound_y); // Run inference webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // normal run + capturing ... // post process output_buffer's content ... // Update input_buffer's content ... // Run again webgpu_session_->Run(Ort::RunOptions{nullptr}, webgpu_binding); // replay() ... // post process output_buffer's content ... ```
MatMulNBits is a decomposed op in WebNN EP. Previously, we share the WebNN constant of zero_points if they have the same value and data type. However, this brings a lot of complexity for developers to fuse it back to MatMulNBits in the underlying WebNN implementation in Chromium. In this PR, we will always create a new constant for zero_points.
…ft#25305) ### Description 1. rename `SessionState` to `GraphCaptureState`, since there is already one SessionState type in ORT. 2. optimize implementation of `ComputeContext::BufferManager()` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
ankitm3k
approved these changes
Jul 7, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backmerging with Msft commits