Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuration template to disable some SIMD. #3

Open
wants to merge 837 commits into
base: main
Choose a base branch
from

Conversation

jslap-ubi
Copy link
Owner

Description

Motivation and Context

jslap-ubi pushed a commit that referenced this pull request Aug 1, 2024
### Description
Security fuzz test with address sanitizer found several bugs
tianleiwu and others added 28 commits August 2, 2024 15:45
Add a check of node.InputDefs()[2]->Exists() for Layernorm bias (Follow up https://github.com/microsoft/onnxruntime/pull/21528/files#r1694026327)

Format the file: break long line to be within 120 chars limit.
### Description
<!-- Describe your changes. -->
Changes to add in Set external data path for model weight files.
Additional fixes to ensure this compiles off the latest v1.19
Onnxruntime


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Separate weights used for larger models (like stable diffusion) is
motivation for this change set

---------

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Artur Wojcik <artur.wojcik@amd.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description
WebNN only supports test mode, so we don't care about other inputs or
attributes about training mode, use WebNN's identity op to implement the
Dropout op directly.
### Description
Several tests result in segfaults during the minimal cuda build.
Although test failures are expected due to the limitation of the minimal
cuda EP, failing gracefully would be much preferred.



### Motivation and Context
To reproduce:
1. Build ORT with:
```bash
./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1
```
2. Run `onnxruntime_test_all`
```bash
...
[----------] 1 test from AllocationPlannerTest
[ RUN      ] AllocationPlannerTest.ReusedInputCrossDifferentStreams
Segmentation fault (core dumped)
```
…microsoft#21536)

### Description

Refactor framework directory structure for MacOS packages

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Apple started enforcing specific [framework
structure](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Concepts/FrameworkAnatomy.html)
for MacOS packages. We need to change how we package for MacOS to follow
the guidelines

Fixes following issue: [Malformed
Framework](microsoft/onnxruntime-swift-package-manager#19
)
Bump up version in main from 1.19.0 to 1.20.0 since the release branch
has been cut.
### Description
<!-- Describe your changes. -->
Add ability to test packaging without rebuilding every time.
Add ability to comment out some platforms/architectures without the
scripts to assemble the c/obj-c packages breaking.
Update a couple of commands to preserve symlinks.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make debugging packaging issues faster.
Creates correct package for mac-catalyst and doesn't require setting
symlinks via bash script.
### Description
<!-- Describe your changes. -->
update script with cmake 3.30 to unblock EP Perf


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…rosoft#21625)

### Description
Fix 2 typos in mlas avx 4bit gemm implementation to call correct vnni
functions under vnni condition



### Motivation and Context
needed for 1.19.0 release

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
… transient connection exceptions. (microsoft#21612)

### Description
Improve docker commands to make docker image layer caching works.
It can make docker building faster and more stable.
So far, A100 pool's system disk is too small to use docker cache.
We won't use pipeline cache for docker image and remove some legacy
code.

### Motivation and Context
There are often an exception of
```
64.58 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
286.4 curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
```
Because Onnxruntime pipeline have been sending too many requests to
download Nodejs in docker building.
Which is the major reason of pipeline failing now

In fact, docker image layer caching never works.
We can always see the scrips are still running
```
microsoft#9 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts
microsoft#9 0.234 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
microsoft#9 0.235 /bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
microsoft#9 0.235 /tmp/scripts/install_centos.sh: line 1: !/bin/bash: No such file or directory
microsoft#9 0.235 ++ '[' '!' -f /etc/yum.repos.d/microsoft-prod.repo ']'
microsoft#9 0.236 +++ tr -dc 0-9.
microsoft#9 0.236 +++ cut -d . -f1
microsoft#9 0.238 ++ os_major_version=8
....
microsoft#9 60.41 + curl https://nodejs.org/dist/v18.17.1/node-v18.17.1-linux-x64.tar.gz -sSL --retry 5 --retry-delay 30 --create-dirs -o /tmp/src/node-v18.17.1-linux-x64.tar.gz --fail
microsoft#9 60.59 + return 0
...
```

This PR is improving the docker command to make image layer caching
work.
Thus, CI won't send so many redundant request of downloading NodeJS.
```
microsoft#9 [2/5] ADD scripts /tmp/scripts
microsoft#9 CACHED

microsoft#10 [3/5] RUN cd /tmp/scripts && /tmp/scripts/install_centos.sh && /tmp/scripts/install_deps.sh && rm -rf /tmp/scripts
microsoft#10 CACHED

microsoft#11 [4/5] RUN adduser --uid 1000 onnxruntimedev
microsoft#11 CACHED

microsoft#12 [5/5] WORKDIR /home/onnxruntimedev
microsoft#12 CACHED
```

###Reference
https://docs.docker.com/build/drivers/

---------

Co-authored-by: Yi Zhang <your@email.com>
### Description
- Update pipelines to use QNN SDK 2.25 by default
- Update ifdef condition to apply workaround for QNN LayerNorm
validation bug to QNN SDK 2.25 (as well as 2.24)



### Motivation and Context
Use the latest QNN SDK
Fix usability checker CoreML config file path. The files got renamed but one place was still referring to the old name.
### Description
<!-- Describe your changes. -->
Improve speed in combining `per-channel` data for using a single
`np.concatenate` instead of multiple `np.concatenates` within a for
loop.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix the issue microsoft#21562

Signed-off-by: duansheng.liu <44742794+duanshengliu@users.noreply.github.com>
### Description
<!-- Describe your changes. -->



### Motivation and Context
To fix whisper test failure
Fix missed ORT_ENFORCE check which caused heap buffer overflow because
of out of bound access.
### Description
<!-- Describe your changes. -->
Update to match microsoft#21627 and make the info for Split consistent.

As a Split that doesn't split anything is a no-op it doesn't seem
meaningful to call that limitation out in the docs.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…21664)

### Description
allow op test to use f16 type for inputs/outputs.

This PR introduces "@petamoriken/float16" as Float16Array polyfill but
restricts it to be only used for test runner.
### Description
* Fix migraphx build error caused by
microsoft#21598:
Add a conditional compile on code block that depends on ROCm >= 6.2.
Note that the pipeline uses ROCm 6.0.

Unblock orttraining-linux-gpu-ci-pipeline and
orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline
pipelines:
* Disable a model test in linux GPU training ci pipelines caused by
microsoft#19470:
Sometime, cudnn frontend throws exception that cudnn graph does not
support a Conv node of keras_lotus_resnet3D model on V100 GPU.
Note that same test does not throw exception in other GPU pipelines. The
failure might be related to cudnn 8.9 and V100 GPU used in the pipeline
(Amper GPUs and cuDNN 9.x do not have the issue).
The actual fix requires fallback logic, which will take time to
implement, so we temporarily disable the test in training pipelines.
* Force install torch for cuda 11.8. (The docker has torch 2.4.0 for
cuda 12.1 to build torch extension, which it is not compatible cuda
11.8). Note that this is temporary walkround. More elegant fix is to
make sure right torch version in docker build step, that might need
update install_python_deps.sh and corresponding requirements.txt.
* Skip test_gradient_correctness_conv1d since it causes segment fault.
Root cause need more investigation (maybe due to cudnn frontend as
well).
* Skip test_aten_attention since it causes assert failure. Root cause
need more investigation (maybe due to torch version).
* Skip orttraining_ortmodule_distributed_tests.py since it has error
that compiler for torch extension does not support c++17. One possible
fix it to set the following compile argument inside setup.py of
extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17'].
However, due to the urgency of unblocking the pipelines, just disable
the test for now.
* skip test_softmax_bf16_large. For some reason,
torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so
the test was run in CI, but V100 does not support bf16 natively.
* Fix typo of deterministic

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
The xcframework now uses symlinks to have the correct structure
according to Apple requirements. Symlinks are not supported by nuget on
Windows.

In order to work around that we can store a zip of the xcframeworks in
the nuget package.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix nuget packaging build break
### Description
Fix a check of mask type introduced by me in a recent commit. Add tests.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->

This change allows to match external data path like `a.data` to
`./a.data`.


<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…efault (microsoft#19707)

### Description
update the build script for webgpu to enable model dump by default

Now if using build_jsep.bat to build debug, the model dump is enabled.
Using
[`optimizedModelFilePath`](https://onnxruntime.ai/docs/api/js/interfaces/InferenceSession.SessionOptions.html#optimizedModelFilePath)
in session option can dump the optimized model in browser

### Motivation and Context
Helps to debug/rule out problems may related to model optimizer.
### Description
This change enhances the existing Pad Fusion to fuse Pad even if a Cast
operator is present between Pad and Conv/MaxPool/AveragePool. It keeps
the Cast as it is.
<pre>
/*
 * Before Fusion:
 *     Pad
 *      |
 *    Cast (Optional)
 *      |
 *   Conv/MaxPool/AveragePool
 * 
 * After Fusion:
 *    Cast (Optional)
 *      |
 *   Conv/MaxPool/AveragePool
 */
</pre>


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add a gather that supports block-quantized input data.


### Motivation and Context
To support Web inference scenario with quantized vocabulary embeddings.
### Description
<!-- Describe your changes. -->
Fix wrong per-tensor quantized weight type for matmul.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix related bug as described in
microsoft#21346
…rosoft#21693)

### Description
When quantize MatMul to DQ + MatMul using 4bit QDQ tool chain,
previously the opsets of domains are not changed.
Now, when quantize MatMul to DQ + MatMul in QDQ format, force upgrade
onnx domain to opset 21.

### Motivation and Context
In QDQ format, DQ with int4 and blocked quantization is used. This
requires DQ with opset >= 21.
When quantize MatMul to DQ + MatMul, force upgrade onnx domain to opset
21.
chilo-ms and others added 29 commits September 17, 2024 09:52
`supportsModel` is deprecated in TRT 10.1.
Add `supportsModelV2 `but still keep `supportsModel` as we still need to
support TRT 8.6 where `supportsModelV2 ` is not
supported.
### Description
See microsoft/onnxruntime-extensions#476
and actions/runner-images#7671

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

### Current issue
- [ ] For default xcode 15.2, that come with the MacOS-13, We Need to
update the boost container header boost/container_hash/hash.hpp version
to pass the build
- [x] For xcode 14.2 The Build passed but the `Run React Native Detox
Android e2e Test` Failed.
Possible flaky test, microsoft#21969
- [x] For xcode 14.3.1 We encountered following issue in `Build React
Native Detox iOS e2e Tests`
```
ld: file not found: /Applications/Xcode_14.3.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/arc/libarclite_iphonesimulator.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```
Applied following code to the eof in both ios/Podfile and fixed the
issue
```
post_install do |installer|
    installer.generated_projects.each do |project|
        project.targets.each do |target|
            target.build_configurations.each do |config|
                config.build_settings['IPHONEOS_DEPLOYMENT_TARGET'] = '13.0'
            end
        end
    end
end
```


- [x] facebook/react-native#32483

Applying changes to ios/Pofile
```
pre_install do |installer|
  # Custom pre-install script or commands
  puts "Running pre-install script..."

  # Recommended fix for facebook/react-native#32483
  # from facebook/react-native#32483 (comment)
  system("sed -i '' 's/typedef uint8_t clockid_t;//' \"${SRCROOT}/Pods/RCT-Folly/folly/portability/Time.h\"")
end
```

- [ ] Detox environment setting up exceeded time out of 120000ms during
iso e2e test


### dependent 

- [x] microsoft#21159

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
### Description
Update XNNPack to latest version (Sep 4)
- Some op outputs are changed, channel or stride paras are moved into
reshape func.
e.g.
google/XNNPACK@96962a6
- input params of xnnpack's resize related function are changed a lot
- KleidiAI is added as a dependency in ARM64
- The latest XNNPACK includes 2 static libs microkernels-prod and
xnnpack.
Without microkernels-prod, it throws the exception of Undefined symbols.
- Add ORT_TARGET_PROCESSOR to get the real processor target in CMake
### Description
This PR refactors the `CPU` kernel for the `CumSum` operator. The new
implementation strives to have as little indirection as possible.


### Motivation and Context
Currently the `CumSum` operator perform very poorly in the case of 1D
tensors(it was slower than a python loop). This is caused by the
extensive use of the `SliceIterator`-s.

Here is a relevant snippet:
```python
import time
import ndonnx as ndx
import onnxruntime as ort
import numpy as np
import onnx

def test_cumsum(sz):
    a = ndx.array(shape=(sz,), dtype=ndx.int64)
    b = ndx.cumsum(a)
    model = ndx.build({'a': a}, {'b': b})
    onnx.save(model, "model.onnx")

    input = np.ones(sz, np.int64)
    start = time.time()
    result = ort.InferenceSession(model.SerializeToString()).run(None, {'a': input})
    end = time.time()
    return end - start

def test_cumsum_by_hand(sz):
    input = np.ones(sz, np.int64)
    start = time.time()
    answer = [0]
    for i in input:
        answer.append(answer[-1] + i)
    end = time.time()
    return end - start

print(test_cumsum(int(1e7))) 
print(test_cumsum_by_hand(int(1e7))) 
```

Before
```console
0.9794480800628662
0.4518160820007324
```

After
```console
0.02483987808227539
0.5496008396148682
```

The `model.onnx`: 
<img width="214" alt="image"
src="https://github.com/user-attachments/assets/a213d6ff-86c3-49b5-a493-ebfd97deaa41">

The flame graph:

![profile-3](https://github.com/user-attachments/assets/c7418a05-cb65-4d72-a76d-6a6b05b4ba4d)
### Description
Builds arm64 python 3.12 wheel for QNN EP.


### Motivation and Context
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add ONNX export script for segment anything v2 (SAM2).

### Limitations
* Does not support video. Only support image right now.
* The decoder does not support batch inference.

### Credits
The demo that is based on [SAM2
notebook](https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/image_predictor_example.ipynb),
and modified to run with ORT.

The export of decoder is inspired by
https://github.com/vietanhdev/samexporter.

### Demo
Example output of demo:

![sam2_demo](https://github.com/user-attachments/assets/9a9fa360-8c20-482e-9935-a7aba9cf15de)

### Motivation and Context
For support optimization of SAM2 image segmentation.
### Description
Fix regression caused by microsoft#17361 



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…MM (microsoft#21984)

### Description
<!-- Describe your changes. -->
ONNXRuntime implementation of S8S8 was using the default C++
implementation; with this new ISA, all variants of QGemm Int8 can
support VNNI dot product and full AVX2 instructions.

All signed/unsigned variants support VNNI instructions starting with
LNL.
Renamed structs and functions to better indicate support of all Int8 vs
U8X8


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
LNL HW implemented new ISA, and this code enables that ISA in QGemm.
Speed is improved for S8S8 to match with existing U8S8 code. S8U8 would
also match speed if ONNX formally accepted the data type.
set up a pipeline to produce nightly Linux x64 whls for onnxruntime-qnn
this can be used for offline context binary generation.
…iptor.shape (microsoft#22121)

The spec renames MLOperandDescriptor.dimensions to
MLOperandDescriptor.shape, in order to support older Chromium versions,
we will keep both in WebNN EP for a while.

Fixed microsoft#22120
### Description
<!-- Describe your changes. -->
Add linker flags to support 16KB page size support on Android. 

See
https://source.android.com/docs/core/architecture/16kb-page-size/16kb#build-lib-16kb-alignment

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
microsoft#21837
…ty is supported. (microsoft#22141)

### Description
<!-- Describe your changes. -->
Use the latest nuget.exe for the `readme` property to be supported.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
microsoft#22137
…osoft#21927)

### Description
Update DML EP for `FusedMatMul` ORT graph node have TransA/B attribute
set instead of updating the strides.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix random crash for QNN UTs with multi-thread run like
QnnHTPBackendTests.MultithreadHtpPowerCfgDefaultAndRunOption

Root cause, last minute code change

microsoft@b4e26bd
static std::mutex mutex; -> OrtMutex mutex;
missed static.
…22157)

followed the rocm example below it which isn't the naming convention we
want to follow. didn't fix rocm because i'm not sure if there are
consumers using its naming convention.
…oft#22140)

### Description
Decouple implementation for different A types to improve readability and
maintainability.

### Motivation and Context
As more types are added, the implementation can differ a lot between
types. Besides, different hardware may require different
implementations.
This PR creates an abstraction boundary where different implemetation
can plug in easily.
…microsoft#22135)

### Description

Fixes the logic for getting the number of elements for the input and
output spans in the `MinMaxMLFloat16` method. This was incorrectly using
the full number of elements in the output rather than the number of
elements in the current span, which worked fine with 1D inputs but
breaks with 2D inputs.

This meant that as the `BroadcastLooper` iterated over spans,
`MinMaxMLFloat16` would start at a position further forward in the input
and output and read and write further beyond the end of the input and
output respectively, causing the asan error in microsoft#21558 and sometimes
segfaults in larger examples.

### Motivation and Context

Fixes microsoft#21558.

From further testing, this issue didn't only cause asan errors in tests
but causes segfaults with larger sized inputs.
### Description
<!-- Describe your changes. -->
The optional `axes` input may exist with an empty name and be a nullptr.

Update the CUDA implementation to handle this.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

microsoft#22035
)

### Description
Fix usage of c++ std::chrono::operator<< in mac builds for wider range
of xcode/targets.

### Motivation and Context

microsoft#21033
### Description
<!-- Describe your changes. -->



### Motivation and Context
By default, CMAKE_SYSTEM_PROCESSOR is same CMAKE_HOST_SYSTEM_PROCESSOR
https://cmake.org/cmake/help/latest/variable/CMAKE_SYSTEM_PROCESSOR.html
KleidiAI uses CMAKE_SYSTEM_PROCESSOR to determine whether to include
some arm64 ukernels.
https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CMakeLists.txt#L134
We use Mac with Intel CPU to cross compile MAC with ARM in ios packaging
pipeline
So we need to make CMAKE_SYSTEM_PROCESSOR same with ORT_TARGET_PROCESSOR
### Description
When K == 0 output a MxN matrix filled with bias if present or filled
with zeros.
This brings it inline with MatMul behavior especially when Gemm is used
to fuse MatMul with Add.


### Motivation and Context
* Comply with numpy spec of MatMul
* Address a case when empty initializers are used for computation.
### Description
* Add MultiHeadAttention fusion for SAM2.
* Add LayerNormalization fusion for NCHW format by inserting Transpose
from NCHW to NHWC before layer normalization, and add another Transpose
after layer norm to convert NHWC back to NCHW. Hopefully, those extra
Transpose nodes will be removed when prefer_nhwc is enabled later.
* Add a condition that the input shall be 3D when fuse SkipLayerNorm.
* Update convert_to_onnx.py to add `--optimize` and `--use_gpu` options
to output optimized onnx model for CPU/CUDA eps.
* Add an option `--dtype fp16|fp32` in convert_to_onnx.py to support
converting optimized model to float16.
* Update the demo to use the optimized onnx models.

### Motivation and Context
To support optimization of SAM2 for CPU/CUDA eps that is exported in
microsoft#22119
### Description
Add benchmark script segment anything v2. 
It depends on microsoft#22119 for
onnx export, and microsoft#22167 for
sam2 graph fusion.

### Motivation and Context

Benchmark SAM2 model performance.
…n logger from that session (microsoft#22170)

### Description
Fix an issue that QNN models shared from other session use the session logger from that producer session also which cause confusion. Make QNN model compute function use the session logger from current session.
…osoft#22056)

### Description
<!-- Describe your changes. -->

Specify the path of `ar`, `ld` and `libtool` when building apple
framework.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Sometimes non-system executables will comes before the system-provided
ones. This PR intends to prevent it from happening.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.