[WebNN EP] Cache MLTensors between runs by egalli · Pull Request #22278 · microsoft/onnxruntime

egalli · 2024-09-30T21:51:30Z

Description

This change enables caching MLTensors between inferences runs. This is done by keeping a reference to MLTensors alive after they have been released. MLTensors are only destroyed once the sessions goes out of scope.

Motivation and Context

Creating and destroying MTensors on every run has a non-trivial performance penalty. This performance penalty materializes when using ort.Tensors[location=cpu] for inputs/outputs or when using the CPU EP as a fallback EP for unsupported operators. The former could be mitigated by developer using ort.Tensors[location=ml-tensor]. The latter cannot be mitigated by developers.

### Description This change enables caching `MLTensor`s between inferences runs. This is done by keeping a reference to `MLTensor`s alive after they have been released. `MLTensor`s are only destroyed once the sessions goes out of scope. ### Motivation and Context Creating and destroying `MTensor`s on every run has a non-trivial performance penalty. This performance penalty materializes when using `ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP as a fallback EP for unsupported operators. The former could be mitigated by developer using `ort.Tensors`[location=ml-tensor]. The latter cannot be mitigated by developers.

fdwr · 2024-10-03T00:20:58Z

@bbernhar, @Honry

Honry

LGTM, thanks!

huningxin · 2024-10-09T13:44:14Z

@egalli

Creating and destroying MTensors on every run has a non-trivial performance penalty.

Thanks for fixing this issue by caching MLTensors between inferences runs.

Dispatching a graph with different MLTensor bindings would cause re-recording the DML graph execution. That is another significant inference overhead if it happens, it is especially expensive for NPU device. Should we try to bind the same MLTensor for a particular graph input or output between runs?

guschmue · 2024-10-16T20:13:49Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2024-10-16T20:13:56Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2024-10-16T20:14:01Z

Azure Pipelines successfully started running 1 pipeline(s).

guschmue · 2024-10-16T20:14:02Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

azure-pipelines · 2024-10-16T20:14:06Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

guschmue · 2024-10-16T20:14:09Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-10-16T20:14:16Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-10-16T20:14:17Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

egalli · 2024-10-16T23:28:01Z

@huningxin Along with currently using READ/WRITE usages and remove the extra copy to wasm, pinning tensors to graphs inputs is one of the planned optimizations. I just didn't want to include it in this PR.

huningxin · 2024-10-17T03:23:50Z

@egalli

Along with currently using READ/WRITE usages and remove the extra copy to wasm, pinning tensors to graphs inputs is one of the planned optimizations. I just didn't want to include it in this PR.

Your plan sounds good to me.

### Description This change enables caching `MLTensor`s between inferences runs. This is done by keeping a reference to `MLTensor`s alive after they have been released. `MLTensor`s are only destroyed once the sessions goes out of scope. ### Motivation and Context Creating and destroying `MTensor`s on every run has a non-trivial performance penalty. This performance penalty materializes when using `ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP as a fallback EP for unsupported operators. The former could be mitigated by developer using `ort.Tensors`[location=ml-tensor]. The latter cannot be mitigated by developers.

fdwr requested a review from fs-eire October 3, 2024 00:20

Honry approved these changes Oct 7, 2024

View reviewed changes

huningxin approved these changes Oct 17, 2024

View reviewed changes

guschmue approved these changes Oct 18, 2024

View reviewed changes

guschmue merged commit 1e5bda8 into microsoft:main Oct 18, 2024

ibelem mentioned this pull request Oct 31, 2024

[Image Classification] Failed to execute 'dispatch' on 'MLContext': Invalid inputs microsoft/webnn-developer-preview#46

Closed

Conversation

egalli commented Sep 30, 2024

Description

Motivation and Context

Uh oh!

fdwr commented Oct 3, 2024

Uh oh!

Honry left a comment

Choose a reason for hiding this comment

Uh oh!

huningxin commented Oct 9, 2024

Uh oh!

guschmue commented Oct 16, 2024

Uh oh!

guschmue commented Oct 16, 2024

Uh oh!

azure-pipelines bot commented Oct 16, 2024

Uh oh!

guschmue commented Oct 16, 2024

Uh oh!

azure-pipelines bot commented Oct 16, 2024

Uh oh!

guschmue commented Oct 16, 2024

Uh oh!

azure-pipelines bot commented Oct 16, 2024

Uh oh!

azure-pipelines bot commented Oct 16, 2024

Uh oh!

egalli commented Oct 16, 2024

Uh oh!

huningxin commented Oct 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants