Skip to content

[WebNN EP] Cache MLTensors between runs#22278

Merged
guschmue merged 1 commit intomicrosoft:mainfrom
egalli:cache_mltensors
Oct 18, 2024
Merged

[WebNN EP] Cache MLTensors between runs#22278
guschmue merged 1 commit intomicrosoft:mainfrom
egalli:cache_mltensors

Conversation

@egalli
Copy link
Contributor

@egalli egalli commented Sep 30, 2024

Description

This change enables caching MLTensors between inferences runs. This is done by keeping a reference to MLTensors alive after they have been released. MLTensors are only destroyed once the sessions goes out of scope.

Motivation and Context

Creating and destroying MTensors on every run has a non-trivial performance penalty. This performance penalty materializes when using ort.Tensors[location=cpu] for inputs/outputs or when using the CPU EP as a fallback EP for unsupported operators. The former could be mitigated by developer using ort.Tensors[location=ml-tensor]. The latter cannot be mitigated by developers.

### Description
This change enables caching `MLTensor`s between inferences runs. This is done by keeping a reference to `MLTensor`s alive after they have been released. `MLTensor`s are only destroyed once the sessions goes out of scope.

### Motivation and Context
Creating and destroying `MTensor`s on every run has a non-trivial performance penalty. This performance penalty materializes when using `ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP as a fallback EP for unsupported operators. The former could be mitigated by developer using `ort.Tensors`[location=ml-tensor]. The latter cannot be mitigated by developers.
@fdwr fdwr requested a review from fs-eire October 3, 2024 00:20
@fdwr
Copy link
Contributor

fdwr commented Oct 3, 2024

@bbernhar, @Honry

Copy link
Contributor

@Honry Honry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@huningxin
Copy link
Contributor

@egalli

Creating and destroying MTensors on every run has a non-trivial performance penalty.

Thanks for fixing this issue by caching MLTensors between inferences runs.

Dispatching a graph with different MLTensor bindings would cause re-recording the DML graph execution. That is another significant inference overhead if it happens, it is especially expensive for NPU device. Should we try to bind the same MLTensor for a particular graph input or output between runs?

@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@egalli
Copy link
Contributor Author

egalli commented Oct 16, 2024

@huningxin Along with currently using READ/WRITE usages and remove the extra copy to wasm, pinning tensors to graphs inputs is one of the planned optimizations. I just didn't want to include it in this PR.

@huningxin
Copy link
Contributor

@egalli

Along with currently using READ/WRITE usages and remove the extra copy to wasm, pinning tensors to graphs inputs is one of the planned optimizations. I just didn't want to include it in this PR.

Your plan sounds good to me.

@guschmue guschmue merged commit 1e5bda8 into microsoft:main Oct 18, 2024
guschmue pushed a commit that referenced this pull request Oct 18, 2024
### Description
This change enables caching `MLTensor`s between inferences runs. This is
done by keeping a reference to `MLTensor`s alive after they have been
released. `MLTensor`s are only destroyed once the sessions goes out of
scope.

### Motivation and Context
Creating and destroying `MTensor`s on every run has a non-trivial
performance penalty. This performance penalty materializes when using
`ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP
as a fallback EP for unsupported operators. The former could be
mitigated by developer using `ort.Tensors`[location=ml-tensor]. The
latter cannot be mitigated by developers.
tianleiwu pushed a commit that referenced this pull request Oct 18, 2024
### Description
This change enables caching `MLTensor`s between inferences runs. This is
done by keeping a reference to `MLTensor`s alive after they have been
released. `MLTensor`s are only destroyed once the sessions goes out of
scope.

### Motivation and Context
Creating and destroying `MTensor`s on every run has a non-trivial
performance penalty. This performance penalty materializes when using
`ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP
as a fallback EP for unsupported operators. The former could be
mitigated by developer using `ort.Tensors`[location=ml-tensor]. The
latter cannot be mitigated by developers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants