Allow saving on CPU usage for infrequent inference requests by reducing thread spinning #11841

yuslepukhin · 2022-06-13T23:11:25Z

Description: Describe your changes.
Introduce Start/Stop threadpool spinning switch
Add a session config option to force spinning stop

Motivation and Context
Some real time customers complain that residual thread pool spinning consumes too much CPUs when there is no processing taking place for requests that are infrequent, such as in response to human interaction and are at least some milliseconds apart.

This change introduces optional enforcement or no spinning between inference requests that reduces CPU, but currently increases the latency of the inference.

The latency increase is still minimal for continuously incoming requests such as for video calls.

Add a session config option to force spinning stop

garymm · 2022-06-13T23:40:35Z

I'm not very familiar with the code here, so hopefully you can find someone more knowledgeable to review. If not you can add me back as a reviewer and I'll do my best.

In reply to: 1154553121

onnxruntime/core/providers/cpu/rnn/deep_cpu_gru.cc

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

onnxruntime/core/session/inference_session.cc

RandySheriffH · 2022-06-14T00:45:26Z

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

 class ThreadPoolParallelSection {
 public:
  // State accessed only by the main thread
  // --------------------------------------

  // Tasks successfully submitted to the work queues.  This sets the
  // maximum degree of parallelism that the section will support.
-  std::vector<std::pair<int,unsigned>> tasks;
+  std::vector<std::pair<int, unsigned>> tasks;


vector

Shall we replace the container with inlined? #Resolved

I would do it in a separate PR if required. I tried it and I did not see a perf gain for the problem at hand.

May be we should update the coding conventions to say that inlined containers should be used only when they give us a perf boost. Currently it reads: Use InlinedVector<T> typedef instead of std::vector. By default, it provides 64 bytes of inlined storage. You can customize inlined size with the second template non-type parameter N.

I do not think the usage of InlinedVector should be conditional on perf, just like std::string with its small buffer is not. You can't profile every single thing, it is enough to know that it is faster in most of the cases, especially on the stack.
In this PR there are other things to worry about, it is the only reason I did not change it.
The second argument creates more instantiations which we probably do not need, except may be for shape and convolutional paddings.

RandySheriffH · 2022-06-14T03:41:37Z

Further:

Do we need UT for the change?
Should we list there be some performance implications for those who want to consume the feature? Like wake-up overheads for threads on general archs?

In reply to: 1154675242

yuslepukhin · 2022-06-14T17:00:48Z

I am open for ideas as to the UT. The change was designed to functionaly work in both cases, it only affects CPU usage between the Runs().
I will document perf implications in the PR.

In reply to: 1154675242

pranavsharma

How do we address workloads where there's a burst of activity followed by a lull period and then another burst? The current set of changes would hurt the period of burst, no? Should we also have a configurable # of runs before we stop spinning?

Also, this requires documentation in the performance page.

onnxruntime/core/session/inference_session.h

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

yuslepukhin · 2022-06-14T21:09:22Z

I have verified that burst activity, described by the customer as continuous calls would not degrade performance. Threads and core cache remain to be hot and restarting them is very quick and sometimes not necessary.

In reply to: 1006323124

tlh20

Approve the main changes in the thread pool logic. A couple of overall comments:

Can we keep the change to use ParallelSection experimental and outside this PR -- it impacts the GRU operator, and will have -ve effects on CPU load if we have longer running jobs that do not parallelize well.
Do we want the enable/disable spinning behavior to be the default? Are the perf tests via Anubis and release testing broad enough now that we can get a clear picture on whether it is a net benefit, or if it would be risky?

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

onnxruntime/core/providers/cpu/rnn/deep_cpu_gru.cc

onnxruntime/core/session/inference_session.cc

yuslepukhin · 2022-06-17T18:03:12Z

This change is off by default, unless explicitly enabled. I will think how to still enable GRU PS when we are not under spinning start/stop spell.

In reply to: 1010419580

Revert ParallelSeciont related changes.

…ng thread spinning (#11841) Introduce Start/Stop threadpool spinning switch Add a session config option to force spinning stop at the end of the Run()

* Update ONNX to 1.12 (#11924) Follow-ups that need to happen after this and before the next ORT release: * Support SequenceMap with #11731 * Support signal ops with #11778 Follow-ups that need to happen after this but don't necessarily need to happen before the release: * Implement LayerNormalization kernel for opset version 17: #11916 Fixes #11640 * Dll version fix ovep4.1 (#11953) * Setting default version values for ovep dlls as well * Update backend_manager.cc Co-authored-by: mayavijx <mayax.vijayan@intel.com> Co-authored-by: mohsin <mohsinx.mohammad@intel.com> * Optimize t5 encoder in beam search (#11926) * ooptimize t5 encoder * update * update * update * refactor expand impl * cuda tests passed * update * alignment * more alignments * review comments * Allow saving on CPU usage for infrequent inference requests by reducing thread spinning (#11841) Introduce Start/Stop threadpool spinning switch Add a session config option to force spinning stop at the end of the Run() * Restructure function inliner (#11731) * Add nested function call tests * Add overload for Specialize * Pass symboltable to onnx shape inference * Avoid renaming empty names * Enable sequence_map tests which failed before this change * Deprecate APIs returning raw ptrs and provide replacements (#11922) Provider better documentation * register signal ops for opset 17 (#11778) * Register signal ops for op set 17 Note code is mostly being moved, not added. These ops were previously only registered as Microsoft contrib ops and only built if `BUILD_MS_EXPERIMENTAL_OPS=1`. They've been added to the ai.onnx standard op set in version 17. Main components of this change: * Move the kernels from the conrib_ops directory to the core directory. * Add function bodies for ms experimental ops. This will allow old models that use the contrib ops to continue to function. All the function bodies consist of a single op (the new standard op), so performance overhead should be minimal. Minor clean-up also in this change: * De-duplicate get_scalar_value_from_tensor: put it in a new utils.h. * Fix some bugs that caused compilation errors with the experimental ops. Tested with `build.sh --ms_experimental` * Fix some spelling errors and lint violations. * Replace a couple of switch statements with `MLTypeCallDispatcher`. * Use `InlineVector` instead of `std::vector`. Unblocks #11640 * Include opset 15 in Conv+BatchNormalization fusion (#11960) * Fix WinML Tests are still targetting deprecated (deleted) experimental signal op definitions (#12006) * fix winml tests * remove legacy test * switch idft -> dft+inverse attr * upgrade opset 13->17 for signal ops tests * [C# Tests] Add support for double tensor output in TestPreTrainedModels. (#12008) Add support for double tensor output in TestPreTrainedModels. * DML EP ResNet50 opset 15 fails in ONNX checker for FusedBatchNormalization lacking training_mode attribute (#12010) FusedBatchNormalization include training_mode attribute * Generalize native op creation (#11539) * create op from ep * read input count from context * create holder to host nodes * fix typo * cast type before comparison * throw error on API fail * silence warning from minimal build * switch to unique_ptr with deleter to host nodes * fix typo * fix build err for minimal * fix build err for minimal * add UT for conv * enable test on CUDA * add comment * fix typo * use gsl::span and string view for Node constructor * Added two APIs - CopyKernelInfo and ReleaseKernelInfo * pass gsl::span by value * switch to span<NodeArg* const> to allow for reference to const containers * fix typo * fix reduced build err * fix reduced build err * refactoring node construction logic * rename exceptions * add input and output count as arguments for op creation * refactor static member * use ORT_CATCH instead of catch * cancel try catch * add static value name map * format input definition and set err code * fix comments * fix typo * [DML EP] Pad operator: Handle negative pad counts (#11974) * Pad fallback to CPU * Added queryPad in operatorRegistration.cpp * Acknowledged PR comments * Used any_of * used none_of instead of any_of Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com> * Add warning about future computation change for ConvTranspose with auto_pad (#11984) * Add warning about future computation change for Convtranspose with auto_pad * improve msg * update TODO to make lint happy * update more contents for warning and add if * valid was not infected * move it into kernel registration * parse auto_pad myself * try to use conv_transpose_attrs_.auto_pad directly * update roialign cuda impl to onnx opset16 (#12036) * roialign opset16 * fix * fix * Fix windows eager build break by pinning to torch version 1.11.0 (#12033) Fix windows and linux eager build to torch 1.11.0. * Skip Constant Folding for ops producing an optional type output (#11839) * Disable sequence-type tests since C# infra doesn't support well (#12037) * Extend lifetime of KernelDef when creating a standalone op (#12057) place tmp kernel def as local variable to cover the lifetime of kernel creation * Add targets files for new .net6 frameworks (#12016) * Add net6 targets. Remove maccatalyst as we don't have a native build targetting that. * Set platform in macos targets * Add targetFramework entries * Move NativeLib.DllName definition and set using preprocessor values for simplicity. Couldn't get it to build with the preprocessor based setup when it was in a separate file. Update the nuspec generation to set platform version for .net6 targets. TODO: Validate versions. I copied them from the managed nuget package the packaging pipeline generated prior to adding targets. Possibly w could/should lower some of the versions. Hopefully the need to specify a version goes away when the release version of VS2022 supports .net6. * Try android 31.1 as https://github.com/actions/virtual-environments/blob/main/images/win/Windows2022-Readme.md suggests that should be available on the CI machines * Fix patch version mismatch Add some extra debug info in case it helps * Debug nuget location in CI * Add workspace entry back in * Add steps * One more attempt with hardcoded nuget.exe path and original android31.0 version * Better fix - found explicit nuget download and updated version there. * flake8 fixes * Fix black complaints. * Exit Microsoft_ML_OnnxRuntime_CheckPrerequisites for net6 iOS. * Removed outdated comment * Fix DML custom operators which set descriptor heap to command list (#12059) * Make C# runtest.sh automatically set latest opset (#12039) * Update C# runtest.sh for opset 17 Should have been part of #11924 * get appropriate opset version from onnx doc * use absolute rather than relative path * fix typo in var name * Disable DML command list reuse for Xbox (#12063) disable cl reuse for xbox * Add data type check in ConvAddRelu fusion (#12058) * Add undocumented attribute to disable generation of Java bindings from the Android AAR. (#12075) The generated bindings causes C# build errors that require workaround code. Disabling generation should avoid the need for any workarounds. As the user has the C# ORT package with the C# to C bindings there's no need for binding generation that calls the ORT Java API (which is C# -> Java ->C). * enable the extensions custom build for java and android (#11823) * generate quantization parameter for outputs (#12089) * DML EP Update to DML 1.9 (#12090) * Update to DML 1.9 * Appease obnoxious Python formatting tool * Fix orttraining-linux-ci-pipeline - Symbolic shape infer (#11965) fix symbolic shape error due to upgraded numpy + legacy sympy * check consumers of dq node before swap dq and transpose (#12099) * check consumers of dq node before swap dq and transpose * add unit test Co-authored-by: Gary Miguel <garymiguel@microsoft.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: mayavijx <mayax.vijayan@intel.com> Co-authored-by: mohsin <mohsinx.mohammad@intel.com> Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: G. Ramalingam <grama@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com> Co-authored-by: Sheil Kumar <smk2007@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: sumitsays <sumitagarwal330@gmail.com> Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com> Co-authored-by: Chun-Wei Chen <jacky82226@gmail.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: Wil Brady <25513670+WilBrady@users.noreply.github.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com> Co-authored-by: Justin Stoecker <justoeck@microsoft.com> Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com> Co-authored-by: pengwa <pengwa@microsoft.com>

Introduce Start/Stop threadpool spinning switch

fd1032b

Add a session config option to force spinning stop

yuslepukhin requested review from garymm, pranavsharma, RandySheriffH and tlh20 June 13, 2022 23:11

garymm removed their request for review June 13, 2022 23:39

RandySheriffH reviewed Jun 14, 2022

View reviewed changes

onnxruntime/core/providers/cpu/rnn/deep_cpu_gru.cc Outdated Show resolved Hide resolved

RandySheriffH reviewed Jun 14, 2022

View reviewed changes

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h Outdated Show resolved Hide resolved

RandySheriffH reviewed Jun 14, 2022

View reviewed changes

onnxruntime/core/session/inference_session.cc Outdated Show resolved Hide resolved

RandySheriffH reviewed Jun 14, 2022

View reviewed changes

pranavsharma reviewed Jun 14, 2022

View reviewed changes

onnxruntime/core/session/inference_session.h Outdated Show resolved Hide resolved

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h Outdated Show resolved Hide resolved

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h Outdated Show resolved Hide resolved

yuslepukhin added 3 commits June 15, 2022 11:10

Address review comments

b7a449e

Merge branch 'master' into yuslepukhin/start_stop_spinning

bd77c39

Adjust formatting and typos

e4ce371

tlh20 requested changes Jun 17, 2022

View reviewed changes

Address review comments.

0d155fd

Revert ParallelSeciont related changes.

yuslepukhin added the release:1.12 label Jun 23, 2022

tlh20 approved these changes Jun 23, 2022

View reviewed changes

yuslepukhin merged commit 607b7df into master Jun 23, 2022

yuslepukhin deleted the yuslepukhin/start_stop_spinning branch June 23, 2022 17:04

RandySheriffH added the triage:approved label Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow saving on CPU usage for infrequent inference requests by reducing thread spinning #11841

Allow saving on CPU usage for infrequent inference requests by reducing thread spinning #11841

yuslepukhin commented Jun 13, 2022 •

edited

Loading

garymm commented Jun 13, 2022 •

edited by yuslepukhin

Loading

RandySheriffH Jun 14, 2022 •

edited by yuslepukhin

Loading

yuslepukhin Jun 14, 2022

pranavsharma Jun 14, 2022

yuslepukhin Jun 14, 2022 •

edited

Loading

RandySheriffH commented Jun 14, 2022 •

edited by yuslepukhin

Loading

yuslepukhin commented Jun 14, 2022 •

edited

Loading

pranavsharma left a comment

yuslepukhin commented Jun 14, 2022 •

edited

Loading

tlh20 left a comment

yuslepukhin commented Jun 17, 2022

Allow saving on CPU usage for infrequent inference requests by reducing thread spinning #11841

Allow saving on CPU usage for infrequent inference requests by reducing thread spinning #11841

Conversation

yuslepukhin commented Jun 13, 2022 • edited Loading

garymm commented Jun 13, 2022 • edited by yuslepukhin Loading

RandySheriffH Jun 14, 2022 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

yuslepukhin Jun 14, 2022

Choose a reason for hiding this comment

pranavsharma Jun 14, 2022

Choose a reason for hiding this comment

yuslepukhin Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

RandySheriffH commented Jun 14, 2022 • edited by yuslepukhin Loading

yuslepukhin commented Jun 14, 2022 • edited Loading

pranavsharma left a comment

Choose a reason for hiding this comment

yuslepukhin commented Jun 14, 2022 • edited Loading

tlh20 left a comment

Choose a reason for hiding this comment

yuslepukhin commented Jun 17, 2022

yuslepukhin commented Jun 13, 2022 •

edited

Loading

garymm commented Jun 13, 2022 •

edited by yuslepukhin

Loading

RandySheriffH Jun 14, 2022 •

edited by yuslepukhin

Loading

yuslepukhin Jun 14, 2022 •

edited

Loading

RandySheriffH commented Jun 14, 2022 •

edited by yuslepukhin

Loading

yuslepukhin commented Jun 14, 2022 •

edited

Loading

yuslepukhin commented Jun 14, 2022 •

edited

Loading