Add Algorithm Search for ConvGrad by Lafi7e · Pull Request #8613 · microsoft/onnxruntime

Lafi7e · 2021-08-04T06:57:59Z

This PR includes:

Add algorithm search for ConvGrad. Cache algorithm globally.
Use bigger workspace size for algorithm search for both Conv and ConvGrad.
Use Reserve() of IAllocator for allocating memory for algorithm search benchmarking, so that for Arena, we can release these memory after used instead of cached.
Add empty_cache() for ExternalAllocator. If the memory is allocated by Reserve(), call empty_cache() when releasing the memory.

With this change, the densenet from torchvision can run 20% faster than before the change.

SherlockNoMad · 2021-08-16T23:30:33Z

return CreateAllocator(default_memory_info);

In the regular inference senario, CUDAExecutionProvider::allocator_ is still BFCArena instance.

So inference use will still see a significant amount of cuda memory hanging around in the arena.

Refers to: onnxruntime/core/providers/cuda/cuda_execution_provider.cc:119 in 9c87733. [](commit_id = 9c87733, deletion_comment = False)

SherlockNoMad · 2021-08-16T23:46:05Z

return CreateAllocator(default_memory_info);

oic...
BFCArena::Reserve() doesn't keep the allocated memroy in the reserved chunks.

In reply to: 899883906

Refers to: onnxruntime/core/providers/cuda/cuda_execution_provider.cc:119 in 9c87733. [](commit_id = 9c87733, deletion_comment = False)

pranavsharma · 2021-08-17T01:41:49Z

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

Lafi7e · 2021-08-24T08:25:15Z

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation.

pranavsharma · 2021-08-25T20:38:30Z

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation.

It's difficult to generalize if all customers would be ok with trading off memory for speed (esp. hundreds of MBs). It's much better to have them explicitly opt-in with a full understanding of what they're signing up for. Changing the default behavior when its effects are conspicuous is not ideal for inferencing scenarios.

hariharans29 · 2021-08-26T23:53:15Z

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation.

It's difficult to generalize if all customers would be ok with trading off memory for speed (esp. hundreds of MBs). It's much better to have them explicitly opt-in with a full understanding of what they're signing up for. Changing the default behavior when its effects are conspicuous is not ideal for inferencing scenarios.

Second the idea of making this an "opt-in" configuration (atleast until all the implications are fully understood). Please see this issue- #7966. This feature seems to have made this model slower (when it was in 1.8.0). Until we have an understanding as to why it makes some models slower, it is better to keep this for users opting in only.

SherlockNoMad · 2021-08-28T21:35:07Z


+  // By default the session uses fix memory size (32M) for Conv algo search, the final algo might not be the best.
+  // If this is set to true, try to use as much as possible memory for algo search.
+  bool use_more_mem_for_conv = false;


SessionOption should host options that are more universal for how the session should be run.
This flag is very speicific to CUDA cudnn conv kernel.
I think we should follow cudnn_conv_algo_search's pattern, and has it as a config in CUDAExecutionProviderInfo.

also, let's rename this for a more explicit name, e.g. cudnn_conv_use_max_workspace

SherlockNoMad · 2021-08-28T21:55:12Z

+  const void* x_data;
+  const void* w_data;
+  const void* dy_data;
+  void* y_data;


y_data is not needed for ConvGrad

SherlockNoMad · 2021-08-28T22:10:22Z

  template <typename T>
-  inline IAllocatorUniquePtr<T> GetScratchBuffer(size_t count_or_bytes) const {
-    return provider_->GetScratchBuffer<T>(count_or_bytes);
+  inline IAllocatorUniquePtr<T> GetScratchBuffer(size_t count_or_bytes, bool is_reserve = false) const {


The name of "is_reserve" flag is counterintuitive. Allocated buffer will be kept in Arena if "is_reserve==false", while it will not be reserved by Arena if "is_reserved==true".

I suggest we create a new API for this purpose, since GetScratchBuffer() is very widely used and let's not add cognitive burden for developer when using this.

We can have another API, e.g. GetTransientScratchBuffer, and please document its behavior.

SherlockNoMad · 2021-08-28T22:14:20Z

 }

 void CUDAExternalAllocator::Free(void* p) {
+  std::lock_guard<OrtMutex> lock(lock_);


Same here, free_(p) doesn't need to be under lock ?

SherlockNoMad · 2021-08-28T22:18:19Z

+                     0.3532f,  -0.1369f, 1.1986f,  -0.4355f, 1.1206f,  -0.3642f, -1.0039f, -2.8045f, 1.3698f, -1.0553f,
+                     0.7075f,  -0.4902f, 0.0947f,  -0.0937f, 0.1146f,  1.1363f,  0.6955f,  0.5441f,  -1.6661f};
+  vector<int64_t> X_shape = {1, 1, 7, 7};
+  vector<float> W = {-1.1080f};


maybe have a 3x3 W?

SherlockNoMad · 2021-08-28T23:01:21Z

+          args.handle, args.w_desc, args.w_data, args.y_tensor, args.dy_data, args.conv_desc, args.x_tensor,
+          args.dx_data, num_algos, &perf_count, candidates.get(), workspace.get(), max_workspace_size));
+    } else {
+      ORT_ENFORCE(false, "Algo mode should be 0, 1 or 2, but got ", args.params.algo_mode);


ic... OrtCudnnConvAlgoSearch::DEFAULT should been handled in OnlyDefaultAlgorithm().

update "Algo mode should be HEURISTIC or EXHAUSTIVE"

SherlockNoMad · 2021-08-28T23:02:11Z

+          args.handle, args.x_tensor, args.x_data, args.y_tensor, args.dy_data, args.conv_desc, args.w_desc,
+          args.dw_data, num_algos, &perf_count, candidates.get(), workspace.get(), max_workspace_size));
+    } else {
+      ORT_ENFORCE(false, "Algo mode should be 0, 1 or 2, but got ", args.params.algo_mode);


same here, update error msg.

SherlockNoMad · 2021-08-31T04:44:42Z

-                                     "gpu_external_free": str(self._torch_free)}, {}]
-            else:
-                provider_options = [{"device_id": str(self._device.index)}, {}]
+                cuda_provider_option["gpu_external_alloc"] = str(self._torch_alloc)


nit, this should still be named as provider_options, since gpu_external_* options still apply to rocm EP.

SherlockNoMad

LGTM, could you please ping Pranav and Hari for their sign-off on the allocator related changes?

Lafi7e · 2021-08-31T05:58:11Z

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation.

It's difficult to generalize if all customers would be ok with trading off memory for speed (esp. hundreds of MBs). It's much better to have them explicitly opt-in with a full understanding of what they're signing up for. Changing the default behavior when its effects are conspicuous is not ideal for inferencing scenarios.

Second the idea of making this an "opt-in" configuration (atleast until all the implications are fully understood). Please see this issue- #7966. This feature seems to have made this model slower (when it was in 1.8.0). Until we have an understanding as to why it makes some models slower, it is better to keep this for users opting in only.

@pranavsharma and @hariharans29, I've made this an "opt-in" configuration. Besides the Conv changes, this PR also contains some change related to allocator, could you please help to review this PR? Thanks!

pranavsharma · 2021-08-31T06:14:03Z

 // TODO remove deprecated global config
 extern bool do_copy_in_default_stream;
 extern onnxruntime::CUDAExecutionProviderExternalAllocatorInfo external_allocator_info;
+extern bool cudnn_conv_use_max_workspace;


We shouldn't add more globals.

pranavsharma · 2021-08-31T06:17:51Z

  CUDAExecutionProviderExternalAllocatorInfo external_allocator_info{};
+  // By default use fix workspace size (32M) for Conv algo search, the final algo might not be the best.
+  // If set to true, try to use as much as possible memory for algo search.
+  bool cudnn_conv_use_max_workspace{false};


I suppose we'll need a separate PR to expose this to inferencing.

Sure. Just removed this configuration from onnxruntime_pybind_state.

So that PR will expose this in the public facing headers/pybind ?

I am actually not sure how to expose this to inferencing... May need Prana's or your suggestion.

For that, you can check how other CUDA configs are getting wired from the c api. For example, you can check how users get to set cudnn_conv_algo_search via the OrtCudaProviderOptions struct in the onnxruntime_c_api.h. I assume, you will just have to follow the same pattern.

cudnn_conv_algo_search defined global variable in onnxruntime_pybind_state, I did the same thing but Prana suggested we shouldn't add more global, that's the reason I removed it. I am thinking maybe we want a new way to do this.

Yes the globals approach is deprecated. But users can provide EP specific config options via a provider options map. Please see the logic here -

onnxruntime/onnxruntime/python/onnxruntime_pybind_state.cc

Line 504 in e348929

const CUDAExecutionProviderInfo info = GetCudaExecutionProviderInfo(cuda_provider_info,

Lafi7e · 2021-09-03T01:59:39Z

@pranavsharma, @hariharans29 and @SherlockNoMad, if there is no new comments, could you please help to sign-off so I can close this one.

hariharans29 · 2021-09-03T02:21:12Z

@pranavsharma, @hariharans29 and @SherlockNoMad, if there is no new comments, could you please help to sign-off so I can close this one.

Overall the feature LGTM.

Sorry for the naive question, but how does a user turn on this feature for inferencing without the necessary knob being exposed in the C API ? Is this going to be future work (after the release)?

pranavsharma

We can expose this feature to inferencing users in a separate PR.

Lafi7e added 2 commits August 4, 2021 10:02

algo search for conv grad

fe33bd5

Merge branch 'master' into weicwang/conv_grad

21ebebd

Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Aug 4, 2021

Lafi7e requested a review from SherlockNoMad August 4, 2021 06:57

Lafi7e requested review from a team, BowenBao, baijumeswani, liqunfu, thiagocrepaldi and tlh20 as code owners August 4, 2021 06:57

Lafi7e added 6 commits August 6, 2021 10:53

Merge branch 'master' into weicwang/conv_grad

3a42d25

global cache, bigger workspace size

2dbf2d5

Merge branch 'master' into weicwang/conv_grad

1e49cc0

fix build error

e4cd6cc

refactor

34f7f45

refactor

9c87733

Merge branch 'master' into weicwang/conv_grad

a56f342

Merge branch 'master' into weicwang/conv_grad

1be90d3

SherlockNoMad reviewed Aug 28, 2021

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/cuda_allocator.cc

SherlockNoMad reviewed Aug 28, 2021

View reviewed changes

Comment thread orttraining/orttraining/training_ops/cuda/nn/conv_grad.cc

SherlockNoMad reviewed Aug 28, 2021

View reviewed changes

resolve comments

0b72e5f

Lafi7e force-pushed the weicwang/conv_grad branch from a35447f to 0b72e5f Compare August 30, 2021 08:43

Lafi7e added 3 commits August 30, 2021 16:51

fix rocm

bba6342

Merge branch 'master' into weicwang/conv_grad

ea6e583

change lock places

47fc771

SherlockNoMad reviewed Aug 31, 2021

View reviewed changes

SherlockNoMad previously approved these changes Aug 31, 2021

View reviewed changes

rename variable

1aa374c

Lafi7e dismissed SherlockNoMad’s stale review via 1aa374c August 31, 2021 05:53

Lafi7e requested review from hariharans29 and pranavsharma August 31, 2021 05:54

pranavsharma reviewed Aug 31, 2021

View reviewed changes

remove setting for inference

777adae

hariharans29 reviewed Aug 31, 2021

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/cuda_allocator.cc

hariharans29 reviewed Aug 31, 2021

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/cuda_allocator.cc

hariharans29 reviewed Aug 31, 2021

View reviewed changes

Comment thread onnxruntime/core/providers/rocm/rocm_allocator.cc Outdated

hariharans29 reviewed Aug 31, 2021

View reviewed changes

Comment thread include/onnxruntime/core/framework/allocator.h Outdated

resolve comments

27c26fe

pranavsharma approved these changes Sep 3, 2021

View reviewed changes

hariharans29 approved these changes Sep 3, 2021

View reviewed changes

SherlockNoMad approved these changes Sep 3, 2021

View reviewed changes

Lafi7e merged commit c343f7c into master Sep 3, 2021

Lafi7e deleted the weicwang/conv_grad branch September 3, 2021 03:25

hariharans29 mentioned this pull request Nov 16, 2021

C# adding option for cudnn_conv_algo_search : DEFAULT #9730

Closed

baijumeswani mentioned this pull request Aug 17, 2023

ConvTransposeGrad CUDA Kernel #17201

Merged

Conversation

Lafi7e commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SherlockNoMad commented Aug 16, 2021

Uh oh!

SherlockNoMad commented Aug 16, 2021

Uh oh!

pranavsharma commented Aug 17, 2021

Uh oh!

Lafi7e commented Aug 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pranavsharma commented Aug 25, 2021

Uh oh!

hariharans29 commented Aug 26, 2021

Uh oh!

SherlockNoMad Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad left a comment

Choose a reason for hiding this comment

Uh oh!

Lafi7e commented Aug 31, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lafi7e commented Sep 3, 2021

Uh oh!

hariharans29 commented Sep 3, 2021

Uh oh!

pranavsharma left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Lafi7e commented Aug 4, 2021 •

edited

Loading

Lafi7e commented Aug 24, 2021 •

edited

Loading

SherlockNoMad Aug 28, 2021 •

edited

Loading

SherlockNoMad Aug 28, 2021 •

edited

Loading