Add Algorithm Search for ConvGrad#8613
Conversation
In the regular inference senario, CUDAExecutionProvider::allocator_ is still BFCArena instance. So inference use will still see a significant amount of cuda memory hanging around in the arena. Refers to: onnxruntime/core/providers/cuda/cuda_execution_provider.cc:119 in 9c87733. [](commit_id = 9c87733, deletion_comment = False) |
|
Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally. |
The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation. |
It's difficult to generalize if all customers would be ok with trading off memory for speed (esp. hundreds of MBs). It's much better to have them explicitly opt-in with a full understanding of what they're signing up for. Changing the default behavior when its effects are conspicuous is not ideal for inferencing scenarios. |
Second the idea of making this an "opt-in" configuration (atleast until all the implications are fully understood). Please see this issue- #7966. This feature seems to have made this model slower (when it was in 1.8.0). Until we have an understanding as to why it makes some models slower, it is better to keep this for users opting in only. |
|
|
||
| // By default the session uses fix memory size (32M) for Conv algo search, the final algo might not be the best. | ||
| // If this is set to true, try to use as much as possible memory for algo search. | ||
| bool use_more_mem_for_conv = false; |
There was a problem hiding this comment.
SessionOption should host options that are more universal for how the session should be run.
This flag is very speicific to CUDA cudnn conv kernel.
I think we should follow cudnn_conv_algo_search's pattern, and has it as a config in CUDAExecutionProviderInfo.
There was a problem hiding this comment.
also, let's rename this for a more explicit name, e.g. cudnn_conv_use_max_workspace
| const void* x_data; | ||
| const void* w_data; | ||
| const void* dy_data; | ||
| void* y_data; |
There was a problem hiding this comment.
y_data is not needed for ConvGrad
| template <typename T> | ||
| inline IAllocatorUniquePtr<T> GetScratchBuffer(size_t count_or_bytes) const { | ||
| return provider_->GetScratchBuffer<T>(count_or_bytes); | ||
| inline IAllocatorUniquePtr<T> GetScratchBuffer(size_t count_or_bytes, bool is_reserve = false) const { |
There was a problem hiding this comment.
The name of "is_reserve" flag is counterintuitive. Allocated buffer will be kept in Arena if "is_reserve==false", while it will not be reserved by Arena if "is_reserved==true".
I suggest we create a new API for this purpose, since GetScratchBuffer() is very widely used and let's not add cognitive burden for developer when using this.
We can have another API, e.g. GetTransientScratchBuffer, and please document its behavior.
| } | ||
|
|
||
| void CUDAExternalAllocator::Free(void* p) { | ||
| std::lock_guard<OrtMutex> lock(lock_); |
There was a problem hiding this comment.
Same here, free_(p) doesn't need to be under lock ?
| 0.3532f, -0.1369f, 1.1986f, -0.4355f, 1.1206f, -0.3642f, -1.0039f, -2.8045f, 1.3698f, -1.0553f, | ||
| 0.7075f, -0.4902f, 0.0947f, -0.0937f, 0.1146f, 1.1363f, 0.6955f, 0.5441f, -1.6661f}; | ||
| vector<int64_t> X_shape = {1, 1, 7, 7}; | ||
| vector<float> W = {-1.1080f}; |
| args.handle, args.w_desc, args.w_data, args.y_tensor, args.dy_data, args.conv_desc, args.x_tensor, | ||
| args.dx_data, num_algos, &perf_count, candidates.get(), workspace.get(), max_workspace_size)); | ||
| } else { | ||
| ORT_ENFORCE(false, "Algo mode should be 0, 1 or 2, but got ", args.params.algo_mode); |
There was a problem hiding this comment.
ic... OrtCudnnConvAlgoSearch::DEFAULT should been handled in OnlyDefaultAlgorithm().
update "Algo mode should be HEURISTIC or EXHAUSTIVE"
| args.handle, args.x_tensor, args.x_data, args.y_tensor, args.dy_data, args.conv_desc, args.w_desc, | ||
| args.dw_data, num_algos, &perf_count, candidates.get(), workspace.get(), max_workspace_size)); | ||
| } else { | ||
| ORT_ENFORCE(false, "Algo mode should be 0, 1 or 2, but got ", args.params.algo_mode); |
There was a problem hiding this comment.
same here, update error msg.
a35447f to
0b72e5f
Compare
| "gpu_external_free": str(self._torch_free)}, {}] | ||
| else: | ||
| provider_options = [{"device_id": str(self._device.index)}, {}] | ||
| cuda_provider_option["gpu_external_alloc"] = str(self._torch_alloc) |
There was a problem hiding this comment.
nit, this should still be named as provider_options, since gpu_external_* options still apply to rocm EP.
SherlockNoMad
left a comment
There was a problem hiding this comment.
LGTM, could you please ping Pranav and Hari for their sign-off on the allocator related changes?
@pranavsharma and @hariharans29, I've made this an "opt-in" configuration. Besides the Conv changes, this PR also contains some change related to allocator, could you please help to review this PR? Thanks! |
| // TODO remove deprecated global config | ||
| extern bool do_copy_in_default_stream; | ||
| extern onnxruntime::CUDAExecutionProviderExternalAllocatorInfo external_allocator_info; | ||
| extern bool cudnn_conv_use_max_workspace; |
There was a problem hiding this comment.
We shouldn't add more globals.
| CUDAExecutionProviderExternalAllocatorInfo external_allocator_info{}; | ||
| // By default use fix workspace size (32M) for Conv algo search, the final algo might not be the best. | ||
| // If set to true, try to use as much as possible memory for algo search. | ||
| bool cudnn_conv_use_max_workspace{false}; |
There was a problem hiding this comment.
I suppose we'll need a separate PR to expose this to inferencing.
There was a problem hiding this comment.
Sure. Just removed this configuration from onnxruntime_pybind_state.
There was a problem hiding this comment.
So that PR will expose this in the public facing headers/pybind ?
There was a problem hiding this comment.
I am actually not sure how to expose this to inferencing... May need Prana's or your suggestion.
There was a problem hiding this comment.
For that, you can check how other CUDA configs are getting wired from the c api. For example, you can check how users get to set cudnn_conv_algo_search via the OrtCudaProviderOptions struct in the onnxruntime_c_api.h. I assume, you will just have to follow the same pattern.
There was a problem hiding this comment.
cudnn_conv_algo_search defined global variable in onnxruntime_pybind_state, I did the same thing but Prana suggested we shouldn't add more global, that's the reason I removed it. I am thinking maybe we want a new way to do this.
There was a problem hiding this comment.
Yes the globals approach is deprecated. But users can provide EP specific config options via a provider options map. Please see the logic here -
|
@pranavsharma, @hariharans29 and @SherlockNoMad, if there is no new comments, could you please help to sign-off so I can close this one. |
Overall the feature LGTM. Sorry for the naive question, but how does a user turn on this feature for inferencing without the necessary knob being exposed in the C API ? Is this going to be future work (after the release)? |
pranavsharma
left a comment
There was a problem hiding this comment.
We can expose this feature to inferencing users in a separate PR.
This PR includes:
With this change, the densenet from torchvision can run 20% faster than before the change.