Skip to content

Add Algorithm Search for ConvGrad#8613

Merged
Lafi7e merged 17 commits into
masterfrom
weicwang/conv_grad
Sep 3, 2021
Merged

Add Algorithm Search for ConvGrad#8613
Lafi7e merged 17 commits into
masterfrom
weicwang/conv_grad

Conversation

@Lafi7e
Copy link
Copy Markdown
Contributor

@Lafi7e Lafi7e commented Aug 4, 2021

This PR includes:

  • Add algorithm search for ConvGrad. Cache algorithm globally.
  • Use bigger workspace size for algorithm search for both Conv and ConvGrad.
  • Use Reserve() of IAllocator for allocating memory for algorithm search benchmarking, so that for Arena, we can release these memory after used instead of cached.
  • Add empty_cache() for ExternalAllocator. If the memory is allocated by Reserve(), call empty_cache() when releasing the memory.

With this change, the densenet from torchvision can run 20% faster than before the change.

@Lafi7e Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Aug 4, 2021
@Lafi7e Lafi7e requested a review from SherlockNoMad August 4, 2021 06:57
@SherlockNoMad
Copy link
Copy Markdown
Contributor

return CreateAllocator(default_memory_info);

In the regular inference senario, CUDAExecutionProvider::allocator_ is still BFCArena instance.

So inference use will still see a significant amount of cuda memory hanging around in the arena.


Refers to: onnxruntime/core/providers/cuda/cuda_execution_provider.cc:119 in 9c87733. [](commit_id = 9c87733, deletion_comment = False)

@SherlockNoMad
Copy link
Copy Markdown
Contributor

return CreateAllocator(default_memory_info);

oic...
BFCArena::Reserve() doesn't keep the allocated memroy in the reserved chunks.


In reply to: 899883906


Refers to: onnxruntime/core/providers/cuda/cuda_execution_provider.cc:119 in 9c87733. [](commit_id = 9c87733, deletion_comment = False)

@pranavsharma
Copy link
Copy Markdown
Contributor

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

@Lafi7e
Copy link
Copy Markdown
Contributor Author

Lafi7e commented Aug 24, 2021

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation.

@pranavsharma
Copy link
Copy Markdown
Contributor

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation.

It's difficult to generalize if all customers would be ok with trading off memory for speed (esp. hundreds of MBs). It's much better to have them explicitly opt-in with a full understanding of what they're signing up for. Changing the default behavior when its effects are conspicuous is not ideal for inferencing scenarios.

@hariharans29
Copy link
Copy Markdown
Member

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation.

It's difficult to generalize if all customers would be ok with trading off memory for speed (esp. hundreds of MBs). It's much better to have them explicitly opt-in with a full understanding of what they're signing up for. Changing the default behavior when its effects are conspicuous is not ideal for inferencing scenarios.

Second the idea of making this an "opt-in" configuration (atleast until all the implications are fully understood). Please see this issue- #7966. This feature seems to have made this model slower (when it was in 1.8.0). Until we have an understanding as to why it makes some models slower, it is better to keep this for users opting in only.


// By default the session uses fix memory size (32M) for Conv algo search, the final algo might not be the best.
// If this is set to true, try to use as much as possible memory for algo search.
bool use_more_mem_for_conv = false;
Copy link
Copy Markdown
Contributor

@SherlockNoMad SherlockNoMad Aug 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SessionOption should host options that are more universal for how the session should be run.
This flag is very speicific to CUDA cudnn conv kernel.
I think we should follow cudnn_conv_algo_search's pattern, and has it as a config in CUDAExecutionProviderInfo.

Copy link
Copy Markdown
Contributor

@SherlockNoMad SherlockNoMad Aug 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, let's rename this for a more explicit name, e.g. cudnn_conv_use_max_workspace

const void* x_data;
const void* w_data;
const void* dy_data;
void* y_data;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y_data is not needed for ConvGrad

template <typename T>
inline IAllocatorUniquePtr<T> GetScratchBuffer(size_t count_or_bytes) const {
return provider_->GetScratchBuffer<T>(count_or_bytes);
inline IAllocatorUniquePtr<T> GetScratchBuffer(size_t count_or_bytes, bool is_reserve = false) const {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of "is_reserve" flag is counterintuitive. Allocated buffer will be kept in Arena if "is_reserve==false", while it will not be reserved by Arena if "is_reserved==true".

I suggest we create a new API for this purpose, since GetScratchBuffer() is very widely used and let's not add cognitive burden for developer when using this.

We can have another API, e.g. GetTransientScratchBuffer, and please document its behavior.

Comment thread onnxruntime/core/providers/cuda/cuda_allocator.cc
}

void CUDAExternalAllocator::Free(void* p) {
std::lock_guard<OrtMutex> lock(lock_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, free_(p) doesn't need to be under lock ?

0.3532f, -0.1369f, 1.1986f, -0.4355f, 1.1206f, -0.3642f, -1.0039f, -2.8045f, 1.3698f, -1.0553f,
0.7075f, -0.4902f, 0.0947f, -0.0937f, 0.1146f, 1.1363f, 0.6955f, 0.5441f, -1.6661f};
vector<int64_t> X_shape = {1, 1, 7, 7};
vector<float> W = {-1.1080f};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe have a 3x3 W?

Comment thread orttraining/orttraining/training_ops/cuda/nn/conv_grad.cc
args.handle, args.w_desc, args.w_data, args.y_tensor, args.dy_data, args.conv_desc, args.x_tensor,
args.dx_data, num_algos, &perf_count, candidates.get(), workspace.get(), max_workspace_size));
} else {
ORT_ENFORCE(false, "Algo mode should be 0, 1 or 2, but got ", args.params.algo_mode);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic... OrtCudnnConvAlgoSearch::DEFAULT should been handled in OnlyDefaultAlgorithm().

update "Algo mode should be HEURISTIC or EXHAUSTIVE"

args.handle, args.x_tensor, args.x_data, args.y_tensor, args.dy_data, args.conv_desc, args.w_desc,
args.dw_data, num_algos, &perf_count, candidates.get(), workspace.get(), max_workspace_size));
} else {
ORT_ENFORCE(false, "Algo mode should be 0, 1 or 2, but got ", args.params.algo_mode);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, update error msg.

@Lafi7e Lafi7e force-pushed the weicwang/conv_grad branch from a35447f to 0b72e5f Compare August 30, 2021 08:43
"gpu_external_free": str(self._torch_free)}, {}]
else:
provider_options = [{"device_id": str(self._device.index)}, {}]
cuda_provider_option["gpu_external_alloc"] = str(self._torch_alloc)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, this should still be named as provider_options, since gpu_external_* options still apply to rocm EP.

SherlockNoMad
SherlockNoMad previously approved these changes Aug 31, 2021
Copy link
Copy Markdown
Contributor

@SherlockNoMad SherlockNoMad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you please ping Pranav and Hari for their sign-off on the allocator related changes?

@Lafi7e
Copy link
Copy Markdown
Contributor Author

Lafi7e commented Aug 31, 2021

Can we make this configurable (may be a session or run option) with the default turned off? This way existing inferencing users won't be impacted in that they won't see an uptick in their peak memory usage. If they want to opt in for this functionality they can do so intentionally.

The (free memory * 90%) is used as max space for workspace calculation, normally it won't take that such big memory for workspace. For the case in issue #7212 as example, it will take several handred MBs (~200MB) at the end (compared to current 32MB), but can speed up the perf significantly. I am guessing such handred MBs uptick should be fine? PyTorch is also trying to use all free memory as max for this calculation.

It's difficult to generalize if all customers would be ok with trading off memory for speed (esp. hundreds of MBs). It's much better to have them explicitly opt-in with a full understanding of what they're signing up for. Changing the default behavior when its effects are conspicuous is not ideal for inferencing scenarios.

Second the idea of making this an "opt-in" configuration (atleast until all the implications are fully understood). Please see this issue- #7966. This feature seems to have made this model slower (when it was in 1.8.0). Until we have an understanding as to why it makes some models slower, it is better to keep this for users opting in only.

@pranavsharma and @hariharans29, I've made this an "opt-in" configuration. Besides the Conv changes, this PR also contains some change related to allocator, could you please help to review this PR? Thanks!

// TODO remove deprecated global config
extern bool do_copy_in_default_stream;
extern onnxruntime::CUDAExecutionProviderExternalAllocatorInfo external_allocator_info;
extern bool cudnn_conv_use_max_workspace;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't add more globals.

CUDAExecutionProviderExternalAllocatorInfo external_allocator_info{};
// By default use fix workspace size (32M) for Conv algo search, the final algo might not be the best.
// If set to true, try to use as much as possible memory for algo search.
bool cudnn_conv_use_max_workspace{false};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we'll need a separate PR to expose this to inferencing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Just removed this configuration from onnxruntime_pybind_state.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that PR will expose this in the public facing headers/pybind ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not sure how to expose this to inferencing... May need Prana's or your suggestion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that, you can check how other CUDA configs are getting wired from the c api. For example, you can check how users get to set cudnn_conv_algo_search via the OrtCudaProviderOptions struct in the onnxruntime_c_api.h. I assume, you will just have to follow the same pattern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudnn_conv_algo_search defined global variable in onnxruntime_pybind_state, I did the same thing but Prana suggested we shouldn't add more global, that's the reason I removed it. I am thinking maybe we want a new way to do this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the globals approach is deprecated. But users can provide EP specific config options via a provider options map. Please see the logic here -

const CUDAExecutionProviderInfo info = GetCudaExecutionProviderInfo(cuda_provider_info,

Comment thread onnxruntime/core/providers/cuda/cuda_allocator.cc
Comment thread onnxruntime/core/providers/cuda/cuda_allocator.cc
Comment thread onnxruntime/core/providers/rocm/rocm_allocator.cc Outdated
Comment thread include/onnxruntime/core/framework/allocator.h Outdated
@Lafi7e
Copy link
Copy Markdown
Contributor Author

Lafi7e commented Sep 3, 2021

@pranavsharma, @hariharans29 and @SherlockNoMad, if there is no new comments, could you please help to sign-off so I can close this one.

@hariharans29
Copy link
Copy Markdown
Member

@pranavsharma, @hariharans29 and @SherlockNoMad, if there is no new comments, could you please help to sign-off so I can close this one.

Overall the feature LGTM.

Sorry for the naive question, but how does a user turn on this feature for inferencing without the necessary knob being exposed in the C API ? Is this going to be future work (after the release)?

Copy link
Copy Markdown
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expose this feature to inferencing users in a separate PR.

@Lafi7e Lafi7e merged commit c343f7c into master Sep 3, 2021
@Lafi7e Lafi7e deleted the weicwang/conv_grad branch September 3, 2021 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants