Skip to content

[CUDA] GatherElements[Grad]/ScatterElements Bugfix and Perf Improve#11374

Merged
Lafi7e merged 11 commits into
masterfrom
weicwang/gather_elements
Jun 15, 2022
Merged

[CUDA] GatherElements[Grad]/ScatterElements Bugfix and Perf Improve#11374
Lafi7e merged 11 commits into
masterfrom
weicwang/gather_elements

Conversation

@Lafi7e
Copy link
Copy Markdown
Contributor

@Lafi7e Lafi7e commented Apr 27, 2022

Bugfix:

  • PR Optimize cuda scatter() on 2D compatible. #2628 introduced dim coalesce to improve perf for ScatterElements and GatherElementsGrad, but the algorithm had bug that will fail to handle case when dim values along axis and the next outer axis are same. The code would coalesce these dims but it should not. And there is no UT that cover such cases. This PR is to fix this, and also support more coalesce cases that the original PR didn't cover.
  • The within-bound check in CPU kernel compares size_t dim with axis_ (could be negative), we should compare to axis.
  • Kernel for all these three Ops are in same pattern, remove the duplicated codes and keep only one copy.
  • Added more UT cases.
  • Template specialization to less data types to save binary size.

Perf improve:

  • Previous dim coalesce is added for ScatterElements and GatherElementsGrad only. This PR also uses it for GatherElements.
  • Previous 2D kernel optimization is used by ScatterElements and GatherElementsGrad only. This PR also uses it for GatherElements.
  • Previous masked input stride optimization is used by GatherElements only. This PR also applies it for GatherElementsGrad and ScatterElements.
  • Adjust the thread work size, use local array to separate read and write. Used Nsight Compute to prove the perf gain.

Tested in V100 using case from real model:

  • Input size: [4,32,512,1023], indices size: [4,32,512,512], axis=-1, when applying the 2D kernel to forward, GatherElements now have 1.10x to 1.15x perf gain, differ by the indices values
  • Input size: [4,32,1023,512], indices size: [4,32,512,512], axis=-2, we can observe ~3.4x perf gain for forward due to dim coalesce, and 8.8x perf gain for backward due to masked input stride optimization and some other changes.

@Lafi7e Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Apr 27, 2022
Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements.cc
Copy link
Copy Markdown
Contributor

@pengwa pengwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good refactoring for the mentioned four kernels, and consolidation on perf optimizers for each kernel together. I have a few comments. Not sure I catch up with all the ideas, will circle back for more if possible.

// Reverse for better calculation.
std::reverse(input_shape.begin(), input_shape.end());
std::reverse(indices_shape.begin(), indices_shape.end());
size_t reverse_axis = rank - 1 - static_cast<size_t>(axis);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check axis >=0 in case caller pass a invalid axis?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoalesceDimensions is currently called by kernel's ComputeInternal, and we have HandleNegativeAxis to guarantee axis is a valid number.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I still feel we'd better do a check somewhere avoiding naïve mis-usage for future , especially here we are casting a negative number to size_t.

Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements.cc
Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements.cc
}

if (curr == reverse_axis) {
if (curr > 0) Move(input_shape, indices_shape, curr, 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this override the merged value at position 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"curr == reverse_axis" here is the special case that after skipping all 1-dim axes, reverse_axis is the leading axis, so there is no valid axis at position 0. If curr (and also reverse_axis) is not at position 0, move it to position 0 as the leading axis.

Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements.cc
return ONNX_NAMESPACE::TensorProto_DataType_INT8;
case sizeof(int16_t):
return ONNX_NAMESPACE::TensorProto_DataType_FLOAT16;
case sizeof(int32_t):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not sizeof(float)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above reply, we care about the size of data only. sizeof(float) is same as sizeof(int32_t). It's just because the first case is sizeof(int8_t) so I made all case consistant. But if you think sizeof(float) is better (because I return TensorProto_DataType_FLOAT), I can change them to float16, float and double.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think using sizeof(float)/sizeof(double) would be better.

Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements.cc
Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements_impl.cu
utils::MLTypeCallDispatcher<float, MLFloat16, int16_t, int8_t, int32_t,
int64_t, uint8_t, uint16_t, uint32_t, uint64_t, double, bool>
t_disp(data_tensor->GetElementType());
utils::MLTypeCallDispatcher<int8_t, MLFloat16, float, double> t_disp(dtype);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we don't use "t_disp(data_tensor->GetElementType());" directly, or "t_disp(input_tensor->GetElementType());"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we just specializes the template functions for 4 data types (for less specialization code in kernel file and smaller binary size), as mentioned in above comment. If we use GetElementType, then we need to specialize the template functions for all these types.

constexpr int threads_per_block = GridDim::maxThreadsPerBlock;
constexpr int thread_worksize = 16;
constexpr int kThreadsPerBlock = GridDim::maxThreadsPerBlock;
constexpr int kThreadWorkSize = 4;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does kThreadWorkSize change bring better perf or both 2D and other cases, on both V100 and A100? Do you know what the possible reason a light-weighter kernel bring a better perf?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually number from 4 to 16 make no big difference according to my testing. Change to 4 is just because most of other kernels use this number, so to make it consistent. Big worksize in thread will reduce the number of threads, this is actually not good for some smaller data shapes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Maybe you can consider trigger some inferencing benchmark via Anubis to see any impact on the perf, just in case it affects existing inferencing models.

}
}

// GatherElementsGrad needs atomic_add which supports float types only, so use half, float and double for 16, 32, and 64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the explanation. So we are using float during kernel (value read and write) for int typed input data, and assuming that should not bring us accuracy problems.

return ONNX_NAMESPACE::TensorProto_DataType_INT8;
case sizeof(int16_t):
return ONNX_NAMESPACE::TensorProto_DataType_FLOAT16;
case sizeof(int32_t):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think using sizeof(float)/sizeof(double) would be better.

Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements.cc
}
}

// GatherElementsGrad needs atomic_add which supports float types only, so use half, float and double for 16, 32, and 64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: BTW, shall we explain a bit further in the comment here for "compute is done only in those 4 types, despite of the real data type", I guess someone else might have same questions when they re-visited this code.

constexpr int threads_per_block = GridDim::maxThreadsPerBlock;
constexpr int thread_worksize = 16;
constexpr int kThreadsPerBlock = GridDim::maxThreadsPerBlock;
constexpr int kThreadWorkSize = 4;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Maybe you can consider trigger some inferencing benchmark via Anubis to see any impact on the perf, just in case it affects existing inferencing models.

Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements.cc
// Reverse for better calculation.
std::reverse(input_shape.begin(), input_shape.end());
std::reverse(indices_shape.begin(), indices_shape.end());
size_t reverse_axis = rank - 1 - static_cast<size_t>(axis);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I still feel we'd better do a check somewhere avoiding naïve mis-usage for future , especially here we are casting a negative number to size_t.

Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements_impl.cu
Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements_impl.cu
Copy link
Copy Markdown
Contributor

@pengwa pengwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave few comments, overall LGTM.

pengwa
pengwa previously approved these changes Jun 14, 2022
pengwa
pengwa previously approved these changes Jun 15, 2022
Copy link
Copy Markdown
Contributor

@pengwa pengwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Lafi7e Lafi7e merged commit 02457ec into master Jun 15, 2022
@Lafi7e Lafi7e deleted the weicwang/gather_elements branch June 15, 2022 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants