[CUDA] GatherElements[Grad]/ScatterElements Bugfix and Perf Improve by Lafi7e · Pull Request #11374 · microsoft/onnxruntime

Lafi7e · 2022-04-27T02:20:13Z

Bugfix:

PR Optimize cuda scatter() on 2D compatible. #2628 introduced dim coalesce to improve perf for ScatterElements and GatherElementsGrad, but the algorithm had bug that will fail to handle case when dim values along axis and the next outer axis are same. The code would coalesce these dims but it should not. And there is no UT that cover such cases. This PR is to fix this, and also support more coalesce cases that the original PR didn't cover.
The within-bound check in CPU kernel compares size_t dim with axis_ (could be negative), we should compare to axis.
Kernel for all these three Ops are in same pattern, remove the duplicated codes and keep only one copy.
Added more UT cases.
Template specialization to less data types to save binary size.

Perf improve:

Previous dim coalesce is added for ScatterElements and GatherElementsGrad only. This PR also uses it for GatherElements.
Previous 2D kernel optimization is used by ScatterElements and GatherElementsGrad only. This PR also uses it for GatherElements.
Previous masked input stride optimization is used by GatherElements only. This PR also applies it for GatherElementsGrad and ScatterElements.
Adjust the thread work size, use local array to separate read and write. Used Nsight Compute to prove the perf gain.

Tested in V100 using case from real model:

Input size: [4,32,512,1023], indices size: [4,32,512,512], axis=-1, when applying the 2D kernel to forward, GatherElements now have 1.10x to 1.15x perf gain, differ by the indices values
Input size: [4,32,1023,512], indices size: [4,32,512,512], axis=-2, we can observe ~3.4x perf gain for forward due to dim coalesce, and 8.8x perf gain for backward due to masked input stride optimization and some other changes.

pengwa

This is a good refactoring for the mentioned four kernels, and consolidation on perf optimizers for each kernel together. I have a few comments. Not sure I catch up with all the ideas, will circle back for more if possible.

pengwa · 2022-06-13T02:40:55Z

+  // Reverse for better calculation.
+  std::reverse(input_shape.begin(), input_shape.end());
+  std::reverse(indices_shape.begin(), indices_shape.end());
+  size_t reverse_axis = rank - 1 - static_cast<size_t>(axis);


should we check axis >=0 in case caller pass a invalid axis?

CoalesceDimensions is currently called by kernel's ComputeInternal, and we have HandleNegativeAxis to guarantee axis is a valid number.

While I still feel we'd better do a check somewhere avoiding naïve mis-usage for future , especially here we are casting a negative number to size_t.

pengwa · 2022-06-13T02:57:10Z

+  }
+
+  if (curr == reverse_axis) {
+    if (curr > 0) Move(input_shape, indices_shape, curr, 0);


will this override the merged value at position 0?

"curr == reverse_axis" here is the special case that after skipping all 1-dim axes, reverse_axis is the leading axis, so there is no valid axis at position 0. If curr (and also reverse_axis) is not at position 0, move it to position 0 as the leading axis.

pengwa · 2022-06-13T03:10:22Z

+      return ONNX_NAMESPACE::TensorProto_DataType_INT8;
+    case sizeof(int16_t):
+      return ONNX_NAMESPACE::TensorProto_DataType_FLOAT16;
+    case sizeof(int32_t):


why not sizeof(float)

Similar to above reply, we care about the size of data only. sizeof(float) is same as sizeof(int32_t). It's just because the first case is sizeof(int8_t) so I made all case consistant. But if you think sizeof(float) is better (because I return TensorProto_DataType_FLOAT), I can change them to float16, float and double.

Yeah, I think using sizeof(float)/sizeof(double) would be better.

pengwa · 2022-06-13T03:21:54Z

-  utils::MLTypeCallDispatcher<float, MLFloat16, int16_t, int8_t, int32_t,
-                              int64_t, uint8_t, uint16_t, uint32_t, uint64_t, double, bool>
-      t_disp(data_tensor->GetElementType());
+  utils::MLTypeCallDispatcher<int8_t, MLFloat16, float, double> t_disp(dtype);


why we don't use "t_disp(data_tensor->GetElementType());" directly, or "t_disp(input_tensor->GetElementType());"

This is because we just specializes the template functions for 4 data types (for less specialization code in kernel file and smaller binary size), as mentioned in above comment. If we use GetElementType, then we need to specialize the template functions for all these types.

pengwa · 2022-06-13T03:29:55Z

-constexpr int threads_per_block = GridDim::maxThreadsPerBlock;
-constexpr int thread_worksize = 16;
+constexpr int kThreadsPerBlock = GridDim::maxThreadsPerBlock;
+constexpr int kThreadWorkSize = 4;


Does kThreadWorkSize change bring better perf or both 2D and other cases, on both V100 and A100? Do you know what the possible reason a light-weighter kernel bring a better perf?

Actually number from 4 to 16 make no big difference according to my testing. Change to 4 is just because most of other kernels use this number, so to make it consistent. Big worksize in thread will reduce the number of threads, this is actually not good for some smaller data shapes.

Got it. Maybe you can consider trigger some inferencing benchmark via Anubis to see any impact on the perf, just in case it affects existing inferencing models.

pengwa · 2022-06-14T14:43:31Z

+  }
+}
+
+// GatherElementsGrad needs atomic_add which supports float types only, so use half, float and double for 16, 32, and 64


OK, thanks for the explanation. So we are using float during kernel (value read and write) for int typed input data, and assuming that should not bring us accuracy problems.

pengwa · 2022-06-14T14:45:15Z

+      return ONNX_NAMESPACE::TensorProto_DataType_INT8;
+    case sizeof(int16_t):
+      return ONNX_NAMESPACE::TensorProto_DataType_FLOAT16;
+    case sizeof(int32_t):


Yeah, I think using sizeof(float)/sizeof(double) would be better.

pengwa · 2022-06-14T14:50:57Z

+  }
+}
+
+// GatherElementsGrad needs atomic_add which supports float types only, so use half, float and double for 16, 32, and 64


nit: BTW, shall we explain a bit further in the comment here for "compute is done only in those 4 types, despite of the real data type", I guess someone else might have same questions when they re-visited this code.

pengwa · 2022-06-14T15:03:09Z

-constexpr int threads_per_block = GridDim::maxThreadsPerBlock;
-constexpr int thread_worksize = 16;
+constexpr int kThreadsPerBlock = GridDim::maxThreadsPerBlock;
+constexpr int kThreadWorkSize = 4;


Got it. Maybe you can consider trigger some inferencing benchmark via Anubis to see any impact on the perf, just in case it affects existing inferencing models.

pengwa · 2022-06-14T15:10:12Z

+  // Reverse for better calculation.
+  std::reverse(input_shape.begin(), input_shape.end());
+  std::reverse(indices_shape.begin(), indices_shape.end());
+  size_t reverse_axis = rank - 1 - static_cast<size_t>(axis);


While I still feel we'd better do a check somewhere avoiding naïve mis-usage for future , especially here we are casting a negative number to size_t.

pengwa

Leave few comments, overall LGTM.

pengwa

gather elements bugfix and perf improve

33658be

Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Apr 27, 2022

Lafi7e requested review from hariharans29, pengwa and weixingzhang April 27, 2022 02:20

Lafi7e added 4 commits April 27, 2022 16:08

fix win build

80db29f

fix ut on some eps

864d728

Merge branch 'master' into weicwang/gather_elements

fe26f73

Merge branch 'master' into weicwang/gather_elements

6bea811

pengwa reviewed Apr 29, 2022

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements.cc

Lafi7e added 2 commits June 1, 2022 15:57

Merge branch 'master' into weicwang/gather_elements

407bb55

ut change

7180342

pengwa reviewed Jun 13, 2022

View reviewed changes

Lafi7e added 2 commits June 13, 2022 16:55

resove comments

b82057c

Merge branch 'master' into weicwang/gather_elements

c8a89a4

pengwa reviewed Jun 14, 2022

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/tensor/gather_elements_impl.cu

pengwa reviewed Jun 14, 2022

View reviewed changes

pengwa previously approved these changes Jun 14, 2022

View reviewed changes

resolve comments

ea99489

Lafi7e dismissed pengwa’s stale review via ea99489 June 15, 2022 04:09

pengwa previously approved these changes Jun 15, 2022

View reviewed changes

fix win build

9bd4455

Lafi7e dismissed pengwa’s stale review via 9bd4455 June 15, 2022 05:51

pengwa approved these changes Jun 15, 2022

View reviewed changes

Lafi7e merged commit 02457ec into master Jun 15, 2022

Lafi7e deleted the weicwang/gather_elements branch June 15, 2022 08:29

Conversation

Lafi7e commented Apr 27, 2022

Uh oh!

Uh oh!

pengwa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pengwa left a comment

Choose a reason for hiding this comment

Uh oh!

pengwa left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants