[CUDA] GatherElements[Grad]/ScatterElements Bugfix and Perf Improve#11374
Conversation
pengwa
left a comment
There was a problem hiding this comment.
This is a good refactoring for the mentioned four kernels, and consolidation on perf optimizers for each kernel together. I have a few comments. Not sure I catch up with all the ideas, will circle back for more if possible.
| // Reverse for better calculation. | ||
| std::reverse(input_shape.begin(), input_shape.end()); | ||
| std::reverse(indices_shape.begin(), indices_shape.end()); | ||
| size_t reverse_axis = rank - 1 - static_cast<size_t>(axis); |
There was a problem hiding this comment.
should we check axis >=0 in case caller pass a invalid axis?
There was a problem hiding this comment.
CoalesceDimensions is currently called by kernel's ComputeInternal, and we have HandleNegativeAxis to guarantee axis is a valid number.
There was a problem hiding this comment.
While I still feel we'd better do a check somewhere avoiding naïve mis-usage for future , especially here we are casting a negative number to size_t.
| } | ||
|
|
||
| if (curr == reverse_axis) { | ||
| if (curr > 0) Move(input_shape, indices_shape, curr, 0); |
There was a problem hiding this comment.
will this override the merged value at position 0?
There was a problem hiding this comment.
"curr == reverse_axis" here is the special case that after skipping all 1-dim axes, reverse_axis is the leading axis, so there is no valid axis at position 0. If curr (and also reverse_axis) is not at position 0, move it to position 0 as the leading axis.
| return ONNX_NAMESPACE::TensorProto_DataType_INT8; | ||
| case sizeof(int16_t): | ||
| return ONNX_NAMESPACE::TensorProto_DataType_FLOAT16; | ||
| case sizeof(int32_t): |
There was a problem hiding this comment.
Similar to above reply, we care about the size of data only. sizeof(float) is same as sizeof(int32_t). It's just because the first case is sizeof(int8_t) so I made all case consistant. But if you think sizeof(float) is better (because I return TensorProto_DataType_FLOAT), I can change them to float16, float and double.
There was a problem hiding this comment.
Yeah, I think using sizeof(float)/sizeof(double) would be better.
| utils::MLTypeCallDispatcher<float, MLFloat16, int16_t, int8_t, int32_t, | ||
| int64_t, uint8_t, uint16_t, uint32_t, uint64_t, double, bool> | ||
| t_disp(data_tensor->GetElementType()); | ||
| utils::MLTypeCallDispatcher<int8_t, MLFloat16, float, double> t_disp(dtype); |
There was a problem hiding this comment.
why we don't use "t_disp(data_tensor->GetElementType());" directly, or "t_disp(input_tensor->GetElementType());"
There was a problem hiding this comment.
This is because we just specializes the template functions for 4 data types (for less specialization code in kernel file and smaller binary size), as mentioned in above comment. If we use GetElementType, then we need to specialize the template functions for all these types.
| constexpr int threads_per_block = GridDim::maxThreadsPerBlock; | ||
| constexpr int thread_worksize = 16; | ||
| constexpr int kThreadsPerBlock = GridDim::maxThreadsPerBlock; | ||
| constexpr int kThreadWorkSize = 4; |
There was a problem hiding this comment.
Does kThreadWorkSize change bring better perf or both 2D and other cases, on both V100 and A100? Do you know what the possible reason a light-weighter kernel bring a better perf?
There was a problem hiding this comment.
Actually number from 4 to 16 make no big difference according to my testing. Change to 4 is just because most of other kernels use this number, so to make it consistent. Big worksize in thread will reduce the number of threads, this is actually not good for some smaller data shapes.
There was a problem hiding this comment.
Got it. Maybe you can consider trigger some inferencing benchmark via Anubis to see any impact on the perf, just in case it affects existing inferencing models.
| } | ||
| } | ||
|
|
||
| // GatherElementsGrad needs atomic_add which supports float types only, so use half, float and double for 16, 32, and 64 |
There was a problem hiding this comment.
OK, thanks for the explanation. So we are using float during kernel (value read and write) for int typed input data, and assuming that should not bring us accuracy problems.
| return ONNX_NAMESPACE::TensorProto_DataType_INT8; | ||
| case sizeof(int16_t): | ||
| return ONNX_NAMESPACE::TensorProto_DataType_FLOAT16; | ||
| case sizeof(int32_t): |
There was a problem hiding this comment.
Yeah, I think using sizeof(float)/sizeof(double) would be better.
| } | ||
| } | ||
|
|
||
| // GatherElementsGrad needs atomic_add which supports float types only, so use half, float and double for 16, 32, and 64 |
There was a problem hiding this comment.
nit: BTW, shall we explain a bit further in the comment here for "compute is done only in those 4 types, despite of the real data type", I guess someone else might have same questions when they re-visited this code.
| constexpr int threads_per_block = GridDim::maxThreadsPerBlock; | ||
| constexpr int thread_worksize = 16; | ||
| constexpr int kThreadsPerBlock = GridDim::maxThreadsPerBlock; | ||
| constexpr int kThreadWorkSize = 4; |
There was a problem hiding this comment.
Got it. Maybe you can consider trigger some inferencing benchmark via Anubis to see any impact on the perf, just in case it affects existing inferencing models.
| // Reverse for better calculation. | ||
| std::reverse(input_shape.begin(), input_shape.end()); | ||
| std::reverse(indices_shape.begin(), indices_shape.end()); | ||
| size_t reverse_axis = rank - 1 - static_cast<size_t>(axis); |
There was a problem hiding this comment.
While I still feel we'd better do a check somewhere avoiding naïve mis-usage for future , especially here we are casting a negative number to size_t.
pengwa
left a comment
There was a problem hiding this comment.
Leave few comments, overall LGTM.
Bugfix:
Perf improve:
Tested in V100 using case from real model: