GatherElementsGrad CPU Kernel and TopKGrad CPU/CUDA Kernel#5511
Conversation
| .SinceVersion(1) | ||
| .SetSupportLevel(OpSchema::SupportType::EXPERIMENTAL) | ||
| .SetDoc("TopKGrad") | ||
| .AllowUncheckedAttributes() |
There was a problem hiding this comment.
need the axis attribute?
The Indices output is index on a specific axis
There was a problem hiding this comment.
Do you know if it would be possible to use GatherGrad (or maybe GatherElementsGrad... I'm not sure) to implement this gradient, instead of adding a new kernel?
I did some tests with PyTorch and found that the following equivalence seemed to hold:
x = torch.rand([2, 3, 4, 5])
topk_values, topk_indices = x.topk(3, dim=2)
gather_values = x.gather(topk_indices, dim=2)
print((topk_values == gather_values).all()) # prints "tensor(True)"
I think this implies that we could use the gradient of gather to implement the gradient of top-k, but there could be a corner case that I haven't seen. It would be nice to reuse GatherGrad though, because there's been a lot of performance engineering on it recently.
There was a problem hiding this comment.
Thanks @mrry ! You are right that GatherElementsGrad can be reused here according to the Op definition. But current GatherElementsGrad doesn't have CPU kernel implementation, then my CPU code can be changed to GatherElementsGrad. :-) Then both GatherElementsGrad and TopKGrad will have both CPU and CUDA. Another bad news is when I use GatherElementsGrad for TopKGrad, it failed one of my UT using CUDA (my CPU implementation works OK), I then added a new UT with same data size and attributes for GatherElementsGrad and it also failed. It means current CUDA implementation has bug somewhere when axis attribute is not default. I will investigate and fix it.
There was a problem hiding this comment.
Just pushed a new version to use Scatter CPU/CUDA impl for both GatherElementsGrad and TopKGrad.
| outputs.push_back(GI(i)); | ||
| } else { | ||
| outputs.push_back(ArgDef("", nullptr)); | ||
| outputs.push_back(IA("ConvInput_" + I(i).name)); |
There was a problem hiding this comment.
Derek had a fix for this. The gradient for inputs are made optional, so this change is probably not needed anymore.
There was a problem hiding this comment.
Maybe we can keep my change here as it reads better then empty string.
There was a problem hiding this comment.
Just for my education: will it still be treated as an optional output if there's a non-empty string there? (If so, I'm happy with putting a more descriptive string in the graph.)
There was a problem hiding this comment.
@mrry , I had a quick test and indeed empty string can skip the output calculation while non-empty string cannot. I've rollbacked the change. Thanks!
There was a problem hiding this comment.
Thanks for testing this! I wonder if this means there are other places in the code base where we give names to unused outputs and end up doing wasted computation :).
| NodeDef(OpDef{"TopKGrad", kMSDomain, 1}, | ||
| {GO(0), O(1), I(0)}, | ||
| {GI(0)}, | ||
| SrcNodeAttributes())}; |
There was a problem hiding this comment.
Shall we make the copy of axis attributed explicit, as don't need the other two attributes.
Add GatherElementsGrad CPU kernel and TopKGrad CPU/CUDA kernel using Scatter. Also fix a bug on Scatter CUDA implementation.
Motivation and Context