Skip to content

GatherElementsGrad CPU Kernel and TopKGrad CPU/CUDA Kernel#5511

Merged
Lafi7e merged 3 commits into
masterfrom
weicwang/topkgradcpu
Oct 21, 2020
Merged

GatherElementsGrad CPU Kernel and TopKGrad CPU/CUDA Kernel#5511
Lafi7e merged 3 commits into
masterfrom
weicwang/topkgradcpu

Conversation

@Lafi7e
Copy link
Copy Markdown
Contributor

@Lafi7e Lafi7e commented Oct 16, 2020

Add GatherElementsGrad CPU kernel and TopKGrad CPU/CUDA kernel using Scatter. Also fix a bug on Scatter CUDA implementation.

Motivation and Context

  • It's required by one of customer's models, which is small and only run on CPU.

@Lafi7e Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Oct 16, 2020
@Lafi7e Lafi7e requested a review from a team as a code owner October 16, 2020 07:59
.SinceVersion(1)
.SetSupportLevel(OpSchema::SupportType::EXPERIMENTAL)
.SetDoc("TopKGrad")
.AllowUncheckedAttributes()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need the axis attribute?

The Indices output is index on a specific axis

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if it would be possible to use GatherGrad (or maybe GatherElementsGrad... I'm not sure) to implement this gradient, instead of adding a new kernel?

I did some tests with PyTorch and found that the following equivalence seemed to hold:

x = torch.rand([2, 3, 4, 5])
topk_values, topk_indices = x.topk(3, dim=2)
gather_values = x.gather(topk_indices, dim=2)
print((topk_values == gather_values).all())  # prints "tensor(True)"

I think this implies that we could use the gradient of gather to implement the gradient of top-k, but there could be a corner case that I haven't seen. It would be nice to reuse GatherGrad though, because there's been a lot of performance engineering on it recently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrry ! You are right that GatherElementsGrad can be reused here according to the Op definition. But current GatherElementsGrad doesn't have CPU kernel implementation, then my CPU code can be changed to GatherElementsGrad. :-) Then both GatherElementsGrad and TopKGrad will have both CPU and CUDA. Another bad news is when I use GatherElementsGrad for TopKGrad, it failed one of my UT using CUDA (my CPU implementation works OK), I then added a new UT with same data size and attributes for GatherElementsGrad and it also failed. It means current CUDA implementation has bug somewhere when axis attribute is not default. I will investigate and fix it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a new version to use Scatter CPU/CUDA impl for both GatherElementsGrad and TopKGrad.

outputs.push_back(GI(i));
} else {
outputs.push_back(ArgDef("", nullptr));
outputs.push_back(IA("ConvInput_" + I(i).name));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derek had a fix for this. The gradient for inputs are made optional, so this change is probably not needed anymore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can keep my change here as it reads better then empty string.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my education: will it still be treated as an optional output if there's a non-empty string there? (If so, I'm happy with putting a more descriptive string in the graph.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrry , I had a quick test and indeed empty string can skip the output calculation while non-empty string cannot. I've rollbacked the change. Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this! I wonder if this means there are other places in the code base where we give names to unused outputs and end up doing wasted computation :).

NodeDef(OpDef{"TopKGrad", kMSDomain, 1},
{GO(0), O(1), I(0)},
{GI(0)},
SrcNodeAttributes())};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make the copy of axis attributed explicit, as don't need the other two attributes.

@Lafi7e Lafi7e changed the title TopKGrad CPU kernel GatherElementsGrad CPU Kernel and TopKGrad CPU/CUDA Kernel Oct 19, 2020
@Lafi7e Lafi7e merged commit b48f596 into master Oct 21, 2020
@Lafi7e Lafi7e deleted the weicwang/topkgradcpu branch October 21, 2020 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants