SoftmaxCrossEntropyLossInternalGrad and Sum Fusion by Lafi7e · Pull Request #12746 · microsoft/onnxruntime

Lafi7e · 2022-08-26T07:57:08Z

We observed the SoftmaxCrossEntropyLossInternalGrad+Sum pattern in more than one customer models, both of the nodes need ~10ms to compute when the input tensor shape is relatively big, especially with big vocab size. The PR is to fuse this two Ops to a single one. In CUDA/ROCm EP, only one fused Ops is launched, so that the total execution is reduced by half. We can also observe >2% perf gain for the whole model from the throughput.

pengwa · 2022-08-31T07:35:11Z

orttraining/orttraining/core/optimizer/sce_loss_grad_bias_fusion.cc

+      continue;
+    }
+
+    bool has_same_shape = true;


I recalled Shape can be compared with ==.

> if (input->Shape() != skip->Shape()) { > return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, > "skip is expected to have same shape as input"); > }

There are different "shape". ORT's tensor shape is onnxruntime::TensorShape, the "shape" in ONNX graph (in transformer code) is onnx::TensorShapeProto.

pengwa · 2022-08-31T07:37:01Z

orttraining/orttraining/core/optimizer/sce_loss_grad_bias_fusion.cc

+
+namespace onnxruntime {
+
+Status SceLossGradBiasFusion::ApplyImpl(Graph& graph, bool& modified, int graph_level,


the fusion graph is looks like this?

SoftmaxCrossEntropyLossInternalGrad --> optional Reshape --> Add|Sum ?

pengwa · 2022-08-31T07:43:09Z

orttraining/orttraining/training_ops/cpu/loss/softmax_cross_entropy_loss.cc

    d_logit->Reshape(new_shape);
  }

+  // Bias.


I am a bit surprised we did not use parallel for here. maybe for the newly added elementwise add, it's simple to a parallel loop.

It may means no body use CPU for training, so I didn't take effort to refactor the code, and just to make it work.

If it's critical for on-device training, we can optimize this in a new PR.

pengwa · 2022-08-31T07:48:28Z

orttraining/orttraining/test/optimizer/graph_transform_test.cc

+  };
+
+  std::unique_ptr<GraphTransformer> transformer = std::make_unique<SceLossGradBiasFusion>();
+  TestGraphTransformer(build_test_case, 14, logger, std::move(transformer), TransformerLevel::Level2, 1,


shall we also run opset 12, 13, we had models onboarded running on 12, they probably will get refreshed to re-train with new ORT + new data, benefiting from this fuse.

Added the test. But the old ORT will not have this fusion. If user wants to re-train with new ORT with this fusion, the default OpSet version for ORTModule is now 15, unless user uses env variables to set the OpSet version to an old one, which is not recommended.

pengwa · 2022-08-31T07:50:04Z

FYI @baijumeswani . let's check whether on-device training models (GPU/CPU) have this pattern or not.

fuse scegrad and sum

fbd2a94

Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Aug 26, 2022

Lafi7e requested review from askhade, pengwa and zhijxu-MS August 26, 2022 07:57

Lafi7e added 2 commits August 29, 2022 11:22

Merge branch 'main' into weicwang/sce_grad_add

d35c65d

add yield output shapes to value_info

0140c59

pengwa reviewed Aug 31, 2022

View reviewed changes

pengwa requested a review from baijumeswani August 31, 2022 07:49

Lafi7e added 4 commits September 1, 2022 10:24

resolve comments

0542bca

Merge branch 'main' into weicwang/sce_grad_add

4803b02

fix merge main

faa20ee

Merge branch 'main' into weicwang/sce_grad_add

ca59e68

zhijxu-MS approved these changes Sep 14, 2022

View reviewed changes

Lafi7e merged commit da07c83 into main Sep 14, 2022

Lafi7e deleted the weicwang/sce_grad_add branch September 14, 2022 06:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SoftmaxCrossEntropyLossInternalGrad and Sum Fusion#12746

SoftmaxCrossEntropyLossInternalGrad and Sum Fusion#12746
Lafi7e merged 7 commits intomainfrom
weicwang/sce_grad_add

Lafi7e commented Aug 26, 2022

Uh oh!

pengwa Aug 31, 2022

Uh oh!

Lafi7e Aug 31, 2022

Uh oh!

pengwa Aug 31, 2022

Uh oh!

Lafi7e Aug 31, 2022

Uh oh!

pengwa Aug 31, 2022

Uh oh!

Lafi7e Aug 31, 2022

Uh oh!

Lafi7e Aug 31, 2022

Uh oh!

pengwa Aug 31, 2022

Uh oh!

Lafi7e Sep 1, 2022

Uh oh!

pengwa commented Aug 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		namespace onnxruntime {

		Status SceLossGradBiasFusion::ApplyImpl(Graph& graph, bool& modified, int graph_level,

Conversation

Lafi7e commented Aug 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengwa commented Aug 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants