SoftmaxCrossEntropyLossInternalGrad and Sum Fusion#12746
Conversation
| continue; | ||
| } | ||
|
|
||
| bool has_same_shape = true; |
There was a problem hiding this comment.
I recalled Shape can be compared with ==.
> if (input->Shape() != skip->Shape()) {
> return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,
> "skip is expected to have same shape as input");
> }
There was a problem hiding this comment.
There are different "shape". ORT's tensor shape is onnxruntime::TensorShape, the "shape" in ONNX graph (in transformer code) is onnx::TensorShapeProto.
|
|
||
| namespace onnxruntime { | ||
|
|
||
| Status SceLossGradBiasFusion::ApplyImpl(Graph& graph, bool& modified, int graph_level, |
There was a problem hiding this comment.
the fusion graph is looks like this?
SoftmaxCrossEntropyLossInternalGrad --> optional Reshape --> Add|Sum ?
| d_logit->Reshape(new_shape); | ||
| } | ||
|
|
||
| // Bias. |
There was a problem hiding this comment.
I am a bit surprised we did not use parallel for here. maybe for the newly added elementwise add, it's simple to a parallel loop.
There was a problem hiding this comment.
It may means no body use CPU for training, so I didn't take effort to refactor the code, and just to make it work.
There was a problem hiding this comment.
If it's critical for on-device training, we can optimize this in a new PR.
| }; | ||
|
|
||
| std::unique_ptr<GraphTransformer> transformer = std::make_unique<SceLossGradBiasFusion>(); | ||
| TestGraphTransformer(build_test_case, 14, logger, std::move(transformer), TransformerLevel::Level2, 1, |
There was a problem hiding this comment.
shall we also run opset 12, 13, we had models onboarded running on 12, they probably will get refreshed to re-train with new ORT + new data, benefiting from this fuse.
There was a problem hiding this comment.
Added the test. But the old ORT will not have this fusion. If user wants to re-train with new ORT with this fusion, the default OpSet version for ORTModule is now 15, unless user uses env variables to set the OpSet version to an old one, which is not recommended.
|
FYI @baijumeswani . let's check whether on-device training models (GPU/CPU) have this pattern or not. |
We observed the SoftmaxCrossEntropyLossInternalGrad+Sum pattern in more than one customer models, both of the nodes need ~10ms to compute when the input tensor shape is relatively big, especially with big vocab size. The PR is to fuse this two Ops to a single one. In CUDA/ROCm EP, only one fused Ops is launched, so that the total execution is reduced by half. We can also observe >2% perf gain for the whole model from the throughput.