Optimize computation orders by pengwa · Pull Request #13672 · microsoft/onnxruntime

pengwa · 2022-11-16T14:11:36Z

Optimize computation orders

In Roberta/Electra, when ClassificationHead is used, there is slicing operation on features on sequence_length dimensions, then loss calculations only depend on this sliced data. This is a slicing at axis 1. Before slicing the shape is [batch, sequence_length, hidden], after slicing, it becomes [batch , hidden_stage]

We had opportunities to bring this slicing earlier as much as possible, by passing through simple elementwise ops (like Add/Div), or Layernorm/Softmax(if their reduce axis is after the slicing axis), or even MatMul's the left operand (if only it did not affect the last dims).

For operators like Reshape/Transpose, it is special since they have either data specified (after slicing we need update), or they have perm specified, which requires the input rank remain unchanged. So for those kinds of operators, we can remain the original rank, but just leave the sliced dim to be 1, after the compute completed, we do a Squeeze.

class RobertaClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

src\transformers\models\roberta\modeling_roberta.py
src\transformers\models\electra\modeling_electra.py

Benchmark

A simple benchmark shows Robeta training latency dropped from 208ms ~ 199ms. 4.5+% reduction.
More comprehensive tests are on the way.

Motivation and Context

…pengwa/optimize_compute

baijumeswani

Looks good. Sorry for the delay in the review. Please update the branch to address failing pipelines.

…pengwa/optimize_compute

askhade · 2022-12-21T06:19:04Z

      transformers.emplace_back(std::make_unique<ReshapeFusion>(compatible_eps));
      transformers.emplace_back(std::make_unique<ConcatSliceElimination>(compatible_eps));
-#if defined(USE_CUDA) || defined(USE_ROCM)
-      transformers.emplace_back(std::make_unique<ComputationReductionTransformer>(compatible_eps));


Why are we deleting this transformer?

Oh This is a renaming., now it is called ComputeOptimizer

askhade · 2022-12-21T06:26:08Z

    "${ONNXRUNTIME_INCLUDE_DIR}/core/optimizer/*.h"
    "${ONNXRUNTIME_ROOT}/core/optimizer/*.h"
    "${ONNXRUNTIME_ROOT}/core/optimizer/*.cc"
+    "${ONNXRUNTIME_ROOT}/core/optimizer/compute_optimizer/*.h"


All the code in these files and in the test files is wrapped in ENABLE_TRAINING so why add these files here? You can add them within the if (onnxruntime_ENABLE_TRAINING) condition

Because the optimizer is applicable for inferencing. I intent to make it easier to enable it for inference later without changing file location and cmake macros.

askhade · 2022-12-21T06:47:05Z

+          // 3. Should all inputs be allowed when track back further (bottom-up);
+          //    if not, add the input index restriction as MatMul did.
+          {GetFullQualifiedOpName("Add", kOnnxDomain),
+           OpPassThroughConfig({}, std::make_shared<SimplePassThroughActor>(), opset_14_13_7_6_1)},


what is the benefit of creating these const initializer list? Since any opportunity for reuse across the ops is purely coincidental.

Defining a temporary list of opsets, on windows, later when we read it it is found some values are always zero. So I have to make is a constant to make it run correctly.

…pengwa/optimize_compute

askhade

LGTM

### Optimize computation orders In `Roberta/Electra`, when `ClassificationHead` is used, there is slicing operation on features on sequence_length dimensions, then loss calculations only depend on this sliced data. This is a slicing at axis 1. Before slicing the shape is [batch, sequence_length, hidden], after slicing, it becomes [batch , hidden_stage] We had opportunities to bring this slicing earlier as much as possible, by passing through simple elementwise ops (like Add/Div), or Layernorm/Softmax(if their reduce axis is after the slicing axis), or even MatMul's the left operand (if only it did not affect the last dims). For operators like Reshape/Transpose, it is special since they have either data specified (after slicing we need update), or they have perm specified, which requires the input rank remain unchanged. So for those kinds of operators, we can remain the original rank, but just leave the sliced dim to be 1, after the compute completed, we do a Squeeze. ``` class RobertaClassificationHead(nn.Module): """Head for sentence-level classification tasks.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) self.dropout = nn.Dropout(classifier_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) def forward(self, features, **kwargs): x = features[:, 0, :] # take <s> token (equiv. to [CLS]) x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x ``` src\transformers\models\roberta\modeling_roberta.py src\transformers\models\electra\modeling_electra.py #### Benchmark A simple benchmark shows Robeta training latency dropped from 208ms ~ 199ms. 4.5+% reduction. More comprehensive tests are on the way. ### Motivation and Context

### Slice op upstream refactor A refactor work for #13672. ### Motivation and Context There is a similar optimization opportunity for other operator upstreaming, to reduce compute flops. So refactor the existing code base for making it easier to support other ops. The changes in this PR are mainly about renaming and moving. - Move common logic (from compute_optimizer.h/cc) into upstream_transformer_base.h/cc and shared_utils.h/cc. - For upstream common logic, they are moved into upstream_transformer_base.h/cc - For shared utilities, they are moved to shared_utils.h/cc. - After the move, compute_optimizer.h/cc mainly for upstreaming gather implementation (inheriting upstream_transformer_base.h/cc). Ideally it should be renamed, but for easier review this time, I keep its name.

optimzier compute orders

1c4ae75

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Nov 16, 2022

pengwa requested review from Lafi7e, askhade, baijumeswani, mindest and zhijxu-MS November 16, 2022 14:12

github-advanced-security AI found potential problems Nov 16, 2022

View reviewed changes

Comment thread onnxruntime/test/testdata/transform/computation_reduction/gather_roberta_e2e.py Fixed

Comment thread onnxruntime/test/testdata/transform/computation_reduction/gather_roberta_e2e.py Fixed

pengwa added 4 commits November 30, 2022 10:46

refinement

2c5e469

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

e686c0c

…pengwa/optimize_compute

add more test cases

fc89ad6

add ignored test files

7e480cd

github-advanced-security AI found potential problems Nov 30, 2022

View reviewed changes

Comment thread onnxruntime/test/testdata/transform/computation_reduction/gather/gather_matmul.py Fixed

pengwa added 7 commits December 1, 2022 08:51

refactor

9ebd598

fixes

0470221

fix builds

477cea8

fix tests and typos

fdd8e16

fix win builds

6d49e0b

fix win cpu tests

c922e3d

fix window build related to std::initializer_list

f03a2f6

snnn reviewed Dec 2, 2022

View reviewed changes

Comment thread onnxruntime/core/optimizer/compute_optimizer/compute_optimizer.h Outdated

snnn reviewed Dec 2, 2022

View reviewed changes

Comment thread onnxruntime/core/optimizer/compute_optimizer/compute_optimizer.h Outdated

baijumeswani reviewed Dec 6, 2022

View reviewed changes

pengwa added 3 commits December 6, 2022 12:58

refine based on comments

d80eb86

minor

6ad1051

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

8994934

…pengwa/optimize_compute

baijumeswani reviewed Dec 7, 2022

View reviewed changes

pengwa added 3 commits December 8, 2022 10:51

refine based on comments

140718b

add switch for turn off for debug purpose

9aff55a

formatting

a72489c

baijumeswani previously approved these changes Dec 12, 2022

View reviewed changes

Comment thread docs/ORTModule_Training_Guidelines.md

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

aeef3e7

…pengwa/optimize_compute

askhade reviewed Dec 21, 2022

View reviewed changes

pengwa added 2 commits December 21, 2022 09:34

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

8c05a20

…pengwa/optimize_compute

enable optimizer for c++ trainer

c4613cb

pengwa dismissed baijumeswani’s stale review via c4613cb December 22, 2022 02:07

askhade approved these changes Dec 22, 2022

View reviewed changes

pengwa merged commit 2f5bf75 into main Dec 22, 2022

pengwa deleted the pengwa/optimize_compute branch December 22, 2022 07:12

pengwa mentioned this pull request Feb 26, 2023

Op slicing upstream refactor #14832

Merged

Conversation

pengwa commented Nov 16, 2022