Add pipeline online partition logic for pipeline by xzhu1900 · Pull Request #3996 · microsoft/onnxruntime

xzhu1900 · 2020-05-19T21:44:14Z

Description: Support graph online partition for pipeline training. Now user can switch between offline-mode (parition offline), or online-mode (online partition) for pipeline training.

Motivation and Context

Why is this change required? What problem does it solve?
user no need to pre-partition their graph before running ort pipeline
If it fixes an open issue, please link to the issue here.

xzhu1900 · 2020-05-19T21:45:39Z

Tested with onnxruntime_training_bert with both naive pipeline and pipedream logic. Will add some unit test too. #Resolved

ke1337 · 2020-05-20T19:26:07Z

onnxruntime/core/graph/graph.cc

 }

+void Graph::AddValueInfo(const NodeArg* new_value_info){
+  for(auto info : value_info_){


auto [](start = 6, length = 4)

const auto& #Resolved

Not sure worth changing it to reference -- as value_info_ is a vector of pointers, and a reference is also a pointer, so taking it by value v.s. by reference would take the same memory space.

I think auto will deduce the 'const', but I think it is a good idea to explicitly put it there.

In reply to: 428256226 [](ancestors = 428256226)

ok, better to use const auto* or auto* for pointer types.

In reply to: 428394644 [](ancestors = 428394644,428256226)

Updated.

In reply to: 428987507 [](ancestors = 428987507,428394644,428256226)

ke1337 · 2020-05-20T19:30:56Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+
+      // original node_arg pointer from the origin graph. This serves as the key in the
+      // updated_node_arg map and any reference for original node_arg name
+      auto original_node_arg = producer_node->MutableOutputDefs()[idx];


auto [](start = 6, length = 4)

better to use auto* for pointer type #Resolved

ke1337 · 2020-05-20T19:31:50Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+      }
+
+      auto& new_receive_output = CreateNodeArg(graph, updated_node_arg);
+      auto old_shape = *(updated_node_arg->Shape());


auto [](start = 6, length = 4)

const auto& ? #Pending

I am not sure if it is legit that two nodes share the same shape object -- updated_node_arg->Shape() returns its shape object, and new_receive_output.SetShape(old_shape) function takes old_shape obj as a reference instead of a copy. So if old_shape is a reference to *updated_node_arg->Shape(), it means both updated_node_arg and new_receive_output shares the same shape obj.

I tested in the bert application with const auto&, it seems to work. But I am not sure whether if it is because I got lucky.

So just want to confirm before I make the change.

In reply to: 428259076 [](ancestors = 428259076)

ke1337 · 2020-05-20T19:38:30Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

 }

+Status ApplyPipelinePartitionToMainGraph(Graph& graph,
+                                         std::vector<TrainingSession::TrainingConfiguration::CutInfo> cut_info,


std::vectorTrainingSession::TrainingConfiguration::CutInfo [](start = 41, length = 60)

const & #Resolved

ke1337 · 2020-05-20T19:39:23Z

orttraining/orttraining/core/graph/pipeline_transformer.h

    std::string& backward_waited_event_after_recv_name,
    std::string& backward_recorded_event_before_send_name);

+Status ApplyPipelinePartitionToMainGraph(Graph& graph,


Graph& graph, [](start = 41, length = 13)

nit: put in new line as above? #Resolved

ke1337

ke1337

wschin · 2020-05-23T05:53:04Z

include/onnxruntime/core/graph/graph.h

    return graph_inputs_including_initializers_;
  }

+  bool IsInputsIncludingInitializers(const NodeArg* node_arg) const noexcept{


Suggested change

bool IsInputsIncludingInitializers(const NodeArg* node_arg) const noexcept{

// Return true if "node_arg" is a input or an initializer. Otherwise, returns false.

bool IsInputsIncludingInitializers(const NodeArg* node_arg) const noexcept{

Could you add comments for those newly-added functions (like other existing ones)? Also, I am not sure if this should be graph_utils or a member of graph. If the design is to keep core structure minimal, it'd be better to move those helper functions into a separated class. #Resolved

comment added.

I am not against moving it to util. The main concern here is that, this should be consistant with graph object's graph_inputs_including_initializers_ variable. Having this method outside makes it hard to keep in sync with this.

In reply to: 429517645 [](ancestors = 429517645)

wschin · 2020-05-23T06:07:15Z

include/onnxruntime/core/graph/graph_nodes.h


+  ConstReverseNodeIterator rbegin() const noexcept {
+    return {nodes_.crbegin(), nodes_.crend()};
+  }


Are we sure the reverse of topological sort can be computed like this? I think we need to do inverse all edges and do an actual sort.

[Update] I found a proof. Your design is correct. #Resolved

wschin · 2020-05-23T06:15:33Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+                   onnx::TensorProto_DataType type,
+                   std::vector<NodeArg*>& new_node_args,
+                   std::vector<std::string>& new_names) {
+  auto& event_id = CreateTypedNodeArg(graph, type, op_name);


event_id [](start = 8, length = 8)

new_node_arg #Resolved

wschin · 2020-05-23T06:17:07Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+}
+
+template <typename T>
+void AddNewNodeArgAndInitializer(Graph& graph,


AddNewNodeArgAndInitializer [](start = 5, length = 27)

AddNewScalarNodeArgAndInitializer #Resolved

wschin · 2020-05-23T06:18:41Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+}
+
+template <typename T>
+void AddNewNodeArgAndInitializer(Graph& graph,


// This function is used when you want to create a scalar constant in a graph.
// It may create a NodeArg so that other Node can references its value.
// It also cerates an initializer to store its value. #Resolved

wschin · 2020-05-23T06:19:27Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+  proto_data.set_data_type(type);
+
+  switch (type) {
+    case ONNX_NAMESPACE::TensorProto_DataType_BOOL:


switch is not good. Can we deduce its type from typename T? #Resolved

the switch is inevitable eventhough we deduce the type from T. Because we still need to differenciate different calling method (add_int32_data vs add_int64_data), it will be one more step of deducing and then entering the switch statement.

In reply to: 429519064 [](ancestors = 429519064)

How about this, let's write

void set_int32_scalar(const int32_t value, ONNX_NAMESPACE::TensorProto& tensor) { tensor.set_data_type(onnx::TensorProto_DataType::INT32); tensor.add_int32_data(value); } void set_int64_scalar(const int64_t value, ONNX_NAMESPACE::TensorProto& tensor) { tensor.set_data_type(onnx::TensorProto_DataType::INT64); tensor.add_int64_data(value); }

and call them in AddNewScalarNodeArgAndInitializer. Then, proto_data.set_data_type(type); and switch can be removed.

In reply to: 430577195 [](ancestors = 430577195,429519064)

Even though we have set_int32_scalar and and set_int64_scalar, how we going to call them without a type check (i.e. if or switch)?

We still need to distinguish based on different data types to decide with function to call.

On another note, I am not sure why switch is so bad that we want to avoid it...

In reply to: 430690470 [](ancestors = 430690470,430577195,429519064)

The type parameter "T" determines which one to call.

In reply to: 430724590 [](ancestors = 430724590,430690470,430577195,429519064)

That's what I was referring too. Once inside AddNewNodeArgAndInitializer<T>, we still need to check T's type to decide which one to call. This check is inevitable.

In reply to: 430729278 [](ancestors = 430729278,430724590,430690470,430577195,429519064)

wschin · 2020-05-23T06:53:06Z

orttraining/orttraining/models/bert/main.cc

+                                                              TrainingSession::TrainingConfiguration::CutEdge("71", {"273"})};
+      TrainingSession::TrainingConfiguration::CutInfo cut1 = {TrainingSession::TrainingConfiguration::CutEdge("308"),
+                                                              TrainingSession::TrainingConfiguration::CutEdge("71", {"395"})};
+      params.pipeline_partition_cut_list.emplace_back(cut0);


Those should be input arguments of main. For example, --cuts 186-71-273,308-71-395. #Resolved

yeah, preferably, it should come as flexible commandline input instead of c++ code. (see TODO on line 385)

Currently the parser is not capable of parsing structured data like this. (the examle you put is not sufficient as there are many ways to interpolate 186-71-273. But it shouldn't be hard. I'll add the parser in later.

In reply to: 429520899 [](ancestors = 429520899)

wschin · 2020-05-23T06:58:20Z

orttraining/orttraining/core/session/training_session.h

    // If not provided, no optimizer is added.
    optional<OptimizerConfiguration> optimizer_config{};

+    // struct to describe a specific edge (not node_arg). This information is


struct to describe a specific edge (not node_arg) [](start = 7, length = 49)

struct to describe a specific edge (not node_arg) ---> struct to describe edges associated to the same tensor. Note that one edge is a connection from a tensor to an operator. If a tensor is consumed by multiple operators, there would be multiple edges from that tensor to its consumers. #Resolved

wschin

Overall looks good but I do feel there are some corners we can further clean up.

wschin · 2020-05-23T07:02:40Z

orttraining/orttraining/core/session/training_session.h

      // Each pipeline stage should pick up some strings from this field..
      std::vector<std::string> fetch_names;
-      // [TODO] Add cut information.
+      std::vector<CutInfo> cut_list;


std::vector cut_list; [](start = 6, length = 30)

What is this? cut_list[i] contains the edges between the first and the second pipeline stages? For vector fields, we need to put a definition for its elements. #Resolved

cut_list[i] contains the CutInfo to make the partition between stage i and stage i+1. Comment updated.

In reply to: 429521450 [](ancestors = 429521450)

wschin · 2020-05-23T08:11:59Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+
+// split the graph into disconnected subgraph based on provided CutInfo
+common::Status SplitGraph(Graph& graph,
+                          std::vector<TrainingSession::TrainingConfiguration::CutInfo> split_edge_groups_,


split_edge_groups_ [](start = 87, length = 18)

split_edge_groups #Resolved

wschin · 2020-05-23T08:17:21Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+    // add output node_arg for send/recv
+    AddNewNodeArg(graph, "send_output_signal" + cut_index_str, ONNX_NAMESPACE::TensorProto_DataType_BOOL, send_output_args, new_output_names);
+    AddNewNodeArg(graph, "receive_output_signal" + cut_index_str, ONNX_NAMESPACE::TensorProto_DataType_BOOL, recv_output_args, new_output_names);
+


Just curious. How long a line can be in ORT code base? #Resolved

based on this, it is 120.

https://github.com/microsoft/onnxruntime/blob/5e0928a777745a1536332973418dbbb7ccbf9d52/docs/Coding_Conventions_and_Standards.md

In reply to: 429526304 [](ancestors = 429526304)

wschin · 2020-05-23T08:18:21Z

orttraining/orttraining/core/graph/pipeline_transformer.cc

+    for (auto& id : edgeIds) {
+      // find node whose output contains id.node_arg_name
+      auto producer_node = graph.GetMutableProducerNode(id.node_arg_name);
+      if (producer_node == nullptr) {


producer_node == nullptr [](start = 10, length = 24)

if (!producer_node) #Resolved

I think basiclly they are the same. I have no problem changing it. Just curious if there is any guidance on prefer one over the other.

In reply to: 429526366 [](ancestors = 429526366)

It's shorter and more readable if people has smaller screen. :)

In reply to: 430585524 [](ancestors = 430585524,429526366)

wschin

Thank you! Let's consider adding bool_data in TensorProto in ONNX.

xzhu1900 · 2020-05-27T00:44:24Z

Thank you all for reviewing :)

online partition

86499aa

xzhu1900 requested a review from a team as a code owner May 19, 2020 21:44

xzhu1900 changed the title ~~Add online partition logic for pipeline~~ Add pipeline online partition logic for pipeline May 19, 2020

ke1337 reviewed May 20, 2020

View reviewed changes

xzhu1900 added 4 commits May 20, 2020 23:35

fix when multiple consumer nodes is in cut info

82c8de4

fix windows build

8e16235

address feedback

91e33a1

adding test

b269ad8

ke1337 previously approved these changes May 22, 2020

View reviewed changes

feedback

0c23280

xzhu1900 dismissed ke1337’s stale review via 0c23280 May 22, 2020 02:11

ke1337 previously approved these changes May 22, 2020

View reviewed changes

wschin reviewed May 23, 2020

View reviewed changes

wschin suggested changes May 23, 2020

View reviewed changes

wschin reviewed May 23, 2020

View reviewed changes

xzhu1900 added the training issues related to ONNX Runtime training; typically submitted using template label May 26, 2020

address feedback

24e5d4b

xzhu1900 dismissed ke1337’s stale review via 24e5d4b May 26, 2020 18:24

add parser for cut edge

54d7b85

wschin previously approved these changes May 26, 2020

View reviewed changes

windows build

090a275

xzhu1900 dismissed wschin’s stale review via 090a275 May 26, 2020 22:33

wschin approved these changes May 27, 2020

View reviewed changes

xzhu1900 merged commit 633008b into master May 27, 2020

xzhu1900 deleted the xuzhu/online-partition branch May 27, 2020 00:44

	bool IsInputsIncludingInitializers(const NodeArg* node_arg) const noexcept{
	// Return true if "node_arg" is a input or an initializer. Otherwise, returns false.
	bool IsInputsIncludingInitializers(const NodeArg* node_arg) const noexcept{

Conversation

xzhu1900 commented May 19, 2020

Uh oh!

xzhu1900 commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ke1337 May 20, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ke1337 May 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ke1337 May 20, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ke1337 May 20, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ke1337 May 20, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ke1337 May 20, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ke1337 left a comment

Choose a reason for hiding this comment

Uh oh!

ke1337 left a comment

Choose a reason for hiding this comment

Uh oh!

wschin May 23, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin May 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin May 23, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin May 23, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin May 23, 2020 • edited by xzhu1900 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin May 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xzhu1900 commented May 19, 2020 •

edited

Loading

ke1337 May 20, 2020 •

edited by xzhu1900

Loading

ke1337 May 22, 2020 •

edited

Loading

ke1337 May 20, 2020 •

edited by xzhu1900

Loading

ke1337 May 20, 2020 •

edited by xzhu1900

Loading

ke1337 May 20, 2020 •

edited by xzhu1900

Loading

ke1337 May 20, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited

Loading

wschin May 26, 2020 •

edited

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited by xzhu1900

Loading

wschin May 23, 2020 •

edited

Loading