Skip to content

Add pipeline online partition logic for pipeline#3996

Merged
xzhu1900 merged 9 commits intomasterfrom
xuzhu/online-partition
May 27, 2020
Merged

Add pipeline online partition logic for pipeline#3996
xzhu1900 merged 9 commits intomasterfrom
xuzhu/online-partition

Conversation

@xzhu1900
Copy link
Copy Markdown
Contributor

Description: Support graph online partition for pipeline training. Now user can switch between offline-mode (parition offline), or online-mode (online partition) for pipeline training.

Motivation and Context

  • Why is this change required? What problem does it solve?
    user no need to pre-partition their graph before running ort pipeline
  • If it fixes an open issue, please link to the issue here.

@xzhu1900 xzhu1900 requested a review from a team as a code owner May 19, 2020 21:44
@xzhu1900
Copy link
Copy Markdown
Contributor Author

xzhu1900 commented May 19, 2020

Tested with onnxruntime_training_bert with both naive pipeline and pipedream logic. Will add some unit test too. #Resolved

@xzhu1900 xzhu1900 changed the title Add online partition logic for pipeline Add pipeline online partition logic for pipeline May 19, 2020
}

void Graph::AddValueInfo(const NodeArg* new_value_info){
for(auto info : value_info_){
Copy link
Copy Markdown
Contributor

@ke1337 ke1337 May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto [](start = 6, length = 4)

const auto& #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure worth changing it to reference -- as value_info_ is a vector of pointers, and a reference is also a pointer, so taking it by value v.s. by reference would take the same memory space.

I think auto will deduce the 'const', but I think it is a good idea to explicitly put it there.


In reply to: 428256226 [](ancestors = 428256226)

Copy link
Copy Markdown
Contributor

@ke1337 ke1337 May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, better to use const auto* or auto* for pointer types.


In reply to: 428394644 [](ancestors = 428394644,428256226)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


In reply to: 428987507 [](ancestors = 428987507,428394644,428256226)


// original node_arg pointer from the origin graph. This serves as the key in the
// updated_node_arg map and any reference for original node_arg name
auto original_node_arg = producer_node->MutableOutputDefs()[idx];
Copy link
Copy Markdown
Contributor

@ke1337 ke1337 May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto [](start = 6, length = 4)

better to use auto* for pointer type #Resolved

}

auto& new_receive_output = CreateNodeArg(graph, updated_node_arg);
auto old_shape = *(updated_node_arg->Shape());
Copy link
Copy Markdown
Contributor

@ke1337 ke1337 May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto [](start = 6, length = 4)

const auto& ? #Pending

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it is legit that two nodes share the same shape object -- updated_node_arg->Shape() returns its shape object, and new_receive_output.SetShape(old_shape) function takes old_shape obj as a reference instead of a copy. So if old_shape is a reference to *updated_node_arg->Shape(), it means both updated_node_arg and new_receive_output shares the same shape obj.

I tested in the bert application with const auto&, it seems to work. But I am not sure whether if it is because I got lucky.

So just want to confirm before I make the change.


In reply to: 428259076 [](ancestors = 428259076)

}

Status ApplyPipelinePartitionToMainGraph(Graph& graph,
std::vector<TrainingSession::TrainingConfiguration::CutInfo> cut_info,
Copy link
Copy Markdown
Contributor

@ke1337 ke1337 May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vectorTrainingSession::TrainingConfiguration::CutInfo [](start = 41, length = 60)

const & #Resolved

std::string& backward_waited_event_after_recv_name,
std::string& backward_recorded_event_before_send_name);

Status ApplyPipelinePartitionToMainGraph(Graph& graph,
Copy link
Copy Markdown
Contributor

@ke1337 ke1337 May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph& graph, [](start = 41, length = 13)

nit: put in new line as above? #Resolved

ke1337
ke1337 previously approved these changes May 22, 2020
Copy link
Copy Markdown
Contributor

@ke1337 ke1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

ke1337
ke1337 previously approved these changes May 22, 2020
Copy link
Copy Markdown
Contributor

@ke1337 ke1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

return graph_inputs_including_initializers_;
}

bool IsInputsIncludingInitializers(const NodeArg* node_arg) const noexcept{
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool IsInputsIncludingInitializers(const NodeArg* node_arg) const noexcept{
// Return true if "node_arg" is a input or an initializer. Otherwise, returns false.
bool IsInputsIncludingInitializers(const NodeArg* node_arg) const noexcept{

Could you add comments for those newly-added functions (like other existing ones)? Also, I am not sure if this should be graph_utils or a member of graph. If the design is to keep core structure minimal, it'd be better to move those helper functions into a separated class. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added.

I am not against moving it to util. The main concern here is that, this should be consistant with graph object's graph_inputs_including_initializers_ variable. Having this method outside makes it hard to keep in sync with this.


In reply to: 429517645 [](ancestors = 429517645)


ConstReverseNodeIterator rbegin() const noexcept {
return {nodes_.crbegin(), nodes_.crend()};
}
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure the reverse of topological sort can be computed like this? I think we need to do inverse all edges and do an actual sort.

[Update] I found a proof. Your design is correct. #Resolved

onnx::TensorProto_DataType type,
std::vector<NodeArg*>& new_node_args,
std::vector<std::string>& new_names) {
auto& event_id = CreateTypedNodeArg(graph, type, op_name);
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event_id [](start = 8, length = 8)

new_node_arg #Resolved

}

template <typename T>
void AddNewNodeArgAndInitializer(Graph& graph,
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddNewNodeArgAndInitializer [](start = 5, length = 27)

AddNewScalarNodeArgAndInitializer #Resolved

}

template <typename T>
void AddNewNodeArgAndInitializer(Graph& graph,
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// This function is used when you want to create a scalar constant in a graph.
// It may create a NodeArg so that other Node can references its value.
// It also cerates an initializer to store its value. #Resolved

proto_data.set_data_type(type);

switch (type) {
case ONNX_NAMESPACE::TensorProto_DataType_BOOL:
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch is not good. Can we deduce its type from typename T? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the switch is inevitable eventhough we deduce the type from T. Because we still need to differenciate different calling method (add_int32_data vs add_int64_data), it will be one more step of deducing and then entering the switch statement.


In reply to: 429519064 [](ancestors = 429519064)

Copy link
Copy Markdown
Contributor

@wschin wschin May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this, let's write

void set_int32_scalar(const int32_t value, ONNX_NAMESPACE::TensorProto& tensor) {
          tensor.set_data_type(onnx::TensorProto_DataType::INT32);
        tensor.add_int32_data(value);
}

void set_int64_scalar(const int64_t value, ONNX_NAMESPACE::TensorProto& tensor) {
          tensor.set_data_type(onnx::TensorProto_DataType::INT64);
        tensor.add_int64_data(value);
}

and call them in AddNewScalarNodeArgAndInitializer. Then, proto_data.set_data_type(type); and switch can be removed.


In reply to: 430577195 [](ancestors = 430577195,429519064)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we have set_int32_scalar and and set_int64_scalar, how we going to call them without a type check (i.e. if or switch)?

We still need to distinguish based on different data types to decide with function to call.

On another note, I am not sure why switch is so bad that we want to avoid it...


In reply to: 430690470 [](ancestors = 430690470,430577195,429519064)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type parameter "T" determines which one to call.


In reply to: 430724590 [](ancestors = 430724590,430690470,430577195,429519064)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was referring too. Once inside AddNewNodeArgAndInitializer<T>, we still need to check T's type to decide which one to call. This check is inevitable.


In reply to: 430729278 [](ancestors = 430729278,430724590,430690470,430577195,429519064)

TrainingSession::TrainingConfiguration::CutEdge("71", {"273"})};
TrainingSession::TrainingConfiguration::CutInfo cut1 = {TrainingSession::TrainingConfiguration::CutEdge("308"),
TrainingSession::TrainingConfiguration::CutEdge("71", {"395"})};
params.pipeline_partition_cut_list.emplace_back(cut0);
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those should be input arguments of main. For example, --cuts 186-71-273,308-71-395. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, preferably, it should come as flexible commandline input instead of c++ code. (see TODO on line 385)

Currently the parser is not capable of parsing structured data like this. (the examle you put is not sufficient as there are many ways to interpolate 186-71-273. But it shouldn't be hard. I'll add the parser in later.


In reply to: 429520899 [](ancestors = 429520899)

// If not provided, no optimizer is added.
optional<OptimizerConfiguration> optimizer_config{};

// struct to describe a specific edge (not node_arg). This information is
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct to describe a specific edge (not node_arg) [](start = 7, length = 49)

struct to describe a specific edge (not node_arg) ---> struct to describe edges associated to the same tensor. Note that one edge is a connection from a tensor to an operator. If a tensor is consumed by multiple operators, there would be multiple edges from that tensor to its consumers. #Resolved

Copy link
Copy Markdown
Contributor

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but I do feel there are some corners we can further clean up.

// Each pipeline stage should pick up some strings from this field..
std::vector<std::string> fetch_names;
// [TODO] Add cut information.
std::vector<CutInfo> cut_list;
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector cut_list; [](start = 6, length = 30)

What is this? cut_list[i] contains the edges between the first and the second pipeline stages? For vector fields, we need to put a definition for its elements. #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut_list[i] contains the CutInfo to make the partition between stage i and stage i+1. Comment updated.


In reply to: 429521450 [](ancestors = 429521450)


// split the graph into disconnected subgraph based on provided CutInfo
common::Status SplitGraph(Graph& graph,
std::vector<TrainingSession::TrainingConfiguration::CutInfo> split_edge_groups_,
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_edge_groups_ [](start = 87, length = 18)

split_edge_groups #Resolved

// add output node_arg for send/recv
AddNewNodeArg(graph, "send_output_signal" + cut_index_str, ONNX_NAMESPACE::TensorProto_DataType_BOOL, send_output_args, new_output_names);
AddNewNodeArg(graph, "receive_output_signal" + cut_index_str, ONNX_NAMESPACE::TensorProto_DataType_BOOL, recv_output_args, new_output_names);

Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. How long a line can be in ORT code base? #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (auto& id : edgeIds) {
// find node whose output contains id.node_arg_name
auto producer_node = graph.GetMutableProducerNode(id.node_arg_name);
if (producer_node == nullptr) {
Copy link
Copy Markdown
Contributor

@wschin wschin May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

producer_node == nullptr [](start = 10, length = 24)

if (!producer_node) #Resolved

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think basiclly they are the same. I have no problem changing it. Just curious if there is any guidance on prefer one over the other.


In reply to: 429526366 [](ancestors = 429526366)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's shorter and more readable if people has smaller screen. :)


In reply to: 430585524 [](ancestors = 430585524,429526366)

@xzhu1900 xzhu1900 added the training issues related to ONNX Runtime training; typically submitted using template label May 26, 2020
wschin
wschin previously approved these changes May 26, 2020
Copy link
Copy Markdown
Contributor

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Let's consider adding bool_data in TensorProto in ONNX.

@xzhu1900 xzhu1900 merged commit 633008b into master May 27, 2020
@xzhu1900 xzhu1900 deleted the xuzhu/online-partition branch May 27, 2020 00:44
@xzhu1900
Copy link
Copy Markdown
Contributor Author

Thank you all for reviewing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants