Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Proposal: Spec Changes and Gradient Operator #2314

Open
wants to merge 9 commits into
base: master
from

Conversation

@wschin
Copy link
Contributor

wschin commented Sep 16, 2019

This PR includes all necessary changes for enabling training. It currently includes #2013, #1955, #1970, #1959, #2168, #1939. Per Training WG's discussion, we want to put everything into a single place for convenience and having a global view to our status.

Please see #2038 and those sub-PRs' messages for details.

To speedup the code review process and properly distribute works, this PR now only contains content from #2517 by @tbennun, #2013, and #2168,. They are the minimal requirement to express training graph. Losses and optimizers will be added separately.

This PR introduces several major changes.

  1. Add a protobuf message, TrainingInfoProto originally designed in #2013, to store training information.
  2. In TrainingInfoProto, the user can store training algorithm in algorithm field as a GraphProto.
  3. The user can also store initialization algorithm for resetting the model in TrainingInfoProto.initialization (proposed by @tbennun in #2517 and agreed by Training WG).
  4. ModelProto.graph is callable inside TrainingInfoProto.algorithm. ModelProto.graph.initializer are visible to nodes in TrainingInfoProto.algorithm.node.
  5. This PR also introduces a Gradient operator to differentiate a function represented by a (sub-)graph. This idea is from #2168. The domain of Gradient is ai.onnx.training.
@wschin wschin added the training label Sep 16, 2019
@wschin wschin added this to the 1.7 milestone Sep 16, 2019
@wschin wschin requested review from onnx/sig-archinfra-approvers as code owners Sep 16, 2019
@CLAassistant

This comment has been minimized.

Copy link

CLAassistant commented Sep 17, 2019

CLA assistant check
All committers have signed the CLA.

@prasanthpul

This comment has been minimized.

Copy link
Member

prasanthpul commented Sep 17, 2019

Are the new ops only useful in training scenarios? If so, they should be put in a separate namespace. It doesn't have to be "ai.onnx.training", it can be "ai.onnx.lossfunction" or something like that

@wschin

This comment has been minimized.

Copy link
Contributor Author

wschin commented Sep 17, 2019

Are the new ops only useful in training scenarios? If so, they should be put in a separate namespace. It doesn't have to be "ai.onnx.training", it can be "ai.onnx.lossfunction" or something like that

Sounds good. I will put them into ai.onnx.training.

@spandantiwari

This comment has been minimized.

Copy link
Contributor

spandantiwari commented Oct 1, 2019

How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in update_binding. This question comes up when backends are trying to optimize the graphs using constant folding, and trying to decide which of the initializers are true constants (non-trainable) that can be folded.

onnx/onnx.in.proto Outdated Show resolved Hide resolved
@wschin

This comment has been minimized.

Copy link
Contributor Author

wschin commented Dec 3, 2019

How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in update_binding. This question comes up when backends are trying to optimize the graphs using constant folding, and trying to decide which of the initializers are true constants (non-trainable) that can be folded.

There is no direct concept of trainable. As long as you use update_binding to update a initializer, that initializer is considered trainable because its value would be altered at the end of each training iteration. In contrast, as long a initializer's name is not the value of any key-value pairs in update_binding, that initializer is considered as a constant.

@wschin wschin force-pushed the wschin:training branch from c78f181 to 3081768 Dec 3, 2019
@chinhuang007

This comment has been minimized.

Copy link
Contributor

chinhuang007 commented Dec 10, 2019

Okay, I believe that means trainable parameters are inferred based on initializer and update_binding. Sounds reasonable.

@wschin

This comment has been minimized.

Copy link
Contributor Author

wschin commented Dec 10, 2019

Okay, I believe that means trainable parameters are inferred based on initializer and update_binding. Sounds reasonable.

Yes!

@wschin wschin requested a review from chinhuang007 Jan 9, 2020
wschin and others added 2 commits Jan 23, 2020
Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     #2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     #2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from #2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
@wschin wschin force-pushed the wschin:training branch from 0ef8fc2 to 9bd2c39 Jan 24, 2020
@wschin wschin requested a review from chinhuang007 Jan 24, 2020
Copy link
Contributor

chinhuang007 left a comment

LGTM!

@spandantiwari

This comment has been minimized.

Copy link
Contributor

spandantiwari commented Jan 24, 2020

How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in update_binding. This question comes up when backends are trying to optimize the graphs using constant folding, and trying to decide which of the initializers are true constants (non-trainable) that can be folded.

There is no direct concept of trainable. As long as you use update_binding to update a initializer, that initializer is considered trainable because its value would be altered at the end of each training iteration. In contrast, as long a initializer's name is not the value of any key-value pairs in update_binding, that initializer is considered as a constant.

OK. To summarize, the necessary and sufficient condition for an initializer to be considered trainable is that it is included in update_binding. If it is not then it is considered a constant and it is fair game to use it for optimizations such as constant folding. There's no other condition required to mark a initializer as trainable/non-trainable (e.g. including the initializer in list of graph inputs to mark is trainable etc).

Is this correct? If yes, we should add this to the description somewhere to be clear.

Copy link
Contributor

spandantiwari left a comment

LGTM. Minor feedback added above.

@wschin

This comment has been minimized.

Copy link
Contributor Author

wschin commented Jan 24, 2020

How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in update_binding. This question comes up when backends are trying to optimize the graphs using constant folding, and trying to decide which of the initializers are true constants (non-trainable) that can be folded.

There is no direct concept of trainable. As long as you use update_binding to update a initializer, that initializer is considered trainable because its value would be altered at the end of each training iteration. In contrast, as long a initializer's name is not the value of any key-value pairs in update_binding, that initializer is considered as a constant.

OK. To summarize, the necessary and sufficient condition for an initializer to be considered trainable is that it is included in update_binding. If it is not then it is considered a constant and it is fair game to use it for optimizations such as constant folding. There's no other condition required to mark a initializer as trainable/non-trainable (e.g. including the initializer in list of graph inputs to mark is trainable etc).

Is this correct? If yes, we should add this to the description somewhere to be clear.

Yes, it's correct. I will extend the spec to explicit reveal that point. Thank you!

[Update] Extended a small paragraph right above update_binding:

// Whether an initializer is trainable is determined by the values of
// key-value pairs in "update_binding". If the name of an initializer
// cannot be found as a value in the "update_binding"s of all
// TrainingInfoProto's, that initializer is not trainable and therefore
// considered as a constant.

repeated StringStringEntryProto update_binding = 3;  
docs/Operators.md Outdated Show resolved Hide resolved
Copy link
Member

prasanthpul left a comment

Ops SIG should weigh in on experimental. We deprecated the concept of experimental ops previously. This PR is re-introducing it.

Feedback to Training WG was to have all training functionality as preview/experimental in initial release. But this does not necessarily mean experimental ops, especially since the IR has no such concept.

onnx/defs/training/defs.cc Outdated Show resolved Hide resolved
@wschin wschin requested review from gramalingam and prasanthpul Jan 25, 2020
docs/Changelog.md Outdated Show resolved Hide resolved
onnx/defs/training/defs.cc Outdated Show resolved Hide resolved
// and assign its random output to the specific tensor using "update_binding".
//
// By default, this field is an empty graph and its evaluation does not
// produce any output.

This comment has been minimized.

Copy link
@gramalingam

gramalingam Jan 26, 2020

Contributor

I don't understand this. (a) Does this mean that this field is optional and may be missing? (b) If it is missing, what is the initial value assigned to initializers?

This comment has been minimized.

Copy link
@wschin

wschin Jan 27, 2020

Author Contributor

(a) this field can be missing. (b) because this default graph doesn't produce any output, the user cannot use update_binding assign initial values to any initializers. Thus, the default initialization behavior is not to change any initializer.

// "algorithm" and "initialization" to any initializer in
// ModelProto.graph.initializer and TrainingInfoProto.algorithm.initializer.
// One initializer can only be assigned up to once; it implies that the
// values in "update_binding" must be unique.

This comment has been minimized.

Copy link
@gramalingam

gramalingam Jan 26, 2020

Contributor

A couple of related points: (a) Should we have a separate update_binding for initialization and a separate one for algorithm? Otherwise, we need to be clear whether we expect the two graphs to use same output-names for one given initializer, or allow both (init_x, x) and (update_x, x) in the bindings. (b) Should we swap the order of names in the update_bindings? That has two advantages: it is more consistent with the ordering in an assignment statement. It also clarifies that only one update-value should be specified for an initializer (since the first entry is the "key").

This comment has been minimized.

Copy link
@wschin

wschin Jan 27, 2020

Author Contributor

(a) My understanding is that we allow both (init_x, x) and (update_x, x) in update_binding. It means update_binding is an union of assignments from initialization and training.

(b) I personally feel (key, value)=(W_new, W) is closer to what will happen in the runtime. After finishing the computation of algorithm and produce an output W_new, we need a mapping from W_new to the updated initializer. If we store (key, value)=(W, W_new), we will need to convert '(W, W_new)to(W_new, W)` when creating output-to-initializer mapping.

This comment has been minimized.

Copy link
@gramalingam

gramalingam Jan 27, 2020

Contributor

Re. (a): I think splitting the two simplifies the semantics. Re. (b) also: I think we should think in terms of defining the semantics formally (using appropriate programming/mathematical notation). My opinion is that difference in this sense is not much, but the order (lhs, rhs) is common for an assignment and similarly (variable, value) is common for a binding in functional languages.

Mathematically, the underlying state is a "map" M that maps tensor-variables to their values. The execution of a single update-binding (W, W_new) updates M to M[W => W_new].

This comment has been minimized.

Copy link
@wschin

wschin Jan 27, 2020

Author Contributor

Re Re (a): Ok. I will create initialization_binding and algorithm_binding and remove update_binding.

Re Re (b): Ok. I will swap keys and values.


// Training-specific information. Sequentially executing all stored
// "TrainingInfoProto"s is one training iteration.
repeated TrainingInfoProto training_info = 21;

This comment has been minimized.

Copy link
@gramalingam

gramalingam Jan 26, 2020

Contributor

What does this mean, given that the initialization is also inside a TrainingInfoProto? I think I understand repeating a list of "update algorithm" in one iteration, but the initialization is a problem.

This comment has been minimized.

Copy link
@wschin

wschin Jan 27, 2020

Author Contributor

There will be two execution modes.

  1. Training: sequentially execute all TrainingInfoProto's algorithm graphs.
  2. Initialization: sequentially execute all TrainingInfoProto's initialization graphs.
Update Gradient test models
@wschin wschin force-pushed the wschin:training branch from d44b566 to 4801755 Jan 27, 2020
1. Create initialization_binding instead of
   using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.
@@ -246,6 +351,10 @@ message ModelProto {

// Named metadata values; keys should be distinct.
repeated StringStringEntryProto metadata_props = 14;

// Training-specific information. Sequentially executing all stored
// "TrainingInfoProto"s is one training iteration.

This comment has been minimized.

Copy link
@gramalingam

gramalingam Jan 27, 2020

Contributor

Can we clarify the semantics by adding the following?

In particular, this defines two functionalities: an initialization-step and a training-iteration-step.
The semantics of the initialization-step is that all initializers (in ModelProto.graph and in all
TrainingInfoProto.algorithm) are first initialized as specified by the initializers in the graph,
and then by the initialization_bindings in every instance in training_info.
The semantics of the training-iteration-step is that all the initializers are updated as specified
by the update_bindings in every instance in training_info.

In particular, I have a question: is my understanding of the initialization above correct?
In the ambiguous case where multiple initialization values are specified for W (e.g., if
the graph specifies an initializer value for initializer W,
and one or more initialization_bindings also specify a value for W.)

This comment has been minimized.

Copy link
@wschin

wschin Jan 27, 2020

Author Contributor

I think the semantic of initialization applies to training as well. We need to compute new weights from those initialized ones.

For ambiguity, only one of algorithm and initialization can be executed at the same time. If the user choose to train the model, they needs to execute all algorithms and ignore all initializations. If the user choose to reset the model, all algorithms will be ignored and only initializations will be sequentially executed.

This comment has been minimized.

Copy link
@wschin

wschin Jan 27, 2020

Author Contributor

I added your clarification to the protobuf files. Thanks.

Copy link
Contributor

gramalingam left a comment

Thanks Wei-Sheng! Greatly appreciate this multi-month effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.