Make Dropout and BatchNormalization Training-friendly #1887

SherlockNoMad · 2019-03-26T23:43:50Z

As we have discussed in the ONNX training working group, we wish to make ONNX ops' training friendly.
To start with, we are adding "is_train" as an input to Dropout and BatchNormalization.

Why as an input, why not attribute?
During training, we also need to performs evaluation periodically to check model's performance. This requires flipping the operation mode on-the-fly. As attribute is usually a constant value for a model, it would be tricky to override it during training. Exposing it as an input allows user to change the value through data feeding from model input.
Why 'is_train', why not 'is_test'?
In the previous versions of Dropout and BatchNormalization, the mode is named as 'is_test'. As ONNX is still mainly for inference purpose, it's better to have the default mode as inference, so that the change won't affect the existing models. By setting is_train = true, we enable the training mode. This is more intuitive than setting is_test = false.

This PR should also address issues #1042

@houseroad @pranavsharma @ebarsoum @linkerzhang @prasanthpul @yuanbyu

pranavsharma · 2019-03-27T00:00:51Z

LGTM 👍

houseroad

this is a breaking change, please bump up the opset version.

houseroad

Please register the opset 10 batchnorm here: https://github.com/onnx/onnx/blob/master/onnx/defs/operator_sets.h#L533,L554

cc: @linkerzhang we need to enable the debug build in ci to detect the problem here and furthermore, we should fix the our op registry.

hobei · 2019-04-03T18:00:27Z

Do you see any issue with the 'is_train' input being defined separately on each operation, which could potentially result in some operations in training mode and some in testing mode in the same model?

SherlockNoMad · 2019-04-03T18:15:53Z

@hobei
Hi Simon, I actually see this as an advantage to have separated control on each op.

Imagine you have a model with two dropout node, one locates in the first few layer, one in the last few layer. If you wish to only fine tune the last few layers, one can set the second dropout node to train mode while keep the first dropout in the test mode.

hobei · 2019-04-04T08:48:45Z

@SherlockNoMad
Hi Sherlock. I agree that this is a potential feature. My concern is how to make it consistent if you would want apply the same control to operations that do not have the 'is_train' input. How do you identify to fine tune layers which do not contain Dropout and Batchnorm?

chinhuang007 · 2019-04-04T19:35:44Z

@SherlockNoMad I think the proposed solution is sound, handling different behaviors of certain operators between training and inference modes. The question is how do we know we have covered all operators that need to have this optional input? Another question is what do we do with a Dropout/BatchNorm with is_train=true while creating/optimizing the inference graph?

chinhuang007 · 2019-04-04T19:44:26Z

@hobei While await for Sherlock's response, I just want to provide my view... The feature of "defining the fine tune layers" seems a separate interesting topic. The is_train input is to determine the expected behavior at the individual operator level. To selectively fine tune certain layers, or some sub-graphs, during training, we might need to introduce something new.

SherlockNoMad · 2019-04-05T17:41:48Z

@chinhuang007
I have gone through the operator list and only identified these two operators behave differently during inference and training. As we encounter more such operators, we will need to add this optional input for them. Hopefully, there are not many of them.
Inference graph should have is_train set to false, or left empty. This should be handled by the model constructor/converter/optimizer.

SherlockNoMad · 2019-04-05T17:47:41Z

@hobei @chinhuang007
Identifying the layers to fine tune is indeed a separate topic. In my opinion, it should be handle by the training runtime/backend.

spandantiwari · 2019-04-08T05:23:03Z

onnx/defs/nn/defs.cc

+        .Input(
+            5,
+            "is_train",
+            "If set to nonzero, run spatial batch normalization in training mode, default is 0.",


I think is_train is type tensor(boolean). Maybe we should consider using {true, false} instead.

hobei · 2019-04-10T09:46:02Z

@SherlockNoMad
I have another comment, specifically for batch norm. When the is_test attribute was removed from op set 7 it was replaced (as I understand it) with the number of outputs instead indicating the mode (training/testing).

Output case #1: Y, mean, var, saved_mean, saved_var (training mode) Output case #2: Y (test mode)

Can you please elaborate on how the value of the new 'is_train' input relates to the number of outputs of batchnorm? As I understand it the number of outputs in a model is fixed, as they may be used as inputs to other operations.

SherlockNoMad · 2019-04-20T00:55:32Z

@hobei , I have updated the doc to further elaborate it.

Output case #1: Y, mean, var, saved_mean, saved_var (training mode)
Output case #2: Y (test mode, where other outputs will not be populated)

The number of outputs is indeed fixed. During testing, the latter 4 outputs are not populated and should not be consumed by other nodes.

CLAassistant · 2019-07-24T00:56:56Z

All committers have signed the CLA.

edgchen1 · 2019-11-08T22:13:03Z

onnx/defs/nn/defs.cc

+        .Input(
+            1,
+            "ratio",
+            "The ratio of random dropout, with value in [0, 1]. If this input was not set, "


ratio's range should be [0,1)

wschin · 2020-02-18T19:00:44Z

Since #2568 is merged, we can close this PR.

Make Dropout and BatchNorm schema Training-friendly

28853d3

houseroad reviewed Mar 28, 2019

View reviewed changes

Sherlock Huang added 4 commits March 29, 2019 14:32

Bump Opset Vesrion for BatchNorm

fab14f0

Merge branch 'master' into bahuang/dropout

fec3bf5

Update Changelog and Doc

451e634

Update Test Coverage

c3c797b

houseroad reviewed Apr 3, 2019

View reviewed changes

Add BatchNorm under opset 10 registry

3e553a3

spandantiwari reviewed Apr 8, 2019

View reviewed changes

Sherlock Huang added 3 commits April 19, 2019 17:56

Update op definition

31eb521

Merge branch 'master' into bahuang/dropout

d4bc7dc

Update changelog

fb1f4aa

SherlockNoMad force-pushed the bahuang/dropout branch from ce96571 to fb1f4aa Compare April 20, 2019 01:06

SherlockNoMad requested a review from a team as a code owner April 20, 2019 01:06

wschin mentioned this pull request May 21, 2019

Summary of Training Story in ONNX #2038

Closed

prasanthpul added this to the 1.7 milestone Aug 20, 2019

edgchen1 reviewed Nov 8, 2019

View reviewed changes

lara-hdr mentioned this pull request Jan 23, 2020

Update Dropout and BatchNorm to be Training Friendly #2568

Merged

wschin closed this Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Dropout and BatchNormalization Training-friendly #1887

Make Dropout and BatchNormalization Training-friendly #1887

SherlockNoMad commented Mar 26, 2019

pranavsharma commented Mar 27, 2019

houseroad left a comment

houseroad left a comment

hobei commented Apr 3, 2019

SherlockNoMad commented Apr 3, 2019

hobei commented Apr 4, 2019 •

edited

Loading

chinhuang007 commented Apr 4, 2019

chinhuang007 commented Apr 4, 2019

SherlockNoMad commented Apr 5, 2019

SherlockNoMad commented Apr 5, 2019

spandantiwari Apr 8, 2019

hobei commented Apr 10, 2019

SherlockNoMad commented Apr 20, 2019

CLAassistant commented Jul 24, 2019 •

edited

Loading

edgchen1 Nov 8, 2019

wschin commented Feb 18, 2020

Make Dropout and BatchNormalization Training-friendly #1887

Make Dropout and BatchNormalization Training-friendly #1887

Conversation

SherlockNoMad commented Mar 26, 2019

pranavsharma commented Mar 27, 2019

houseroad left a comment

Choose a reason for hiding this comment

houseroad left a comment

Choose a reason for hiding this comment

hobei commented Apr 3, 2019

SherlockNoMad commented Apr 3, 2019

hobei commented Apr 4, 2019 • edited Loading

chinhuang007 commented Apr 4, 2019

chinhuang007 commented Apr 4, 2019

SherlockNoMad commented Apr 5, 2019

SherlockNoMad commented Apr 5, 2019

spandantiwari Apr 8, 2019

Choose a reason for hiding this comment

hobei commented Apr 10, 2019

SherlockNoMad commented Apr 20, 2019

CLAassistant commented Jul 24, 2019 • edited Loading

edgchen1 Nov 8, 2019

Choose a reason for hiding this comment

wschin commented Feb 18, 2020

hobei commented Apr 4, 2019 •

edited

Loading

CLAassistant commented Jul 24, 2019 •

edited

Loading