From 3193b023cff0c984816ea007b3fe2046e2fa9fef Mon Sep 17 00:00:00 2001
From: Chin Huang <chhuang@us.ibm.com>
Date: Mon, 30 Mar 2020 19:27:11 -0700
Subject: [PATCH] Rel 1.7.103 verify (#2687)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Fix Greater/LessOrEqual function definition (#2645)

* Fix Greater/LessOrEqual function definition

* Update test data

Co-authored-by: Ke Zhang <kezhan@microsoft.com>

* Suppress a warning in unsqueeze (#2637)

I keep getting this warning when building PyTorch:

```
In file included from
/home/hong/wsrc/pytorch/third_party/onnx/onnx/defs/tensor/utils.h:6,
                 from
/home/hong/wsrc/pytorch/third_party/onnx/onnx/defs/tensor/defs.cc:4:
/home/hong/wsrc/pytorch/third_party/onnx/onnx/defs/tensor/defs.cc: In
lambda function:
/home/hong/wsrc/pytorch/third_party/onnx/onnx/defs/tensor/defs.cc:1414:22:
warning: unnecessary parentheses in declaration of âiâ
[-Wparentheses]
           for (size_t(i) = 0; i < axes.size(); ++i) {
                      ^
/home/hong/wsrc/pytorch/third_party/onnx/onnx/defs/schema.h:959:12:
note: in definition of macro âONNX_OPERATOR_SET_SCHEMA_EXâ
     return impl.SetName(#name)
\
            ^~~~
/home/hong/wsrc/pytorch/third_party/onnx/onnx/defs/tensor/defs.cc:1369:1:
note: in expansion of macro âONNX_OPERATOR_SET_SCHEMAâ
 ONNX_OPERATOR_SET_SCHEMA(
```

This commit should fix it and modernize the code a bit.

Co-authored-by: Ke Zhang <kezhan@microsoft.com>

* [Training] Add Adagrad optimizer operator (#1955)

* Adagrad draft

* MIMO

* Support multiple tensors to be optimized

* Address comments

* Move optimizers to a new place

Remove copied

Add momentum

Save

Remove momentum

Fix

Move constants to attributes

* Fix build

* Add shape test

Add two node tests

Update test coverage

* Fix shape inf

* Fix shape inf

* fix shape inf

* Format

* Add function type

* Merge lines

* Format

* Fix version number

* Update op version in model files

* Fix a test function and update related test files

* Update onnx/backend/test/case/node/adagrad.py

* Remove unused file

* sync docs

* Fix shape test

* sync doc

* sync with master

* Update onnx/defs/training/defs.cc

Co-Authored-By: Michał Karzyński <postrational@users.noreply.github.com>

* sync doc

* address comments

* address a minor comment

* Polish one line

Co-authored-by: Michał Karzyński <postrational@users.noreply.github.com>

* [Training] SG with Momentum Optimizer (#1959)

* SG with Momentum

* Registrate Op

Fix

Update other docs

* Add shape inference code and polish definition

* Update docs

* Add test cases and fix several bugs

* Remove accidently added copy

* Alpha -> alpha & Beta -> beta

* Clarify an attribute

* Fix an attribute

* Fix bug

* Fix missing attributes

* sync doc

* Remove unused domain

* sync with master

Co-authored-by: Chin Huang <chhuang@us.ibm.com>

* Change type of label tensor to int32/int64 in SoftmaxCrossEntropyLoss spec. (#2667)

* Update Pow input types in Opset 12 (#2666)

* Update Pow input types in Opset 12

* gen doc and tests

* remove uints and 8 bit ints

* add tests

* remove uint int x tets

* Adding CI for ONNX Debug mode (Linux, OSX) (#2651)

* adding an osx build, linux build, with and without onnx_ml for debug mode

* test debug mode with ONNX_ML=1

* Rename OPTIONAL to OPTIONAL_VALUE (#2682)

Co-authored-by: G. Ramalingam <grama@microsoft.com>

* Update Batchnorm test (#2674)

* Update Batchnorm test

* relax shape inference on scalar

* Remove unnecessary copies and std::move (#2684)

* Update sequence test case so input is not scalar and splits are specified (#2675)

* Update sequence test case to input is not scalar and splits are specified

* Add spaces to make the checker happy

* Use cmake GNUInstallDirs (#2661)

https://cmake.org/cmake/help/latest/module/GNUInstallDirs.html
this make allow install the libraries (and headers) in different location than `lib` (Gentoo uses lib64 for 64-bits libs)
also change the .cmake files for avoid conclicts if build both 32-bis and 64-bits (avoids conflict/overwrite files)

Co-authored-by: Ke Zhang <kezhan@microsoft.com>

* Add 'ignore_index' input in the spec for SoftmaxCrossEntropyLoss and NLLLoss. (#2680)

* Add 'ignore_index' input in the spec for SoftmaxCrossEntropyLoss and NLLLoss.

* Add tests.

* build break.

* build break.

* clean up.

* build break.

* Change ignore_index to attribute.

* Change ignore_index to attribute.

* PR feedback.

* PR feedback.

* Make ignore_index optional in NLLLoss.

* Build break.

* remove trailing spaces to fix build break.

* Build break.

* Update spec doc.

* Fix NLLLoss function definition to fix test: test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded

* PR feedback.

* Fix test for softmax cross entropy loss to exclude ignored_index'ed weights from the sum of weights.

* Build break.

* Reduce binary size of libraries consuming ONNX (part 1/2) (#2643)

* Change the return type for the zipmap operator to match the description in the spec.

* Reduce binary size of libraries consuming ONNX (part 1/2)

* Fix build error

* Replace separate Get*Doc() functions with easy macro for greater convenience

* Add one more macro for complicated operator doc documentation.

Co-authored-by: Ke Zhang <kezhan@microsoft.com>

* Update pybind (#2340) (#2688)

* Change version number for release verification

Change version number for release verification

Co-authored-by: Takeshi Watanabe <take-cheeze@users.noreply.github.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
Co-authored-by: Hong Xu <hong@topbug.net>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Michał Karzyński <postrational@users.noreply.github.com>
Co-authored-by: M. Zeeshan Siddiqui <mzs@microsoft.com>
Co-authored-by: Lara Haidar <haidar.lara@gmail.com>
Co-authored-by: Vinitra Swamy <vinitras@gmail.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: G. Ramalingam <grama@microsoft.com>
Co-authored-by: Changming Sun <me@sunchangming.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Gustavo Alvarez <462213+sl1pkn07@users.noreply.github.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
---
 .travis.yml                                   |  13 +
 .travis/install.sh                            |   4 +
 CMakeLists.txt                                |  12 +-
 VERSION_NUMBER                                |   2 +-
 docs/Changelog.md                             | 153 +++-
 docs/Operators.md                             | 399 ++++++++++-
 docs/TestCoverage.md                          | 266 ++++++-
 onnx/backend/test/case/model/sequence.py      |  15 +-
 onnx/backend/test/case/node/batchnorm.py      |   8 +-
 onnx/backend/test/case/node/momentum.py       | 147 ++++
 .../case/node/negativeloglikelihoodloss.py    |  44 +-
 onnx/backend/test/case/node/pow.py            |  69 +-
 .../test/case/node/softmaxcrossentropy.py     |  47 +-
 .../test_batchnorm_epsilon_old/model.onnx     |   2 +-
 .../model.onnx                                | Bin 451 -> 455 bytes
 .../test_data_set_0/input_5.pb                |   2 +-
 .../test_data_set_0/output_0.pb               | Bin 496 -> 496 bytes
 .../test_data_set_0/output_1.pb               |   2 +-
 .../test_data_set_0/output_2.pb               |   2 +-
 .../test_data_set_0/output_3.pb               |   4 +-
 .../test_data_set_0/output_4.pb               |   2 +-
 .../test_batchnorm_example_old/model.onnx     |   2 +-
 .../model.onnx                                | Bin 431 -> 435 bytes
 .../test_data_set_0/input_5.pb                |   2 +-
 .../test_data_set_0/output_0.pb               | Bin 39 -> 39 bytes
 .../test_data_set_0/output_1.pb               | Bin 27 -> 27 bytes
 .../test_data_set_0/output_2.pb               |   2 +-
 .../test_data_set_0/output_3.pb               | Bin 20 -> 26 bytes
 .../test_data_set_0/output_4.pb               |   2 +-
 .../test/data/node/test_momentum/model.onnx   | Bin 0 -> 317 bytes
 .../test_momentum/test_data_set_0/input_0.pb  |   1 +
 .../test_momentum/test_data_set_0/input_1.pb  | Bin 0 -> 15 bytes
 .../test_momentum/test_data_set_0/input_2.pb  |   1 +
 .../test_momentum/test_data_set_0/input_3.pb  | Bin 0 -> 17 bytes
 .../test_momentum/test_data_set_0/input_4.pb  |   1 +
 .../test_momentum/test_data_set_0/output_0.pb |   1 +
 .../test_momentum/test_data_set_0/output_1.pb |   1 +
 .../node/test_momentum_multiple/model.onnx    | Bin 0 -> 462 bytes
 .../test_data_set_0/input_0.pb                |   1 +
 .../test_data_set_0/input_1.pb                | Bin 0 -> 15 bytes
 .../test_data_set_0/input_2.pb                | Bin 0 -> 14 bytes
 .../test_data_set_0/input_3.pb                | Bin 0 -> 18 bytes
 .../test_data_set_0/input_4.pb                | Bin 0 -> 14 bytes
 .../test_data_set_0/input_5.pb                | Bin 0 -> 18 bytes
 .../test_data_set_0/input_6.pb                | Bin 0 -> 14 bytes
 .../test_data_set_0/input_7.pb                | Bin 0 -> 18 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 18 bytes
 .../test_data_set_0/output_1.pb               | Bin 0 -> 22 bytes
 .../test_data_set_0/output_2.pb               |   1 +
 .../test_data_set_0/output_3.pb               |   1 +
 .../model.onnx                                |   2 +-
 .../model.onnx                                | Bin 1757 -> 1757 bytes
 .../model.onnx                                |   2 +-
 .../model.onnx                                | Bin 1837 -> 1837 bytes
 .../model.onnx                                | Bin 246 -> 246 bytes
 .../model.onnx                                | Bin 2319 -> 2319 bytes
 .../model.onnx                                | Bin 244 -> 244 bytes
 .../model.onnx                                | Bin 2302 -> 2302 bytes
 .../model.onnx                                |   2 +-
 .../model.onnx                                | Bin 2572 -> 2572 bytes
 .../model.onnx                                | Bin 288 -> 288 bytes
 .../model.onnx                                | Bin 3897 -> 3897 bytes
 .../model.onnx                                | Bin 286 -> 286 bytes
 .../model.onnx                                | Bin 3130 -> 3130 bytes
 .../model.onnx                                | Bin 0 -> 320 bytes
 .../test_data_set_0/input_0.pb                | Bin 0 -> 2180 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 451 bytes
 .../test_data_set_0/input_2.pb                |   1 +
 .../test_data_set_0/output_0.pb               |   1 +
 .../model.onnx                                | Bin 0 -> 4425 bytes
 .../test_data_set_0/input_0.pb                | Bin 0 -> 2180 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 451 bytes
 .../test_data_set_0/input_2.pb                |   1 +
 .../test_data_set_0/output_0.pb               |   1 +
 .../node/test_nesterov_momentum/model.onnx    | Bin 0 -> 326 bytes
 .../test_data_set_0/input_0.pb                |   1 +
 .../test_data_set_0/input_1.pb                | Bin 0 -> 15 bytes
 .../test_data_set_0/input_2.pb                |   1 +
 .../test_data_set_0/input_3.pb                | Bin 0 -> 17 bytes
 .../test_data_set_0/input_4.pb                |   1 +
 .../test_data_set_0/output_0.pb               |   1 +
 .../test_data_set_0/output_1.pb               |   1 +
 .../test/data/node/test_pow/model.onnx        |   4 +-
 .../node/test_pow/test_data_set_0/output_0.pb | Bin 254 -> 254 bytes
 .../data/node/test_pow_bcast_array/model.onnx |   4 +-
 .../node/test_pow_bcast_scalar/model.onnx     | Bin 108 -> 108 bytes
 .../data/node/test_pow_example/model.onnx     |   4 +-
 .../data/node/test_pow_types_float/model.onnx |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 33 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 33 bytes
 .../test_pow_types_float32_int32/model.onnx   |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 21 bytes
 .../test_pow_types_float32_int64/model.onnx   |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 33 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 21 bytes
 .../test_pow_types_float32_uint32/model.onnx  |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 21 bytes
 .../test_pow_types_float32_uint64/model.onnx  |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 33 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 21 bytes
 .../data/node/test_pow_types_int/model.onnx   |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 33 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 21 bytes
 .../test_pow_types_int32_float32/model.onnx   |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 21 bytes
 .../test_pow_types_int32_int32/model.onnx     |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 21 bytes
 .../test_pow_types_int64_float32/model.onnx   |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 33 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 33 bytes
 .../test_pow_types_int64_int64/model.onnx     |  16 +
 .../test_data_set_0/input_0.pb                | Bin 0 -> 33 bytes
 .../test_data_set_0/input_1.pb                | Bin 0 -> 33 bytes
 .../test_data_set_0/output_0.pb               | Bin 0 -> 33 bytes
 .../model.onnx                                | Bin 165 -> 165 bytes
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../model.onnx                                | Bin 176 -> 176 bytes
 .../test_data_set_0/input_1.pb                | Bin 59 -> 35 bytes
 .../model.onnx                                | Bin 1327 -> 1327 bytes
 .../test_data_set_0/input_1.pb                | Bin 59 -> 35 bytes
 .../model.onnx                                | Bin 1277 -> 1277 bytes
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../model.onnx                                | Bin 192 -> 192 bytes
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../model.onnx                                | Bin 1395 -> 1395 bytes
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../model.onnx                                | Bin 0 -> 226 bytes
 .../test_data_set_0/input_0.pb                |   1 +
 .../test_data_set_0/input_1.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_2.pb                |   1 +
 .../test_data_set_0/output_0.pb               |   1 +
 .../model.onnx                                | Bin 0 -> 1598 bytes
 .../test_data_set_0/input_0.pb                |   1 +
 .../test_data_set_0/input_1.pb                | Bin 0 -> 21 bytes
 .../test_data_set_0/input_2.pb                |   1 +
 .../test_data_set_0/output_0.pb               |   1 +
 .../model.onnx                                |   2 +-
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../model.onnx                                |   2 +-
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../model.onnx                                |   2 +-
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../model.onnx                                |   2 +-
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../test_softmax_cross_entropy_sum/model.onnx | Bin 163 -> 163 bytes
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 .../model.onnx                                | Bin 1262 -> 1262 bytes
 .../test_data_set_0/input_1.pb                | Bin 33 -> 21 bytes
 onnx/common/constants.h                       |   2 +-
 onnx/cpp2py_export.cc                         |   4 +-
 onnx/defs/generator/defs.cc                   |  18 +-
 onnx/defs/logical/defs.cc                     |  78 +-
 onnx/defs/logical/old.cc                      |  29 +-
 onnx/defs/math/defs.cc                        | 678 ++++++++++++------
 onnx/defs/math/old.cc                         | 135 ++--
 onnx/defs/nn/defs.cc                          | 296 ++++----
 onnx/defs/nn/old.cc                           | 147 ++--
 onnx/defs/operator_sets-training.h            |   2 +
 onnx/defs/operator_sets.h                     |   2 +
 onnx/defs/reduction/defs.cc                   |  41 +-
 onnx/defs/reduction/old.cc                    |  24 +-
 onnx/defs/rnn/defs.cc                         |  21 +-
 onnx/defs/rnn/old.cc                          |  22 +-
 onnx/defs/schema.cc                           |  61 +-
 onnx/defs/schema.h                            | 116 ++-
 onnx/defs/sequence/defs.cc                    |   2 +-
 onnx/defs/tensor/defs.cc                      |  16 +-
 onnx/defs/tensor/old.cc                       |  18 +-
 onnx/defs/traditionalml/defs.cc               | 136 ++--
 onnx/defs/traditionalml/old.cc                |   2 +-
 onnx/defs/training/defs.cc                    | 154 +++-
 onnx/optimizer/pass_manager.cc                |   4 +-
 onnx/shape_inference/implementation.cc        |   4 +-
 onnx/test/shape_inference_test.py             |  41 ++
 onnx/version_converter/convert.cc             |   4 +-
 third_party/pybind11                          |   2 +-
 189 files changed, 2662 insertions(+), 807 deletions(-)
 create mode 100644 onnx/backend/test/case/node/momentum.py
 create mode 100644 onnx/backend/test/data/node/test_momentum/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_momentum/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum/test_data_set_0/input_2.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum/test_data_set_0/input_3.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum/test_data_set_0/input_4.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum/test_data_set_0/output_1.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_2.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_3.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_4.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_5.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_6.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_7.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_1.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_2.pb
 create mode 100644 onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_3.pb
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_2.pb
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_2.pb
 create mode 100644 onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_nesterov_momentum/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_2.pb
 create mode 100644 onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_3.pb
 create mode 100644 onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_4.pb
 create mode 100644 onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/output_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_int32/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_int64/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_uint32/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_uint64/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int32_float32/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int32_int32/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int64_float32/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int64_int64/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_2.pb
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/output_0.pb
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/model.onnx
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_0.pb
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_1.pb
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_2.pb
 create mode 100644 onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/output_0.pb

diff --git a/.travis.yml b/.travis.yml
index 5abb6f90e99..8d7c5580a80 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -12,6 +12,16 @@ matrix:
       env: PYTHON_VERSION=python3 ONNX_ML=0
       language: python
       python: "3.6"
+    - os: linux
+      sudo: required
+      env: PYTHON_VERSION=python3 ONNX_ML=0 ONNX_DEBUG=1
+      language: python
+      python: "3.6"
+    - os: linux
+      sudo: required
+      env: PYTHON_VERSION=python3 ONNX_ML=1 ONNX_DEBUG=1
+      language: python
+      python: "3.6"
     - os: osx
       osx_image: xcode9.3
       env: PYTHON_VERSION=python2 ONNX_ML=0
@@ -34,6 +44,9 @@ matrix:
     - os: osx
       osx_image: xcode9.3
       env: PYTHON_VERSION=python3
+    - os: osx
+      osx_image: xcode9.3
+      env: PYTHON_VERSION=python3 ONNX_DEBUG=1
     - os: linux
       sudo: required
       env: PYTHON_VERSION=python2 LITE=1
diff --git a/.travis/install.sh b/.travis/install.sh
index 1c6555eea83..edfc75db787 100755
--- a/.travis/install.sh
+++ b/.travis/install.sh
@@ -13,5 +13,9 @@ fi
 export CMAKE_ARGS="${CMAKE_ARGS} -DONNXIFI_DUMMY_BACKEND=ON"
 export ONNX_NAMESPACE=ONNX_NAMESPACE_FOO_BAR_FOR_CI
 
+if [ "${ONNX_DEBUG}" == "1" ]; then
+  export DEBUG=1
+fi
+
 time python setup.py --quiet bdist_wheel --universal --dist-dir .
 find . -maxdepth 1 -name "*.whl" -ls -exec pip install {} \;
diff --git a/CMakeLists.txt b/CMakeLists.txt
index ca3a65d7fd8..0aa9fda2451 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -640,12 +640,14 @@ if(MSVC)
   add_msvc_runtime_flag(onnxifi_dummy)
 endif()
 
+include(GNUInstallDirs)
+
 install(DIRECTORY ${ONNX_ROOT}/onnx
-        DESTINATION include
+        DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
         FILES_MATCHING
         PATTERN "*.h")
 install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/onnx
-        DESTINATION include
+        DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
         FILES_MATCHING
         PATTERN "*.h")
 
@@ -660,13 +662,13 @@ configure_file(
 install(FILES
   ${PROJECT_BINARY_DIR}/ONNXConfigVersion.cmake
   ${PROJECT_BINARY_DIR}/ONNXConfig.cmake
-  DESTINATION share/cmake/ONNX
+  DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/ONNX
   COMPONENT dev)
-install(EXPORT ONNXTargets DESTINATION share/cmake/ONNX)
+install(EXPORT ONNXTargets DESTINATION "${CMAKE_INSTALL_LIBDIR}/cmake/ONNX")
 install(TARGETS
   onnx onnx_proto
   onnxifi onnxifi_dummy onnxifi_loader
-  EXPORT ONNXTargets DESTINATION lib)
+  EXPORT ONNXTargets DESTINATION ${CMAKE_INSTALL_LIBDIR})
 
 if(NOT ANDROID AND NOT IOS)
   install(TARGETS onnxifi_wrapper
diff --git a/VERSION_NUMBER b/VERSION_NUMBER
index a5e19fb8444..8f27ab8c269 100644
--- a/VERSION_NUMBER
+++ b/VERSION_NUMBER
@@ -1 +1 @@
-1.7.102
\ No newline at end of file
+1.7.103
\ No newline at end of file
diff --git a/docs/Changelog.md b/docs/Changelog.md
index 0f66c6e22eb..e10bcaf83d3 100644
--- a/docs/Changelog.md
+++ b/docs/Changelog.md
@@ -14801,6 +14801,8 @@ This version of the operator has been available since version 12 of the default
 #### Attributes
 
 <dl>
+<dt><tt>ignore_index</tt> : int</dt>
+<dd>Specifies a target value that is ignored and does not contribute to the input gradient. It is an optional value and valid values are [0, C).</dd>
 <dt><tt>reduction</tt> : string (default is mean)</dt>
 <dd>Type of reduction to apply to loss: none, sum, mean (default). 'none': the output is the loss for each sample. 'sum': the output will be summed. 'mean': the sum of the output will be divided by the sum of applied weights.</dd>
 </dl>
@@ -14832,6 +14834,42 @@ This version of the operator has been available since version 12 of the default
 <dd>Constrain target to integer types</dd>
 </dl>
 
+### <a name="Pow-12"></a>**Pow-12**</a>
+
+  Pow takes input data (Tensor<T>) and exponent Tensor, and
+  produces one output data (Tensor<T>) where the function `f(x) = x^exponent`,
+  is applied to the data tensor elementwise.
+  This operator supports **multidirectional (i.e., Numpy-style) broadcasting**; for more details please check [the doc](Broadcasting.md).
+
+#### Version
+
+This version of the operator has been available since version 12 of the default ONNX operator set.
+
+#### Inputs
+
+<dl>
+<dt><tt>X</tt> : T</dt>
+<dd>First operand, base of the exponent.</dd>
+<dt><tt>Y</tt> : T1</dt>
+<dd>Second operand, power of the exponent.</dd>
+</dl>
+
+#### Outputs
+
+<dl>
+<dt><tt>Z</tt> : T</dt>
+<dd>Output tensor (same size as X)</dd>
+</dl>
+
+#### Type Constraints
+
+<dl>
+<dt><tt>T</tt> : tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double)</dt>
+<dd>Constrain input X and output types to float/int tensors.</dd>
+<dt><tt>T1</tt> : tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double)</dt>
+<dd>Constrain input Y types to float/int tensors.</dd>
+</dl>
+
 ### <a name="ReduceMax-12"></a>**ReduceMax-12**</a>
 
   Computes the max of the input tensor's element along the provided axes. The resulted
@@ -14956,6 +14994,8 @@ This version of the operator has been available since version 12 of the default
 #### Attributes
 
 <dl>
+<dt><tt>ignore_index</tt> : int</dt>
+<dd>Specifies a target value that is ignored and does not contribute to the input gradient. It is an optional value and valid values are [0, C).</dd>
 <dt><tt>reduction</tt> : string (default is mean)</dt>
 <dd>Type of reduction to apply to loss: none, sum, mean(default). 'none': no reduction will be applied, 'sum': the output will be summed. 'mean': the sum of the output will be divided by the number of elements in the output.</dd>
 </dl>
@@ -14965,7 +15005,7 @@ This version of the operator has been available since version 12 of the default
 <dl>
 <dt><tt>scores</tt> : T</dt>
 <dd>The predicted outputs with shape [batch_size, class_size], or [batch_size, class_size, D1, D2 , ..., Dk], where K is the number of dimensions.</dd>
-<dt><tt>labels</tt> : T</dt>
+<dt><tt>labels</tt> : Tind</dt>
 <dd>The ground truth output tensor, with shape [batch_size], or [batch_size, D1, D2, ..., Dk], where K is the number of dimensions.</dd>
 <dt><tt>weights</tt> (optional) : T</dt>
 <dd>A manual rescaling weight given to each class. If given, it has to be a 1D Tensor assigning weight to each of the classes. Otherwise, it is treated as if having all ones.</dd>
@@ -14985,6 +15025,8 @@ This version of the operator has been available since version 12 of the default
 <dl>
 <dt><tt>T</tt> : tensor(float16), tensor(float), tensor(double)</dt>
 <dd>Constrain input and output types to float tensors.</dd>
+<dt><tt>Tind</tt> : tensor(int32), tensor(int64)</dt>
+<dd>Constrain target to integer types</dd>
 </dl>
 
 ### <a name="UnfoldToDepth-12"></a>**UnfoldToDepth-12**</a>
@@ -15424,3 +15466,112 @@ This version of the operator has been available since version 1 of the 'ai.onnx.
 <dd>Allow inputs and outputs to be any kind of tensor.</dd>
 </dl>
 
+### <a name="ai.onnx.training.Momentum-1"></a>**ai.onnx.training.Momentum-1**</a>
+
+  Compute one iteration of stochastic gradient update with momentum.
+      This operator can conduct the optimization of multiple tensor variables.
+  
+      Let's define the behavior of this operator. As you can imagine, SG with momentum requires
+      several parameters:
+       
+       - The learning-rate "R".
+       - The update count "T". That is, the number of conducted training iterations. It should
+         be zero in the first training iteration.
+       - A L2-norm regularization coefficient "norm_coefficient".
+       - A decay coefficient of previous accumulated gradient (i.e., momentum) "alpha".
+       - The scaling coefficient of current gradient "beta".
+       - An attribute to choose either standard momentum or Nesterov's momentum "mode" should
+         be used.
+  
+      For the sake of simplicity, assume that there is only one tensor (called "X") to be optimized.
+      Other necessary inputs are "X"'s gradient (called "G") and "X"'s momentum (called "V"). This
+      Momentum operator maps all these inputs to the new value of "X" (called "X_new") and its new
+      momentum (called "V_new").
+      
+      This operator supports two different momentum algorithms. Set the attribute "mode" to
+      "nesterov" if Nesterov's momentum is desired. Otherwise, set the attribute "model" to
+      "standard" to use standard momentum. Computation details are described subsequently.
+  
+      Let "+", "-", "*", and "/" are all element-wise operations with numpy-style broadcasting.
+  
+      Pseudo code for SG with standard momentum:
+  
+        // Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
+        // values of all elements in X.
+        G_regularized = norm_coefficient * X + G
+  
+        // In the first training iteration, beta should always be 1.
+        beta_adjusted = T > 0 ? beta : 1
+  
+        // Compute the current momentum based on previous momentum and the current gradient.
+        V_new = alpha * V + beta_adjusted * G_regularized
+  
+        // Update X.
+        X_new = X - R * V_new
+  
+      Pseudo code for SG with Nesterov's momentum:
+  
+        // Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
+        // values of all elements in X.
+        G_regularized = norm_coefficient * X + G;
+  
+        // In the first training iteration, beta should always be 1.
+        beta_adjusted = T > 0 ? beta : 1
+  
+        // Compute the current momentum based on previous momentum and the current gradient.
+        V_new = alpha * V + beta_adjusted * G_regularized;
+  
+        // Compute final update direction and then update X.
+        X_new = X - R * (G_regularized + alpha * V_new)
+  
+      If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2". The same
+      pseudo code would be extended to handle all tensors jointly. More specifically, we can view "X" as a
+      concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
+      be concatenated too) and then our pseudo code becomes applicable.
+
+#### Version
+
+This version of the operator has been available since version 1 of the 'ai.onnx.training' operator set.
+
+#### Attributes
+
+<dl>
+<dt><tt>alpha</tt> : float (required)</dt>
+<dd>The decay factor of momentum. It should be a scalar.</dd>
+<dt><tt>beta</tt> : float (required)</dt>
+<dd>The coefficient of gradient in computing new momentum. It should be a scalar.</dd>
+<dt><tt>mode</tt> : string (required)</dt>
+<dd>Its value should be either "nesterov" or "standard". The value "nesterov" leads to the use of Nesterov's momentum while "standard" invokes stochastic gradient method using standard momentum</dd>
+<dt><tt>norm_coefficient</tt> : float (required)</dt>
+<dd>Coefficient of 0.5 * norm_coefficient * ||X||^2.</dd>
+</dl>
+
+#### Inputs (3 - &#8734;)
+
+<dl>
+<dt><tt>R</tt> : T1</dt>
+<dd>The learning rate.</dd>
+<dt><tt>T</tt> : T2</dt>
+<dd>Update count of "X". It should be a scalar.</dd>
+<dt><tt>inputs</tt> (variadic, heterogeneous) : T3</dt>
+<dd>It sequentially contains the current values of optimized tensors, then their gradient tensors, and finally their momentum tensors. For example, if two tensors "X_1" and "X_2" are optimized, The expected input list would be ["X_1", "X_2", gradient of "X_1", gradient of "X_2", momentum of "X_1", momentum of "X_2"].</dd>
+</dl>
+
+#### Outputs (1 - &#8734;)
+
+<dl>
+<dt><tt>outputs</tt> (variadic, heterogeneous) : T3</dt>
+<dd>It sequentially contains the new values of optimized tensors and then the new values of their momentum tensors. For example, if two tensors "X_1" and "X_2" are optimized, the output list would be [new value of "X_1," new value of "X_2" new momentum of "X_1", new momentum of "X_2"].</dd>
+</dl>
+
+#### Type Constraints
+
+<dl>
+<dt><tt>T1</tt> : tensor(float), tensor(double)</dt>
+<dd>Constrain input types to float scalars.</dd>
+<dt><tt>T2</tt> : tensor(int64)</dt>
+<dd>Constrain input types to 64-bit integer scalars.</dd>
+<dt><tt>T3</tt> : tensor(float), tensor(double)</dt>
+<dd>Constrain input types to float tensors.</dd>
+</dl>
+
diff --git a/docs/Operators.md b/docs/Operators.md
index 10c0d12a6a2..88d8345c3b5 100644
--- a/docs/Operators.md
+++ b/docs/Operators.md
@@ -175,6 +175,7 @@
   * <a href="#ai.onnx.training.Adagrad">ai.onnx.training.Adagrad</a>
   * <a href="#ai.onnx.training.Gradient">ai.onnx.training.Gradient</a>
   * <a href="#ai.onnx.training.GraphCall">ai.onnx.training.GraphCall</a>
+  * <a href="#ai.onnx.training.Momentum">ai.onnx.training.Momentum</a>
 
 ## ai.onnx (default)
 ### <a name="Abs"></a><a name="abs">**Abs**</a>
@@ -1968,7 +1969,9 @@ s = np.array([1.0, 1.5]).astype(np.float32)
 bias = np.array([0, 1]).astype(np.float32)
 mean = np.array([0, 3]).astype(np.float32)
 var = np.array([1, 1.5]).astype(np.float32)
-training_mode = np.ones(1, dtype=bool)
+# using np.bool(1) while generating test data with "'bool' object has no attribute 'dtype'"
+# working around by using np.byte(1).astype(bool)
+training_mode = np.byte(1).astype(bool)
 y, saved_mean, saved_var, output_mean, output_var = batchnorm_training_mode(x, s, bias, mean, var)
 
 node = onnx.helper.make_node(
@@ -1987,7 +1990,7 @@ s = np.random.randn(3).astype(np.float32)
 bias = np.random.randn(3).astype(np.float32)
 mean = np.random.randn(3).astype(np.float32)
 var = np.random.rand(3).astype(np.float32)
-training_mode = np.ones(1, dtype=bool)
+training_mode = np.byte(1).astype(bool)
 momentum = 0.9
 epsilon = 1e-2
 y, saved_mean, saved_var, output_mean, output_var = batchnorm_training_mode(x, s, bias, mean, var, momentum, epsilon)
@@ -10739,6 +10742,8 @@ This version of the operator has been available since version 12 of the default
 #### Attributes
 
 <dl>
+<dt><tt>ignore_index</tt> : int</dt>
+<dd>Specifies a target value that is ignored and does not contribute to the input gradient. It is an optional value and valid values are [0, C).</dd>
 <dt><tt>reduction</tt> : string (default is mean)</dt>
 <dd>Type of reduction to apply to loss: none, sum, mean (default). 'none': the output is the loss for each sample. 'sum': the output will be summed. 'mean': the sum of the output will be divided by the sum of applied weights.</dd>
 </dl>
@@ -10958,6 +10963,36 @@ expect(node, inputs=[input, target, weight], outputs=[negative_log_likelihood_lo
 </details>
 
 
+<details>
+<summary>input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index</summary>
+
+```python
+reduction = 'sum'
+ignore_index = np.int64(0)
+node = onnx.helper.make_node(
+    'NegativeLogLikelihoodLoss',
+    inputs=['input', 'target', 'weight'],
+    outputs=['loss'],
+    reduction=reduction,
+    ignore_index=ignore_index
+)
+
+N, C, dim1, dim2 = 3, 5, 6, 6
+np.random.seed(0)
+input = np.random.rand(N, C, dim1, dim2).astype(np.float32)
+target = np.random.randint(0, high=C, size=(N, dim1, dim2))
+target[0][0][0] = 0
+weight = np.random.rand(C).astype(np.float32)
+
+negative_log_likelihood_loss = compute_negative_log_likelihood_loss(input, target, weight=weight, reduction=reduction, ignore_index=ignore_index)
+
+expect(node, inputs=[input, target, weight], outputs=[negative_log_likelihood_loss],
+    name='test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index')
+```
+
+</details>
+
+
 ### <a name="NonMaxSuppression"></a><a name="nonmaxsuppression">**NonMaxSuppression**</a>
 
   Filter out boxes that have high intersection-over-union (IOU) overlap with previously selected boxes.
@@ -11953,16 +11988,16 @@ for mode in ['edge', 'reflect']:
 
 #### Version
 
-This version of the operator has been available since version 7 of the default ONNX operator set.
+This version of the operator has been available since version 12 of the default ONNX operator set.
 
-Other versions of this operator: <a href="Changelog.md#Pow-1">Pow-1</a>
+Other versions of this operator: <a href="Changelog.md#Pow-1">Pow-1</a>, <a href="Changelog.md#Pow-7">Pow-7</a>
 
 #### Inputs
 
 <dl>
 <dt><tt>X</tt> : T</dt>
 <dd>First operand, base of the exponent.</dd>
-<dt><tt>Y</tt> : T</dt>
+<dt><tt>Y</tt> : T1</dt>
 <dd>Second operand, power of the exponent.</dd>
 </dl>
 
@@ -11976,8 +12011,10 @@ Other versions of this operator: <a href="Changelog.md#Pow-1">Pow-1</a>
 #### Type Constraints
 
 <dl>
-<dt><tt>T</tt> : tensor(float16), tensor(float), tensor(double)</dt>
-<dd>Constrain input and output types to float tensors.</dd>
+<dt><tt>T</tt> : tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double)</dt>
+<dd>Constrain input X and output types to float/int tensors.</dd>
+<dt><tt>T1</tt> : tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double)</dt>
+<dd>Constrain input Y types to float/int tensors.</dd>
 </dl>
 
 
@@ -11995,13 +12032,13 @@ node = onnx.helper.make_node(
 
 x = np.array([1, 2, 3]).astype(np.float32)
 y = np.array([4, 5, 6]).astype(np.float32)
-z = np.power(x, y)  # expected output [1., 32., 729.]
+z = pow(x, y)  # expected output [1., 32., 729.]
 expect(node, inputs=[x, y], outputs=[z],
        name='test_pow_example')
 
 x = np.arange(60).reshape(3, 4, 5).astype(np.float32)
 y = np.random.randn(3, 4, 5).astype(np.float32)
-z = np.power(x, y)
+z = pow(x, y)
 expect(node, inputs=[x, y], outputs=[z],
        name='test_pow')
 ```
@@ -12021,7 +12058,7 @@ node = onnx.helper.make_node(
 
 x = np.array([1, 2, 3]).astype(np.float32)
 y = np.array(2).astype(np.float32)
-z = np.power(x, y)  # expected output [1., 4., 9.]
+z = pow(x, y)  # expected output [1., 4., 9.]
 expect(node, inputs=[x, y], outputs=[z],
        name='test_pow_bcast_scalar')
 
@@ -12033,7 +12070,7 @@ node = onnx.helper.make_node(
 x = np.array([[1, 2, 3], [4, 5, 6]]).astype(np.float32)
 y = np.array([1, 2, 3]).astype(np.float32)
 # expected output [[1, 4, 27], [4, 25, 216]]
-z = np.power(x, y).astype(np.float32)
+z = pow(x, y)
 expect(node, inputs=[x, y], outputs=[z],
        name='test_pow_bcast_array')
 ```
@@ -12041,6 +12078,68 @@ expect(node, inputs=[x, y], outputs=[z],
 </details>
 
 
+<details>
+<summary>types</summary>
+
+```python
+node = onnx.helper.make_node(
+    'Pow',
+    inputs=['x', 'y'],
+    outputs=['z'],
+)
+
+x = np.array([1, 2, 3]).astype(np.float32)
+y = np.array([4, 5, 6]).astype(np.int64)
+z = pow(x, y)  # expected output [1., 32., 729.]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_float32_int64')
+
+x = np.array([1, 2, 3]).astype(np.int64)
+y = np.array([4, 5, 6]).astype(np.float32)
+z = pow(x, y)  # expected output [1, 32, 729]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_int64_float32')
+
+x = np.array([1, 2, 3]).astype(np.float32)
+y = np.array([4, 5, 6]).astype(np.int32)
+z = pow(x, y)  # expected output [1., 32., 729.]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_float32_int32')
+
+x = np.array([1, 2, 3]).astype(np.int32)
+y = np.array([4, 5, 6]).astype(np.float32)
+z = pow(x, y)  # expected output [1, 32, 729]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_int32_float32')
+
+x = np.array([1, 2, 3]).astype(np.float32)
+y = np.array([4, 5, 6]).astype(np.uint64)
+z = pow(x, y)  # expected output [1., 32., 729.]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_float32_uint64')
+
+x = np.array([1, 2, 3]).astype(np.float32)
+y = np.array([4, 5, 6]).astype(np.uint32)
+z = pow(x, y)  # expected output [1., 32., 729.]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_float32_uint32')
+
+x = np.array([1, 2, 3]).astype(np.int64)
+y = np.array([4, 5, 6]).astype(np.int64)
+z = pow(x, y)  # expected output [1, 32, 729]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_int64_int64')
+
+x = np.array([1, 2, 3]).astype(np.int32)
+y = np.array([4, 5, 6]).astype(np.int32)
+z = pow(x, y)  # expected output [1, 32, 729]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_int32_int32')
+```
+
+</details>
+
+
 ### <a name="QLinearConv"></a><a name="qlinearconv">**QLinearConv**</a>
 
   The convolution operator consumes a quantized input tensor, its scale and zero point,
@@ -18208,6 +18307,8 @@ This version of the operator has been available since version 12 of the default
 #### Attributes
 
 <dl>
+<dt><tt>ignore_index</tt> : int</dt>
+<dd>Specifies a target value that is ignored and does not contribute to the input gradient. It is an optional value and valid values are [0, C).</dd>
 <dt><tt>reduction</tt> : string (default is mean)</dt>
 <dd>Type of reduction to apply to loss: none, sum, mean(default). 'none': no reduction will be applied, 'sum': the output will be summed. 'mean': the sum of the output will be divided by the number of elements in the output.</dd>
 </dl>
@@ -18217,7 +18318,7 @@ This version of the operator has been available since version 12 of the default
 <dl>
 <dt><tt>scores</tt> : T</dt>
 <dd>The predicted outputs with shape [batch_size, class_size], or [batch_size, class_size, D1, D2 , ..., Dk], where K is the number of dimensions.</dd>
-<dt><tt>labels</tt> : T</dt>
+<dt><tt>labels</tt> : Tind</dt>
 <dd>The ground truth output tensor, with shape [batch_size], or [batch_size, D1, D2, ..., Dk], where K is the number of dimensions.</dd>
 <dt><tt>weights</tt> (optional) : T</dt>
 <dd>A manual rescaling weight given to each class. If given, it has to be a 1D Tensor assigning weight to each of the classes. Otherwise, it is treated as if having all ones.</dd>
@@ -18237,6 +18338,8 @@ This version of the operator has been available since version 12 of the default
 <dl>
 <dt><tt>T</tt> : tensor(float16), tensor(float), tensor(double)</dt>
 <dd>Constrain input and output types to float tensors.</dd>
+<dt><tt>Tind</tt> : tensor(int32), tensor(int64)</dt>
+<dd>Constrain target to integer types</dd>
 </dl>
 
 
@@ -18327,6 +18430,38 @@ expect(node, inputs=[x, labels, weights], outputs=[sce], name='test_softmax_cros
 </details>
 
 
+<details>
+<summary>softmaxcrossentropy_mean_weights_ignore_index</summary>
+
+```python
+# Define operator attributes.
+reduction = 'mean'
+ignore_index = np.int64(0)
+
+# Create operator.
+node = onnx.helper.make_node('SoftmaxCrossEntropyLoss',
+                             inputs=['x', 'y', 'w'],
+                             outputs=['z'],
+                             reduction=reduction,
+                             ignore_index=ignore_index)
+
+# Define operator inputs.
+np.random.seed(0)
+x = np.random.rand(3, 5).astype(np.float32)
+labels = np.random.randint(0, high=5, size=(3, ))
+labels[0] = 0
+weights = np.array([0.9, 0.7, 0.8, 0.9, 0.9], dtype=np.float32)
+
+# Compute SoftmaxCrossEntropyLoss
+sce = softmaxcrossentropy(x, labels, weight=weights, ignore_index=ignore_index)
+
+# Check results
+expect(node, inputs=[x, labels, weights], outputs=[sce], name='test_softmax_cross_entropy_mean_weight_ignore_index')
+```
+
+</details>
+
+
 <details>
 <summary>softmaxcrossentropy_none</summary>
 
@@ -21529,3 +21664,243 @@ This version of the operator has been available since version 1 of the 'ai.onnx.
 </dl>
 
 
+### <a name="ai.onnx.training.Momentum"></a><a name="ai.onnx.training.momentum">**ai.onnx.training.Momentum**</a>
+
+  Compute one iteration of stochastic gradient update with momentum.
+      This operator can conduct the optimization of multiple tensor variables.
+  
+      Let's define the behavior of this operator. As you can imagine, SG with momentum requires
+      several parameters:
+       
+       - The learning-rate "R".
+       - The update count "T". That is, the number of conducted training iterations. It should
+         be zero in the first training iteration.
+       - A L2-norm regularization coefficient "norm_coefficient".
+       - A decay coefficient of previous accumulated gradient (i.e., momentum) "alpha".
+       - The scaling coefficient of current gradient "beta".
+       - An attribute to choose either standard momentum or Nesterov's momentum "mode" should
+         be used.
+  
+      For the sake of simplicity, assume that there is only one tensor (called "X") to be optimized.
+      Other necessary inputs are "X"'s gradient (called "G") and "X"'s momentum (called "V"). This
+      Momentum operator maps all these inputs to the new value of "X" (called "X_new") and its new
+      momentum (called "V_new").
+      
+      This operator supports two different momentum algorithms. Set the attribute "mode" to
+      "nesterov" if Nesterov's momentum is desired. Otherwise, set the attribute "model" to
+      "standard" to use standard momentum. Computation details are described subsequently.
+  
+      Let "+", "-", "*", and "/" are all element-wise operations with numpy-style broadcasting.
+  
+      Pseudo code for SG with standard momentum:
+  
+        // Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
+        // values of all elements in X.
+        G_regularized = norm_coefficient * X + G
+  
+        // In the first training iteration, beta should always be 1.
+        beta_adjusted = T > 0 ? beta : 1
+  
+        // Compute the current momentum based on previous momentum and the current gradient.
+        V_new = alpha * V + beta_adjusted * G_regularized
+  
+        // Update X.
+        X_new = X - R * V_new
+  
+      Pseudo code for SG with Nesterov's momentum:
+  
+        // Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
+        // values of all elements in X.
+        G_regularized = norm_coefficient * X + G;
+  
+        // In the first training iteration, beta should always be 1.
+        beta_adjusted = T > 0 ? beta : 1
+  
+        // Compute the current momentum based on previous momentum and the current gradient.
+        V_new = alpha * V + beta_adjusted * G_regularized;
+  
+        // Compute final update direction and then update X.
+        X_new = X - R * (G_regularized + alpha * V_new)
+  
+      If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2". The same
+      pseudo code would be extended to handle all tensors jointly. More specifically, we can view "X" as a
+      concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
+      be concatenated too) and then our pseudo code becomes applicable.
+
+#### Version
+
+This version of the operator has been available since version 1 of the 'ai.onnx.training' operator set.
+
+#### Attributes
+
+<dl>
+<dt><tt>alpha</tt> : float (required)</dt>
+<dd>The decay factor of momentum. It should be a scalar.</dd>
+<dt><tt>beta</tt> : float (required)</dt>
+<dd>The coefficient of gradient in computing new momentum. It should be a scalar.</dd>
+<dt><tt>mode</tt> : string (required)</dt>
+<dd>Its value should be either "nesterov" or "standard". The value "nesterov" leads to the use of Nesterov's momentum while "standard" invokes stochastic gradient method using standard momentum</dd>
+<dt><tt>norm_coefficient</tt> : float (required)</dt>
+<dd>Coefficient of 0.5 * norm_coefficient * ||X||^2.</dd>
+</dl>
+
+#### Inputs (3 - &#8734;)
+
+<dl>
+<dt><tt>R</tt> : T1</dt>
+<dd>The learning rate.</dd>
+<dt><tt>T</tt> : T2</dt>
+<dd>Update count of "X". It should be a scalar.</dd>
+<dt><tt>inputs</tt> (variadic, heterogeneous) : T3</dt>
+<dd>It sequentially contains the current values of optimized tensors, then their gradient tensors, and finally their momentum tensors. For example, if two tensors "X_1" and "X_2" are optimized, The expected input list would be ["X_1", "X_2", gradient of "X_1", gradient of "X_2", momentum of "X_1", momentum of "X_2"].</dd>
+</dl>
+
+#### Outputs (1 - &#8734;)
+
+<dl>
+<dt><tt>outputs</tt> (variadic, heterogeneous) : T3</dt>
+<dd>It sequentially contains the new values of optimized tensors and then the new values of their momentum tensors. For example, if two tensors "X_1" and "X_2" are optimized, the output list would be [new value of "X_1," new value of "X_2" new momentum of "X_1", new momentum of "X_2"].</dd>
+</dl>
+
+#### Type Constraints
+
+<dl>
+<dt><tt>T1</tt> : tensor(float), tensor(double)</dt>
+<dd>Constrain input types to float scalars.</dd>
+<dt><tt>T2</tt> : tensor(int64)</dt>
+<dd>Constrain input types to 64-bit integer scalars.</dd>
+<dt><tt>T3</tt> : tensor(float), tensor(double)</dt>
+<dd>Constrain input types to float tensors.</dd>
+</dl>
+
+
+#### Examples
+
+<details>
+<summary>momentum</summary>
+
+```python
+# Define operator attributes.
+norm_coefficient = 0.001
+alpha = 0.95
+beta = 0.1
+
+# Create operator.
+node = onnx.helper.make_node('Momentum',
+                             inputs=['R', 'T', 'X', 'G', 'V'],
+                             outputs=['X_new', 'V_new'],
+                             norm_coefficient=norm_coefficient,
+                             alpha=alpha,
+                             beta=beta,
+                             mode='standard',
+                             domain='ai.onnx.training'
+                             )
+
+# Define operator inputs.
+r = np.array(0.1, dtype=np.float32)  # scalar
+t = np.array(0, dtype=np.int64)  # scalar
+x = np.array([1.2, 2.8], dtype=np.float32)
+g = np.array([-0.94, -2.5], dtype=np.float32)
+v = np.array([1.7, 3.6], dtype=np.float32)
+
+# Compute expected outputs of Momentum.
+x_new, v_new = apply_momentum(r, t, x, g, v,
+                              norm_coefficient, alpha, beta)
+
+# Check results.
+expect(node, inputs=[r, t, x, g, v],
+       outputs=[x_new, v_new], name='test_momentum',
+       opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
+```
+
+</details>
+
+
+<details>
+<summary>momentum_multiple</summary>
+
+```python
+# Define operator attributes.
+norm_coefficient = 0.001
+alpha = 0.95
+beta = 0.85
+
+node = onnx.helper.make_node('Momentum',
+                             inputs=['R', 'T', 'X1', 'X2',
+                                     'G1', 'G2', 'H1', 'H2'],
+                             outputs=['X1_new', 'X2_new',
+                                      'V1_new', 'V2_new'],
+                             norm_coefficient=norm_coefficient,
+                             alpha=alpha,
+                             beta=beta,
+                             mode='standard',
+                             domain='ai.onnx.training'
+                             )
+
+# Define operator inputs.
+r = np.array(0.1, dtype=np.float32)  # scalar
+t = np.array(0, dtype=np.int64)  # scalar
+
+x1 = np.array([1.0], dtype=np.float32)
+g1 = np.array([-1.0], dtype=np.float32)
+v1 = np.array([2.0], dtype=np.float32)
+
+x2 = np.array([1.0, 2.0], dtype=np.float32)
+g2 = np.array([-1.0, -3.0], dtype=np.float32)
+v2 = np.array([4.0, 1.0], dtype=np.float32)
+
+# Compute expected outputs of Momentum.
+x1_new, v1_new = apply_momentum(r, t, x1, g1, v1,
+                                norm_coefficient, alpha, beta)
+x2_new, v2_new = apply_momentum(r, t, x2, g2, v2,
+                                norm_coefficient, alpha, beta)
+
+# Check results.
+expect(node, inputs=[r, t, x1, x2, g1, g2, v1, v2],
+       outputs=[x1_new, x2_new, v1_new, v2_new], name='test_momentum_multiple',
+       opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
+```
+
+</details>
+
+
+<details>
+<summary>nesterov_momentum</summary>
+
+```python
+# Define operator attributes.
+norm_coefficient = 0.01
+alpha = 0.95
+beta = 1.0
+
+# Create operator.
+node = onnx.helper.make_node('Momentum',
+                             inputs=['R', 'T', 'X', 'G', 'V'],
+                             outputs=['X_new', 'V_new'],
+                             norm_coefficient=norm_coefficient,
+                             alpha=alpha,
+                             beta=beta,
+                             mode='nesterov',
+                             domain='ai.onnx.training'
+                             )
+
+# Define operator inputs.
+r = np.array(0.1, dtype=np.float32)  # scalar
+t = np.array(0, dtype=np.int64)  # scalar
+x = np.array([1.2, 2.8], dtype=np.float32)
+g = np.array([-0.94, -2.5], dtype=np.float32)
+v = np.array([1.7, 3.6], dtype=np.float32)
+
+# Compute expected outputs of Adagrad.
+x_new, v_new = apply_nesterov(r, t, x, g, v,
+                              norm_coefficient, alpha, beta)
+
+# Check results.
+expect(node, inputs=[r, t, x, g, v],
+       outputs=[x_new, v_new], name='test_nesterov_momentum',
+       opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
+```
+
+</details>
+
+
diff --git a/docs/TestCoverage.md b/docs/TestCoverage.md
index 3a2b7cc7eb3..f0bdd734020 100644
--- a/docs/TestCoverage.md
+++ b/docs/TestCoverage.md
@@ -5,7 +5,7 @@
 * [Overall Test Coverage](#overall-test-coverage)
 # Node Test Coverage
 ## Summary
-Node tests have covered 145/163 (88.96%, 5 generators excluded) common operators.
+Node tests have covered 146/164 (89.02%, 5 generators excluded) common operators.
 
 Node tests have covered 0/0 (N/A) experimental operators.
 
@@ -1257,7 +1257,9 @@ s = np.array([1.0, 1.5]).astype(np.float32)
 bias = np.array([0, 1]).astype(np.float32)
 mean = np.array([0, 3]).astype(np.float32)
 var = np.array([1, 1.5]).astype(np.float32)
-training_mode = np.ones(1, dtype=bool)
+# using np.bool(1) while generating test data with "'bool' object has no attribute 'dtype'"
+# working around by using np.byte(1).astype(bool)
+training_mode = np.byte(1).astype(bool)
 y, saved_mean, saved_var, output_mean, output_var = batchnorm_training_mode(x, s, bias, mean, var)
 
 node = onnx.helper.make_node(
@@ -1276,7 +1278,7 @@ s = np.random.randn(3).astype(np.float32)
 bias = np.random.randn(3).astype(np.float32)
 mean = np.random.randn(3).astype(np.float32)
 var = np.random.rand(3).astype(np.float32)
-training_mode = np.ones(1, dtype=bool)
+training_mode = np.byte(1).astype(bool)
 momentum = 0.9
 epsilon = 1e-2
 y, saved_mean, saved_var, output_mean, output_var = batchnorm_training_mode(x, s, bias, mean, var, momentum, epsilon)
@@ -6011,6 +6013,132 @@ expect(node, inputs=[x, y], outputs=[z],
 </details>
 
 
+### Momentum
+There are 3 test cases, listed as following:
+<details>
+<summary>momentum</summary>
+
+```python
+# Define operator attributes.
+norm_coefficient = 0.001
+alpha = 0.95
+beta = 0.1
+
+# Create operator.
+node = onnx.helper.make_node('Momentum',
+                             inputs=['R', 'T', 'X', 'G', 'V'],
+                             outputs=['X_new', 'V_new'],
+                             norm_coefficient=norm_coefficient,
+                             alpha=alpha,
+                             beta=beta,
+                             mode='standard',
+                             domain='ai.onnx.training'
+                             )
+
+# Define operator inputs.
+r = np.array(0.1, dtype=np.float32)  # scalar
+t = np.array(0, dtype=np.int64)  # scalar
+x = np.array([1.2, 2.8], dtype=np.float32)
+g = np.array([-0.94, -2.5], dtype=np.float32)
+v = np.array([1.7, 3.6], dtype=np.float32)
+
+# Compute expected outputs of Momentum.
+x_new, v_new = apply_momentum(r, t, x, g, v,
+                              norm_coefficient, alpha, beta)
+
+# Check results.
+expect(node, inputs=[r, t, x, g, v],
+       outputs=[x_new, v_new], name='test_momentum',
+       opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
+```
+
+</details>
+<details>
+<summary>momentum_multiple</summary>
+
+```python
+# Define operator attributes.
+norm_coefficient = 0.001
+alpha = 0.95
+beta = 0.85
+
+node = onnx.helper.make_node('Momentum',
+                             inputs=['R', 'T', 'X1', 'X2',
+                                     'G1', 'G2', 'H1', 'H2'],
+                             outputs=['X1_new', 'X2_new',
+                                      'V1_new', 'V2_new'],
+                             norm_coefficient=norm_coefficient,
+                             alpha=alpha,
+                             beta=beta,
+                             mode='standard',
+                             domain='ai.onnx.training'
+                             )
+
+# Define operator inputs.
+r = np.array(0.1, dtype=np.float32)  # scalar
+t = np.array(0, dtype=np.int64)  # scalar
+
+x1 = np.array([1.0], dtype=np.float32)
+g1 = np.array([-1.0], dtype=np.float32)
+v1 = np.array([2.0], dtype=np.float32)
+
+x2 = np.array([1.0, 2.0], dtype=np.float32)
+g2 = np.array([-1.0, -3.0], dtype=np.float32)
+v2 = np.array([4.0, 1.0], dtype=np.float32)
+
+# Compute expected outputs of Momentum.
+x1_new, v1_new = apply_momentum(r, t, x1, g1, v1,
+                                norm_coefficient, alpha, beta)
+x2_new, v2_new = apply_momentum(r, t, x2, g2, v2,
+                                norm_coefficient, alpha, beta)
+
+# Check results.
+expect(node, inputs=[r, t, x1, x2, g1, g2, v1, v2],
+       outputs=[x1_new, x2_new, v1_new, v2_new], name='test_momentum_multiple',
+       opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
+```
+
+</details>
+<details>
+<summary>nesterov_momentum</summary>
+
+```python
+# Define operator attributes.
+norm_coefficient = 0.01
+alpha = 0.95
+beta = 1.0
+
+# Create operator.
+node = onnx.helper.make_node('Momentum',
+                             inputs=['R', 'T', 'X', 'G', 'V'],
+                             outputs=['X_new', 'V_new'],
+                             norm_coefficient=norm_coefficient,
+                             alpha=alpha,
+                             beta=beta,
+                             mode='nesterov',
+                             domain='ai.onnx.training'
+                             )
+
+# Define operator inputs.
+r = np.array(0.1, dtype=np.float32)  # scalar
+t = np.array(0, dtype=np.int64)  # scalar
+x = np.array([1.2, 2.8], dtype=np.float32)
+g = np.array([-0.94, -2.5], dtype=np.float32)
+v = np.array([1.7, 3.6], dtype=np.float32)
+
+# Compute expected outputs of Adagrad.
+x_new, v_new = apply_nesterov(r, t, x, g, v,
+                              norm_coefficient, alpha, beta)
+
+# Check results.
+expect(node, inputs=[r, t, x, g, v],
+       outputs=[x_new, v_new], name='test_nesterov_momentum',
+       opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
+```
+
+</details>
+
+
 ### Mul
 There are 2 test cases, listed as following:
 <details>
@@ -6084,7 +6212,7 @@ expect(node, inputs=[x], outputs=[y],
 
 
 ### NegativeLogLikelihoodLoss
-There are 7 test cases, listed as following:
+There are 8 test cases, listed as following:
 <details>
 <summary>input_shape_is_NC</summary>
 
@@ -6255,6 +6383,34 @@ expect(node, inputs=[input, target, weight], outputs=[negative_log_likelihood_lo
     name='test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum')
 ```
 
+</details>
+<details>
+<summary>input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index</summary>
+
+```python
+reduction = 'sum'
+ignore_index = np.int64(0)
+node = onnx.helper.make_node(
+    'NegativeLogLikelihoodLoss',
+    inputs=['input', 'target', 'weight'],
+    outputs=['loss'],
+    reduction=reduction,
+    ignore_index=ignore_index
+)
+
+N, C, dim1, dim2 = 3, 5, 6, 6
+np.random.seed(0)
+input = np.random.rand(N, C, dim1, dim2).astype(np.float32)
+target = np.random.randint(0, high=C, size=(N, dim1, dim2))
+target[0][0][0] = 0
+weight = np.random.rand(C).astype(np.float32)
+
+negative_log_likelihood_loss = compute_negative_log_likelihood_loss(input, target, weight=weight, reduction=reduction, ignore_index=ignore_index)
+
+expect(node, inputs=[input, target, weight], outputs=[negative_log_likelihood_loss],
+    name='test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index')
+```
+
 </details>
 
 
@@ -6841,7 +6997,7 @@ for mode in ['edge', 'reflect']:
 
 
 ### Pow
-There are 2 test cases, listed as following:
+There are 3 test cases, listed as following:
 <details>
 <summary>pow</summary>
 
@@ -6854,13 +7010,13 @@ node = onnx.helper.make_node(
 
 x = np.array([1, 2, 3]).astype(np.float32)
 y = np.array([4, 5, 6]).astype(np.float32)
-z = np.power(x, y)  # expected output [1., 32., 729.]
+z = pow(x, y)  # expected output [1., 32., 729.]
 expect(node, inputs=[x, y], outputs=[z],
        name='test_pow_example')
 
 x = np.arange(60).reshape(3, 4, 5).astype(np.float32)
 y = np.random.randn(3, 4, 5).astype(np.float32)
-z = np.power(x, y)
+z = pow(x, y)
 expect(node, inputs=[x, y], outputs=[z],
        name='test_pow')
 ```
@@ -6878,7 +7034,7 @@ node = onnx.helper.make_node(
 
 x = np.array([1, 2, 3]).astype(np.float32)
 y = np.array(2).astype(np.float32)
-z = np.power(x, y)  # expected output [1., 4., 9.]
+z = pow(x, y)  # expected output [1., 4., 9.]
 expect(node, inputs=[x, y], outputs=[z],
        name='test_pow_bcast_scalar')
 
@@ -6890,11 +7046,71 @@ node = onnx.helper.make_node(
 x = np.array([[1, 2, 3], [4, 5, 6]]).astype(np.float32)
 y = np.array([1, 2, 3]).astype(np.float32)
 # expected output [[1, 4, 27], [4, 25, 216]]
-z = np.power(x, y).astype(np.float32)
+z = pow(x, y)
 expect(node, inputs=[x, y], outputs=[z],
        name='test_pow_bcast_array')
 ```
 
+</details>
+<details>
+<summary>types</summary>
+
+```python
+node = onnx.helper.make_node(
+    'Pow',
+    inputs=['x', 'y'],
+    outputs=['z'],
+)
+
+x = np.array([1, 2, 3]).astype(np.float32)
+y = np.array([4, 5, 6]).astype(np.int64)
+z = pow(x, y)  # expected output [1., 32., 729.]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_float32_int64')
+
+x = np.array([1, 2, 3]).astype(np.int64)
+y = np.array([4, 5, 6]).astype(np.float32)
+z = pow(x, y)  # expected output [1, 32, 729]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_int64_float32')
+
+x = np.array([1, 2, 3]).astype(np.float32)
+y = np.array([4, 5, 6]).astype(np.int32)
+z = pow(x, y)  # expected output [1., 32., 729.]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_float32_int32')
+
+x = np.array([1, 2, 3]).astype(np.int32)
+y = np.array([4, 5, 6]).astype(np.float32)
+z = pow(x, y)  # expected output [1, 32, 729]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_int32_float32')
+
+x = np.array([1, 2, 3]).astype(np.float32)
+y = np.array([4, 5, 6]).astype(np.uint64)
+z = pow(x, y)  # expected output [1., 32., 729.]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_float32_uint64')
+
+x = np.array([1, 2, 3]).astype(np.float32)
+y = np.array([4, 5, 6]).astype(np.uint32)
+z = pow(x, y)  # expected output [1., 32., 729.]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_float32_uint32')
+
+x = np.array([1, 2, 3]).astype(np.int64)
+y = np.array([4, 5, 6]).astype(np.int64)
+z = pow(x, y)  # expected output [1, 32, 729]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_int64_int64')
+
+x = np.array([1, 2, 3]).astype(np.int32)
+y = np.array([4, 5, 6]).astype(np.int32)
+z = pow(x, y)  # expected output [1, 32, 729]
+expect(node, inputs=[x, y], outputs=[z],
+       name='test_pow_types_int32_int32')
+```
+
 </details>
 
 
@@ -10485,7 +10701,7 @@ expect(node, inputs=[x], outputs=[y],
 
 
 ### SoftmaxCrossEntropyLoss
-There are 6 test cases, listed as following:
+There are 7 test cases, listed as following:
 <details>
 <summary>softmaxcrossentropy_mean</summary>
 
@@ -10564,6 +10780,36 @@ sce = softmaxcrossentropy(x, labels, weight=weights)
 expect(node, inputs=[x, labels, weights], outputs=[sce], name='test_softmax_cross_entropy_mean_weight')
 ```
 
+</details>
+<details>
+<summary>softmaxcrossentropy_mean_weights_ignore_index</summary>
+
+```python
+# Define operator attributes.
+reduction = 'mean'
+ignore_index = np.int64(0)
+
+# Create operator.
+node = onnx.helper.make_node('SoftmaxCrossEntropyLoss',
+                             inputs=['x', 'y', 'w'],
+                             outputs=['z'],
+                             reduction=reduction,
+                             ignore_index=ignore_index)
+
+# Define operator inputs.
+np.random.seed(0)
+x = np.random.rand(3, 5).astype(np.float32)
+labels = np.random.randint(0, high=5, size=(3, ))
+labels[0] = 0
+weights = np.array([0.9, 0.7, 0.8, 0.9, 0.9], dtype=np.float32)
+
+# Compute SoftmaxCrossEntropyLoss
+sce = softmaxcrossentropy(x, labels, weight=weights, ignore_index=ignore_index)
+
+# Check results
+expect(node, inputs=[x, labels, weights], outputs=[sce], name='test_softmax_cross_entropy_mean_weight_ignore_index')
+```
+
 </details>
 <details>
 <summary>softmaxcrossentropy_none</summary>
diff --git a/onnx/backend/test/case/model/sequence.py b/onnx/backend/test/case/model/sequence.py
index 560d440ca8a..1322c8db1bb 100644
--- a/onnx/backend/test/case/model/sequence.py
+++ b/onnx/backend/test/case/model/sequence.py
@@ -330,14 +330,15 @@ def make_graph(
         expect(model, inputs=[x], outputs=[out], name="test_sequence_model7")
 
         #8th testcase - split zero length
-        seq_split_node = onnx.helper.make_node('SplitToSequence', ['X'], ['seq_1'])
+        seq_split_node = onnx.helper.make_node('SplitToSequence', ['X', 'Splits'], ['seq_1'])
         seq_len_node = onnx.helper.make_node('SequenceLength', ['seq_1'], ['len'])
 
-        tensor_shape = []  # type: ignore
-        len_shape = []  # type: ignore
+        tensor_shape = ['n']  # type: ignore
+        splits_shape = [3]  # type: ignore
 
         x = np.array([]).astype(np.float32)
-        out_len = np.int64(0)
+        splits = np.array([0, 0, 0]).astype(np.int64)
+        out_len = np.int64(3)
 
         graph = onnx.helper.make_graph(
             nodes=[seq_split_node, seq_len_node],
@@ -348,9 +349,9 @@ def make_graph(
                     onnx.TensorProto.FLOAT,
                     tensor_shape),  # type: ignore
                 onnx.helper.make_tensor_value_info(
-                    'Split',
+                    'Splits',
                     onnx.TensorProto.INT64,
-                    len_shape)],  # type: ignore
+                    splits_shape)],  # type: ignore
             outputs=[
                 onnx.helper.make_tensor_value_info(
                     'len',
@@ -358,4 +359,4 @@ def make_graph(
                     len_shape)])  # type: ignore
 
         model = onnx.helper.make_model(graph, producer_name='backend-test')
-        expect(model, inputs=[x], outputs=[out_len], name="test_sequence_model8")
+        expect(model, inputs=[x, splits], outputs=[out_len], name="test_sequence_model8")
diff --git a/onnx/backend/test/case/node/batchnorm.py b/onnx/backend/test/case/node/batchnorm.py
index 44ae086329f..52d30d03d45 100644
--- a/onnx/backend/test/case/node/batchnorm.py
+++ b/onnx/backend/test/case/node/batchnorm.py
@@ -23,7 +23,7 @@ def batchnorm_test_mode(x, s, bias, mean, var, epsilon=1e-5):  # type: ignore
 
 def batchnorm_training_mode(x, s, bias, mean, var, momentum=0.9, epsilon=1e-5):  # type: ignore
     axis = np.arange(len(x.shape))
-    np.delete(axis, 1)
+    axis = np.delete(axis, 1)
     axis = tuple(axis)
     saved_mean = x.mean(axis=axis)
     saved_var = x.var(axis=axis)
@@ -43,7 +43,9 @@ def export_train():  # type: () -> None
         bias = np.array([0, 1]).astype(np.float32)
         mean = np.array([0, 3]).astype(np.float32)
         var = np.array([1, 1.5]).astype(np.float32)
-        training_mode = np.ones(1, dtype=bool)
+        # using np.bool(1) while generating test data with "'bool' object has no attribute 'dtype'"
+        # working around by using np.byte(1).astype(bool)
+        training_mode = np.byte(1).astype(bool)
         y, saved_mean, saved_var, output_mean, output_var = batchnorm_training_mode(x, s, bias, mean, var)
 
         node = onnx.helper.make_node(
@@ -62,7 +64,7 @@ def export_train():  # type: () -> None
         bias = np.random.randn(3).astype(np.float32)
         mean = np.random.randn(3).astype(np.float32)
         var = np.random.rand(3).astype(np.float32)
-        training_mode = np.ones(1, dtype=bool)
+        training_mode = np.byte(1).astype(bool)
         momentum = 0.9
         epsilon = 1e-2
         y, saved_mean, saved_var, output_mean, output_var = batchnorm_training_mode(x, s, bias, mean, var, momentum, epsilon)
diff --git a/onnx/backend/test/case/node/momentum.py b/onnx/backend/test/case/node/momentum.py
new file mode 100644
index 00000000000..530f6062ccb
--- /dev/null
+++ b/onnx/backend/test/case/node/momentum.py
@@ -0,0 +1,147 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import numpy as np  # type: ignore
+
+import onnx
+from ..base import Base
+from . import expect
+
+
+def apply_momentum(t, r, x, g, v, norm_coefficient, alpha, beta):  # type: ignore
+    # Add gradient of regularization term.
+    g_regularized = norm_coefficient * x + g
+    # Coefficient of gradient should be 1 at the first iteration.
+    beta_adjusted = beta if t > 0 else 1
+    # Update momentum.
+    v_new = alpha * v + beta_adjusted * g_regularized
+    # Apply SG with momentum update rule.
+    x_new = x - r * v_new
+    return x_new, v_new
+
+
+def apply_nesterov(t, r, x, g, v, norm_coefficient, alpha, beta):  # type: ignore
+    # Add gradient of regularization term.
+    g_regularized = norm_coefficient * x + g
+    # Coefficient of gradient should be 1 at the first iteration.
+    beta_adjusted = beta if t > 0 else 1
+    # Update momentum.
+    v_new = alpha * v + beta_adjusted * g_regularized
+    # Apply Nesterov with momentum update rule.
+    x_new = x - r * (g_regularized + alpha * v_new)
+    return x_new, v_new
+
+
+class Momentum(Base):
+
+    @staticmethod
+    def export_momentum():  # type: () -> None
+        # Define operator attributes.
+        norm_coefficient = 0.001
+        alpha = 0.95
+        beta = 0.1
+
+        # Create operator.
+        node = onnx.helper.make_node('Momentum',
+                                     inputs=['R', 'T', 'X', 'G', 'V'],
+                                     outputs=['X_new', 'V_new'],
+                                     norm_coefficient=norm_coefficient,
+                                     alpha=alpha,
+                                     beta=beta,
+                                     mode='standard',
+                                     domain='ai.onnx.training'
+                                     )
+
+        # Define operator inputs.
+        r = np.array(0.1, dtype=np.float32)  # scalar
+        t = np.array(0, dtype=np.int64)  # scalar
+        x = np.array([1.2, 2.8], dtype=np.float32)
+        g = np.array([-0.94, -2.5], dtype=np.float32)
+        v = np.array([1.7, 3.6], dtype=np.float32)
+
+        # Compute expected outputs of Momentum.
+        x_new, v_new = apply_momentum(r, t, x, g, v,
+                                      norm_coefficient, alpha, beta)
+
+        # Check results.
+        expect(node, inputs=[r, t, x, g, v],
+               outputs=[x_new, v_new], name='test_momentum',
+               opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
+
+    @staticmethod
+    def export_nesterov_momentum():  # type: () -> None
+        # Define operator attributes.
+        norm_coefficient = 0.01
+        alpha = 0.95
+        beta = 1.0
+
+        # Create operator.
+        node = onnx.helper.make_node('Momentum',
+                                     inputs=['R', 'T', 'X', 'G', 'V'],
+                                     outputs=['X_new', 'V_new'],
+                                     norm_coefficient=norm_coefficient,
+                                     alpha=alpha,
+                                     beta=beta,
+                                     mode='nesterov',
+                                     domain='ai.onnx.training'
+                                     )
+
+        # Define operator inputs.
+        r = np.array(0.1, dtype=np.float32)  # scalar
+        t = np.array(0, dtype=np.int64)  # scalar
+        x = np.array([1.2, 2.8], dtype=np.float32)
+        g = np.array([-0.94, -2.5], dtype=np.float32)
+        v = np.array([1.7, 3.6], dtype=np.float32)
+
+        # Compute expected outputs of Adagrad.
+        x_new, v_new = apply_nesterov(r, t, x, g, v,
+                                      norm_coefficient, alpha, beta)
+
+        # Check results.
+        expect(node, inputs=[r, t, x, g, v],
+               outputs=[x_new, v_new], name='test_nesterov_momentum',
+               opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
+
+    @staticmethod
+    def export_momentum_multiple():  # type: () -> None
+        # Define operator attributes.
+        norm_coefficient = 0.001
+        alpha = 0.95
+        beta = 0.85
+
+        node = onnx.helper.make_node('Momentum',
+                                     inputs=['R', 'T', 'X1', 'X2',
+                                             'G1', 'G2', 'H1', 'H2'],
+                                     outputs=['X1_new', 'X2_new',
+                                              'V1_new', 'V2_new'],
+                                     norm_coefficient=norm_coefficient,
+                                     alpha=alpha,
+                                     beta=beta,
+                                     mode='standard',
+                                     domain='ai.onnx.training'
+                                     )
+
+        # Define operator inputs.
+        r = np.array(0.1, dtype=np.float32)  # scalar
+        t = np.array(0, dtype=np.int64)  # scalar
+
+        x1 = np.array([1.0], dtype=np.float32)
+        g1 = np.array([-1.0], dtype=np.float32)
+        v1 = np.array([2.0], dtype=np.float32)
+
+        x2 = np.array([1.0, 2.0], dtype=np.float32)
+        g2 = np.array([-1.0, -3.0], dtype=np.float32)
+        v2 = np.array([4.0, 1.0], dtype=np.float32)
+
+        # Compute expected outputs of Momentum.
+        x1_new, v1_new = apply_momentum(r, t, x1, g1, v1,
+                                        norm_coefficient, alpha, beta)
+        x2_new, v2_new = apply_momentum(r, t, x2, g2, v2,
+                                        norm_coefficient, alpha, beta)
+
+        # Check results.
+        expect(node, inputs=[r, t, x1, x2, g1, g2, v1, v2],
+               outputs=[x1_new, x2_new, v1_new, v2_new], name='test_momentum_multiple',
+               opset_imports=[onnx.helper.make_opsetid('ai.onnx.training', 1)])
diff --git a/onnx/backend/test/case/node/negativeloglikelihoodloss.py b/onnx/backend/test/case/node/negativeloglikelihoodloss.py
index 7f655c7c87f..1d08addc7af 100644
--- a/onnx/backend/test/case/node/negativeloglikelihoodloss.py
+++ b/onnx/backend/test/case/node/negativeloglikelihoodloss.py
@@ -10,7 +10,7 @@
 from . import expect
 
 
-def compute_negative_log_likelihood_loss(input, target, weight=None, reduction='mean'):  # type: ignore
+def compute_negative_log_likelihood_loss(input, target, weight=None, reduction='mean', ignore_index=None):  # type: ignore
     ''' Compute negative_log_likelihood_loss '''
     input_shape = input.shape
 
@@ -19,20 +19,34 @@ def compute_negative_log_likelihood_loss(input, target, weight=None, reduction='
         N, C = input_shape
         neg_gather_element_input = np.zeros((N, ), dtype=np.float32)
         for i in range(N):
-            neg_gather_element_input[i] = -input[i][target[i]]
+            if target[i] != ignore_index:
+                neg_gather_element_input[i] = -input[i][target[i]]
     else:
         N, C, dim1, dim2 = input_shape
         neg_gather_element_input = np.zeros((N, dim1, dim2), dtype=np.float32)
         for i in range(N):
             for d1 in range(dim1):
                 for d2 in range(dim2):
-                    neg_gather_element_input[i][d1][d2] = -input[i][target[i][d1][d2]][d1][d2]
+                    if target[i][d1][d2] != ignore_index:
+                        neg_gather_element_input[i][d1][d2] = -input[i][target[i][d1][d2]][d1][d2]
 
     loss = neg_gather_element_input
     if weight is not None:
         # Gather(input=weight, index=target)
         gather_weight = np.take(weight, target)
 
+        if ignore_index is not None:
+            if len(input_shape) == 2:
+                for i in range(input_shape[0]):
+                    if target[i] == ignore_index:
+                        gather_weight[i] = 0
+
+            if len(input_shape) == 3:
+                for i in range(input_shape[0]):
+                    for j in range(input_shape[1]):
+                        if target[i][j] == ignore_index:
+                            gather_weight[i][j] = 0
+
         loss = gather_weight * loss
         if reduction == 'mean':
             return loss.sum() / gather_weight.sum()
@@ -189,3 +203,27 @@ def export_input_shape_is_NCd1d2_with_weight_reduction_sum():  # type: () -> Non
 
         expect(node, inputs=[input, target, weight], outputs=[negative_log_likelihood_loss],
             name='test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum')
+
+    @staticmethod
+    def export_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index():  # type: () -> None
+        reduction = 'sum'
+        ignore_index = np.int64(0)
+        node = onnx.helper.make_node(
+            'NegativeLogLikelihoodLoss',
+            inputs=['input', 'target', 'weight'],
+            outputs=['loss'],
+            reduction=reduction,
+            ignore_index=ignore_index
+        )
+
+        N, C, dim1, dim2 = 3, 5, 6, 6
+        np.random.seed(0)
+        input = np.random.rand(N, C, dim1, dim2).astype(np.float32)
+        target = np.random.randint(0, high=C, size=(N, dim1, dim2))
+        target[0][0][0] = 0
+        weight = np.random.rand(C).astype(np.float32)
+
+        negative_log_likelihood_loss = compute_negative_log_likelihood_loss(input, target, weight=weight, reduction=reduction, ignore_index=ignore_index)
+
+        expect(node, inputs=[input, target, weight], outputs=[negative_log_likelihood_loss],
+            name='test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index')
diff --git a/onnx/backend/test/case/node/pow.py b/onnx/backend/test/case/node/pow.py
index a48dfbdcabe..14d73aa6c66 100644
--- a/onnx/backend/test/case/node/pow.py
+++ b/onnx/backend/test/case/node/pow.py
@@ -10,6 +10,11 @@
 from . import expect
 
 
+def pow(x, y):  # type: ignore
+    z = np.power(x, y).astype(x.dtype)
+    return z
+
+
 class Pow(Base):
 
     @staticmethod
@@ -22,13 +27,13 @@ def export():  # type: () -> None
 
         x = np.array([1, 2, 3]).astype(np.float32)
         y = np.array([4, 5, 6]).astype(np.float32)
-        z = np.power(x, y)  # expected output [1., 32., 729.]
+        z = pow(x, y)  # expected output [1., 32., 729.]
         expect(node, inputs=[x, y], outputs=[z],
                name='test_pow_example')
 
         x = np.arange(60).reshape(3, 4, 5).astype(np.float32)
         y = np.random.randn(3, 4, 5).astype(np.float32)
-        z = np.power(x, y)
+        z = pow(x, y)
         expect(node, inputs=[x, y], outputs=[z],
                name='test_pow')
 
@@ -42,7 +47,7 @@ def export_pow_broadcast():  # type: () -> None
 
         x = np.array([1, 2, 3]).astype(np.float32)
         y = np.array(2).astype(np.float32)
-        z = np.power(x, y)  # expected output [1., 4., 9.]
+        z = pow(x, y)  # expected output [1., 4., 9.]
         expect(node, inputs=[x, y], outputs=[z],
                name='test_pow_bcast_scalar')
 
@@ -54,6 +59,62 @@ def export_pow_broadcast():  # type: () -> None
         x = np.array([[1, 2, 3], [4, 5, 6]]).astype(np.float32)
         y = np.array([1, 2, 3]).astype(np.float32)
         # expected output [[1, 4, 27], [4, 25, 216]]
-        z = np.power(x, y).astype(np.float32)
+        z = pow(x, y)
         expect(node, inputs=[x, y], outputs=[z],
                name='test_pow_bcast_array')
+
+    @staticmethod
+    def export_types():  # type: () -> None
+        node = onnx.helper.make_node(
+            'Pow',
+            inputs=['x', 'y'],
+            outputs=['z'],
+        )
+
+        x = np.array([1, 2, 3]).astype(np.float32)
+        y = np.array([4, 5, 6]).astype(np.int64)
+        z = pow(x, y)  # expected output [1., 32., 729.]
+        expect(node, inputs=[x, y], outputs=[z],
+               name='test_pow_types_float32_int64')
+
+        x = np.array([1, 2, 3]).astype(np.int64)
+        y = np.array([4, 5, 6]).astype(np.float32)
+        z = pow(x, y)  # expected output [1, 32, 729]
+        expect(node, inputs=[x, y], outputs=[z],
+               name='test_pow_types_int64_float32')
+
+        x = np.array([1, 2, 3]).astype(np.float32)
+        y = np.array([4, 5, 6]).astype(np.int32)
+        z = pow(x, y)  # expected output [1., 32., 729.]
+        expect(node, inputs=[x, y], outputs=[z],
+               name='test_pow_types_float32_int32')
+
+        x = np.array([1, 2, 3]).astype(np.int32)
+        y = np.array([4, 5, 6]).astype(np.float32)
+        z = pow(x, y)  # expected output [1, 32, 729]
+        expect(node, inputs=[x, y], outputs=[z],
+               name='test_pow_types_int32_float32')
+
+        x = np.array([1, 2, 3]).astype(np.float32)
+        y = np.array([4, 5, 6]).astype(np.uint64)
+        z = pow(x, y)  # expected output [1., 32., 729.]
+        expect(node, inputs=[x, y], outputs=[z],
+               name='test_pow_types_float32_uint64')
+
+        x = np.array([1, 2, 3]).astype(np.float32)
+        y = np.array([4, 5, 6]).astype(np.uint32)
+        z = pow(x, y)  # expected output [1., 32., 729.]
+        expect(node, inputs=[x, y], outputs=[z],
+               name='test_pow_types_float32_uint32')
+
+        x = np.array([1, 2, 3]).astype(np.int64)
+        y = np.array([4, 5, 6]).astype(np.int64)
+        z = pow(x, y)  # expected output [1, 32, 729]
+        expect(node, inputs=[x, y], outputs=[z],
+               name='test_pow_types_int64_int64')
+
+        x = np.array([1, 2, 3]).astype(np.int32)
+        y = np.array([4, 5, 6]).astype(np.int32)
+        z = pow(x, y)  # expected output [1, 32, 729]
+        expect(node, inputs=[x, y], outputs=[z],
+               name='test_pow_types_int32_int32')
diff --git a/onnx/backend/test/case/node/softmaxcrossentropy.py b/onnx/backend/test/case/node/softmaxcrossentropy.py
index 1948bc02f43..3e8adca88f6 100644
--- a/onnx/backend/test/case/node/softmaxcrossentropy.py
+++ b/onnx/backend/test/case/node/softmaxcrossentropy.py
@@ -10,7 +10,7 @@
 from . import expect
 
 
-def softmaxcrossentropy(x, target, weight=None, reduction='mean'):  # type: ignore
+def softmaxcrossentropy(x, target, weight=None, reduction='mean', ignore_index=None):  # type: ignore
     max_x = np.max(x, axis=1, keepdims=True)
     exp_x = np.exp(x - max_x)
     p = exp_x / np.sum(exp_x, axis=1, keepdims=True)
@@ -20,17 +20,32 @@ def softmaxcrossentropy(x, target, weight=None, reduction='mean'):  # type: igno
         N, C = input_shape
         neg_gather_element_input = np.zeros((N, ), dtype=np.float32)
         for i in range(N):
-            neg_gather_element_input[i] = -inp[i][target[i]]
+            if target[i] != ignore_index:
+                neg_gather_element_input[i] = -inp[i][target[i]]
     if len(input_shape) == 3:
         N, C, D = input_shape
         neg_gather_element_input = np.zeros((N, D), dtype=np.float32)
         for i in range(N):
             for d in range(D):
-                neg_gather_element_input[i][d] = -inp[i][target[i][d]][d]
+                if target[i][d] != ignore_index:
+                    neg_gather_element_input[i][d] = -inp[i][target[i][d]][d]
 
     loss = neg_gather_element_input
     if weight is not None:
         gather_weight = np.take(weight, target)
+
+        if ignore_index is not None:
+            if len(input_shape) == 2:
+                for i in range(input_shape[0]):
+                    if target[i] == ignore_index:
+                        gather_weight[i] = 0
+
+            if len(input_shape) == 3:
+                for i in range(input_shape[0]):
+                    for j in range(input_shape[1]):
+                        if target[i][j] == ignore_index:
+                            gather_weight[i][j] = 0
+
         loss = gather_weight * loss
         if reduction == 'mean':
             return loss.sum() / gather_weight.sum()
@@ -178,3 +193,29 @@ def export_softmaxcrossentropy_mean_weights():  # type: () -> None
 
         # Check results
         expect(node, inputs=[x, labels, weights], outputs=[sce], name='test_softmax_cross_entropy_mean_weight')
+
+    @staticmethod
+    def export_softmaxcrossentropy_mean_weights_ignore_index():  # type: () -> None
+        # Define operator attributes.
+        reduction = 'mean'
+        ignore_index = np.int64(0)
+
+        # Create operator.
+        node = onnx.helper.make_node('SoftmaxCrossEntropyLoss',
+                                     inputs=['x', 'y', 'w'],
+                                     outputs=['z'],
+                                     reduction=reduction,
+                                     ignore_index=ignore_index)
+
+        # Define operator inputs.
+        np.random.seed(0)
+        x = np.random.rand(3, 5).astype(np.float32)
+        labels = np.random.randint(0, high=5, size=(3, ))
+        labels[0] = 0
+        weights = np.array([0.9, 0.7, 0.8, 0.9, 0.9], dtype=np.float32)
+
+        # Compute SoftmaxCrossEntropyLoss
+        sce = softmaxcrossentropy(x, labels, weight=weights, ignore_index=ignore_index)
+
+        # Check results
+        expect(node, inputs=[x, labels, weights], outputs=[sce], name='test_softmax_cross_entropy_mean_weight_ignore_index')
diff --git a/onnx/backend/test/data/node/test_batchnorm_epsilon_old/model.onnx b/onnx/backend/test/data/node/test_batchnorm_epsilon_old/model.onnx
index a58bef46800..870cd21363c 100644
--- a/onnx/backend/test/data/node/test_batchnorm_epsilon_old/model.onnx
+++ b/onnx/backend/test/data/node/test_batchnorm_epsilon_old/model.onnx
@@ -1,4 +1,4 @@
-backend-test:�
+backend-test:�
 A
 x
 s
diff --git a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/model.onnx b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/model.onnx
index 94158119efbc9108fa9f4a63ab6648871246b134..b34f0a5391af2f70e7b05760c795fa2c5bc74873 100644
GIT binary patch
delta 119
zcmX@ie4JT`gI$OxDKR-aH7`ZCB(=E2YQsh$5k^^YF5Z%&#LT?Ry!80o{FGE7HZB$p
zP9cUQX)eafi3cSn$1|FlNpNu$CzhqA#OJ0a<_U3ead0pSv2ZbQFeiy~aYAIu5{r-}
IoR|c70CGwjvj6}9

delta 107
zcmX@ke3)5?gH4DhDKR-aH7`ZCB(=E2YRyI=5k?tlF5Z%&#LT?Ry!80o{FGE7E-nrZ
zP9YX9CJx5Qc8tM#B3xX>iDjuN@wusqc|vSlEF6qN3`xRVoDk8n#3GoW6O#ZB0PA`g
A4FCWD

diff --git a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/input_5.pb b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/input_5.pb
index 2b72d47c5f9..f72fe9452db 100644
--- a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/input_5.pb
+++ b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/input_5.pb
@@ -1 +1 @@
-	Btraining_modeJ
\ No newline at end of file
+	Btraining_modeJ
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_0.pb
index 29af295c856d443c74b4fc4c2fb15286670b7847..77cd8aabed707547c796b76dd9094c23c8f03e0e 100644
GIT binary patch
literal 496
zcmV<M0T2EN0tf>L1PBEX0YU+JO5g)EN&rCB5V=1gXV*W6$_zl`00Te*uVFut=G8yh
zC7eG3te!vooVh;)qpLqaR`5UY{L4SXcB(&sTf0ARl(j#QZ1O*?Q<*>1f3!dJnQlK{
z^$0-6+>XC;L#n?NIGR4dbuhq&AA>(YZ2!LGP=>zKkO07J<M_W>7#zQ~9#+3*?d3ko
zi^o4$1IxXhbsWFI(AmEaFxJ1I7jnI1?i9XNJ>$N95evWAtscKtxVJu_W|_Y3myf<{
zIsd-2tv5Z)Ss1?zaACfjL(aaab#%QEkD0zQR`tHe+%CQx3`f0K=@h=|fT+Ehd#b*K
zMTNc+(||uaJCQ$1<#InYBNIOyW34}E1dl(kHXT26S-d~31!+IX&Za*cP|82s*s4FZ
z)89YctVTa4F1bHVEr36l*=#?YyM;fCzmq?02d6)U5Ia9e{;t24-E6<;!;L=@So^=y
zDHy=Y9@D=E7mB{4h9W=o^}@d}D=@vY%h<ms^(nvD*towtbWp#=zLmd-7wx`w0};Ti
z&J@2<P+Y%Fbgn)$9tOYk(g43=1jRkjPV~NHy|z9eK8(KO`?<af=(9eMl5Rd`QLH|`
mOQJoHlF7bYB(gr}lyJUu3sXH>;kZ2<o>9Jl*O5A2C#yY@<MCqv

literal 496
zcmV<M0T2EN0tf>L1PBEX0YU+JO5g*+{PaJ79JW8Y$k0C|d<8&E!udbsIBh?jcF#Yv
zube;7=bk^gm$pCcO{zcGwB|p5zs5iDMX5hdAh<u|3$s6;is(PWG@3uM7_vWSlz2a1
zSO`Gf4WGYgaJ#=)1ZO_GQ#`=uiFrTKYZ|`{W4FGJwhF+GaR$Ki(l@`Y?sLCp@4-GQ
zro%tHr9r-tGdRD}vh2Ue@aVrlF&DmM#&W(SF8RK@xfj3MqAb4x1-(9ea;m;x{hz*!
zf(yUg(+oXq<0ZcxBYwVq(Bi&7*qyzcovOYJJqEv}HAucLqJ_QNi7dX1FVDSFsl2|<
z`;@-Usf0hH7Lz~nT6{m-DJVbY*Q-BAB#}SG1u;LA|F}P-taCpD52inNM#VoboT)$2
zFxWrjc~w70Hnu+x7=%AE3U@z}`iMUuO_e{<K&C%Ss4G9XR=~efzlFah%ZEQJw+Fy>
zC?mj7!{xu<OS`_?JQ6?d1=YW#9tOT-r0c&eFGj!Hi^{(r5N^L9FQ&gX?GV3Vz#G6w
zMlioQS$e;~BeOnmauUB>%?!UbkhVRbum!(v`MW-~YMj2MFwMRI^te7Up>RH#+^{}k
mijh6m&)vRa61P5jB!|BE0V+K@pQ$~(L~Xucsz|(0{F*%z=<?kF

diff --git a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_1.pb b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_1.pb
index d9345873f6b..89e2fbe59fa 100644
--- a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_1.pb
+++ b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_1.pb
@@ -1 +1 @@
-Boutput_meanJ�������?�J?
\ No newline at end of file
+Boutput_meanJ��¾u�?N2?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_2.pb b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_2.pb
index fd4203bc72c..30fcb8615d3 100644
--- a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_2.pb
+++ b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_2.pb
@@ -1,2 +1,2 @@
 B
-output_varJ~�|?;$a?�N�=
\ No newline at end of file
+output_varJ�`v?�/c?P�	>
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_3.pb b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_3.pb
index 16d491d0a59..b6f0d22db61 100644
--- a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_3.pb
+++ b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_3.pb
@@ -1,2 +1,2 @@
-B
-saved_meanJ`�>
\ No newline at end of file
+B
+saved_meanJ"`�=4t>�O=
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_4.pb b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_4.pb
index 7f174abd3a1..9d7c5f6aeca 100644
--- a/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_4.pb
+++ b/onnx/backend/test/data/node/test_batchnorm_epsilon_training_mode/test_data_set_0/output_4.pb
@@ -1 +1 @@
-B	saved_varJ{ҋ?
\ No newline at end of file
+B	saved_varJ0+X?-�?��?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_batchnorm_example_old/model.onnx b/onnx/backend/test/data/node/test_batchnorm_example_old/model.onnx
index b7b47f79c09..d21127e42c0 100644
--- a/onnx/backend/test/data/node/test_batchnorm_example_old/model.onnx
+++ b/onnx/backend/test/data/node/test_batchnorm_example_old/model.onnx
@@ -1,4 +1,4 @@
-backend-test:�
+backend-test:�
 .
 x
 s
diff --git a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/model.onnx b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/model.onnx
index 97d51b19d7fd0deb73b4a502c1851cc8a4d286ac..8727e3e78bef95f27f6a5a37009ad6dfd3bed150 100644
GIT binary patch
delta 110
zcmZ3_yqQ^ugI$OxDKR-aH7`ZCB(=E2YR*C-MhS5)-jbrk%)HFJ^!VKTlvE)$E*1_>
zA%@9%jKO9STwKM8WvMCgxv7bHLR?%N9E?INTudBHN#b0b5Sg;XB4i0CCIKD*HLw`X

delta 106
zcmdnYyq;NzgH4DhDKR-aH7`ZCB(=E2YQ{n#Mj2@?-jbrk%)HFJ^!VKTlvE)uE)EV(
zAr>ws4#vq^jKO*$TwKM8WvMCgxv7bHLTp?t9E?H?Ny1#55Ye*4BAB2PlK>9@yww<)

diff --git a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/input_5.pb b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/input_5.pb
index 2b72d47c5f9..f72fe9452db 100644
--- a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/input_5.pb
+++ b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/input_5.pb
@@ -1 +1 @@
-	Btraining_modeJ
\ No newline at end of file
+	Btraining_modeJ
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_0.pb
index e698ff662bf0754defaf14212c0ecdd0c1c59bef..5211daa8b6d8a206444ac22784393c083a5cafdb 100644
GIT binary patch
literal 39
qcmd;J<Y3}p<X{$HbYiUZl2AS}XFmfF0BQRypD-Z1!Cq>zsRICtjR^+;

literal 39
vcmd;J<Y3}p<X{$HbYiUZl8A}lz5nWsg#C}7PTgl}w!=OkP{g6N-Nyj{(Ps@}

diff --git a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_1.pb b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_1.pb
index 065ff563187457af5601a1ebb639cef7ff2b3838..35ea3df65837ac14156d93217759e7081e25dd46 100644
GIT binary patch
literal 27
gcmd;J5@2-V&Mz$~C@qQ4O-;=6;$Q%R|NreB0AN4|=>Px#

literal 27
icmd;J5@2-V&Mz$~C@qQ4O-;=6;+Qp4(k?B{%mDyq$_O|B

diff --git a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_2.pb b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_2.pb
index 1ea67621406..ab0f6f0d655 100644
--- a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_2.pb
+++ b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_2.pb
@@ -1,2 +1,2 @@
 B
-output_varJ���?""�?
\ No newline at end of file
+output_varJwww?UU�?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_3.pb b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_3.pb
index bee5bfe8c58006b1d5dab906dc05bcd5af4312e2..0593c9a9e913fdcfa48cd550bd009486df4864e5 100644
GIT binary patch
literal 26
dcmd;J5@2-VDo!j*O^MG<P0aJ+U;qOL2LMca1x)||

literal 20
bcmWe&bmA&bEK5y^&rMCt^I~COIA9L|IaUR}

diff --git a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_4.pb b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_4.pb
index 5222d71972a..df97b713232 100644
--- a/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_4.pb
+++ b/onnx/backend/test/data/node/test_batchnorm_example_training_mode/test_data_set_0/output_4.pb
@@ -1 +1 @@
-B	saved_varJ��:@
\ No newline at end of file
+B	saved_varJ��*?��*?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_momentum/model.onnx b/onnx/backend/test/data/node/test_momentum/model.onnx
new file mode 100644
index 0000000000000000000000000000000000000000..5848f82c85d37810538c1c2bac5a7b48d3c181a6
GIT binary patch
literal 317
zcmZXPy>7xV6oq@!q*%8<M5b9fGDd=hEfPXvVPdIJgt|mdVp^;9b(Oe+y(4QMfCqwi
zs*M<;cDg#>{X2*C_+2TgcS-fg%GPE-0qz0e8DWL6Mpz)^JX#kdw>-*y8M39(lI)jG
z`xuFj&o_}wChhD1kUn5oN-L7n@i_H*fT1?E9J1C5szp;D;5vQ6m`tOK%E-+|Rm$B;
z4FAsDLKX`(HblFzZG<94ul(NID)i;$%dPXqZ!l!Que?~DMbC-8^UD#L>x3Avz=OqK
Ykas~jz)P~unuq6{-O2eQ{&N<?F9yU$bN~PV

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_0.pb
new file mode 100644
index 00000000000..d0483cc61f7
--- /dev/null
+++ b/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_0.pb
@@ -0,0 +1 @@
+BRJ���=
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..54656de61df113af74638d4e31110b7cb8c8c9f0
GIT binary patch
literal 15
QcmWe&cVZ0j;$VOR01J-+0RR91

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_2.pb b/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_2.pb
new file mode 100644
index 00000000000..15244fd4b75
--- /dev/null
+++ b/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_2.pb
@@ -0,0 +1 @@
+BXJ���?333@
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_3.pb b/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_3.pb
new file mode 100644
index 0000000000000000000000000000000000000000..439d577ffbbfb621346b9f41aca167107861274a
GIT binary patch
literal 17
Ycmd;J5@2*<bob)8zPMmN1B1c=03F^0fdBvi

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_4.pb b/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_4.pb
new file mode 100644
index 00000000000..e3f3aa94fdf
--- /dev/null
+++ b/onnx/backend/test/data/node/test_momentum/test_data_set_0/input_4.pb
@@ -0,0 +1 @@
+BVJ���?fff@
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_momentum/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_momentum/test_data_set_0/output_0.pb
new file mode 100644
index 00000000000..e2bda563867
--- /dev/null
+++ b/onnx/backend/test/data/node/test_momentum/test_data_set_0/output_0.pb
@@ -0,0 +1 @@
+BX_newJ���?333@
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_momentum/test_data_set_0/output_1.pb b/onnx/backend/test/data/node/test_momentum/test_data_set_0/output_1.pb
new file mode 100644
index 00000000000..e6b78134b34
--- /dev/null
+++ b/onnx/backend/test/data/node/test_momentum/test_data_set_0/output_1.pb
@@ -0,0 +1 @@
+BV_newJ��?��J@
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_momentum_multiple/model.onnx b/onnx/backend/test/data/node/test_momentum_multiple/model.onnx
new file mode 100644
index 0000000000000000000000000000000000000000..0bd4ad795955218b35ce2b30b6c69aacf9f1c8cf
GIT binary patch
literal 462
zcmZXQF;BxV6ohdaTI+!zG7SS`$4FFax@AFPNfs7VRfQ#T5*I9DyGmTa&aZ%}3mZQY
z#|cWM4A$vBpPu~=Psn*j^EW1oTg_@6{fyuj0eS?q0%x-@Cj=)p8XHZFCLtMb7O7<K
zM#EOuonKo&!w*U@srMqdgi~7XUa7xa)^`Vlf(vjnrm6q^^<@1OxIz_dhHFh_L960`
zM!^kuQdJ_&729rk&dn@ek<Z8sh&;;YO3IJ9t|*sWzL4wY->K-HOvS#`d{?sPGqVL5
zgNqS4Y0S!+i7{TY83DlN$^{1_OTr`KG$bBm;z1@EWK2K(WG9=#e_K68mbTsX)MmRT
Rdup?7ug7!mr!Z*3Lw|4RTkik>

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_0.pb
new file mode 100644
index 00000000000..d0483cc61f7
--- /dev/null
+++ b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_0.pb
@@ -0,0 +1 @@
+BRJ���=
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..54656de61df113af74638d4e31110b7cb8c8c9f0
GIT binary patch
literal 15
QcmWe&cVZ0j;$VOR01J-+0RR91

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_2.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_2.pb
new file mode 100644
index 0000000000000000000000000000000000000000..259b58bf7f3d11fde5b959e44f4a97ae791ed8c5
GIT binary patch
literal 14
Vcmd;J6kv2>iZJwIVPI&m2LKBq0rda?

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_3.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_3.pb
new file mode 100644
index 0000000000000000000000000000000000000000..eb51654d455357db15a81c2fc6a46a89d28f47b2
GIT binary patch
literal 18
Zcmd;J5@2*<iZJrxU|?vlXJBA(000wq0y_Wz

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_4.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_4.pb
new file mode 100644
index 0000000000000000000000000000000000000000..15cbe61b3da8a9e66d8b6906b1b0d3ebb7a0bcf0
GIT binary patch
literal 14
Vcmd;J6kv2>ayRs1VPI(34*&}q0%QOH

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_5.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_5.pb
new file mode 100644
index 0000000000000000000000000000000000000000..62e5b3a8f7515918af3f49d0b6536a2fcea0da0b
GIT binary patch
literal 18
Zcmd;J5@2*<ayRngU|?w2&%oet000&$18M*O

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_6.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_6.pb
new file mode 100644
index 0000000000000000000000000000000000000000..ad3e95e23d473aa9cacf16b9deef0502b2fb5e84
GIT binary patch
literal 14
Vcmd;J6kv2>@-XybVPIfz000T20cHRI

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_7.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/input_7.pb
new file mode 100644
index 0000000000000000000000000000000000000000..cb33a3de6635f4b8e576e11be4af29c88323dfdf
GIT binary patch
literal 18
Xcmd;J5@2*<@-Xt^U|?u)0AhOp6S@Mc

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..cc63aee5c44bdc27df70bb0c29c8593dda990200
GIT binary patch
literal 18
Zcmd;J6kv2>i!hAOOD*?eVPI&m2LK%N1EK%`

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_1.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..a3e2b32c78b8014b6d81469d0adad991b2143b0c
GIT binary patch
literal 22
dcmd;J5@2*<i!h4MOD*@}U|?vlXJBA(001ll1Lyz%

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_2.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_2.pb
new file mode 100644
index 00000000000..580ed7b79f5
--- /dev/null
+++ b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_2.pb
@@ -0,0 +1 @@
+BV1_newJ@��?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_3.pb b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_3.pb
new file mode 100644
index 00000000000..4cc4e23dade
--- /dev/null
+++ b/onnx/backend/test/data/node/test_momentum_multiple/test_data_set_0/output_3.pb
@@ -0,0 +1 @@
+BV2_newJ��<@�̿
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NC/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NC/model.onnx
index 1f93bb13220..862203edb52 100644
--- a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NC/model.onnx
+++ b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NC/model.onnx
@@ -1,4 +1,4 @@
-backend-test:�
+backend-test:�
 F
 input
 targetloss"NegativeLogLikelihoodLoss*
diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NC_expanded/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NC_expanded/model.onnx
index 36856187991028c3e0ebe42bd5bf480cb3c78d90..e62e833911304d6851308764dc8925c7456743ec 100644
GIT binary patch
delta 11
Scmcc1dzY7qgMA~@Z8iWGAOm^;

delta 11
Scmcc1dzY7qgKZ<zZ8iWG8v}U&

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2/model.onnx
index c7ba431bf38..e5006e3c91c 100644
--- a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2/model.onnx
+++ b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2/model.onnx
@@ -1,4 +1,4 @@
-backend-test:�
+backend-test:�
 F
 input
 targetloss"NegativeLogLikelihoodLoss*
diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_expanded/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_expanded/model.onnx
index 22050f9bf22ede68592b68e06a39009cb3018ebc..76797d7e810fa2c701d6c8cc82446004bf3ac531 100644
GIT binary patch
delta 11
ScmZ3>x0a8IgMA~DHah?k69R1j

delta 11
ScmZ3>x0a8IgKZ;|Hah?k4gzcd

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_reduction_mean/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_reduction_mean/model.onnx
index aee8b92e5d9617b5b8626e76db65b087e2e1ec65..6455c78d411b2ec4b88aad83580bda7faecb46af 100644
GIT binary patch
delta 10
Rcmeyy_>GZ?gMA{?7XTDt1BU<r

delta 10
Rcmeyy_>GZ?gKZ+y7XTDp1BL(q

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_reduction_mean_expanded/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_reduction_mean_expanded/model.onnx
index 5eb7e7cd05b1400214ee9a0ee202b6a875a34a8e..d7a6ef04b5c4d5c44e324e65d58bd5fb93fba7bc 100644
GIT binary patch
delta 11
ScmeAd>K9_-VBg5Z%LxDv$pRe!

delta 11
ScmeAd>K9_-VB5&V%LxDv!~z@u

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_reduction_sum/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_reduction_sum/model.onnx
index 27a5fadc85ac98f84fbc3ddd71bf6d309777f6f1..4f134ce1b2aeef7049b710a0061fa8cbcfc7ffa7 100644
GIT binary patch
delta 10
Rcmeyu_=S;)gMA{?Cjb;X1Azbl

delta 10
Rcmeyu_=S;)gKZ+yCjb;T1AqVk

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_reduction_sum_expanded/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_reduction_sum_expanded/model.onnx
index 37e8a8bc8213f8e3d6df34a3cca91689bd844bf4..99f9bfe5f25a43ca1f1f1849174234afad371fc0 100644
GIT binary patch
delta 11
Scmew-_)n0DgMA~@9}WN-Km+Fh

delta 11
Scmew-_)n0DgKZ<z9}WN-I|Jqb

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight/model.onnx
index 442601bdfd9..a982f3748c9 100644
--- a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight/model.onnx
+++ b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight/model.onnx
@@ -1,4 +1,4 @@
-backend-test:�
+backend-test:�
 N
 input
 target
diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_expanded/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_expanded/model.onnx
index debdd14b04e4ca335073e6719e9335087c18a73a..8bffc806b1dd85c48414431a7016ac2da1eed956 100644
GIT binary patch
delta 11
ScmeAX=@DV#VBg5Z#RUKltO6MT

delta 11
ScmeAX=@DV#VB5&V#RUKlrvexN

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_mean/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_mean/model.onnx
index b85d1b38784eb08350deeb1a334c4bffc7f7aeff..66be0f2d10755d5d8cedfb5fb720c8fbf87587b2 100644
GIT binary patch
delta 11
ScmZ3$w1A0;gMA~D93ub_7Xl~%

delta 11
ScmZ3$w1A0;gKZ;|93ub_5&|ax

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_mean_expanded/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_mean_expanded/model.onnx
index 50908b5f3caae2cfcd56a27258d893d03381af23..5b18da6fea9d36c3ce4564c96d9382af357b17bd 100644
GIT binary patch
delta 11
Scmdlfw^NRZgMA~DIX?gtLjt(~

delta 11
Scmdlfw^NRZgKZ;|IX?gtJ_5J^

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum/model.onnx
index a2f88f9bc1c8bbafd423dc4f700842f9409f78c1..ac98e309862ef7ee77a2802e65bce645a7de7d12 100644
GIT binary patch
delta 11
ScmbQoG>?gigMA~D3?l#!>jEMG

delta 11
ScmbQoG>?gigKZ;|3?l#!<^mxA

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_expanded/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_expanded/model.onnx
index 129b0d2170914885694c1f701f5b9733f41695e0..b4b60b33b3d4ab8dde194b5482e8f60c2492b700 100644
GIT binary patch
delta 11
Scmdlbu}gx9gMA~D1rGod>H@6*

delta 11
Scmdlbu}gx9gKZ;|1rGod<pQh#

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/model.onnx
new file mode 100644
index 0000000000000000000000000000000000000000..1e8430a4beba58672238b63a03f8c2577e7c3223
GIT binary patch
literal 320
zcmYk2O-sW-5QdwN#&rl1)>0}6LXUz6J$lq@=nv?@A?XgArE#}pH`britN+r8O@oJ*
zWxk%7cg2;=>uR?rtDXY+@cI)&4Rd1;eSj=fohE*dWONf`+B)yocN@}GFke(UU79_$
zrg2V{Tzb5P+-U1MLSq$uuV$wx=Hzups^10U>^kxN=P63$$FL0TNW#`>Y$q0n-ZO%4
zK;Kjc#(CVlt7r9sM-v)6--8ntnTr26Zy!bvatR9&$pT`Lu;#36(Dwe~PD&_5CiAi5
bTwP3_0R+#d!)$#6e?K5INI5rqi;MghwVYZ+

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..1a71820b36a1975cdfa96cecb1c5fcf20bc15a86
GIT binary patch
literal 2180
zcmWO6izC+e0tWDs_r*a=#9_lCN^T>U^!~oj6Ro*KvxXwHS1ftmg^^3SZm}Yv%UUNY
za>{KbtqW>vN?NsASv9(>#jHA0TbGrc&mZy8oTMq!lxxbTO!ShiPE1Z;<M()qcD@ex
z8M^bU#U{+m)uzwzb$k$N#YF8PNQ<Jmy*-1eOTIx~vmYB{G#IJU<F8}!3|m_%29n!_
z?(V%(m31(ET)k;_G?K?kve{VePr3IfPG9uqEssCMtxJ}4>6?nnS5x>?t&E!&U%}P1
zQaHH0$Ir@#akog1dcSXz!cEP%xcx1zd}PY3Z7Doc=0KHRCmK!-;%Tij|8C9@9&NKZ
z+HA|x#HH|*spE(I-SD6LHR7J>QCTqrJ$Wt%|Gp^B-3`VP2P?ekJd?>S>*3q|5QqG-
zG1_Z^2%S5^T{eqX^L268-xuX)k7K9Z5^CK#BQ_O=5Z`|z_&kfcV~?P&o6T9*OsJeM
zs4RS2=%25^PMbeP>#kZ9&m0qN7wX_`as{VX6pEh~I<Zejj~2H}Q8Omvot51}k)qG#
z_Y?TZoArpjF`KeK<(R8<r<VN|6xAruVP(kb&OK6`YY+==Zc{`R+cO6naL`ras}C}0
zbu<q;`TBgL!iXmMQ(2f4jqqR7_~q~SAexs^UvCZ@eYawAa3K5D`tWj-k^vVCxcaD)
z-M&tIx5<NjArpANXAR44&S2@J0n|uua<XElq#nB+787l#Tf7BcZf4wEdILv}{tFqM
zTQPT7kI4&+_)@M<+qp9-9$WMFotrQyis4kJ<7l-Rgi&xbk3}t}wzCU+7W50Zl`0lr
zcVqIQe7?{$MuCWAd9*bGt4$eP{YvN=g);rXV%D_kF(~^Y5*j`iZ5y1~<@Nye*XL5v
z7s2n2W%1#(<xCu<r2T#;_7BZtTbCmmqhu_sj1>3!{2@D5E1hv%#6tfVJi641<24Cf
zALPOftJOr!-dS)pHmB#e`n2DkhJCY=IA~`|g}D}&svFUJcpg2E>_Xk$KKR@YfpR|(
zeJX`>J`_}$zJzix7LybO_`4-R+)(|Dj{04wT;R;s*dAz%xI^3aoMiv?XQE})o=^N9
zz%xRN^UGuSQ%Npi5{bMn8#-Msfv??u$;8TwegpMT27iOG9fqv!3gQ#>>D2mo1rO<#
z<GjB=Lydr$BWYBOJ5swj9JfnS8ELHG9`ypONen}OqdIrZyDVv?1>vLQ0k~Fm;Z*4F
z2(#LVb?%1TV->-&LyyJNRCV5xG+<Wb&XC@7Nb8!yj)Mw3Ut)m5JI^sVwFB1O$y`3V
zfT~8)K5P}E2AV`pxGp{ZGmROZ84TZ2jmtYSF#K`9NUSZ9>YgUBA>Wq0<34<d3cN5g
z<jWTY!eP%dn2T)gi~5LmmIskxlR>8i3D_i0<Mh|b+_2T3S~1fZa%ly7<TZ#03Z<>R
z8Dj^RNwvLg5bZ{=3th`6qfaoc={2$rS<?K-anv~K;Qfpx%v<G!{$JD>@l_03qk2Vk
zXE^8UWpTmtbjl23`Pa<%IAL=Wa{j$fjI6(ivlkL@tw{qJKZc3VyFDn)Q%Jq<E~G=2
zA+m1PiSz4^h!3^xSmwDFV}aLjv}H1<1%$K8d>&P?8<0>hhvUU4VX4}VFqaZ|l?|ix
zHwP9NI5Wzo5@(A~p+4UOW<UF~bRbr&PIPAVGH*JndsFL&A|yw5;bFWX16#_V8ajb(
zy}nSS{)Uyu+(dkrKbP*<DU_Gh>2A0aHwN?>zcZai?^@C2`V?_E;wCaH)cDX#izkvC
zs6Txp<WDBytj<|XytaneMdYbg@n<(}E_ytS%zh;;FN{EUr3ZBa&SOo|Dz4r84iio0
z@Y%Tr@VF=MU$NlOcj@qpo`JxjMf_h`I_)Y)aKg`s9aYQeWIO_$<}@Zo>$A^v319Ay
zrR|wOmJ3Vdj=#;S@jmQJS|@r;1|U+4P!$%+q&Q7+FF&0R{x}TZx(<Y$E|$s;3fxWU
zhxY$}8&Nwie8yAQUlYmar4O)lVFR**60s@3lpa|cVoPhjWLkCu@o{Bh_3L#|`y>k9
zb?dQgjvQO&ER-}0N5#<~Gv0{J;{4KFZb-U>l3UieS8T}ut(W-IH<xZFi-m5151%zw
zA;m2d_qW;b)n99IGJQ6K6V>TtF#y?w$y7bt41+r^G>^T4gl(O;X%WP_bMdrEmBhEM
ziI|?!j=j4BY3rNAPbYuE_NG<z_^}pgPjdNeRX#ra<+|9NlEzM15*oI=Lj#Q@(fx*k
zah*yI<dmUH{}j6GyHL|go-9m**9%8%)VdF!yE|ZfHH2GJ)3Ei63A71F;eh}PHp?Sf
z-ImO^o#!*mXbx-7XHvH^k9NVo!hS}jFdnYPpxRqF(|r{k$#P^im{2u-8ftT67~PY}
zeG_ze@NhZezA)m9%`pt)MNA0R<y6Bmyl@=G#|Qgi{81h<)=i`|@+A^ppG0P~oa5H>
znb*^V|5nFyxI3KLN!}vq)_M%7?!nsoN0F*(hi0arvzr5Jg3S5OFRm=Q_5mi3HDjKc
zJwKhMWR8P2)lOF6QOjRa-E~(6*-e!CzCQ`yOB&37Bc8wc91#bueUEIr@1(<VE%?mO
eoUd<hL|bnNE<GI;nO*vHf91p|nH@JiNaa7U>2-4e

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..febe677b84bd42bb04ce84d35a913f994cf373da
GIT binary patch
literal 451
zcmYk0!3}^g3<H6ps843%$rd~if7HPkkzB<|wUoMX9JjU9sIkQ!?R@U59dDTuE60x|
z8F5w^n3-OE^!Q<m8T+H2ZHF1mD1qO7H5lDuu8$c#8Ds+Q^7s$ma>t%MGtEa6_|Pfx
bC^U87NU3|sG8-P}c-+G$^r*q!@vvikna%{#

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_2.pb b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_2.pb
new file mode 100644
index 00000000000..6d74ae43745
--- /dev/null
+++ b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/input_2.pb
@@ -0,0 +1 @@
+BweightJ�Y?`�o?�p?�W=K�?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/output_0.pb
new file mode 100644
index 00000000000..d97cebfae6a
--- /dev/null
+++ b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index/test_data_set_0/output_0.pb
@@ -0,0 +1 @@
+BlossJbڿ�
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/model.onnx b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/model.onnx
new file mode 100644
index 0000000000000000000000000000000000000000..96173792901c7d2ff7370762369770a0fd8e895c
GIT binary patch
literal 4425
zcmd5=Pixdb6yM2av*~kbm$8%{1R=)-5B2CtEaGMD!Np52Lz=wJ4x62<nMu3!W2hjC
zNI!$1UW(qtqhG;~;Kj3(%x)+}dXyI!m_KjGZ{DBZyf<N=4^HA_jw-!pP@4xIBe-M`
zn7GW)@VD#;v)IT}92ePG&M}v>qDaSu)}j^{YP~G-A`^KlV24&hs%2$_p2bTPQj4R*
z^dP-2&ZL=%GnCoPh!WE(F|trXR|_FCRg_j#rMMEfTE@1MifN@9k#@pii?*IC{h~rV
z$I&hD;}z<M?1J^-mj}J(&p;SG|3FBy&Q8|OGb{z>xIks1?I+Fnu^VcqW2yJ8jy<^a
zfcgRWD?%D>!om7~8g*?z!Dq%_6GE31%7*eemPHgE*8NzS(QW9R#(9OKe#nNwh;=TQ
z51$w*wL0L1!W(c~fA!%{4@kNL&~yV*%4&^F6BVg|a6EHhd8RkXWrFaP@%QAI=IT9v
zL*zD$^NsO$<PK0%rI_YLY_7oH+YlyL!eo)(v0u$Szq=5)bO&gg0w^n*R4VGssYQy@
zxXMkmGf84&u-qg|@ZBSod|t!bruT0MMb^3^a9L;+xa<>tc%)dr<yTVf*r2ME(;%i%
zfAX)o=x3LdPhJBJWPOppBormi0qZ*AX_dnl1{A?@zlIrYJ*%sGoK%a^4uo^WWhxhX
z-(HFy&;HZv!SUVoV*uZS-H`DicwyJJo^3(hj_<T=$YB@;e6W^qtGk=Kn*)S)zhCb?
R*@otXJOJO$bsl-c!C&E3s*eBw

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..1a71820b36a1975cdfa96cecb1c5fcf20bc15a86
GIT binary patch
literal 2180
zcmWO6izC+e0tWDs_r*a=#9_lCN^T>U^!~oj6Ro*KvxXwHS1ftmg^^3SZm}Yv%UUNY
za>{KbtqW>vN?NsASv9(>#jHA0TbGrc&mZy8oTMq!lxxbTO!ShiPE1Z;<M()qcD@ex
z8M^bU#U{+m)uzwzb$k$N#YF8PNQ<Jmy*-1eOTIx~vmYB{G#IJU<F8}!3|m_%29n!_
z?(V%(m31(ET)k;_G?K?kve{VePr3IfPG9uqEssCMtxJ}4>6?nnS5x>?t&E!&U%}P1
zQaHH0$Ir@#akog1dcSXz!cEP%xcx1zd}PY3Z7Doc=0KHRCmK!-;%Tij|8C9@9&NKZ
z+HA|x#HH|*spE(I-SD6LHR7J>QCTqrJ$Wt%|Gp^B-3`VP2P?ekJd?>S>*3q|5QqG-
zG1_Z^2%S5^T{eqX^L268-xuX)k7K9Z5^CK#BQ_O=5Z`|z_&kfcV~?P&o6T9*OsJeM
zs4RS2=%25^PMbeP>#kZ9&m0qN7wX_`as{VX6pEh~I<Zejj~2H}Q8Omvot51}k)qG#
z_Y?TZoArpjF`KeK<(R8<r<VN|6xAruVP(kb&OK6`YY+==Zc{`R+cO6naL`ras}C}0
zbu<q;`TBgL!iXmMQ(2f4jqqR7_~q~SAexs^UvCZ@eYawAa3K5D`tWj-k^vVCxcaD)
z-M&tIx5<NjArpANXAR44&S2@J0n|uua<XElq#nB+787l#Tf7BcZf4wEdILv}{tFqM
zTQPT7kI4&+_)@M<+qp9-9$WMFotrQyis4kJ<7l-Rgi&xbk3}t}wzCU+7W50Zl`0lr
zcVqIQe7?{$MuCWAd9*bGt4$eP{YvN=g);rXV%D_kF(~^Y5*j`iZ5y1~<@Nye*XL5v
z7s2n2W%1#(<xCu<r2T#;_7BZtTbCmmqhu_sj1>3!{2@D5E1hv%#6tfVJi641<24Cf
zALPOftJOr!-dS)pHmB#e`n2DkhJCY=IA~`|g}D}&svFUJcpg2E>_Xk$KKR@YfpR|(
zeJX`>J`_}$zJzix7LybO_`4-R+)(|Dj{04wT;R;s*dAz%xI^3aoMiv?XQE})o=^N9
zz%xRN^UGuSQ%Npi5{bMn8#-Msfv??u$;8TwegpMT27iOG9fqv!3gQ#>>D2mo1rO<#
z<GjB=Lydr$BWYBOJ5swj9JfnS8ELHG9`ypONen}OqdIrZyDVv?1>vLQ0k~Fm;Z*4F
z2(#LVb?%1TV->-&LyyJNRCV5xG+<Wb&XC@7Nb8!yj)Mw3Ut)m5JI^sVwFB1O$y`3V
zfT~8)K5P}E2AV`pxGp{ZGmROZ84TZ2jmtYSF#K`9NUSZ9>YgUBA>Wq0<34<d3cN5g
z<jWTY!eP%dn2T)gi~5LmmIskxlR>8i3D_i0<Mh|b+_2T3S~1fZa%ly7<TZ#03Z<>R
z8Dj^RNwvLg5bZ{=3th`6qfaoc={2$rS<?K-anv~K;Qfpx%v<G!{$JD>@l_03qk2Vk
zXE^8UWpTmtbjl23`Pa<%IAL=Wa{j$fjI6(ivlkL@tw{qJKZc3VyFDn)Q%Jq<E~G=2
zA+m1PiSz4^h!3^xSmwDFV}aLjv}H1<1%$K8d>&P?8<0>hhvUU4VX4}VFqaZ|l?|ix
zHwP9NI5Wzo5@(A~p+4UOW<UF~bRbr&PIPAVGH*JndsFL&A|yw5;bFWX16#_V8ajb(
zy}nSS{)Uyu+(dkrKbP*<DU_Gh>2A0aHwN?>zcZai?^@C2`V?_E;wCaH)cDX#izkvC
zs6Txp<WDBytj<|XytaneMdYbg@n<(}E_ytS%zh;;FN{EUr3ZBa&SOo|Dz4r84iio0
z@Y%Tr@VF=MU$NlOcj@qpo`JxjMf_h`I_)Y)aKg`s9aYQeWIO_$<}@Zo>$A^v319Ay
zrR|wOmJ3Vdj=#;S@jmQJS|@r;1|U+4P!$%+q&Q7+FF&0R{x}TZx(<Y$E|$s;3fxWU
zhxY$}8&Nwie8yAQUlYmar4O)lVFR**60s@3lpa|cVoPhjWLkCu@o{Bh_3L#|`y>k9
zb?dQgjvQO&ER-}0N5#<~Gv0{J;{4KFZb-U>l3UieS8T}ut(W-IH<xZFi-m5151%zw
zA;m2d_qW;b)n99IGJQ6K6V>TtF#y?w$y7bt41+r^G>^T4gl(O;X%WP_bMdrEmBhEM
ziI|?!j=j4BY3rNAPbYuE_NG<z_^}pgPjdNeRX#ra<+|9NlEzM15*oI=Lj#Q@(fx*k
zah*yI<dmUH{}j6GyHL|go-9m**9%8%)VdF!yE|ZfHH2GJ)3Ei63A71F;eh}PHp?Sf
z-ImO^o#!*mXbx-7XHvH^k9NVo!hS}jFdnYPpxRqF(|r{k$#P^im{2u-8ftT67~PY}
zeG_ze@NhZezA)m9%`pt)MNA0R<y6Bmyl@=G#|Qgi{81h<)=i`|@+A^ppG0P~oa5H>
znb*^V|5nFyxI3KLN!}vq)_M%7?!nsoN0F*(hi0arvzr5Jg3S5OFRm=Q_5mi3HDjKc
zJwKhMWR8P2)lOF6QOjRa-E~(6*-e!CzCQ`yOB&37Bc8wc91#bueUEIr@1(<VE%?mO
eoUd<hL|bnNE<GI;nO*vHf91p|nH@JiNaa7U>2-4e

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..febe677b84bd42bb04ce84d35a913f994cf373da
GIT binary patch
literal 451
zcmYk0!3}^g3<H6ps843%$rd~if7HPkkzB<|wUoMX9JjU9sIkQ!?R@U59dDTuE60x|
z8F5w^n3-OE^!Q<m8T+H2ZHF1mD1qO7H5lDuu8$c#8Ds+Q^7s$ma>t%MGtEa6_|Pfx
bC^U87NU3|sG8-P}c-+G$^r*q!@vvikna%{#

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_2.pb b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_2.pb
new file mode 100644
index 00000000000..6d74ae43745
--- /dev/null
+++ b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/input_2.pb
@@ -0,0 +1 @@
+BweightJ�Y?`�o?�p?�W=K�?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/output_0.pb
new file mode 100644
index 00000000000..d97cebfae6a
--- /dev/null
+++ b/onnx/backend/test/data/node/test_negative_log_likelihood_loss_input_shape_is_NCd1d2_with_weight_reduction_sum_ignore_index_expanded/test_data_set_0/output_0.pb
@@ -0,0 +1 @@
+BlossJbڿ�
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_nesterov_momentum/model.onnx b/onnx/backend/test/data/node/test_nesterov_momentum/model.onnx
new file mode 100644
index 0000000000000000000000000000000000000000..1f3fb547acead6c3d2cf783e4d4632074cc15b86
GIT binary patch
literal 326
zcmd;J7vf1uOwLZtOVKS!EiSQ|%f!{q$i*1M#TdfH7{SHp&czre#2OKwms&2w8U~`2
zIDGSSQ}aqnbG7)nSQB#!G7?3Njf?FUFfwZKaj_(&mL!TYFf@Sq!dxu5`6;PN9C<*q
zQ;YJ;7BDhvNpT6}<rn3~C+DZ8rDY~(0?iWTx~^;kG{8zAF;g!;FRwzcq$n{nFEcM)
zNDSmGpuG_N@wrgDqqu>N6k_9I;b0VE0C7UV9CjdwALwu)E-ntB3=0<%2NOuZ9bF&{
bSs+Q63+hEAZ6HCghmln}iEv>!QGgKueA-B#

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_0.pb
new file mode 100644
index 00000000000..d0483cc61f7
--- /dev/null
+++ b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_0.pb
@@ -0,0 +1 @@
+BRJ���=
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..54656de61df113af74638d4e31110b7cb8c8c9f0
GIT binary patch
literal 15
QcmWe&cVZ0j;$VOR01J-+0RR91

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_2.pb b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_2.pb
new file mode 100644
index 00000000000..15244fd4b75
--- /dev/null
+++ b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_2.pb
@@ -0,0 +1 @@
+BXJ���?333@
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_3.pb b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_3.pb
new file mode 100644
index 0000000000000000000000000000000000000000..439d577ffbbfb621346b9f41aca167107861274a
GIT binary patch
literal 17
Ycmd;J5@2*<bob)8zPMmN1B1c=03F^0fdBvi

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_4.pb b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_4.pb
new file mode 100644
index 00000000000..e3f3aa94fdf
--- /dev/null
+++ b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/input_4.pb
@@ -0,0 +1 @@
+BVJ���?fff@
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/output_0.pb
new file mode 100644
index 00000000000..e2bda563867
--- /dev/null
+++ b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/output_0.pb
@@ -0,0 +1 @@
+BX_newJ���?333@
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/output_1.pb b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/output_1.pb
new file mode 100644
index 00000000000..ae87a60ae6b
--- /dev/null
+++ b/onnx/backend/test/data/node/test_nesterov_momentum/test_data_set_0/output_1.pb
@@ -0,0 +1 @@
+BV_newJ<�/? �r?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow/model.onnx b/onnx/backend/test/data/node/test_pow/model.onnx
index c49de5707db..b79fad40158 100644
--- a/onnx/backend/test/data/node/test_pow/model.onnx
+++ b/onnx/backend/test/data/node/test_pow/model.onnx
@@ -1,4 +1,4 @@
-backend-test:e
+backend-test:e
 
 x
 yz"Powtest_powZ
@@ -16,4 +16,4 @@
 
 
 
-B	
\ No newline at end of file
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow/test_data_set_0/output_0.pb
index 06e5fda91c44583f96d2c81f3851a83b4d64129b..68db533213a5220236396ac4f310874e1bd382bc 100644
GIT binary patch
delta 62
zcmV-E0KxzM0saAy8Ue?V8z%uXkwYs18<B4(CAS}zKIAFiKQ30JLgB2LLhD3&KHn#)
UK4CYjJm>qTJG#E!I=7L#LjkfIa{vGU

delta 62
zcmV-E0KxzM0saAy8Ue<U8z%uWkwYs18j)`&CAc4!KIAFiKQ30JLgB2LLhD3&KHn#)
UK43SiJm>qTJG#E!I=GR$LjgD&aR2}S

diff --git a/onnx/backend/test/data/node/test_pow_bcast_array/model.onnx b/onnx/backend/test/data/node/test_pow_bcast_array/model.onnx
index 8b630fa48b5..0edfc2fb0ef 100644
--- a/onnx/backend/test/data/node/test_pow_bcast_array/model.onnx
+++ b/onnx/backend/test/data/node/test_pow_bcast_array/model.onnx
@@ -1,4 +1,4 @@
-backend-test:a
+backend-test:a
 
 x
 yz"Powtest_pow_bcast_arrayZ
@@ -13,4 +13,4 @@
 z
 
 
-B	
\ No newline at end of file
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_bcast_scalar/model.onnx b/onnx/backend/test/data/node/test_pow_bcast_scalar/model.onnx
index b9eaabe2114b120ff0c798541e98105d77aa6b10..bea5d68cad06df8c466ae46682581112f02f9784 100644
GIT binary patch
delta 10
Rcmd1FVd7w)$dt**0{{%B0rLO=

delta 10
Rcmd1FVd7w($dt**2>=X>0qg((

diff --git a/onnx/backend/test/data/node/test_pow_example/model.onnx b/onnx/backend/test/data/node/test_pow_example/model.onnx
index ddc1ea3ac00..a4aa9a398bf 100644
--- a/onnx/backend/test/data/node/test_pow_example/model.onnx
+++ b/onnx/backend/test/data/node/test_pow_example/model.onnx
@@ -1,4 +1,4 @@
-backend-test:U
+backend-test:U
 
 x
 yz"Powtest_pow_exampleZ
@@ -13,4 +13,4 @@
 z
 
 
-B	
\ No newline at end of file
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_float/model.onnx b/onnx/backend/test/data/node/test_pow_types_float/model.onnx
new file mode 100644
index 00000000000..c0fd50393cd
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_float/model.onnx
@@ -0,0 +1,16 @@
+backend-test:Y
+
+x
+yz"Powtest_pow_types_floatZ
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..d91963f15c890955c4049fa33858b8d64ca4c264
GIT binary patch
literal 33
Ycmd;J7GQT`tniXxWPkuBD9sF|0V1^lMgRZ+

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..a760307aaa9d741605ce33d06bfeb54d3669be86
GIT binary patch
literal 21
acmd;J7GQK@tn}hxU}$h)U|0ae2OIz(Yy-~#

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_float/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..c77d2103a132ff11755f8c395343ece7c1ccd1ff
GIT binary patch
literal 33
acmd;J7GQT`tn!jzWPkt#D1DO&!T<m(i2^<V

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_int32/model.onnx b/onnx/backend/test/data/node/test_pow_types_float32_int32/model.onnx
new file mode 100644
index 00000000000..fd29f330a52
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_float32_int32/model.onnx
@@ -0,0 +1,16 @@
+backend-test:a
+
+x
+yz"Powtest_pow_types_float32_int32Z
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..62e4e87e30c2908b48e0e912c49f073faf7953fd
GIT binary patch
literal 21
acmd;J7GQK@tnlJtU}&&sU|?_nA_o8)lme{)

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..194c2ad3ae1744cdd30210b1f3770f246a19e3eb
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxVPIfj1!6WJ1^^Q_0Yd-)

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_float32_int32/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..0cc39708ca5cd984c0c32285657b92f0545914ba
GIT binary patch
literal 21
ccmd;J7GQK@tn%VvU}&&sU|?`!a4>TL032)r>i_@%

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_int64/model.onnx b/onnx/backend/test/data/node/test_pow_types_float32_int64/model.onnx
new file mode 100644
index 00000000000..9fe81897e54
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_float32_int64/model.onnx
@@ -0,0 +1,16 @@
+backend-test:a
+
+x
+yz"Powtest_pow_types_float32_int64Z
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..62e4e87e30c2908b48e0e912c49f073faf7953fd
GIT binary patch
literal 21
acmd;J7GQK@tnlJtU}&&sU|?_nA_o8)lme{)

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..2a776165494d557dfd635a732120b459acb76fcf
GIT binary patch
literal 33
Ycmd;J7GQT`tn`v#VSoTuD9r|?0V7}mPyhe`

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_float32_int64/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..0cc39708ca5cd984c0c32285657b92f0545914ba
GIT binary patch
literal 21
ccmd;J7GQK@tn%VvU}&&sU|?`!a4>TL032)r>i_@%

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_uint32/model.onnx b/onnx/backend/test/data/node/test_pow_types_float32_uint32/model.onnx
new file mode 100644
index 00000000000..110e4dc2fc7
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_float32_uint32/model.onnx
@@ -0,0 +1,16 @@
+backend-test:b
+
+x
+yz"Powtest_pow_types_float32_uint32Z
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..62e4e87e30c2908b48e0e912c49f073faf7953fd
GIT binary patch
literal 21
acmd;J7GQK@tnlJtU}&&sU|?_nA_o8)lme{)

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..7918aa428c7bc792a28f31407f517e45ba1c62f9
GIT binary patch
literal 21
Zcmd;J7T|GWtn}hxVPIfj1!6WJ1^^SH0Z9M=

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_float32_uint32/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..0cc39708ca5cd984c0c32285657b92f0545914ba
GIT binary patch
literal 21
ccmd;J7GQK@tn%VvU}&&sU|?`!a4>TL032)r>i_@%

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_uint64/model.onnx b/onnx/backend/test/data/node/test_pow_types_float32_uint64/model.onnx
new file mode 100644
index 00000000000..46cdbc7e253
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_float32_uint64/model.onnx
@@ -0,0 +1,16 @@
+backend-test:b
+
+x
+yz"Powtest_pow_types_float32_uint64Z
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..62e4e87e30c2908b48e0e912c49f073faf7953fd
GIT binary patch
literal 21
acmd;J7GQK@tnlJtU}&&sU|?_nA_o8)lme{)

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..015d9afbace42a4d241bc77acb6ea8770177b674
GIT binary patch
literal 33
Ycmd;J7T|Satn`v#VSoTuD9r|?0VEUwRsaA1

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_float32_uint64/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..0cc39708ca5cd984c0c32285657b92f0545914ba
GIT binary patch
literal 21
ccmd;J7GQK@tn%VvU}&&sU|?`!a4>TL032)r>i_@%

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int/model.onnx b/onnx/backend/test/data/node/test_pow_types_int/model.onnx
new file mode 100644
index 00000000000..2e62ba5ad4f
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_int/model.onnx
@@ -0,0 +1,16 @@
+backend-test:W
+
+x
+yz"Powtest_pow_types_intZ
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..62e4e87e30c2908b48e0e912c49f073faf7953fd
GIT binary patch
literal 21
acmd;J7GQK@tnlJtU}&&sU|?_nA_o8)lme{)

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..2a776165494d557dfd635a732120b459acb76fcf
GIT binary patch
literal 33
Ycmd;J7GQT`tn`v#VSoTuD9r|?0V7}mPyhe`

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_int/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..0cc39708ca5cd984c0c32285657b92f0545914ba
GIT binary patch
literal 21
ccmd;J7GQK@tn%VvU}&&sU|?`!a4>TL032)r>i_@%

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int32_float32/model.onnx b/onnx/backend/test/data/node/test_pow_types_int32_float32/model.onnx
new file mode 100644
index 00000000000..bc90f825a44
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_int32_float32/model.onnx
@@ -0,0 +1,16 @@
+backend-test:a
+
+x
+yz"Powtest_pow_types_int32_float32Z
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..bb7ca3491e33a4debc334ff548aa5cca7ec638db
GIT binary patch
literal 21
Zcmd;J7GQH?tnlJtWME)m0%B$$1^^P@0XYBw

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..a760307aaa9d741605ce33d06bfeb54d3669be86
GIT binary patch
literal 21
acmd;J7GQK@tn}hxU}$h)U|0ae2OIz(Yy-~#

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_int32_float32/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..67fd11d8712c63df32379d3d5d267b267b5638de
GIT binary patch
literal 21
acmd;J7GQH?tn%VvWME)W0OFfW3=9AlO9C+f

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int32_int32/model.onnx b/onnx/backend/test/data/node/test_pow_types_int32_int32/model.onnx
new file mode 100644
index 00000000000..ee0ccee1ac5
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_int32_int32/model.onnx
@@ -0,0 +1,16 @@
+backend-test:_
+
+x
+yz"Powtest_pow_types_int32_int32Z
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..bb7ca3491e33a4debc334ff548aa5cca7ec638db
GIT binary patch
literal 21
Zcmd;J7GQH?tnlJtWME)m0%B$$1^^P@0XYBw

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..194c2ad3ae1744cdd30210b1f3770f246a19e3eb
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxVPIfj1!6WJ1^^Q_0Yd-)

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_int32_int32/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..67fd11d8712c63df32379d3d5d267b267b5638de
GIT binary patch
literal 21
acmd;J7GQH?tn%VvWME)W0OFfW3=9AlO9C+f

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int64_float32/model.onnx b/onnx/backend/test/data/node/test_pow_types_int64_float32/model.onnx
new file mode 100644
index 00000000000..3aa62ec7f97
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_int64_float32/model.onnx
@@ -0,0 +1,16 @@
+backend-test:a
+
+x
+yz"Powtest_pow_types_int64_float32Z
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..d91963f15c890955c4049fa33858b8d64ca4c264
GIT binary patch
literal 33
Ycmd;J7GQT`tniXxWPkuBD9sF|0V1^lMgRZ+

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..a760307aaa9d741605ce33d06bfeb54d3669be86
GIT binary patch
literal 21
acmd;J7GQK@tn}hxU}$h)U|0ae2OIz(Yy-~#

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_int64_float32/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..c77d2103a132ff11755f8c395343ece7c1ccd1ff
GIT binary patch
literal 33
acmd;J7GQT`tn!jzWPkt#D1DO&!T<m(i2^<V

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int64_int64/model.onnx b/onnx/backend/test/data/node/test_pow_types_int64_int64/model.onnx
new file mode 100644
index 00000000000..6f95347f43d
--- /dev/null
+++ b/onnx/backend/test/data/node/test_pow_types_int64_int64/model.onnx
@@ -0,0 +1,16 @@
+backend-test:_
+
+x
+yz"Powtest_pow_types_int64_int64Z
+x
+
+
+Z
+y
+
+
+b
+z
+
+
+B
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/input_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..d91963f15c890955c4049fa33858b8d64ca4c264
GIT binary patch
literal 33
Ycmd;J7GQT`tniXxWPkuBD9sF|0V1^lMgRZ+

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..2a776165494d557dfd635a732120b459acb76fcf
GIT binary patch
literal 33
Ycmd;J7GQT`tn`v#VSoTuD9r|?0V7}mPyhe`

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_pow_types_int64_int64/test_data_set_0/output_0.pb
new file mode 100644
index 0000000000000000000000000000000000000000..c77d2103a132ff11755f8c395343ece7c1ccd1ff
GIT binary patch
literal 33
acmd;J7GQT`tn!jzWPkt#D1DO&!T<m(i2^<V

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean/model.onnx
index 1925ad13c53fb5b765cfc811704853a15c4cf34d..9b4faf33d8ab003192926bfb317b1c642cbbcba3 100644
GIT binary patch
delta 11
ScmZ3=xRh~1FC*K;K5+mSLIbA&

delta 11
ScmZ3=xRh~1FC+WJK5+mSMgyn-

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_3d/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_3d/model.onnx
index 9bbcf18e8f91df3d9aea0dbf2fb2ede3dbd96954..a17aa7d16f0b89edd2b70bc1e4798f36a3304bc2 100644
GIT binary patch
delta 35
pcmdnMxPfuP6d5)l4lX7RW*}xt;^tzk5@O?G;b0VEaAFeR0RV021cLwo

delta 35
pcmdnMxPfuP6d86Q4lX7RW*}xt;^tzk5@O?G;b0VEaAFeR0RV0U1cU$p

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_3d/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_3d/test_data_set_0/input_1.pb
index cd3dd54a5b93f253a108e3cb7a976f79bb38e27b..87a0fc4fe9ba2edea50d17f0ee92267c58f8a54a 100644
GIT binary patch
literal 35
fcmd;J=3o+Fb7HLYl3-?FU;tqzAZCHmK#BnXB&Gpa

literal 59
hcmd;J=3o+FcVevcGGJza02s{#<+DI(7$3@I002oy0dW8T

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_3d_expanded/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_3d_expanded/model.onnx
index 80e5bfe48d17fba807fcf538e109f673da11b73b..626da48c88d7cd241f738109a5afde597c1aad09 100644
GIT binary patch
delta 36
qcmZ3_wVrE(Fslrk5C<0%2Qv^eC2?~xRtd3jv2ZX7F*q>^@Bjd3gakJL

delta 36
qcmZ3_wVrE(Fslr^5C<0%2Qv^eC2?~xRtd3jv2ZX7F*q>^@Bjd3paeJo

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_3d_expanded/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_3d_expanded/test_data_set_0/input_1.pb
index cd3dd54a5b93f253a108e3cb7a976f79bb38e27b..87a0fc4fe9ba2edea50d17f0ee92267c58f8a54a 100644
GIT binary patch
literal 35
fcmd;J=3o+Fb7HLYl3-?FU;tqzAZCHmK#BnXB&Gpa

literal 59
hcmd;J=3o+FcVevcGGJza02s{#<+DI(7$3@I002oy0dW8T

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_expanded/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_expanded/model.onnx
index e7787066fa6e634d83f18e332d7cf14c8d1ebbe0..19875ee4259d07ba683cd5f0fa51ddf8cfce359a 100644
GIT binary patch
delta 32
ncmey%`ImFUQx*v}Ar>ws4(23oF2*V$HZB$pMj-|#CIKD*gN_A+

delta 32
ncmey%`ImFUQx*w!Ar>ws4(23oF2*V$HZB$pMj-|#CIKD*gQf+A

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_expanded/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_expanded/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight/model.onnx
index dfe74a2fa911457af577a64845e77027b2af7b34..4d23b428c71965804fd36da45832eeccd8a917c5 100644
GIT binary patch
delta 11
ScmX@Wcz|)jbVjy`Gc*7hzXSpR

delta 11
ScmX@Wcz|)jbVl}xGc*7h!vq5W

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_expanded/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_expanded/model.onnx
index b27ae19a56bd7aa0ec9086a19243e7864b06928e..b5f370b67e56b9c6d0c9bc0c05a2e390b8d30471 100644
GIT binary patch
delta 13
Ucmey&^_gpf7b_#%WN%gt03+7~3;+NC

delta 13
Ucmey&^_gpf7b_$CWN%gt03+N44FCWD

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_expanded/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_expanded/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/model.onnx
new file mode 100644
index 0000000000000000000000000000000000000000..0817553caac6ad330b6e5b03e70077303df0c407
GIT binary patch
literal 226
zcmYk0y$*sf6ot8fSg%A)G0~aP#L3N3<K|%E=F%WkB!<>N4L?uA`}k6}usEFL<jcKZ
z4jEJb9>p&%lC9#U+J6Br2sJ{3P*G^Z)m1)J9@WP}mgPyPLcW``uA2b;EETfgEE5T@
z*H#DiW{H!6C!bhlQiN{KBhi6FcSg<1LBwKXVti0BbHgg0rBB8FeYZf*pfmIdP=Ypb
ibVj$i9!3!f@u+K{0aXv62c!EGp`{+W`pr7n3;zccP&xGg

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_0.pb
new file mode 100644
index 00000000000..aeb52e93c10
--- /dev/null
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_0.pb
@@ -0,0 +1 @@
+BxJ<?�7?�N?w}?H��>QY%?n�><Kd?��v?rR�>~�J?�e?^k?��l?Z{�=
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..29f42478fda2a3b734ddaa7ee030d968c21e231f
GIT binary patch
literal 21
Xcmd;J7GQH?tn}hx00I^uW(Hya67m5%

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_2.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_2.pb
new file mode 100644
index 00000000000..2186ff73f9e
--- /dev/null
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/input_2.pb
@@ -0,0 +1 @@
+BwJfff?333?��L?fff?fff?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/output_0.pb
new file mode 100644
index 00000000000..840b42e6bd1
--- /dev/null
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index/test_data_set_0/output_0.pb
@@ -0,0 +1 @@
+BzJ���?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/model.onnx
new file mode 100644
index 0000000000000000000000000000000000000000..fc50e774ebf05403000c572c6e42e6b374474261
GIT binary patch
literal 1598
zcmcgsJx{|h5UrE6q(=nEWq{RVCH??n0=gB3W<g!7e7LFA){YXVP5W!Qu`=;5kPyF!
zbGxu3ckt=-&erp@?_PLDqcq7DLgjZx)aL#@f+diarm>!xRnk6Hx~@mcRJz=bG39m_
zua8BZ*(w*uGGB>A@r{th+;CYaT?sB#E?O?yGQlM0vqoh`YW2onl9u@x;F};FoPf@`
zq_|0$j{&}jb3I7oT+gU2nU4W}-0MLanmBH`NzD2akvEx$n-zR`&Ogb%oqJKv``}rw
znUBGas_QvL8Y-Oj!BQ8ztTc}5SQqd~;52kVwm>$N?AYzAC=w&r0{O>sA(nEkb#A?N
zcn$r^HmJ2o7Favo6Mr~>=&zgJboRuf5C8Gu>A+h21w<jCA$XyWt2@2KJ5qo!?9sr2
Y7T6pBasayNIglOgL4e0PkKR@E3p=W0FaQ7m

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_0.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_0.pb
new file mode 100644
index 00000000000..aeb52e93c10
--- /dev/null
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_0.pb
@@ -0,0 +1 @@
+BxJ<?�7?�N?w}?H��>QY%?n�><Kd?��v?rR�>~�J?�e?^k?��l?Z{�=
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_1.pb
new file mode 100644
index 0000000000000000000000000000000000000000..29f42478fda2a3b734ddaa7ee030d968c21e231f
GIT binary patch
literal 21
Xcmd;J7GQH?tn}hx00I^uW(Hya67m5%

literal 0
HcmV?d00001

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_2.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_2.pb
new file mode 100644
index 00000000000..2186ff73f9e
--- /dev/null
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/input_2.pb
@@ -0,0 +1 @@
+BwJfff?333?��L?fff?fff?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/output_0.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/output_0.pb
new file mode 100644
index 00000000000..840b42e6bd1
--- /dev/null
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_mean_weight_ignore_index_expanded/test_data_set_0/output_0.pb
@@ -0,0 +1 @@
+BzJ���?
\ No newline at end of file
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_none/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_none/model.onnx
index 53f852258e1..dce630c51ac 100644
--- a/onnx/backend/test/data/node/test_softmax_cross_entropy_none/model.onnx
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_none/model.onnx
@@ -9,7 +9,7 @@
 Z
 y
 
-
+
 b
 z
 
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_none/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_none/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_expanded/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_expanded/model.onnx
index 4504fd8a571..bb633ae1ffd 100644
--- a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_expanded/model.onnx
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_expanded/model.onnx
@@ -23,7 +23,7 @@ QSoftmaxCrossEntropyLoss_test_softmax_cross_entropy_none_expanded_functionlog_pr
 Z
 y
 
-
+
 b
 z
 
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_expanded/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_expanded/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights/model.onnx
index e19f2be1fe9..1f3dd146cbb 100644
--- a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights/model.onnx
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights/model.onnx
@@ -10,7 +10,7 @@
 Z
 y
 
-
+
 Z
 w
 
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights_expanded/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights_expanded/model.onnx
index ea22b63b577..7e7af80dcdc 100644
--- a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights_expanded/model.onnx
+++ b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights_expanded/model.onnx
@@ -25,7 +25,7 @@ YSoftmaxCrossEntropyLoss_test_softmax_cross_entropy_none_weights_expanded_functi
 Z
 y
 
-
+
 Z
 w
 
diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights_expanded/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_none_weights_expanded/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_sum/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_sum/model.onnx
index e9f7ea879ee06bb073d5ebdd9e7a89b0c07bd2bc..a3b4bb690971fa54434ec12f11836cc9e7a9a317 100644
GIT binary patch
delta 31
mcmZ3?xR`N5w*;FI3l|dya}qZfW0epa7YhfY5Q7tw01p6M7z9E9

delta 31
mcmZ3?xR`N5w*<Qo3l|dya}qZfW0epa7YhfY5Q7tw01p6MFa$#Y

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_sum/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_sum/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_sum_expanded/model.onnx b/onnx/backend/test/data/node/test_softmax_cross_entropy_sum_expanded/model.onnx
index b6eee89383fe1110a9f8de90349b6852dfb9fcd4..3fa97120d5e7f4bc2db4cdbbac3746530119cc93 100644
GIT binary patch
delta 32
ncmaFI`HpkLH5LgrAr>ws4(23oF2*V$HZB$pMj-|#CIKD*epdxo

delta 32
ncmaFI`HpkLH5LhWAr>ws4(23oF2*V$HZB$pMj-|#CIKD*es2X>

diff --git a/onnx/backend/test/data/node/test_softmax_cross_entropy_sum_expanded/test_data_set_0/input_1.pb b/onnx/backend/test/data/node/test_softmax_cross_entropy_sum_expanded/test_data_set_0/input_1.pb
index 6251a1cdb8d5f655406beff874d0b8f495b4e18b..bc4806043c89396003882cc95bab3524afb90a85 100644
GIT binary patch
literal 21
Zcmd;J7GQH?tn}hxWME)m0b*t#1^^QN0XzTz

literal 33
Ycmd;J7GQT`tn`v#WPkt`D9sF|0V41LNdN!<

diff --git a/onnx/common/constants.h b/onnx/common/constants.h
index fc2a2212280..1725d6ca475 100644
--- a/onnx/common/constants.h
+++ b/onnx/common/constants.h
@@ -12,7 +12,7 @@ namespace ONNX_NAMESPACE {
 constexpr const char* AI_ONNX_ML_DOMAIN = "ai.onnx.ml";
 constexpr const char* AI_ONNX_TRAINING_DOMAIN = "ai.onnx.training";
 constexpr const char* ONNX_DOMAIN = "";
-constexpr bool OPTIONAL = false;
+constexpr bool OPTIONAL_VALUE = false;
 
 // For dimension denotation.
 constexpr const char* DATA_BATCH = "DATA_BATCH";
diff --git a/onnx/cpp2py_export.cc b/onnx/cpp2py_export.cc
index 26d52ab6f2e..fc5cb5b2575 100644
--- a/onnx/cpp2py_export.cc
+++ b/onnx/cpp2py_export.cc
@@ -270,7 +270,7 @@ PYBIND11_MODULE(onnx_cpp2py_export, onnx_cpp2py_export) {
       [](const py::bytes& bytes, const std::vector<std::string>& names) {
         ModelProto proto{};
         ParseProtoFromPyBytes(&proto, bytes);
-        auto const result = optimization::Optimize(std::move(proto), names);
+        auto const result = optimization::Optimize(proto, names);
         std::string out;
         result.SerializeToString(&out);
         return py::bytes(out);
@@ -282,7 +282,7 @@ PYBIND11_MODULE(onnx_cpp2py_export, onnx_cpp2py_export) {
         ModelProto proto{};
         ParseProtoFromPyBytes(&proto, bytes);
         auto const result =
-            optimization::OptimizeFixed(std::move(proto), names);
+            optimization::OptimizeFixed(proto, names);
         std::string out;
         result.SerializeToString(&out);
         return py::bytes(out);
diff --git a/onnx/defs/generator/defs.cc b/onnx/defs/generator/defs.cc
index 0c0fe5c5130..2b5cd7f2d1a 100644
--- a/onnx/defs/generator/defs.cc
+++ b/onnx/defs/generator/defs.cc
@@ -184,7 +184,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "(Optional) The value of the output elements."
             "Should be a one-element tensor. If not specified, it defaults to a tensor of value 0 and datatype float32",
             AttributeProto::TENSOR,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(
             0,
             "input",
@@ -313,7 +313,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "the data type of the input tensor T1 is used. If input tensor T1 is also not"
             "specified, then type defaults to 'float'.",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(
             0,
             "input",
@@ -397,7 +397,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "seed",
             "(Optional) Seed to the random generator, if not specified we will auto generate one.",
             AttributeProto::FLOAT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "dtype",
             "The data type for the elements of the output tensor. If not specified, default is TensorProto::FLOAT.",
@@ -447,7 +447,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "seed",
             "(Optional) Seed to the random generator, if not specified we will auto generate one.",
             AttributeProto::FLOAT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "dtype",
             "The data type for the elements of the output tensor. Default is TensorProto::FLOAT.",
@@ -497,13 +497,13 @@ ONNX_OPERATOR_SET_SCHEMA(
             "seed",
             "(Optional) Seed to the random generator, if not specified we will auto generate one.",
             AttributeProto::FLOAT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "dtype",
             "(Optional) The data type for the elements of the output tensor, if not specified, we will use "
             "the data type of the input tensor.",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(
             0,
             "input",
@@ -562,13 +562,13 @@ ONNX_OPERATOR_SET_SCHEMA(
             "seed",
             "(Optional) Seed to the random generator, if not specified we will auto generate one.",
             AttributeProto::FLOAT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "dtype",
             "(Optional) The data type for the elements of the output tensor, if not specified, we will use "
             "the data type of the input tensor.",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(
             0,
             "input",
@@ -617,7 +617,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "seed",
             "(Optional) Seed to the random generator, if not specified we will auto generate one.",
             AttributeProto::FLOAT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "dtype",
             "(Optional) The data type for the elements of the output tensor, if not specified, we will use int32.",
diff --git a/onnx/defs/logical/defs.cc b/onnx/defs/logical/defs.cc
index 4bb95d3d234..40a80de642c 100644
--- a/onnx/defs/logical/defs.cc
+++ b/onnx/defs/logical/defs.cc
@@ -7,39 +7,42 @@
 namespace ONNX_NAMESPACE {
 
 inline void unaryLogicalOpInference(InferenceContext& ctx) {
-    // Type inference
-    updateOutputElemType(ctx, 0, TensorProto::BOOL);
-    // Shape inference
-    if (hasInputShape(ctx, 0)) {
-        propagateShapeFromInputToOutput(ctx, 0, 0);
-    }
+  // Type inference
+  updateOutputElemType(ctx, 0, TensorProto::BOOL);
+  // Shape inference
+  if (hasInputShape(ctx, 0)) {
+    propagateShapeFromInputToOutput(ctx, 0, 0);
+  }
 }
 
 std::function<void(OpSchema&)> BinaryLogicDocGenerator(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
 Returns the tensor resulted from performing the `{name}` logical operation
 elementwise on the input tensors `A` and `B` (with Numpy-style broadcasting support).
 
 {broadcast_doc}
 )DOC";
         ReplaceAll(doc, "{name}", name);
-        ReplaceAll(doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str());
-        schema.SetDoc(doc);
-        schema.Input(0, "A", "First input operand for the logical operator.", "T");
-        schema.Input(1, "B", "Second input operand for the logical operator.", "T");
-        schema.Output(0, "C", "Result tensor.", "T1");
-        schema.TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
-          // Type inference 
-          updateOutputElemType(ctx, 0, TensorProto::BOOL);
-          // Shape inference
-          if (hasNInputShapes(ctx, 2))
-            bidirectionalBroadcastShapeInference(
-                ctx.getInputType(0)->tensor_type().shape(),
-                ctx.getInputType(1)->tensor_type().shape(),
-                *ctx.getOutputType(0)->mutable_tensor_type()->mutable_shape());
-        });
-    };
+        ReplaceAll(
+            doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str()););
+    schema.SetDoc(doc);
+    schema.Input(0, "A", "First input operand for the logical operator.", "T");
+    schema.Input(1, "B", "Second input operand for the logical operator.", "T");
+    schema.Output(0, "C", "Result tensor.", "T1");
+    schema.TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
+      // Type inference
+      updateOutputElemType(ctx, 0, TensorProto::BOOL);
+      // Shape inference
+      if (hasNInputShapes(ctx, 2))
+        bidirectionalBroadcastShapeInference(
+            ctx.getInputType(0)->tensor_type().shape(),
+            ctx.getInputType(1)->tensor_type().shape(),
+            *ctx.getOutputType(0)->mutable_tensor_type()->mutable_shape());
+    });
+  };
 }
 
 ONNX_OPERATOR_SET_SCHEMA(
@@ -172,7 +175,8 @@ ONNX_OPERATOR_SET_SCHEMA(
     BitShift,
     11,
     OpSchema()
-        .SetDoc(std::string(BitShift_ver11_doc) + GenerateBroadcastingDocMul())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(BitShift_ver11_doc) + GenerateBroadcastingDocMul()))
         .Input(0, "X", "First operand, input to be shifted.", "T")
         .Input(1, "Y", "Second operand, amounts of shift.", "T")
         .Output(0, "Z", "Output tensor", "T")
@@ -213,12 +217,11 @@ ONNX_OPERATOR_SET_SCHEMA(
             {"tensor(bool)"},
             "Constrains output to boolean tensor.")
         .TypeAndShapeInferenceFunction(InferenceFunction())
-        .FunctionBody(FunctionBodyHelper::BuildNodes({
-            // nodes: {outputs, op, inputs, attributes}
-            {{"O1"}, "Less", {"A", "B"}},
-            {{"O2"}, "Equal", {"A", "B"}},
-            {{"C"}, "Or", {"O1", "O2"}}
-        })));
+        .FunctionBody(FunctionBodyHelper::BuildNodes(
+            {// nodes: {outputs, op, inputs, attributes}
+             {{"O1"}, "Less", {"A", "B"}},
+             {{"O2"}, "Equal", {"A", "B"}},
+             {{"C"}, "Or", {"O1", "O2"}}})));
 
 ONNX_OPERATOR_SET_SCHEMA(
     GreaterOrEqual,
@@ -234,11 +237,10 @@ ONNX_OPERATOR_SET_SCHEMA(
             {"tensor(bool)"},
             "Constrains output to boolean tensor.")
         .TypeAndShapeInferenceFunction(InferenceFunction())
-        .FunctionBody(FunctionBodyHelper::BuildNodes({
-            // nodes: {outputs, op, inputs, attributes}
-            {{"O1"}, "Greater", {"A", "B"}},
-            {{"O2"}, "Equal", {"A", "B"}},
-            {{"C"}, "Or", {"O1", "O2"}}
-        })));
-
-} // namespace ONNX_NAMESPACE
+        .FunctionBody(FunctionBodyHelper::BuildNodes(
+            {// nodes: {outputs, op, inputs, attributes}
+             {{"O1"}, "Greater", {"A", "B"}},
+             {{"O2"}, "Equal", {"A", "B"}},
+             {{"C"}, "Or", {"O1", "O2"}}})));
+
+} // namespace ONNX_NAMESPACE
\ No newline at end of file
diff --git a/onnx/defs/logical/old.cc b/onnx/defs/logical/old.cc
index 770ae6cbbc4..424a2f9c22f 100644
--- a/onnx/defs/logical/old.cc
+++ b/onnx/defs/logical/old.cc
@@ -16,7 +16,8 @@ inline void logicalOpInference_opset1(InferenceContext& ctx) {
 std::function<void(OpSchema&)> BinaryLogicDocGenerator_opset1(
     const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 Returns the tensor resulted from performing the `{name}` logical operation
 elementwise on the input tensors `A` and `B`.
 
@@ -24,7 +25,7 @@ If broadcasting is enabled, the right-hand-side argument will be broadcasted
 to match the shape of left-hand-side argument. See the doc of `Add` for a
 detailed description of the broadcasting rules.
 )DOC";
-    ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{name}", name););
     schema.SetDoc(doc);
     schema.Attr(
         "broadcast",
@@ -35,7 +36,7 @@ detailed description of the broadcasting rules.
         "axis",
         "If set, defines the broadcast dimensions.",
         AttributeProto::INT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Input(0, "A", "Left input tensor for the logical operator.", "T");
     schema.Input(1, "B", "Right input tensor for the logical operator.", "T");
     schema.Output(0, "C", "Result tensor.", "T1");
@@ -43,16 +44,20 @@ detailed description of the broadcasting rules.
   };
 }
 
-std::function<void(OpSchema&)> BinaryLogicDocGenerator_opset7(const char* name) {
+std::function<void(OpSchema&)> BinaryLogicDocGenerator_opset7(
+    const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
 Returns the tensor resulted from performing the `{name}` logical operation
 elementwise on the input tensors `A` and `B` (with Numpy-style broadcasting support).
 
 {broadcast_doc}
 )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str());
+        ReplaceAll(doc, "{name}", name);
+        ReplaceAll(
+            doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str()););
     schema.SetDoc(doc);
     schema.Input(0, "A", "First input operand for the logical operator.", "T");
     schema.Input(1, "B", "Second input operand for the logical operator.", "T");
@@ -173,9 +178,7 @@ ONNX_OPERATOR_SET_SCHEMA(
         .FillUsing(BinaryLogicDocGenerator_opset7("greater"))
         .TypeConstraint(
             "T",
-            {"tensor(float16)",
-             "tensor(float)",
-             "tensor(double)"},
+            {"tensor(float16)", "tensor(float)", "tensor(double)"},
             "Constrains input to float tensors.")
         .TypeConstraint(
             "T1",
@@ -189,15 +192,11 @@ ONNX_OPERATOR_SET_SCHEMA(
         .FillUsing(BinaryLogicDocGenerator_opset7("less"))
         .TypeConstraint(
             "T",
-            {"tensor(float16)",
-             "tensor(float)",
-             "tensor(double)"},
+            {"tensor(float16)", "tensor(float)", "tensor(double)"},
             "Constrains input to float tensors.")
         .TypeConstraint(
             "T1",
             {"tensor(bool)"},
             "Constrains output to boolean tensor."));
 
-
-
 } // namespace ONNX_NAMESPACE
diff --git a/onnx/defs/math/defs.cc b/onnx/defs/math/defs.cc
index 3d387a58592..98fb875664d 100644
--- a/onnx/defs/math/defs.cc
+++ b/onnx/defs/math/defs.cc
@@ -1,9 +1,9 @@
 // Copyright (c) ONNX Project Contributors.
 // Licensed under the MIT license.
 
+#include <algorithm>
 #include <functional>
 #include "onnx/defs/function.h"
-#include <algorithm>
 #include "onnx/defs/schema.h"
 #include "onnx/defs/tensor_proto_util.h"
 
@@ -11,13 +11,16 @@ namespace ONNX_NAMESPACE {
 
 std::function<void(OpSchema&)> MathDocGenerator(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
 Performs element-wise binary {name} (with Numpy-style broadcasting support).
 
 {broadcast_doc}
 )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str());
+        ReplaceAll(doc, "{name}", name);
+        ReplaceAll(
+            doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str()););
     schema.SetDoc(doc);
     schema.Input(0, "A", "First operand.", "T");
     schema.Input(1, "B", "Second operand.", "T");
@@ -41,7 +44,8 @@ std::function<void(OpSchema&)> SoftmaxFamilyDocGenerator(
     const char* name,
     const char* description) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 The operator computes the {name} ({description}) values for each layer in the batch
  of the given input.
 
@@ -57,8 +61,8 @@ Each of these dimensions must be matched correctly, or else the operator
 will throw errors. The output tensor has the same shape
 and contains the {name} values of the corresponding input.
 )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{description}", description);
+                        ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{description}", description););
     schema.SetDoc(doc);
     schema.Attr(
         "axis",
@@ -87,7 +91,7 @@ and contains the {name} values of the corresponding input.
     schema.TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
       // Type inference
       propagateElemTypeFromInputToOutput(ctx, 0, 0);
-      
+
       // Shape inference starts
       if (!hasNInputShapes(ctx, 1)) {
         return;
@@ -95,12 +99,17 @@ and contains the {name} values of the corresponding input.
 
       // Validate the value of 'axis'
       const TensorShapeProto& input_shape =
-        ctx.getInputType(0)->tensor_type().shape();
+          ctx.getInputType(0)->tensor_type().shape();
       int r = input_shape.dim_size();
       int axis = static_cast<int>(getAttribute(ctx, "axis", 1));
       if (axis < -r || axis >= r) {
-          fail_shape_inference(
-         "'axis' must be in [", -r, " , " , (r-1) , "]. Its actual value is: ", axis);
+        fail_shape_inference(
+            "'axis' must be in [",
+            -r,
+            " , ",
+            (r - 1),
+            "]. Its actual value is: ",
+            axis);
       }
 
       // Shape inference
@@ -419,25 +428,31 @@ ONNX_OPERATOR_SET_SCHEMA(
     Celu,
     12,
     OpSchema()
-       .SetDoc(celu_ver12_doc)
-       .Input(0, "X", "Input tensor", "T")
-       .Output(0, "Y", "Output tensor", "T")
-       .Attr(
-           "alpha",
-           "The Alpha value in Celu formula which control the shape of "
-           "the unit. The default value is 1.0.",
-           AttributeProto::FLOAT,
-           celu_default_alpha)
-       .TypeConstraint(
-           "T",
-           {"tensor(float16)", "tensor(float)", "tensor(double)"},
-           "Constrain input and output types to floating-point tensors.")
-       .FunctionBody(FunctionBodyHelper::BuildNodes(
-           {// nodes: {outputs, op, inputs, attributes}
-            FunctionBodyHelper::NodeDef{{"alpha"}, "Constant", {}, {MakeRefAttribute("value_float", "alpha", AttributeProto::FLOAT)}},
-            {{"X_alpha"}, "Div", {"X", "alpha"}},
-            {{"Elu_Result"}, "Elu", {"X_alpha"}, {{"alpha", 1.f}}},
-            {{"Y"}, "Mul", {"alpha", "Elu_Result"}}})));
+        .SetDoc(celu_ver12_doc)
+        .Input(0, "X", "Input tensor", "T")
+        .Output(0, "Y", "Output tensor", "T")
+        .Attr(
+            "alpha",
+            "The Alpha value in Celu formula which control the shape of "
+            "the unit. The default value is 1.0.",
+            AttributeProto::FLOAT,
+            celu_default_alpha)
+        .TypeConstraint(
+            "T",
+            {"tensor(float16)", "tensor(float)", "tensor(double)"},
+            "Constrain input and output types to floating-point tensors.")
+        .FunctionBody(FunctionBodyHelper::BuildNodes(
+            {// nodes: {outputs, op, inputs, attributes}
+             FunctionBodyHelper::NodeDef{{"alpha"},
+                                         "Constant",
+                                         {},
+                                         {MakeRefAttribute(
+                                             "value_float",
+                                             "alpha",
+                                             AttributeProto::FLOAT)}},
+             {{"X_alpha"}, "Div", {"X", "alpha"}},
+             {{"Elu_Result"}, "Elu", {"X_alpha"}, {{"alpha", 1.f}}},
+             {{"Y"}, "Mul", {"alpha", "Elu_Result"}}})));
 
 static const char* Exp_ver6_doc = R"DOC(
 Calculates the exponential of the given input tensor, element-wise.
@@ -505,7 +520,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "Constrain input and output types to float tensors.")
         .TypeAndShapeInferenceFunction(propagateShapeAndTypeFromFirstInput));
 
-static const char* Pow_ver7_doc = R"DOC(
+static const char* Pow_ver12_doc = R"DOC(
 Pow takes input data (Tensor<T>) and exponent Tensor, and
 produces one output data (Tensor<T>) where the function `f(x) = x^exponent`,
 is applied to the data tensor elementwise.
@@ -513,16 +528,35 @@ is applied to the data tensor elementwise.
 
 ONNX_OPERATOR_SET_SCHEMA(
     Pow,
-    7,
+    12,
     OpSchema()
-        .SetDoc(std::string(Pow_ver7_doc) + GenerateBroadcastingDocMul())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(Pow_ver12_doc) + GenerateBroadcastingDocMul()))
         .Input(0, "X", "First operand, base of the exponent.", "T")
-        .Input(1, "Y", "Second operand, power of the exponent.", "T")
+        .Input(1, "Y", "Second operand, power of the exponent.", "T1")
         .Output(0, "Z", "Output tensor (same size as X)", "T")
         .TypeConstraint(
             "T",
-            {"tensor(float16)", "tensor(float)", "tensor(double)"},
-            "Constrain input and output types to float tensors.")
+            {"tensor(int32)",
+             "tensor(int64)",
+             "tensor(float16)",
+             "tensor(float)",
+             "tensor(double)"},
+            "Constrain input X and output types to float/int tensors.")
+        .TypeConstraint(
+            "T1",
+            {"tensor(uint8)",
+             "tensor(uint16)",
+             "tensor(uint32)",
+             "tensor(uint64)",
+             "tensor(int8)",
+             "tensor(int16)",
+             "tensor(int32)",
+             "tensor(int64)",
+             "tensor(float16)",
+             "tensor(float)",
+             "tensor(double)"},
+            "Constrain input Y types to float/int tensors.")
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           propagateElemTypeFromInputToOutput(ctx, 0, 0);
           if (hasNInputShapes(ctx, 2))
@@ -542,9 +576,9 @@ ONNX_OPERATOR_SET_SCHEMA(
     PRelu,
     9,
     OpSchema()
-        .SetDoc(
-            PRelu_ver9_doc +
-            GenerateBroadcastingDocUni("tensor slope", "input tensor X"))
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(PRelu_ver9_doc) +
+            GenerateBroadcastingDocUni("tensor slope", "input tensor X")))
         .Input(0, "X", "Input tensor", "T")
         .Input(
             1,
@@ -605,17 +639,21 @@ ONNX_OPERATOR_SET_SCHEMA(
             "Constrain input and output types to float tensors.")
         .TypeAndShapeInferenceFunction(propagateShapeAndTypeFromFirstInput));
 
-// Generate opschema for element-wise ops. Leaves type constraint "T" unspecified.
+// Generate opschema for element-wise ops. Leaves type constraint "T"
+// unspecified.
 std::function<void(OpSchema&)> ElementwiseMultiOpDocGenerator(
     const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
 Element-wise {name} of each of the input tensors (with Numpy-style broadcasting support).
 All inputs and outputs must have the same data type.
 {broadcast_doc}
 )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str());
+        ReplaceAll(doc, "{name}", name);
+        ReplaceAll(
+            doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str()););
     schema.SetDoc(doc);
     schema.Input(
         0,
@@ -695,11 +733,7 @@ ONNX_OPERATOR_SET_SCHEMA(
     12,
     OpSchema()
         .SetDoc(Clip_ver12_doc)
-        .Input(
-            0,
-            "input",
-            "Input tensor whose elements to be clipped",
-            "T")
+        .Input(0, "input", "Input tensor whose elements to be clipped", "T")
         .Input(
             1,
             "min",
@@ -797,11 +831,10 @@ ONNX_OPERATOR_SET_SCHEMA(
     Gemm,
     11,
     OpSchema()
-        .SetDoc(
-            Gemm_ver11_doc +
-            GenerateBroadcastingDocUni("tensor C", "tensor A * B") +
-            "\n" +
-            GenerateOptionalArgumentsDoc())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(Gemm_ver11_doc) +
+            GenerateBroadcastingDocUni("tensor C", "tensor A * B") + "\n" +
+            GenerateOptionalArgumentsDoc()))
         .Input(
             0,
             "A",
@@ -1601,40 +1634,49 @@ output = [5, 3, 0]
  )DOC";
 
 ONNX_OPERATOR_SET_SCHEMA(
-      CumSum,
-      11,
+    CumSum,
+    11,
     OpSchema()
-            .SetDoc(CumSum_ver11_doc)
-      .Attr(
-          "exclusive",
-          "If set to 1 will return exclusive sum in which the top element is not included."
-          " In other terms, if set to 1, the j-th output element would be the sum of the first (j-1) elements."
-          " Otherwise, it would be the sum of the first j elements.",
-          AttributeProto::INT,
-          static_cast<int64_t>(0))
-      .Attr(
-          "reverse",
-          "If set to 1 will perform the sums in reverse direction.",
-          AttributeProto::INT,
-          static_cast<int64_t>(0))
-      .Input(0, "x", "An input tensor that is to be processed.", "T")
-      .Input(1, "axis", "(Optional) A 0-D tensor. Must be in the range [-rank(x), rank(x)-1]. "
-            "Negative value means counting dimensions from the back.", "T2")
-      .Output(0, "y",
-              "Output tensor of the same type as 'x' with cumulative sums of the x's elements",
-              "T")
-      .TypeConstraint("T",  {
-        "tensor(uint32)",
-        "tensor(uint64)",
-        "tensor(int32)",
-        "tensor(int64)",
-        "tensor(float)",
-        "tensor(double)"}, "Input can be of any tensor type.")
-      .TypeConstraint("T2", {
-        "tensor(int32)",
-        "tensor(int64)"}, "axis tensor can be int32 or int64 only")
-      .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput)
-      );
+        .SetDoc(CumSum_ver11_doc)
+        .Attr(
+            "exclusive",
+            "If set to 1 will return exclusive sum in which the top element is not included."
+            " In other terms, if set to 1, the j-th output element would be the sum of the first (j-1) elements."
+            " Otherwise, it would be the sum of the first j elements.",
+            AttributeProto::INT,
+            static_cast<int64_t>(0))
+        .Attr(
+            "reverse",
+            "If set to 1 will perform the sums in reverse direction.",
+            AttributeProto::INT,
+            static_cast<int64_t>(0))
+        .Input(0, "x", "An input tensor that is to be processed.", "T")
+        .Input(
+            1,
+            "axis",
+            "(Optional) A 0-D tensor. Must be in the range [-rank(x), rank(x)-1]. "
+            "Negative value means counting dimensions from the back.",
+            "T2")
+        .Output(
+            0,
+            "y",
+            "Output tensor of the same type as 'x' with cumulative sums of the x's elements",
+            "T")
+        .TypeConstraint(
+            "T",
+            {"tensor(uint32)",
+             "tensor(uint64)",
+             "tensor(int32)",
+             "tensor(int64)",
+             "tensor(float)",
+             "tensor(double)"},
+            "Input can be of any tensor type.")
+        .TypeConstraint(
+            "T2",
+            {"tensor(int32)", "tensor(int64)"},
+            "axis tensor can be int32 or int64 only")
+        .TypeAndShapeInferenceFunction(
+            ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
 
 static const char* Round_ver11_doc = R"DOC(
 Round takes one input Tensor and rounds the values, element-wise, meaning
@@ -1702,8 +1744,7 @@ ONNX_OPERATOR_SET_SCHEMA(
 
             const auto mat_w = input_shape.dim(rank - 1);
             const auto mat_h = input_shape.dim(rank - 2);
-            if (mat_w.has_dim_value() &&
-                mat_h.has_dim_value() &&
+            if (mat_w.has_dim_value() && mat_h.has_dim_value() &&
                 (mat_w.dim_value() != mat_h.dim_value())) {
               fail_shape_inference(
                   "The inner-most 2 dimensions must have the same size (mat_w:",
@@ -1713,7 +1754,7 @@ ONNX_OPERATOR_SET_SCHEMA(
                   ").");
             }
 
-            for (int i=0; i < rank - 2; ++i) {
+            for (int i = 0; i < rank - 2; ++i) {
               auto* dim = output_shape->add_dim();
               *dim = input_shape.dim(i);
             }
@@ -1810,45 +1851,156 @@ TensorProto ToDimensionOneTensor(int32_t value) {
   return t;
 }
 
-bool BuildContextDependentFunctionBody(const FunctionBodyBuildContext& ctx, const OpSchema& schema, FunctionProto& functionProto) {
+TensorProto ToDimensionOneFloatTensor(float value) {
+  auto t = ToTensor(std::vector<float>({value}));
+  t.add_dims(1);
+  return t;
+}
+
+bool BuildContextDependentFunctionBody(
+    const FunctionBodyBuildContext& ctx,
+    const OpSchema& schema,
+    FunctionProto& functionProto) {
   std::vector<FunctionBodyHelper::NodeDef> body;
-  body.push_back({{"expanded_target"}, "Unsqueeze", {"target"}, {MakeAttribute("axes", std::vector<int64_t>({1}))}});
-  body.push_back({{"input_gather_element"}, "GatherElements", {"input", "expanded_target"}, {MakeAttribute("axis", (int64_t)1)}});
+  body.push_back({{"expanded_target"},
+                  "Unsqueeze",
+                  {"target"},
+                  {MakeAttribute("axes", std::vector<int64_t>({1}))}});
+  body.push_back({{"input_gather_element"},
+                  "GatherElements",
+                  {"input", "expanded_target"},
+                  {MakeAttribute("axis", (int64_t)1)}});
   body.push_back({{"loss_NCdd"}, "Neg", {"input_gather_element"}});
-  body.push_back({{"const_zero"}, "Constant", {}, {MakeAttribute("value", ToDimensionOneTensor(0))}});
-  body.push_back({{"const_one"}, "Constant", {}, {MakeAttribute("value", ToDimensionOneTensor(1))}});
-  body.push_back({{"loss_N1dd"}, "Slice", {"loss_NCdd", "const_zero", "const_one", "const_one"}});
-
-  if (!ctx.hasInput(2)) {
-    if (ctx.getAttribute("reduction")->s() == "none") {
-      body.push_back({{"loss"}, "Squeeze", {"loss_N1dd"}, {MakeAttribute("axes", std::vector<int64_t>({1}))}});
+  body.push_back({{"const_zero"},
+                  "Constant",
+                  {},
+                  {MakeAttribute("value", ToDimensionOneTensor(0))}});
+  body.push_back({{"const_one"},
+                  "Constant",
+                  {},
+                  {MakeAttribute("value", ToDimensionOneTensor(1))}});
+  body.push_back({{"loss_N1dd"},
+                  "Slice",
+                  {"loss_NCdd", "const_zero", "const_one", "const_one"}});
+
+  if (ctx.getAttribute("ignore_index") == nullptr) {
+    if (!ctx.hasInput(2)) {
+      if (ctx.getAttribute("reduction")->s() == "none") {
+        body.push_back({{"loss"},
+                        "Squeeze",
+                        {"loss_N1dd"},
+                        {MakeAttribute("axes", std::vector<int64_t>({1}))}});
+      } else {
+        body.push_back({{"loss_Ndd"},
+                        "Squeeze",
+                        {"loss_N1dd"},
+                        {MakeAttribute("axes", std::vector<int64_t>({1}))}});
+        if (ctx.getAttribute("reduction")->s() == "mean") {
+          body.push_back({{"loss"},
+                          "ReduceMean",
+                          {"loss_Ndd"},
+                          {MakeAttribute("keepdims", (int64_t)0)}});
+        } else {
+          body.push_back({{"loss"},
+                          "ReduceSum",
+                          {"loss_Ndd"},
+                          {MakeAttribute("keepdims", (int64_t)0)}});
+        }
+      }
     } else {
-      body.push_back({{"loss_Ndd"}, "Squeeze", {"loss_N1dd"}, {MakeAttribute("axes", std::vector<int64_t>({1}))}}); 
-      if(ctx.getAttribute("reduction")->s() == "mean") {
-        body.push_back({{"loss"}, "ReduceMean", {"loss_Ndd"}, {MakeAttribute("keepdims", (int64_t)0)}});
+      body.push_back({{"weight_gather"}, "Gather", {"weight", "target"}});
+      body.push_back({{"loss_unweighted"},
+                      "Squeeze",
+                      {"loss_N1dd"},
+                      {MakeAttribute("axes", std::vector<int64_t>({1}))}});
+      if (ctx.getAttribute("reduction")->s() == "none") {
+        body.push_back({{"loss"}, "Mul", {"loss_unweighted", "weight_gather"}});
       } else {
-        body.push_back({{"loss"}, "ReduceSum", {"loss_Ndd"}, {MakeAttribute("keepdims", (int64_t)0)}});  
+        body.push_back(
+            {{"loss_Ndd"}, "Mul", {"loss_unweighted", "weight_gather"}});
+        if (ctx.getAttribute("reduction")->s() == "mean") {
+          body.push_back({{"loss_sum"},
+                          "ReduceSum",
+                          {"loss_Ndd"},
+                          {MakeAttribute("keepdims", (int64_t)0)}});
+          body.push_back({{"weight_gather_sum"},
+                          "ReduceSum",
+                          {"weight_gather"},
+                          {MakeAttribute("keepdims", (int64_t)0)}});
+          body.push_back({{"loss"}, "Div", {"loss_sum", "weight_gather_sum"}});
+        } else {
+          body.push_back({{"loss"},
+                          "ReduceSum",
+                          {"loss_Ndd"},
+                          {MakeAttribute("keepdims", (int64_t)0)}});
+        }
       }
-    } 
+    }
   } else {
-    body.push_back({{"weight_gather"}, "Gather", {"weight", "target"}});
-    body.push_back({{"loss_unweighted"}, "Squeeze", {"loss_N1dd"}, {MakeAttribute("axes", std::vector<int64_t>({1}))}});
+    body.push_back(
+        {{"const_ignore_index"},
+         "Constant",
+         {},
+         {MakeAttribute(
+             "value",
+             ToDimensionOneTensor(ctx.getAttribute("ignore_index")->i()))}});
+    body.push_back({{"const_zero_float"},
+                    "Constant",
+                    {},
+                    {MakeAttribute("value", ToDimensionOneFloatTensor(0.0f))}});
+    if (!ctx.hasInput(2)) {
+      body.push_back({{"input_shape"}, "Shape", {"input"}});
+      body.push_back({{"input_class"},
+                      "Slice",
+                      {"input_shape", "const_one", "const_one"}});
+      body.push_back({{"const_weights_ones"},
+                      "ConstantOfShape",
+                      {"input_class"},
+                      {MakeAttribute("value", ToDimensionOneFloatTensor(1))}});
+      body.push_back(
+          {{"weights_default"},
+           "ScatterElements",
+           {"const_weights_ones", "const_ignore_index", "const_zero_float"}});
+      body.push_back(
+          {{"weight_gather"}, "Gather", {"weights_default", "target"}});
+    } else {
+      body.push_back({{"weights_default"},
+                      "ScatterElements",
+                      {"weight", "const_ignore_index", "const_zero_float"}});
+      body.push_back(
+          {{"weight_gather"}, "Gather", {"weights_default", "target"}});
+    }
+
+    body.push_back({{"loss_unweighted"},
+                    "Squeeze",
+                    {"loss_N1dd"},
+                    {MakeAttribute("axes", std::vector<int64_t>({1}))}});
     if (ctx.getAttribute("reduction")->s() == "none") {
       body.push_back({{"loss"}, "Mul", {"loss_unweighted", "weight_gather"}});
     } else {
-      body.push_back({{"loss_Ndd"}, "Mul", {"loss_unweighted", "weight_gather"}});
-      if(ctx.getAttribute("reduction")->s() == "mean") {
-        body.push_back({{"loss_sum"}, "ReduceSum", {"loss_Ndd"}, {MakeAttribute("keepdims", (int64_t)0)}});
-        body.push_back({{"weight_gather_sum"}, "ReduceSum", {"weight_gather"}, {MakeAttribute("keepdims", (int64_t)0)}});
+      body.push_back(
+          {{"loss_Ndd"}, "Mul", {"loss_unweighted", "weight_gather"}});
+      if (ctx.getAttribute("reduction")->s() == "mean") {
+        body.push_back({{"loss_sum"},
+                        "ReduceSum",
+                        {"loss_Ndd"},
+                        {MakeAttribute("keepdims", (int64_t)0)}});
+        body.push_back({{"weight_gather_sum"},
+                        "ReduceSum",
+                        {"weight_gather"},
+                        {MakeAttribute("keepdims", (int64_t)0)}});
         body.push_back({{"loss"}, "Div", {"loss_sum", "weight_gather_sum"}});
       } else {
-        body.push_back({{"loss"}, "ReduceSum", {"loss_Ndd"}, {MakeAttribute("keepdims", (int64_t)0)}});
+        body.push_back({{"loss"},
+                        "ReduceSum",
+                        {"loss_Ndd"},
+                        {MakeAttribute("keepdims", (int64_t)0)}});
       }
     }
   }
 
   auto func_nodes = FunctionBodyHelper::BuildNodes(body);
-  for (const auto node : func_nodes) {
+  for (const auto& node : func_nodes) {
     auto new_node = functionProto.add_node();
     new_node->CopyFrom(node);
   }
@@ -1888,6 +2040,12 @@ ONNX_OPERATOR_SET_SCHEMA(
             "'mean': the sum of the output will be divided by the sum of applied weights.",
             AttributeProto::STRING,
             std::string("mean"))
+        .Attr(
+            "ignore_index",
+            "Specifies a target value that is ignored and does not contribute to the input gradient. "
+            "It is an optional value and valid values are [0, C).",
+            AttributeProto::INT,
+            false)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -1896,73 +2054,87 @@ ONNX_OPERATOR_SET_SCHEMA(
             "Tind",
             {"tensor(int32)", "tensor(int64)"},
             "Constrain target to integer types")
-        .SetContextDependentFunctionBodyBuilder(BuildContextDependentFunctionBody)
+        .SetContextDependentFunctionBodyBuilder(
+            BuildContextDependentFunctionBody)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
-            // Type inference
-            propagateElemTypeFromInputToOutput(ctx, 0, 0);
+          // Type inference
+          propagateElemTypeFromInputToOutput(ctx, 0, 0);
 
-            // Shape inference
-            if (hasInputShape(ctx, 0) && hasInputShape(ctx, 1)) {
-                const TensorShapeProto& input_shape = ctx.getInputType(0)->tensor_type().shape();
-                const TensorShapeProto& target_shape = ctx.getInputType(1)->tensor_type().shape();
-
-                const int input_rank = static_cast<int>(input_shape.dim_size());
-                const int target_rank = static_cast<int>(target_shape.dim_size());
-
-                if (input_rank < 2) {
-                    fail_shape_inference("Input rank must be >= 2.")
-                }
-                if (target_rank != input_rank - 1) {
-                    fail_shape_inference("Target rank must be 1 less than the input rank.")
-                }
-
-                // match input dimensions (N, C, d1, ..., dk) with target dimensions of (C,
-                // d1, ..., dk)
-                for (int dim = 0; dim < target_rank; dim++) {
-                    const auto input_dim = dim == 0 ? input_shape.dim(dim) : input_shape.dim(dim + 1);
-                    const auto target_dim = target_shape.dim(dim);
-                    if (input_dim.has_dim_value() && target_dim.has_dim_value() && input_dim.dim_value() != target_dim.dim_value()) 
-                        fail_shape_inference("Input and target dimension value mismatch.")
-                }
-
-                if (ctx.getNumInputs() == 3) {
-                    const TensorShapeProto& weight_shape = ctx.getInputType(2)->tensor_type().shape();
-                    if (weight_shape.dim_size() != 1)
-                        fail_shape_inference("Weight rank must be 1.") 
+          // Shape inference
+          if (hasInputShape(ctx, 0) && hasInputShape(ctx, 1)) {
+            const TensorShapeProto& input_shape =
+                ctx.getInputType(0)->tensor_type().shape();
+            const TensorShapeProto& target_shape =
+                ctx.getInputType(1)->tensor_type().shape();
+
+            const int input_rank = static_cast<int>(input_shape.dim_size());
+            const int target_rank = static_cast<int>(target_shape.dim_size());
+
+            if (input_rank < 2) {
+              fail_shape_inference("Input rank must be >= 2.")
+            }
+            if (target_rank != input_rank - 1) {
+              fail_shape_inference(
+                  "Target rank must be 1 less than the input rank.")
+            }
+
+            // match input dimensions (N, C, d1, ..., dk) with target
+            // dimensions of (C, d1, ..., dk)
+            for (int dim = 0; dim < target_rank; dim++) {
+              const auto input_dim =
+                  dim == 0 ? input_shape.dim(dim) : input_shape.dim(dim + 1);
+              const auto target_dim = target_shape.dim(dim);
+              if (input_dim.has_dim_value() && target_dim.has_dim_value() &&
+                  input_dim.dim_value() != target_dim.dim_value())
+                fail_shape_inference(
+                    "Input and target dimension value mismatch.")
+            }
+
+            if (ctx.getNumInputs() == 3) {
+              const TensorShapeProto& weight_shape =
+                  ctx.getInputType(2)->tensor_type().shape();
+              if (weight_shape.dim_size() != 1)
+                fail_shape_inference("Weight rank must be 1.")
                     const auto weight_dim = weight_shape.dim(0);
-                    const auto input_dim_1 = input_shape.dim(1);
-                    if (input_dim_1.has_dim_value() && weight_dim.has_dim_value() && weight_dim.dim_value() != input_dim_1.dim_value())
-                        fail_shape_inference("Input and weight dimension value mismatch.")
-                }
-
-                TensorShapeProto* output_shape = ctx.getOutputType(0)->mutable_tensor_type()->mutable_shape();
-
-                if (ctx.getAttribute("reduction")->s() == "none") {
-                    // output tensor is of shape (N, d1, d2, ..., dk) if reduction attribute
-                    // is "none".
-                    for (int i = 0; i < input_rank - 1; i++) {
-                        auto* dim = output_shape->add_dim();
-                        if (i == 0)
-                            *dim = input_shape.dim(i);
-                        else
-                            *dim = input_shape.dim(i + 1);
-                    }
-                }
-                // otherwise output is a scalar.
-            }}));
+              const auto input_dim_1 = input_shape.dim(1);
+              if (input_dim_1.has_dim_value() && weight_dim.has_dim_value() &&
+                  weight_dim.dim_value() != input_dim_1.dim_value())
+                fail_shape_inference(
+                    "Input and weight dimension value mismatch.")
+            }
 
-void einsumRankInference(
-    ONNX_NAMESPACE::InferenceContext& ctx, std::string equation) {
+            TensorShapeProto* output_shape =
+                ctx.getOutputType(0)->mutable_tensor_type()->mutable_shape();
 
+            if (ctx.getAttribute("reduction")->s() == "none") {
+              // output tensor is of shape (N, d1, d2, ..., dk) if
+              // reduction attribute is "none".
+              for (int i = 0; i < input_rank - 1; i++) {
+                auto* dim = output_shape->add_dim();
+                if (i == 0)
+                  *dim = input_shape.dim(i);
+                else
+                  *dim = input_shape.dim(i + 1);
+              }
+            }
+            // otherwise output is a scalar.
+          }
+        }));
+
+void einsumRankInference(
+    ONNX_NAMESPACE::InferenceContext& ctx,
+    std::string equation) {
   const size_t numInputs = ctx.getNumInputs();
   if (numInputs < 1 || !hasNInputShapes(ctx, static_cast<int>(numInputs))) {
     return;
   }
 
   auto* output_shape = getOutputShape(ctx, 0);
-  std::string  left_equation;
+  std::string left_equation;
 
-  equation.erase(std::remove(equation.begin(), equation.end(), ' '), equation.end()); // Remove space char
+  equation.erase(
+      std::remove(equation.begin(), equation.end(), ' '),
+      equation.end()); // Remove space char
   auto mid_index = equation.find("->");
   if (mid_index != std::string::npos) {
     // Separate right and left hand sides of the equation
@@ -1979,17 +2151,21 @@ void einsumRankInference(
 
   // Parse the left-hand side
   std::stringstream str(left_equation);
-  while(std::getline(str, term, ',')) {
+  while (std::getline(str, term, ',')) {
     auto ellipsis_index = term.find("...");
     if (ellipsis_index != std::string::npos) {
       if (numInputs <= num_operands) {
-        fail_shape_inference("Number of input tensors does not match the operands in the equation.");
+        fail_shape_inference(
+            "Number of input tensors does not match the operands in the equation.");
       }
-      // If there is an ellipsis, the number of dimensions it represents must be total dim - letter dimensions
-      size_t rank = ctx.getInputType(num_operands)->tensor_type().shape().dim_size();
+      // If there is an ellipsis, the number of dimensions it represents
+      // must be total dim - letter dimensions
+      size_t rank =
+          ctx.getInputType(num_operands)->tensor_type().shape().dim_size();
       if (num_ellipsis == 0) {
         num_ellipsis_indices = rank - term.size() + 3;
-      } else { // ellipsis has been seen before. Check that if dimensions are compatible
+      } else { // ellipsis has been seen before. Check that if dimensions
+               // are compatible
         if (num_ellipsis_indices != rank - term.size() + 3) {
           fail_shape_inference("Ellipsis represents incompatible dimensions.");
         }
@@ -2000,7 +2176,8 @@ void einsumRankInference(
   }
 
   if (numInputs != num_operands) {
-    fail_shape_inference("Number of input tensors does not match the operands in the equation.");
+    fail_shape_inference(
+        "Number of input tensors does not match the operands in the equation.");
   }
 
   const size_t number_of_letters = 26;
@@ -2009,12 +2186,14 @@ void einsumRankInference(
   if (mid_index != std::string::npos) {
     std::string right_equation = equation.substr(mid_index + 2);
     auto right_ellipsis_index = right_equation.find("...");
-    if (right_ellipsis_index != std::string::npos) { // Right-hand side contains ellipsis
+    if (right_ellipsis_index !=
+        std::string::npos) { // Right-hand side contains ellipsis
       for (size_t i = 0; i < num_ellipsis; ++i) {
         output_shape->add_dim();
       }
     }
-    for (char c: right_equation) { // Add a dimension per each character in right hand equation
+    for (char c : right_equation) { // Add a dimension per each character
+                                    // in right hand equation
       if (c != '.') {
         output_shape->add_dim();
       }
@@ -2024,7 +2203,8 @@ void einsumRankInference(
     for (size_t i = 0; i < num_ellipsis_indices; i++) {
       output_shape->add_dim();
     }
-    for (size_t i = 0; i < left_equation.size(); i++) { // Count chars that appear exactly once on left hand side
+    for (size_t i = 0; i < left_equation.size();
+         i++) { // Count chars that appear exactly once on left hand side
       if ((left_equation.at(i) != ',') && (left_equation.at(i) != '.')) {
         num_letter_occurrences[left_equation.at(i) - 'a']++;
       }
@@ -2067,15 +2247,8 @@ ONNX_OPERATOR_SET_SCHEMA(
     12,
     OpSchema()
         .SetDoc(Einsum_ver12_doc)
-        .Attr(
-            "equation",
-            "Einsum expression string.",
-            AttributeProto::STRING)
-        .Input(0,
-            "Inputs",
-            "Operands",
-            "T",
-            OpSchema::Variadic)
+        .Attr("equation", "Einsum expression string.", AttributeProto::STRING)
+        .Input(0, "Inputs", "Operands", "T", OpSchema::Variadic)
         .Output(0, "Output", "Output tensor", "T")
         .TypeConstraint(
             "T",
@@ -2088,7 +2261,7 @@ ONNX_OPERATOR_SET_SCHEMA(
           if (equation.compare("") == 0) {
             return;
           }
-	        einsumRankInference(ctx, equation);
+          einsumRankInference(ctx, equation);
         }));
 
 static const char* Inverse_ver12_doc = R"DOC(
@@ -2125,8 +2298,7 @@ ONNX_OPERATOR_SET_SCHEMA(
 
             const auto mat_w = input_shape.dim(rank - 1);
             const auto mat_h = input_shape.dim(rank - 2);
-            if (mat_w.has_dim_value() &&
-                mat_h.has_dim_value() &&
+            if (mat_w.has_dim_value() && mat_h.has_dim_value() &&
                 (mat_w.dim_value() != mat_h.dim_value())) {
               fail_shape_inference(
                   "The inner-most 2 dimensions must have the same size (mat_w:",
@@ -2168,7 +2340,10 @@ L = ReduceSum(L), if reduction = 'sum';
 
 .)DOC";
 
-bool BuildContextDependentFunctionBodyMSD(const FunctionBodyBuildContext& ctx, const OpSchema& schema, FunctionProto& functionProto) {
+bool BuildContextDependentFunctionBodyMSD(
+    const FunctionBodyBuildContext& ctx,
+    const OpSchema& schema,
+    FunctionProto& functionProto) {
   std::vector<FunctionBodyHelper::NodeDef> body;
   body.push_back(FunctionBodyHelper::Const<int>("Q_Pow", 2));
   body.push_back({{"X_Sub"}, "Sub", {"scores", "labels"}});
@@ -2179,9 +2354,15 @@ bool BuildContextDependentFunctionBodyMSD(const FunctionBodyBuildContext& ctx, c
     } else {
       body.push_back({{"X_Pow"}, "Pow", {"X_Sub", "Q_Pow"}});
       if (ctx.getAttribute("reduction")->s() == "sum") {
-        body.push_back({{"output"}, "ReduceSum", {"X_Pow"}, {MakeAttribute("keepdims", (int64_t)0)}});
+        body.push_back({{"output"},
+                        "ReduceSum",
+                        {"X_Pow"},
+                        {MakeAttribute("keepdims", (int64_t)0)}});
       } else {
-        body.push_back({{"output"}, "ReduceMean", {"X_Pow"}, {MakeAttribute("keepdims", (int64_t)0)}});
+        body.push_back({{"output"},
+                        "ReduceMean",
+                        {"X_Pow"},
+                        {MakeAttribute("keepdims", (int64_t)0)}});
       }
     }
   } else {
@@ -2191,15 +2372,21 @@ bool BuildContextDependentFunctionBodyMSD(const FunctionBodyBuildContext& ctx, c
     } else {
       body.push_back({{"X_Mul"}, "Mul", {"weights", "X_Pow"}});
       if (ctx.getAttribute("reduction")->s() == "sum") {
-        body.push_back({{"output"}, "ReduceSum", {"X_Mul"}, {MakeAttribute("keepdims", (int64_t)0)}});
+        body.push_back({{"output"},
+                        "ReduceSum",
+                        {"X_Mul"},
+                        {MakeAttribute("keepdims", (int64_t)0)}});
       } else {
-        body.push_back({{"output"}, "ReduceMean", {"X_Mul"}, {MakeAttribute("keepdims", (int64_t)0)}});
+        body.push_back({{"output"},
+                        "ReduceMean",
+                        {"X_Mul"},
+                        {MakeAttribute("keepdims", (int64_t)0)}});
       }
     }
   }
 
   auto func_nodes = FunctionBodyHelper::BuildNodes(body);
-  for (const auto node : func_nodes) {
+  for (const auto& node : func_nodes) {
     auto new_node = functionProto.add_node();
     new_node->CopyFrom(node);
   }
@@ -2241,18 +2428,18 @@ ONNX_OPERATOR_SET_SCHEMA(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
             "Constrain input and output types to float tensors.")
-        .SetContextDependentFunctionBodyBuilder(BuildContextDependentFunctionBodyMSD)
+        .SetContextDependentFunctionBodyBuilder(
+            BuildContextDependentFunctionBodyMSD)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
-            propagateElemTypeFromInputToOutput(ctx, 0, 0);
-            std::string reduction = getAttribute(ctx, "reduction", "mean");
-            if (reduction.compare("none") == 0) {
-	        if (hasInputShape(ctx, 0)) {
-		    propagateShapeFromInputToOutput(ctx, 0, 0);
-		}
-            } else {
-                updateOutputShape(ctx, 0, TensorShapeProto());
+          propagateElemTypeFromInputToOutput(ctx, 0, 0);
+          std::string reduction = getAttribute(ctx, "reduction", "mean");
+          if (reduction.compare("none") == 0) {
+            if (hasInputShape(ctx, 0)) {
+              propagateShapeFromInputToOutput(ctx, 0, 0);
             }
-
+          } else {
+            updateOutputShape(ctx, 0, TensorShapeProto());
+          }
         }));
 
 const char* reduction_doc_sce =
@@ -2293,7 +2480,10 @@ If reduction = 'mean', the output is scalar: ReduceMean(L), or if weight is prov
 where tensor W is of shape (N, D1, D2, ..., Dk) and W[n][d1][d2]...[dk] = weights[labels[i][d1][d2]...[dk]].
 )DOC";
 
-bool BuildContextDependentFunctionBodySCE(const FunctionBodyBuildContext& ctx, const OpSchema& schema, FunctionProto& functionProto) {
+bool BuildContextDependentFunctionBodySCE(
+    const FunctionBodyBuildContext& ctx,
+    const OpSchema& schema,
+    FunctionProto& functionProto) {
   std::vector<FunctionBodyHelper::NodeDef> body;
   body.push_back({{"X_Max"}, "Max", {"scores"}});
   body.push_back({{"X_Sub"}, "Sub", {"scores", "X_Max"}});
@@ -2301,16 +2491,36 @@ bool BuildContextDependentFunctionBodySCE(const FunctionBodyBuildContext& ctx, c
   body.push_back({{"X_RS"}, "ReduceSum", {"X_Exp"}});
   body.push_back({{"X_Div"}, "Div", {"X_Exp", "X_RS"}});
   body.push_back({{"log_prob"}, "Log", {"X_Div"}});
-  if (!ctx.hasInput(2)) {
-    body.push_back({ {"output"}, "NegativeLogLikelihoodLoss", {"log_prob", "labels"},
-        {MakeRefAttribute("reduction", AttributeProto::STRING)}});
+  if (ctx.getAttribute("ignore_index") == nullptr) {
+    if (!ctx.hasInput(2)) {
+      body.push_back({{"output"},
+                      "NegativeLogLikelihoodLoss",
+                      {"log_prob", "labels"},
+                      {MakeRefAttribute("reduction", AttributeProto::STRING)}});
+    } else {
+      body.push_back({{"output"},
+                      "NegativeLogLikelihoodLoss",
+                      {"log_prob", "labels", "weights"},
+                      {MakeRefAttribute("reduction", AttributeProto::STRING)}});
+    }
   } else {
-    body.push_back({{"output"}, "NegativeLogLikelihoodLoss", {"log_prob", "labels", "weights"},
-        {MakeRefAttribute("reduction", AttributeProto::STRING)}});
+    if (!ctx.hasInput(2)) {
+      body.push_back({{"output"},
+                      "NegativeLogLikelihoodLoss",
+                      {"log_prob", "labels"},
+                      {MakeRefAttribute("reduction", AttributeProto::STRING),
+                       MakeRefAttribute("ignore_index", AttributeProto::INT)}});
+    } else {
+      body.push_back({{"output"},
+                      "NegativeLogLikelihoodLoss",
+                      {"log_prob", "labels", "weights"},
+                      {MakeRefAttribute("reduction", AttributeProto::STRING),
+                       MakeRefAttribute("ignore_index", AttributeProto::INT)}});
+    }
   }
 
   auto func_nodes = FunctionBodyHelper::BuildNodes(body);
-  for (const auto node : func_nodes) {
+  for (const auto& node : func_nodes) {
     auto new_node = functionProto.add_node();
     new_node->CopyFrom(node);
   }
@@ -2329,6 +2539,12 @@ ONNX_OPERATOR_SET_SCHEMA(
             reduction_doc_sce,
             AttributeProto::STRING,
             std::string("mean"))
+        .Attr(
+            "ignore_index",
+            "Specifies a target value that is ignored and does not contribute to the input gradient. "
+            "It is an optional value and valid values are [0, C).",
+            AttributeProto::INT,
+            false)
         .Input(
             0,
             "scores",
@@ -2340,8 +2556,8 @@ ONNX_OPERATOR_SET_SCHEMA(
             "labels",
             "The ground truth output tensor, with shape [batch_size], or "
             "[batch_size, D1, D2, ..., Dk], where K is the number of dimensions.",
-            "T")
-         .Input(
+            "Tind")
+        .Input(
             2,
             "weights",
             "A manual rescaling weight given to each class. If given, it has to "
@@ -2356,27 +2572,31 @@ ONNX_OPERATOR_SET_SCHEMA(
             "shape of [batch_size], or [batch_size, D1, D2, ..., Dk] in case of "
             "K-dimensional loss. Otherwise, it is a scalar.",
             "T")
-	.Output(
-	    1,
-	    "log_prob",
-	    "Log probability tensor. If the output of softmax is prob, its value is log(prob).",
-	    "T",
-	    OpSchema::Optional)
+        .Output(
+            1,
+            "log_prob",
+            "Log probability tensor. If the output of softmax is prob, its value is log(prob).",
+            "T",
+            OpSchema::Optional)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
             "Constrain input and output types to float tensors.")
-        .SetContextDependentFunctionBodyBuilder(BuildContextDependentFunctionBodySCE)
+        .TypeConstraint(
+            "Tind",
+            {"tensor(int32)", "tensor(int64)"},
+            "Constrain target to integer types")
+        .SetContextDependentFunctionBodyBuilder(
+            BuildContextDependentFunctionBodySCE)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
-            propagateElemTypeFromInputToOutput(ctx, 0, 0);
-            std::string reduction = getAttribute(ctx, "reduction", "mean");
-            if (reduction.compare("none") == 0) {
-                if (hasInputShape(ctx, 1)) {
-                    propagateShapeFromInputToOutput(ctx, 1, 0);
-                }
-            } else {
-                updateOutputShape(ctx, 0, TensorShapeProto());
+          propagateElemTypeFromInputToOutput(ctx, 0, 0);
+          std::string reduction = getAttribute(ctx, "reduction", "mean");
+          if (reduction.compare("none") == 0) {
+            if (hasInputShape(ctx, 1)) {
+              propagateShapeFromInputToOutput(ctx, 1, 0);
             }
-
+          } else {
+            updateOutputShape(ctx, 0, TensorShapeProto());
+          }
         }));
 } // namespace ONNX_NAMESPACE
diff --git a/onnx/defs/math/old.cc b/onnx/defs/math/old.cc
index 7244c03af89..3e8ebf71a6d 100644
--- a/onnx/defs/math/old.cc
+++ b/onnx/defs/math/old.cc
@@ -11,7 +11,8 @@ std::function<void(OpSchema&)> SoftmaxFamilyDocGenerator_opset1(
     const char* name,
     const char* description) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 The operator computes the {name} ({description}) values for each layer in the batch
  of the given input. The input is a 2-D tensor (Tensor<float>) of size
 (batch_size x input_feature_dimensions). The output tensor has the same shape
@@ -28,8 +29,8 @@ In this situation, we must have a_0 = N and a_1 * ... * a_{n-1} = D.
 Each of these dimensions must be matched correctly, or else the operator
 will throw errors.
 )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{description}", description);
+                        ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{description}", description););
     schema.SetDoc(doc);
     schema.Attr(
         "axis",
@@ -100,11 +101,12 @@ Attribute `broadcast=1` needs to be passed to enable broadcasting.
 
 std::function<void(OpSchema&)> MathDocGenerator_old(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 Performs element-wise binary {name} (with limited broadcast support).
 {broadcast_doc})DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{broadcast_doc}", kBroadcastDoc_old);
+                        ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{broadcast_doc}", kBroadcastDoc_old););
     schema.SetDoc(doc);
     schema.Attr(
         "broadcast",
@@ -119,12 +121,12 @@ Performs element-wise binary {name} (with limited broadcast support).
         "consumed_inputs",
         "legacy optimization attribute.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "axis",
         "If set, defines the broadcast dimensions. See doc for details.",
         AttributeProto::INT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Input(
         0,
         "A",
@@ -146,11 +148,12 @@ Performs element-wise binary {name} (with limited broadcast support).
 
 std::function<void(OpSchema&)> MathDocGenerator_old_opset6(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 Performs element-wise binary {name} (with limited broadcast support).
 {broadcast_doc})DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{broadcast_doc}", kBroadcastDoc_old);
+                        ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{broadcast_doc}", kBroadcastDoc_old););
     schema.SetDoc(doc);
     schema.Attr(
         "broadcast",
@@ -161,7 +164,7 @@ Performs element-wise binary {name} (with limited broadcast support).
         "axis",
         "If set, defines the broadcast dimensions. See doc for details.",
         AttributeProto::INT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Input(
         0,
         "A",
@@ -249,7 +252,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "axis",
             "If set, defines the broadcast dimensions. See doc for details.",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Output(0, "Z", "Output tensor (same size as X)", "T")
         .TypeConstraint(
             "T",
@@ -257,6 +260,33 @@ ONNX_OPERATOR_SET_SCHEMA(
             "Constrain input and output types to float tensors.")
         .TypeAndShapeInferenceFunction(propagateShapeAndTypeFromFirstInput));
 
+static const char* Pow_ver7_doc = R"DOC(
+Pow takes input data (Tensor<T>) and exponent Tensor, and
+produces one output data (Tensor<T>) where the function `f(x) = x^exponent`,
+is applied to the data tensor elementwise.
+)DOC";
+
+ONNX_OPERATOR_SET_SCHEMA(
+    Pow,
+    7,
+    OpSchema()
+        .SetDoc(std::string(Pow_ver7_doc) + GenerateBroadcastingDocMul())
+        .Input(0, "X", "First operand, base of the exponent.", "T")
+        .Input(1, "Y", "Second operand, power of the exponent.", "T")
+        .Output(0, "Z", "Output tensor (same size as X)", "T")
+        .TypeConstraint(
+            "T",
+            {"tensor(float16)", "tensor(float)", "tensor(double)"},
+            "Constrain input and output types to float tensors.")
+        .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
+          propagateElemTypeFromInputToOutput(ctx, 0, 0);
+          if (hasNInputShapes(ctx, 2))
+            bidirectionalBroadcastShapeInference(
+                ctx.getInputType(0)->tensor_type().shape(),
+                ctx.getInputType(1)->tensor_type().shape(),
+                *ctx.getOutputType(0)->mutable_tensor_type()->mutable_shape());
+        }));
+
 static const char* Neg_ver1_doc = R"DOC(
 Neg takes one input data (Tensor<T>) and produces one output data
 (Tensor<T>) where each element flipped sign, y = -x, is applied to
@@ -277,7 +307,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -303,7 +333,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -329,7 +359,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -355,7 +385,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -381,7 +411,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -407,7 +437,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -433,7 +463,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -464,7 +494,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -498,7 +528,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .SetDoc(Selu_ver1_doc)
         .Input(0, "X", "Input tensor", "T")
         .Output(0, "Y", "Output tensor", "T")
@@ -530,7 +560,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .SetDoc(Elu_ver1_doc)
         .Input(0, "X", "1D input tensor", "T")
         .Output(0, "Y", "1D input tensor", "T")
@@ -562,7 +592,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -591,7 +621,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -620,7 +650,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -654,7 +684,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -689,9 +719,9 @@ ONNX_OPERATOR_SET_SCHEMA(
     PRelu,
     7,
     OpSchema()
-        .SetDoc(
-            PRelu_ver7_doc +
-            GenerateBroadcastingDocUni("tensor slope", "input tensor X"))
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(PRelu_ver7_doc) +
+            GenerateBroadcastingDocUni("tensor slope", "input tensor X")))
         .Input(0, "X", "Input tensor", "T")
         .Input(
             1,
@@ -726,7 +756,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -759,7 +789,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .SetDoc(HardSigmoid_ver1_doc)
         .Input(0, "X", "Input tensor", "T")
         .Output(0, "Y", "Output tensor", "T")
@@ -787,7 +817,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -812,7 +842,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -837,7 +867,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -867,7 +897,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeConstraint(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
@@ -888,12 +918,12 @@ ONNX_OPERATOR_SET_SCHEMA(
             "min",
             "Minimum value, under which element is replaced by min",
             AttributeProto::FLOAT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "max",
             "Maximum value, above which element is replaced by max",
             AttributeProto::FLOAT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         // This attribute was added via AllowConsumed API in OpSchema.
         // After removing the API, we're now using the Attr API to simulate the
         // old definition.
@@ -901,7 +931,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(0, "input", "Input tensor whose elements to be clipped", "T")
         .Output(0, "output", "Output tensor with clipped input elements", "T")
         .TypeConstraint(
@@ -1055,9 +1085,9 @@ ONNX_OPERATOR_SET_SCHEMA(
     Gemm,
     7,
     OpSchema()
-        .SetDoc(
-            Gemm_ver7_doc +
-            GenerateBroadcastingDocUni("tensor C", "tensor A * B"))
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(Gemm_ver7_doc) +
+            GenerateBroadcastingDocUni("tensor C", "tensor A * B")))
         .Input(
             0,
             "A",
@@ -1145,9 +1175,9 @@ ONNX_OPERATOR_SET_SCHEMA(
     Gemm,
     9,
     OpSchema()
-        .SetDoc(
-            Gemm_ver9_doc +
-            GenerateBroadcastingDocUni("tensor C", "tensor A * B"))
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(Gemm_ver9_doc) +
+            GenerateBroadcastingDocUni("tensor C", "tensor A * B")))
         .Input(
             0,
             "A",
@@ -1616,11 +1646,7 @@ ONNX_OPERATOR_SET_SCHEMA(
     11,
     OpSchema()
         .SetDoc(Clip_ver11_doc)
-        .Input(
-            0,
-            "input",
-            "Input tensor whose elements to be clipped",
-            "T")
+        .Input(0, "input", "Input tensor whose elements to be clipped", "T")
         .Input(
             1,
             "min",
@@ -1645,13 +1671,16 @@ ONNX_OPERATOR_SET_SCHEMA(
 std::function<void(OpSchema&)> ElementwiseMultiOpDocGenerator_old(
     const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
 Element-wise {name} of each of the input tensors (with Numpy-style broadcasting support).
 All inputs and outputs must have the same data type.
 {broadcast_doc}
 )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str());
+        ReplaceAll(doc, "{name}", name);
+        ReplaceAll(
+            doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str()););
     schema.SetDoc(doc);
     schema.Input(
         0,
diff --git a/onnx/defs/nn/defs.cc b/onnx/defs/nn/defs.cc
index 1c0601d7456..76b8b880d27 100644
--- a/onnx/defs/nn/defs.cc
+++ b/onnx/defs/nn/defs.cc
@@ -89,10 +89,10 @@ void convPoolShapeInference(
   std::vector<int64_t> effective_kernel_shape = kernel_shape;
   for (int i = 0; i < static_cast<int>(kernel_shape.size()); i++) {
     // accounting for dilation, how big is the kernel in this dimension
-    effective_kernel_shape[i] = (effective_kernel_shape[i] - 1) * dilations[i] + 1;
+    effective_kernel_shape[i] =
+        (effective_kernel_shape[i] - 1) * dilations[i] + 1;
   }
 
-
   std::vector<int64_t> pads;
   if (getRepeatedAttribute(ctx, "pads", pads)) {
     if (pads.size() != n_input_dims * 2) {
@@ -115,7 +115,9 @@ void convPoolShapeInference(
             residual -= stride;
           }
         }
-        int64_t total_pad = residual == 0 ? effective_kernel_shape[i] - stride : effective_kernel_shape[i] - residual;
+        int64_t total_pad = residual == 0
+            ? effective_kernel_shape[i] - stride
+            : effective_kernel_shape[i] - residual;
         if (total_pad < 0)
           total_pad = 0;
         int64_t half_pad_small = total_pad >> 1;
@@ -167,7 +169,8 @@ void convPoolShapeInference(
 
     if (ceil_mode == 1)
       strided_kernel_positions = (int64_t)(std::ceil(
-          (effective_input_size - effective_kernel_shape[i]) / float(strides[i])));
+          (effective_input_size - effective_kernel_shape[i]) /
+          float(strides[i])));
     else
       strided_kernel_positions =
           (effective_input_size - effective_kernel_shape[i]) / strides[i];
@@ -184,11 +187,15 @@ void convPoolShapeInference(
   }
 }
 
-std::vector<std::string> GetSupportedDataTypesForPoolingOps(bool supports8bit){
-    if (supports8bit) {
-        return {"tensor(float16)", "tensor(float)", "tensor(double)", "tensor(int8)", "tensor(uint8)"};
-    }
-    return {"tensor(float16)", "tensor(float)", "tensor(double)"};
+std::vector<std::string> GetSupportedDataTypesForPoolingOps(bool supports8bit) {
+  if (supports8bit) {
+    return {"tensor(float16)",
+            "tensor(float)",
+            "tensor(double)",
+            "tensor(int8)",
+            "tensor(uint8)"};
+  }
+  return {"tensor(float16)", "tensor(float)", "tensor(double)"};
 }
 
 std::function<void(OpSchema&)> PoolOpSchemaGenerator(
@@ -198,7 +205,9 @@ std::function<void(OpSchema&)> PoolOpSchemaGenerator(
     bool use_dilation,
     bool supports8bit = false) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
  {name} consumes an input tensor X and applies {opName} pooling across
  the tensor according to kernel sizes, stride sizes, and pad lengths.
  {opName} pooling consisting of computing the {opName} on all values of a
@@ -228,14 +237,14 @@ std::function<void(OpSchema&)> PoolOpSchemaGenerator(
  ```
  {additionalDescription}
  )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{opName}", opName);
-    ReplaceAll(doc, "{additionalDescription}", additionalDescription);
-    ReplaceAll(
-        doc,
-        "{kernelSpatialShape}",
-        use_dilation ? "((kernel_spatial_shape[i] - 1) * dilations[i] + 1)"
-                     : "kernel_spatial_shape[i]");
+        ReplaceAll(doc, "{name}", name);
+        ReplaceAll(doc, "{opName}", opName);
+        ReplaceAll(doc, "{additionalDescription}", additionalDescription);
+        ReplaceAll(
+            doc,
+            "{kernelSpatialShape}",
+            use_dilation ? "((kernel_spatial_shape[i] - 1) * dilations[i] + 1)"
+                         : "kernel_spatial_shape[i]"););
     schema.SetDoc(doc);
     schema.Attr(
         "kernel_shape",
@@ -245,13 +254,13 @@ std::function<void(OpSchema&)> PoolOpSchemaGenerator(
         "strides",
         "Stride along each spatial axis. If not present, the stride defaults to 1 along each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL);
+    schema.Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Attr(
         "ceil_mode",
         "Whether to use ceil or floor (default) to compute the output shape.",
@@ -283,8 +292,9 @@ std::function<void(OpSchema&)> PoolOpSchemaGenerator(
     schema.TypeConstraint(
         "T",
         GetSupportedDataTypesForPoolingOps(supports8bit),
-        supports8bit ? "Constrain input and output types to float and 8 bit tensors."
-        : "Constrain input and output types to float tensors.");
+        supports8bit
+            ? "Constrain input and output types to float and 8 bit tensors."
+            : "Constrain input and output types to float tensors.");
     schema.TypeAndShapeInferenceFunction([use_dilation](InferenceContext& ctx) {
       propagateElemTypeFromInputToOutput(ctx, 0, 0);
       if (ctx.getNumOutputs() > 1) {
@@ -335,7 +345,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "dilations",
             "Dilation value along each spatial axis of filter. If not present, the dilation defaults to 1 along each spatial axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Output(
             1,
             "Indices",
@@ -475,8 +485,8 @@ ONNX_OPERATOR_SET_SCHEMA(
             "strides",
             "Stride along each spatial axis. If not present, the stride defaults to 1 along each spatial axis.",
             AttributeProto::INTS,
-            OPTIONAL)
-        .Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL)
+            OPTIONAL_VALUE)
+        .Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL_VALUE)
         .Input(
             0,
             "X",
@@ -530,13 +540,14 @@ ONNX_OPERATOR_SET_SCHEMA(
 
 std::function<void(OpSchema&)> LpPoolOpSchemaGenerator(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
  {name} consumes an input tensor X and applies Lp pooling across
  the tensor according to kernel sizes, stride sizes, and pad lengths.
  Lp pooling consisting of computing the Lp norm on all values of a subset
  of the input tensor according to the kernel size and downsampling the
  data into the output tensor Y for further processing.)DOC";
-    ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{name}", name););
     schema.SetDoc(doc);
     schema.Attr(
         "kernel_shape",
@@ -546,13 +557,13 @@ std::function<void(OpSchema&)> LpPoolOpSchemaGenerator(const char* name) {
         "strides",
         "Stride along each spatial axis. If not present, the stride defaults to 1 along each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL);
+    schema.Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Attr(
         "p",
         "p value of the Lp norm used to pool over the input data.",
@@ -636,11 +647,12 @@ void roiPoolTypeShapeInference(InferenceContext& ctx) {
 
 std::function<void(OpSchema&)> RoiPoolOpSchemaGenerator(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
  ROI {name} pool consumes an input tensor X and region of interests (RoIs) to
  apply {name} pooling across each RoI, to produce output 4-D tensor of shape
  (num_rois, channels, pooled_shape[0], pooled_shape[1]).)DOC";
-    ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{name}", name););
     schema.SetDoc(doc);
     schema.Attr(
         "pooled_shape",
@@ -688,10 +700,11 @@ ONNX_OPERATOR_SET_SCHEMA(
 
 std::function<void(OpSchema&)> ConvOpSchemaGenerator(const char* filter_desc) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 The convolution operator consumes an input tensor and {filter_desc}, and
 computes the output.)DOC";
-    ReplaceAll(doc, "{filter_desc}", filter_desc);
+                        ReplaceAll(doc, "{filter_desc}", filter_desc););
     schema.SetDoc(doc);
     schema.Input(
         0,
@@ -745,27 +758,23 @@ computes the output.)DOC";
         "kernel_shape",
         "The shape of the convolution kernel. If not present, should be inferred from input W.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "dilations",
         "dilation value along each spatial axis of the filter. If not present, the dilation defaults is 1 along each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "strides",
         "Stride along each spatial axis. If not present, the stride defaults is 1 along each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr(
-        "pads",
-        pads_doc,
-        AttributeProto::INTS,
-        OPTIONAL);
+    schema.Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Attr(
         "group",
         "number of groups input channels and output channels are divided into.",
@@ -898,17 +907,17 @@ ONNX_OPERATOR_SET_SCHEMA(
             "kernel_shape",
             "The shape of the convolution kernel. If not present, should be inferred from input 'w'.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "dilations",
             "dilation value along each spatial axis of the filter. If not present, the dilation defaults to 1 along each spatial axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "strides",
             "Stride along each spatial axis. If not present, the stride defaults to 1 along each spatial axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "pads",
             "Padding for the beginning and ending along each spatial axis, it can take any value greater than or equal to 0."
@@ -918,14 +927,13 @@ ONNX_OPERATOR_SET_SCHEMA(
             "This attribute cannot be used simultaneously with auto_pad attribute. If not present, the padding defaults"
             "to 0 along start and end of each spatial axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "group",
             "number of groups input channels and output channels are divided into. default is 1.",
             AttributeProto::INT,
             static_cast<int64_t>(1))
-        .TypeAndShapeInferenceFunction([](InferenceContext&
-                                              ctx) {
+        .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           auto x_type = ctx.getInputType(0);
           auto w_type = ctx.getInputType(3);
           if (nullptr == x_type || nullptr == w_type ||
@@ -1038,17 +1046,17 @@ ONNX_OPERATOR_SET_SCHEMA(
             "kernel_shape",
             "The shape of the convolution kernel. If not present, should be inferred from input 'w'.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "dilations",
             "dilation value along each spatial axis of the filter. If not present, the dilation defaults to 1 along each axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "strides",
             "Stride along each spatial axis. If not present, the stride defaults to 1 along each axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "pads",
             "Padding for the beginning and ending along each spatial axis, it can take any value greater than or equal to 0."
@@ -1058,14 +1066,13 @@ ONNX_OPERATOR_SET_SCHEMA(
             "This attribute cannot be used simultaneously with auto_pad attribute. If not present, the padding defaults"
             "to 0 along start and end of each spatial axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "group",
             "number of groups input channels and output channels are divided into. default is 1.",
             AttributeProto::INT,
             static_cast<int64_t>(1))
-        .TypeAndShapeInferenceFunction([](InferenceContext&
-                                              ctx) {
+        .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           auto x_type = ctx.getInputType(0);
           auto w_type = ctx.getInputType(1);
           auto y_type = ctx.getOutputType(0);
@@ -1077,8 +1084,7 @@ ONNX_OPERATOR_SET_SCHEMA(
           }
 
           // Right now we only support int32
-          y_type->mutable_tensor_type()->set_elem_type(
-              TensorProto::INT32);
+          y_type->mutable_tensor_type()->set_elem_type(TensorProto::INT32);
 
           convPoolShapeInference(ctx, true, false, 0, 1);
         }));
@@ -1180,7 +1186,7 @@ void convTransposeShapeInference(InferenceContext& ctx) {
       }
     }
   }
-    
+
   std::vector<int64_t> output_shape;
   bool output_shape_presented = true;
   if (getRepeatedAttribute(ctx, "output_shape", output_shape)) {
@@ -1243,7 +1249,8 @@ void convTransposeShapeInference(InferenceContext& ctx) {
 std::function<void(OpSchema&)> ConvTransposeOpSchemaGenerator(
     const char* filter_desc) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 The convolution transpose operator consumes an input tensor and {filter_desc},
 and computes the output.
 
@@ -1258,7 +1265,7 @@ output_shape can also be explicitly specified in which case pads values are auto
   Else: pads[start_i] = total_padding[i] - (total_padding[i]/2); pads[end_i] = (total_padding[i]/2).
 
     )DOC";
-    ReplaceAll(doc, "{filter_desc}", filter_desc);
+                        ReplaceAll(doc, "{filter_desc}", filter_desc););
     schema.SetDoc(doc);
     schema.Input(
         0,
@@ -1304,13 +1311,13 @@ output_shape can also be explicitly specified in which case pads values are auto
         "kernel_shape",
         "The shape of the convolution kernel. If not present, should be inferred from input W.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "output_shape",
         "The shape of the output can be explicitly set which will cause pads values to be auto generated. If output_shape is specified "
         "pads values are ignored. See doc for details for equations to generate pads",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "output_padding",
         "Additional elements added to the side with higher coordinate indices in the output. "
@@ -1324,23 +1331,23 @@ output_shape can also be explicitly specified in which case pads values are auto
         "participates in the computation of the needed padding amount. "
         "This is also called adjs or adjustment in some frameworks.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "dilations",
         "dilation value along each spatial axis of the filter. If not present, the dilation defaults to 1 along each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "strides",
         "Stride along each spatial axis. If not present, the stride defaults to 1 along each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL);
+    schema.Attr("pads", pads_doc, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Attr(
         "group",
         "number of groups input channels and output channels are divided into.",
@@ -1388,12 +1395,13 @@ std::function<void(OpSchema&)> GlobalPoolingOpSchemaGenerator(
     const char* op_type,
     const char* op) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
  Global{op_type} consumes an input tensor X and applies {op} pooling across
  the values in the same channel. This is equivalent to {op_type} with kernel size
  equal to the spatial dimension of input tensor.)DOC";
-    ReplaceAll(doc, "{op_type}", op_type);
-    ReplaceAll(doc, "{op}", op);
+                        ReplaceAll(doc, "{op_type}", op_type);
+                        ReplaceAll(doc, "{op}", op););
     schema.SetDoc(doc);
     schema.Input(
         0,
@@ -1436,12 +1444,13 @@ std::function<void(OpSchema&)> GlobalLpPoolingOpSchemaGenerator(
     const char* op_type,
     const char* op) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
  Global{op_type} consumes an input tensor X and applies {op} pooling across
  the values in the same channel. This is equivalent to {op_type} with kernel size
  equal to the spatial dimension of input tensor.)DOC";
-    ReplaceAll(doc, "{op_type}", op_type);
-    ReplaceAll(doc, "{op}", op);
+                        ReplaceAll(doc, "{op_type}", op_type);
+                        ReplaceAll(doc, "{op}", op););
     schema.SetDoc(doc);
     schema.Attr(
         "p",
@@ -1471,7 +1480,6 @@ std::function<void(OpSchema&)> GlobalLpPoolingOpSchemaGenerator(
         "T",
         {"tensor(float16)", "tensor(float)", "tensor(double)"},
         "Constrain input and output types to float tensors.");
-    schema.SetDoc(doc);
     schema.TypeAndShapeInferenceFunction(
         [](InferenceContext& ctx) { globalPoolTypeShapeInference(ctx); });
   };
@@ -1521,7 +1529,9 @@ ONNX_OPERATOR_SET_SCHEMA(
     12,
     OpSchema()
         .NumOutputs({1, 5})
-        .SetDoc(BatchNormalization_ver12_doc + GenerateOptionalArgumentsDoc())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(BatchNormalization_ver12_doc) +
+            GenerateOptionalArgumentsDoc()))
         .Attr(
             "epsilon",
             "The epsilon value to use to avoid division by zero.",
@@ -1598,70 +1608,94 @@ ONNX_OPERATOR_SET_SCHEMA(
             "Constrain input 'training_mode' types to boolean tensors.")
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           propagateElemTypeFromInputToOutput(ctx, 0, 0);
-          if(hasInputShape(ctx, 0)){
+          if (hasInputShape(ctx, 0)) {
             propagateShapeFromInputToOutput(ctx, 0, 0);
             auto& x_input_shape = getInputShape(ctx, 0);
             int num_channels = 1;
-            if ( static_cast<int>(x_input_shape.dim_size()) > 1 && x_input_shape.dim(1).has_dim_value()) {
-                num_channels = static_cast<int>(x_input_shape.dim(1).dim_value());
+            if (static_cast<int>(x_input_shape.dim_size()) > 1 &&
+                x_input_shape.dim(1).has_dim_value()) {
+              num_channels = static_cast<int>(x_input_shape.dim(1).dim_value());
             }
 
             if (hasInputShape(ctx, 1)) {
-                auto& scale_input_shape = getInputShape(ctx, 1);
-                if(static_cast<int>(scale_input_shape.dim_size()) != 1 || !scale_input_shape.dim(0).has_dim_value() || static_cast<int>(scale_input_shape.dim(0).dim_value()) != num_channels) {
-                    fail_shape_inference("All scale, B, mean and var must be tensors of shape C.");
-                }
+              auto& scale_input_shape = getInputShape(ctx, 1);
+              if (static_cast<int>(scale_input_shape.dim_size()) != 1 ||
+                  !scale_input_shape.dim(0).has_dim_value() ||
+                  static_cast<int>(scale_input_shape.dim(0).dim_value()) !=
+                      num_channels) {
+                fail_shape_inference(
+                    "All scale, B, mean and var must be tensors of shape C.");
+              }
             }
 
             if (hasInputShape(ctx, 2)) {
-                auto& b_input_shape = getInputShape(ctx, 2);
-                if(static_cast<int>(b_input_shape.dim_size()) != 1 || !b_input_shape.dim(0).has_dim_value() || static_cast<int>(b_input_shape.dim(0).dim_value()) != num_channels) {
-                    fail_shape_inference("All scale, B, mean and var must be tensors of shape C.");
-                }
+              auto& b_input_shape = getInputShape(ctx, 2);
+              if (static_cast<int>(b_input_shape.dim_size()) != 1 ||
+                  !b_input_shape.dim(0).has_dim_value() ||
+                  static_cast<int>(b_input_shape.dim(0).dim_value()) !=
+                      num_channels) {
+                fail_shape_inference(
+                    "All scale, B, mean and var must be tensors of shape C.");
+              }
             }
 
             if (hasInputShape(ctx, 3)) {
-                auto& mean_input_shape = getInputShape(ctx, 3);
-                if(static_cast<int>(mean_input_shape.dim_size() != 1)|| !mean_input_shape.dim(0).has_dim_value() || static_cast<int>(mean_input_shape.dim(0).dim_value()) != num_channels) {
-                    fail_shape_inference("All scale, B, mean and var must be tensors of shape C.");
-                }
+              auto& mean_input_shape = getInputShape(ctx, 3);
+              if (static_cast<int>(mean_input_shape.dim_size() != 1) ||
+                  !mean_input_shape.dim(0).has_dim_value() ||
+                  static_cast<int>(mean_input_shape.dim(0).dim_value()) !=
+                      num_channels) {
+                fail_shape_inference(
+                    "All scale, B, mean and var must be tensors of shape C.");
+              }
             }
 
             if (hasInputShape(ctx, 4)) {
-                auto& var_input_shape = getInputShape(ctx, 4);
-                if(static_cast<int>(var_input_shape.dim_size()) != 1 || !var_input_shape.dim(0).has_dim_value() || static_cast<int>(var_input_shape.dim(0).dim_value()) != num_channels) {
-                    fail_shape_inference("All scale, B, mean and var must be tensors of shape C.");
-                }
+              auto& var_input_shape = getInputShape(ctx, 4);
+              if (static_cast<int>(var_input_shape.dim_size()) != 1 ||
+                  !var_input_shape.dim(0).has_dim_value() ||
+                  static_cast<int>(var_input_shape.dim(0).dim_value()) !=
+                      num_channels) {
+                fail_shape_inference(
+                    "All scale, B, mean and var must be tensors of shape C.");
+              }
             }
 
             if (ctx.getNumInputs() > 5 && hasInputShape(ctx, 5)) {
-                auto& mode_input_shape = getInputShape(ctx, 5);
-                if (static_cast<int>(mode_input_shape.dim_size()) != 0) {
-                    fail_shape_inference("Training_mode is not a scalar boolean.");
+              auto& mode_input_shape = getInputShape(ctx, 5);
+              // if mode is not scalar or tensor of rank 1, fail shape inference
+              if (static_cast<int>(mode_input_shape.dim_size()) != 0) {
+                if (static_cast<int>(mode_input_shape.dim_size()) > 1 ||
+                    !mode_input_shape.dim(0).has_dim_value() ||
+                    static_cast<int>(mode_input_shape.dim(0).dim_value()) !=
+                        1) {
+                  fail_shape_inference(
+                      "Training_mode is not a scalar boolean.");
                 }
+              }
             }
 
             if (ctx.getNumOutputs() > 1) {
-                TensorShapeProto outputs_shape;
-                *outputs_shape.add_dim() = x_input_shape.dim(1); // channel
-
-                propagateElemTypeFromInputToOutput(ctx, 0, 1);
-                updateOutputShape(ctx, 1, outputs_shape);
-
-                if (ctx.getNumOutputs() > 2){
-                    propagateElemTypeFromInputToOutput(ctx, 0, 2);
-                    updateOutputShape(ctx, 2, outputs_shape);
-                }
-
-                if (ctx.getNumOutputs() > 3){
-                    propagateElemTypeFromInputToOutput(ctx, 0, 3);
-                    updateOutputShape(ctx, 3, outputs_shape);
-                }
-
-                if (ctx.getNumOutputs() > 4){
-                    propagateElemTypeFromInputToOutput(ctx, 0, 4);
-                    updateOutputShape(ctx, 4, outputs_shape);
-                }
+              TensorShapeProto outputs_shape;
+              *outputs_shape.add_dim() = x_input_shape.dim(1); // channel
+
+              propagateElemTypeFromInputToOutput(ctx, 0, 1);
+              updateOutputShape(ctx, 1, outputs_shape);
+
+              if (ctx.getNumOutputs() > 2) {
+                propagateElemTypeFromInputToOutput(ctx, 0, 2);
+                updateOutputShape(ctx, 2, outputs_shape);
+              }
+
+              if (ctx.getNumOutputs() > 3) {
+                propagateElemTypeFromInputToOutput(ctx, 0, 3);
+                updateOutputShape(ctx, 3, outputs_shape);
+              }
+
+              if (ctx.getNumOutputs() > 4) {
+                propagateElemTypeFromInputToOutput(ctx, 0, 4);
+                updateOutputShape(ctx, 4, outputs_shape);
+              }
             }
           }
         }));
@@ -1763,13 +1797,23 @@ ONNX_OPERATOR_SET_SCHEMA(
     Dropout,
     12,
     OpSchema()
-        .SetDoc(Dropout_ver12_doc + GenerateOptionalArgumentsDoc())
-        .Attr("seed", "(Optional) Seed to the random generator, if not specified we will auto generate one.", AttributeProto::INT, OPTIONAL)
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(Dropout_ver12_doc) + GenerateOptionalArgumentsDoc()))
+        .Attr(
+            "seed",
+            "(Optional) Seed to the random generator, if not specified we will auto generate one.",
+            AttributeProto::INT,
+            OPTIONAL_VALUE)
         .Input(0, "data", "The input data as Tensor.", "T")
-        .Input(1, "ratio", "The ratio of random dropout, with value in [0, 1). If this input was not set, "
-                  "or if it was set to 0, the output would be a simple copy of the input. "
-                  "If it's non-zero, output will be a random dropout of the scaled input, which is typically "
-                  "the case during training.", "T1", OpSchema::Optional)
+        .Input(
+            1,
+            "ratio",
+            "The ratio of random dropout, with value in [0, 1). If this input was not set, "
+            "or if it was set to 0, the output would be a simple copy of the input. "
+            "If it's non-zero, output will be a random dropout of the scaled input, which is typically "
+            "the case during training.",
+            "T1",
+            OpSchema::Optional)
         .Output(0, "output", "The output.", "T")
         .Output(1, "mask", "The output mask.", "T2", OpSchema::Optional)
         .TypeConstraint(
@@ -1793,7 +1837,7 @@ ONNX_OPERATOR_SET_SCHEMA(
           if (ctx.getNumInputs() > 1 && hasInputShape(ctx, 1)) {
             auto& ratio_input_shape = getInputShape(ctx, 1);
             if (static_cast<int>(ratio_input_shape.dim_size()) != 0) {
-                fail_shape_inference("Ratio of Dropout must be a scalar.");
+              fail_shape_inference("Ratio of Dropout must be a scalar.");
             }
           }
           if (ctx.getNumOutputs() == 2) {
@@ -2000,14 +2044,14 @@ ONNX_OPERATOR_SET_SCHEMA(
             "It's an 1-D tensor starting with the collections of all 1-grams and ending with the collections of n-grams. "
             "The i-th element in pool stores the n-gram that should be mapped to coordinate ngram_indexes[i] in the output vector.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "pool_int64s",
             "List of int64 n-grams learned from the training set. Either this or pool_strings attributes must be present but not both. "
             "It's an 1-D tensor starting with the collections of all 1-grams and ending with the collections of n-grams. "
             "The i-th element in pool stores the n-gram that should be mapped to coordinate ngram_indexes[i] in the output vector.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "ngram_counts",
             "The starting indexes of 1-grams, 2-grams, and so on in pool. "
@@ -2028,7 +2072,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "By default, weights is an all-one tensor.This attribute is used when mode is \"IDF\" or \"TFIDF\" "
             "to scale the associated word counts.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "mode",
             "The weighting criteria. It can be one of \"TF\" (term frequency), "
@@ -2105,13 +2149,13 @@ ONNX_OPERATOR_SET_SCHEMA(
             "stopwords",
             "List of stop words. If not set, no word would be removed from X.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "locale",
             "Environment dependent string that denotes the locale according to which output strings needs to be upper/lowercased."
             "Default en_US or platform specific equivalent as decided by the implementation.",
             AttributeProto::STRING,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .SetDoc(StringNormalizer_ver10_doc)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           auto output_elem_type = ctx.getOutputType(0)->mutable_tensor_type();
diff --git a/onnx/defs/nn/old.cc b/onnx/defs/nn/old.cc
index e6251590771..5b6173786ff 100644
--- a/onnx/defs/nn/old.cc
+++ b/onnx/defs/nn/old.cc
@@ -87,10 +87,10 @@ void convPoolShapeInference1(
   std::vector<int64_t> effective_kernel_shape = kernel_shape;
   for (int i = 0; i < static_cast<int>(kernel_shape.size()); i++) {
     // accounting for dilation, how big is the kernel in this dimension
-    effective_kernel_shape[i] = (effective_kernel_shape[i] - 1) * dilations[i] + 1;
+    effective_kernel_shape[i] =
+        (effective_kernel_shape[i] - 1) * dilations[i] + 1;
   }
 
-
   std::vector<int64_t> pads;
   if (getRepeatedAttribute(ctx, "pads", pads)) {
     if (pads.size() != n_input_dims * 2) {
@@ -113,7 +113,9 @@ void convPoolShapeInference1(
             residual -= stride;
           }
         }
-        int64_t total_pad = residual == 0 ? effective_kernel_shape[i] - stride : effective_kernel_shape[i] - residual;
+        int64_t total_pad = residual == 0
+            ? effective_kernel_shape[i] - stride
+            : effective_kernel_shape[i] - residual;
         if (total_pad < 0)
           total_pad = 0;
         int64_t half_pad_small = total_pad >> 1;
@@ -128,7 +130,7 @@ void convPoolShapeInference1(
       }
     }
   }
-    
+
   auto output_shape =
       ctx.getOutputType(0)->mutable_tensor_type()->mutable_shape();
 
@@ -165,7 +167,8 @@ void convPoolShapeInference1(
 
     if (ceil_mode == 1)
       strided_kernel_positions = (int64_t)(std::ceil(
-          (effective_input_size - effective_kernel_shape[i]) / float(strides[i])));
+          (effective_input_size - effective_kernel_shape[i]) /
+          float(strides[i])));
     else
       strided_kernel_positions =
           (effective_input_size - effective_kernel_shape[i]) / strides[i];
@@ -187,7 +190,9 @@ std::function<void(OpSchema&)> PoolOpSchemaGenerator_9(
     const char* opName,
     const char* additionalDescription) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
  {name} consumes an input tensor X and applies {opName} pooling across
  the tensor according to kernel sizes, stride sizes, and pad lengths.
  {opName} pooling consisting of computing the {opName} on all values of a
@@ -210,22 +215,25 @@ std::function<void(OpSchema&)> PoolOpSchemaGenerator_9(
  ```
  {additionalDescription}
  )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{opName}", opName);
-    ReplaceAll(doc, "{additionalDescription}", additionalDescription);
+        ReplaceAll(doc, "{name}", name);
+        ReplaceAll(doc, "{opName}", opName);
+        ReplaceAll(doc, "{additionalDescription}", additionalDescription););
     schema.SetDoc(doc);
     schema.Attr(
         "kernel_shape",
         "The size of the kernel along each axis.",
         AttributeProto::INTS);
     schema.Attr(
-        "strides", "Stride along each spatial axis.", AttributeProto::INTS, OPTIONAL);
+        "strides",
+        "Stride along each spatial axis.",
+        AttributeProto::INTS,
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc2,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL);
+    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Input(
         0,
         "X",
@@ -275,7 +283,9 @@ std::function<void(OpSchema&)> PoolOpSchemaGenerator_10(
     bool use_dilation,
     int opsetNum) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
  {name} consumes an input tensor X and applies {opName} pooling across
  the tensor according to kernel sizes, stride sizes, and pad lengths.
  {opName} pooling consisting of computing the {opName} on all values of a
@@ -305,30 +315,32 @@ std::function<void(OpSchema&)> PoolOpSchemaGenerator_10(
  ```
  {additionalDescription}
  )DOC";
-    ReplaceAll(doc, "{name}", name);
-    ReplaceAll(doc, "{opName}", opName);
-    ReplaceAll(doc, "{additionalDescription}", additionalDescription);
-    ReplaceAll(
-        doc,
-        "{kernelSpatialShape}",
-        use_dilation ? "((kernel_spatial_shape[i] - 1) * dilations[i] + 1)"
-                     : "kernel_spatial_shape[i]");
+        ReplaceAll(doc, "{name}", name);
+        ReplaceAll(doc, "{opName}", opName);
+        ReplaceAll(doc, "{additionalDescription}", additionalDescription);
+        ReplaceAll(
+            doc,
+            "{kernelSpatialShape}",
+            use_dilation ? "((kernel_spatial_shape[i] - 1) * dilations[i] + 1)"
+                         : "kernel_spatial_shape[i]"););
     schema.SetDoc(doc);
     schema.Attr(
         "kernel_shape",
         "The size of the kernel along each axis.",
         AttributeProto::INTS);
     schema.Attr(
-        "strides", 
-        opsetNum == 11 ? "Stride along each spatial axis. If not present, the stride defaults to 1 along each spatial axis." : "Stride along each spatial axis.", 
-        AttributeProto::INTS, 
-        OPTIONAL);
+        "strides",
+        opsetNum == 11
+            ? "Stride along each spatial axis. If not present, the stride defaults to 1 along each spatial axis."
+            : "Stride along each spatial axis.",
+        AttributeProto::INTS,
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc2,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL);
+    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Attr(
         "ceil_mode",
         "Whether to use ceil or floor (default) to compute the output shape.",
@@ -470,7 +482,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "dilations",
             "Dilation value along each spatial axis of filter.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Output(
             1,
             "Indices",
@@ -506,7 +518,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "dilations",
             "Dilation value along each spatial axis of filter. If not present, the dilation defaults to 1 along each spatial axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Output(
             1,
             "Indices",
@@ -646,8 +658,8 @@ ONNX_OPERATOR_SET_SCHEMA(
             "strides",
             "Stride along each spatial axis.",
             AttributeProto::INTS,
-            OPTIONAL)
-        .Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL)
+            OPTIONAL_VALUE)
+        .Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL_VALUE)
         .Input(
             0,
             "X",
@@ -732,18 +744,18 @@ ONNX_OPERATOR_SET_SCHEMA(
             "kernel_shape",
             "The size of the kernel along each axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "strides",
             "Stride along each axis.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "auto_pad",
             auto_pad_doc1,
             AttributeProto::STRING,
             std::string("NOTSET"))
-        .Attr("pads", pads_doc1, AttributeProto::INTS, OPTIONAL)
+        .Attr("pads", pads_doc1, AttributeProto::INTS, OPTIONAL_VALUE)
         .Attr(
             "p",
             "p value of the Lp norm used to pool over the input data, default is 2.0.",
@@ -775,26 +787,30 @@ ONNX_OPERATOR_SET_SCHEMA(
 
 std::function<void(OpSchema&)> LpPoolOpSchemaGenerator_10(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
  {name} consumes an input tensor X and applies Lp pooling across
  the tensor according to kernel sizes, stride sizes, and pad lengths.
  Lp pooling consisting of computing the Lp norm on all values of a subset
  of the input tensor according to the kernel size and downsampling the
  data into the output tensor Y for further processing.)DOC";
-    ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{name}", name););
     schema.SetDoc(doc);
     schema.Attr(
         "kernel_shape",
         "The size of the kernel along each axis.",
         AttributeProto::INTS);
     schema.Attr(
-        "strides", "Stride along each spatial axis.", AttributeProto::INTS, OPTIONAL);
+        "strides",
+        "Stride along each spatial axis.",
+        AttributeProto::INTS,
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc2,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL);
+    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Attr(
         "p",
         "p value of the Lp norm used to pool over the input data.",
@@ -840,12 +856,14 @@ static const char* GlobalLpPool_ver1_doc = R"DOC(
  the values in the same channel. This is equivalent to LpPool with kernel size
  equal to the spatial dimension of input tensor.)DOC";
 
- std::function<void(OpSchema&)> ConvOpSchemaGenerator_10(const char* filter_desc) {
+std::function<void(OpSchema&)> ConvOpSchemaGenerator_10(
+    const char* filter_desc) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 The convolution operator consumes an input tensor and {filter_desc}, and
 computes the output.)DOC";
-    ReplaceAll(doc, "{filter_desc}", filter_desc);
+                        ReplaceAll(doc, "{filter_desc}", filter_desc););
     schema.SetDoc(doc);
     schema.Input(
         0,
@@ -899,20 +917,23 @@ computes the output.)DOC";
         "kernel_shape",
         "The shape of the convolution kernel. If not present, should be inferred from input W.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "dilations",
         "dilation value along each spatial axis of the filter.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
-        "strides", "Stride along each spatial axis.", AttributeProto::INTS, OPTIONAL);
+        "strides",
+        "Stride along each spatial axis.",
+        AttributeProto::INTS,
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc2,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL);
+    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Attr(
         "group",
         "number of groups input channels and output channels are divided into.",
@@ -1027,7 +1048,7 @@ void convTransposeShapeInference1(InferenceContext& ctx) {
       }
     }
   }
-    
+
   std::vector<int64_t> output_shape;
   bool output_shape_presented = true;
   if (getRepeatedAttribute(ctx, "output_shape", output_shape)) {
@@ -1090,7 +1111,8 @@ void convTransposeShapeInference1(InferenceContext& ctx) {
 std::function<void(OpSchema&)> ConvTransposeOpSchemaGenerator_10(
     const char* filter_desc) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 The convolution transpose operator consumes an input tensor and {filter_desc},
 and computes the output.
 
@@ -1105,7 +1127,7 @@ output_shape can also be explicitly specified in which case pads values are auto
   Else: pads[start_i] = total_padding[i] - (total_padding[i]/2); pads[end_i] = (total_padding[i]/2).
 
     )DOC";
-    ReplaceAll(doc, "{filter_desc}", filter_desc);
+                        ReplaceAll(doc, "{filter_desc}", filter_desc););
     schema.SetDoc(doc);
     schema.Input(
         0,
@@ -1151,32 +1173,35 @@ output_shape can also be explicitly specified in which case pads values are auto
         "kernel_shape",
         "The shape of the convolution kernel. If not present, should be inferred from input W.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "output_shape",
         "The shape of the output can be explicitly set which will cause pads values to be auto generated. If output_shape is specified "
         "pads values are ignored. See doc for details for equations to generate pads",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "output_padding",
         "The zero-padding added to one side of the output."
         " This is also called adjs/adjustment in some frameworks.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "dilations",
         "dilation value along each spatial axis of the filter.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
-        "strides", "Stride along each spatial axis.", AttributeProto::INTS, OPTIONAL);
+        "strides",
+        "Stride along each spatial axis.",
+        AttributeProto::INTS,
+        OPTIONAL_VALUE);
     schema.Attr(
         "auto_pad",
         auto_pad_doc2,
         AttributeProto::STRING,
         std::string("NOTSET"));
-    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL);
+    schema.Attr("pads", pads_doc2, AttributeProto::INTS, OPTIONAL_VALUE);
     schema.Attr(
         "group",
         "number of groups input channels and output channels are divided into.",
@@ -1353,7 +1378,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "epsilon",
             "The epsilon value to use to avoid division by zero, default is 1e-5f.",
@@ -1401,7 +1426,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "is_test",
             "(int, default 0) if nonzero, run dropout in test mode where "
@@ -1463,7 +1488,8 @@ ONNX_OPERATOR_SET_SCHEMA(
     Dropout,
     7,
     OpSchema()
-        .SetDoc(Dropout_ver7_doc + GenerateOptionalArgumentsDoc())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(Dropout_ver7_doc) + GenerateOptionalArgumentsDoc()))
         .Attr(
             "ratio",
             "The ratio of random dropout",
@@ -1490,7 +1516,8 @@ ONNX_OPERATOR_SET_SCHEMA(
     Dropout,
     10,
     OpSchema()
-        .SetDoc(Dropout_ver10_doc + GenerateOptionalArgumentsDoc())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(Dropout_ver10_doc) + GenerateOptionalArgumentsDoc()))
         .Attr(
             "ratio",
             "The ratio of random dropout",
@@ -1647,7 +1674,9 @@ ONNX_OPERATOR_SET_SCHEMA(
     9,
     OpSchema()
         .NumOutputs({1, 5})
-        .SetDoc(BatchNormalization_ver9_doc + GenerateOptionalArgumentsDoc())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(BatchNormalization_ver9_doc) +
+            GenerateOptionalArgumentsDoc()))
         .Attr(
             "epsilon",
             "The epsilon value to use to avoid division by zero.",
@@ -1835,8 +1864,10 @@ ONNX_OPERATOR_SET_SCHEMA(
     BatchNormalization,
     7,
     OpSchema()
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(BatchNormalization_ver7_doc) +
+            GenerateOptionalArgumentsDoc()))
         .NumOutputs({1, 5})
-        .SetDoc(BatchNormalization_ver7_doc + GenerateOptionalArgumentsDoc())
         .Attr(
             "spatial",
             "If true, compute the mean and variance across per activation. "
diff --git a/onnx/defs/operator_sets-training.h b/onnx/defs/operator_sets-training.h
index 902159ecd50..2cef74d018b 100644
--- a/onnx/defs/operator_sets-training.h
+++ b/onnx/defs/operator_sets-training.h
@@ -10,6 +10,7 @@ namespace ONNX_NAMESPACE {
 // Declare training operators.
 class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(OnnxTraining, 1, Gradient);
 class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(OnnxTraining, 1, GraphCall);
+class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(OnnxTraining, 1, Momentum);
 class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(OnnxTraining, 1, Adagrad);
 
 // Iterate over schema from ai.onnx.training version 1
@@ -18,6 +19,7 @@ class OpSet_OnnxTraining_ver1 {
   static void ForEachSchema(std::function<void(OpSchema&&)> fn) {
     fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(OnnxTraining, 1, Gradient)>());
     fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(OnnxTraining, 1, GraphCall)>());
+    fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(OnnxTraining, 1, Momentum)>());
     fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(OnnxTraining, 1, Adagrad)>());
   }
 };
diff --git a/onnx/defs/operator_sets.h b/onnx/defs/operator_sets.h
index 8895988ff59..2668a1950d9 100644
--- a/onnx/defs/operator_sets.h
+++ b/onnx/defs/operator_sets.h
@@ -732,6 +732,7 @@ class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, MeanSquaredDistance);
 class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, LessOrEqual);
 class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, GreaterOrEqual);
 class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, SoftmaxCrossEntropyLoss);
+class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, Pow);
 
 // Iterate over schema from ai.onnx version 12
 class OpSet_Onnx_ver12 {
@@ -758,6 +759,7 @@ class OpSet_Onnx_ver12 {
     fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, LessOrEqual)>());
     fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, GreaterOrEqual)>());
     fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, SoftmaxCrossEntropyLoss)>());
+    fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 12, Pow)>());
   }
 };
 
diff --git a/onnx/defs/reduction/defs.cc b/onnx/defs/reduction/defs.cc
index 75e533b16b2..6fbc832daca 100644
--- a/onnx/defs/reduction/defs.cc
+++ b/onnx/defs/reduction/defs.cc
@@ -7,35 +7,39 @@
 
 namespace ONNX_NAMESPACE {
 
-std::vector<std::string> GetSupportedDataTypesForReductionOps(bool supports8bit){
-    if (supports8bit) {
-        auto data_types = OpSchema::numeric_types_for_math_reduction();
-        data_types.push_back("tensor(uint8)");
-        data_types.push_back("tensor(int8)");
+std::vector<std::string> GetSupportedDataTypesForReductionOps(
+    bool supports8bit) {
+  if (supports8bit) {
+    auto data_types = OpSchema::numeric_types_for_math_reduction();
+    data_types.push_back("tensor(uint8)");
+    data_types.push_back("tensor(int8)");
 
-        return data_types;
-    }
+    return data_types;
+  }
 
-    return OpSchema::numeric_types_for_math_reduction();
+  return OpSchema::numeric_types_for_math_reduction();
 }
 
-std::function<void(OpSchema&)> ReduceDocGenerator(const char* name, bool supports_8bit_datatypes = false) {
+std::function<void(OpSchema&)> ReduceDocGenerator(
+    const char* name,
+    bool supports_8bit_datatypes = false) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 Computes the {name} of the input tensor's element along the provided axes. The resulted
 tensor has the same rank as the input if keepdims equal 1. If keepdims equal 0, then
 the resulted tensor have the reduced dimension pruned.
 
 The above behavior is similar to numpy, with the exception that numpy default keepdims to
 False instead of True.)DOC";
-    ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{name}", name););
     schema.SetDoc(doc.c_str());
     schema.Attr(
         "axes",
         "A list of integers, along which to reduce. The default is to reduce over "
         "all the dimensions of the input tensor. Accepted range is [-r, r-1] where r = rank(data).",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "keepdims",
         "Keep the reduced dimension or not, default 1 mean keep reduced dimension.",
@@ -46,9 +50,9 @@ False instead of True.)DOC";
     schema.TypeConstraint(
         "T",
         GetSupportedDataTypesForReductionOps(supports_8bit_datatypes),
-        supports_8bit_datatypes ? 
-        "Constrain input and output types to high-precision and 8 bit numeric tensors." :
-        "Constrain input and output types to high-precision numeric tensors.");
+        supports_8bit_datatypes
+            ? "Constrain input and output types to high-precision and 8 bit numeric tensors."
+            : "Constrain input and output types to high-precision numeric tensors.");
     schema.TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
       propagateElemTypeFromInputToOutput(ctx, 0, 0);
       if (!hasNInputShapes(ctx, 1)) {
@@ -147,7 +151,8 @@ ONNX_OPERATOR_SET_SCHEMA(
 
 std::function<void(OpSchema&)> ArgReduceDocGenerator(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 Computes the indices of the {name} elements of the input tensor's element along the 
 provided axis. The resulting tensor has the same rank as the input if keepdims equal 1. 
 If keepdims equal 0, then the resulting tensor have the reduced dimension pruned. 
@@ -155,7 +160,7 @@ If select_last_index is True (default False), the index of the last occurrence o
 is selected if the {name} appears more than once in the input. Otherwise the index of the 
 first occurrence is selected.
 The type of the output tensor is integer.)DOC";
-    ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{name}", name););
     schema.SetDoc(doc.c_str());
     schema.Attr(
         "axis",
@@ -200,7 +205,7 @@ The type of the output tensor is integer.)DOC";
         axis = axis_proto->i();
         if (axis < -input_ndim || axis >= input_ndim) {
           fail_shape_inference(
-            "'axis' must be in [-rank(indices), rank(indices)-1]");
+              "'axis' must be in [-rank(indices), rank(indices)-1]");
         }
         if (axis < 0)
           axis += input_ndim;
diff --git a/onnx/defs/reduction/old.cc b/onnx/defs/reduction/old.cc
index 9a017f02135..d7ef7e05b5a 100644
--- a/onnx/defs/reduction/old.cc
+++ b/onnx/defs/reduction/old.cc
@@ -10,24 +10,25 @@ std::function<void(OpSchema&)> ReduceDocGenerator_opset1(
     const char* name,
     int opset = 1) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 Computes the {name} of the input tensor's element along the provided axes. The resulted
 tensor has the same rank as the input if keepdims equal 1. If keepdims equal 0, then
 the resulted tensor have the reduced dimension pruned.
 
 The above behavior is similar to numpy, with the exception that numpy default keepdims to
 False instead of True.)DOC";
-    ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{name}", name););
     schema.SetDoc(doc.c_str());
     schema.Attr(
         "axes",
-        opset >= 11 ?
-        "A list of integers, along which to reduce. The default is to reduce over "
-        "all the dimensions of the input tensor. Accepted range is [-r, r-1] where r = rank(data)." :
-        "A list of integers, along which to reduce. The default is to reduce over "
-        "all the dimensions of the input tensor.",
+        opset >= 11
+            ? "A list of integers, along which to reduce. The default is to reduce over "
+              "all the dimensions of the input tensor. Accepted range is [-r, r-1] where r = rank(data)."
+            : "A list of integers, along which to reduce. The default is to reduce over "
+              "all the dimensions of the input tensor.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "keepdims",
         "Keep the reduced dimension or not, default 1 mean keep reduced dimension.",
@@ -143,12 +144,13 @@ ONNX_OPERATOR_SET_SCHEMA(
 
 std::function<void(OpSchema&)> ArgReduceDocGenerator_opset1(const char* name) {
   return [=](OpSchema& schema) {
-    std::string doc = R"DOC(
+    std::string doc;
+    POPULATE_OP_DOC_STR(doc = R"DOC(
 Computes the indices of the {name} elements of the input tensor's element along the
 provided axis. The resulted tensor has the same rank as the input if keepdims equal 1.
 If keepdims equal 0, then the resulted tensor have the reduced dimension pruned.
 The type of the output tensor is integer.)DOC";
-    ReplaceAll(doc, "{name}", name);
+                        ReplaceAll(doc, "{name}", name););
     schema.SetDoc(doc.c_str());
     schema.Attr(
         "axis",
@@ -268,7 +270,7 @@ The type of the output tensor is integer.)DOC";
         axis = axis_proto->i();
         if (axis < -input_ndim || axis >= input_ndim) {
           fail_shape_inference(
-            "'axis' must be in [-rank(indices), rank(indices)-1]");
+              "'axis' must be in [-rank(indices), rank(indices)-1]");
         }
         if (axis < 0)
           axis += input_ndim;
diff --git a/onnx/defs/rnn/defs.cc b/onnx/defs/rnn/defs.cc
index a640669e28d..63875b19a46 100644
--- a/onnx/defs/rnn/defs.cc
+++ b/onnx/defs/rnn/defs.cc
@@ -62,7 +62,7 @@ std::function<void(OpSchema&)> RNNDocGenerator(const char* /*name*/) {
         "hidden_size",
         "Number of neurons in the hidden layer",
         AttributeProto::INT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "activation_alpha",
         "Optional scaling values used by some activation functions. The values "
@@ -70,21 +70,21 @@ std::function<void(OpSchema&)> RNNDocGenerator(const char* /*name*/) {
         "in LSTM. Default values are the same as of corresponding ONNX operators."
         "For example with LeakyRelu, the default alpha is 0.01.",
         AttributeProto::FLOATS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "activation_beta",
         "Optional scaling values used by some activation functions. The values "
         "are consumed in the order of activation functions, for example (f, g, h) "
         "in LSTM. Default values are the same as of corresponding ONNX operators.",
         AttributeProto::FLOATS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "clip",
         "Cell clip threshold. Clipping bounds the elements of a tensor "
         "in the range of [-threshold, +threshold] and is applied to the input "
         "of activations. No clip if not specified.",
         AttributeProto::FLOAT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Input(
         0,
         "X",
@@ -197,7 +197,8 @@ ONNX_OPERATOR_SET_SCHEMA(
     RNN,
     7,
     OpSchema()
-        .SetDoc(RNN_ver7_doc + GenerateOptionalArgumentsDoc())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(RNN_ver7_doc) + GenerateOptionalArgumentsDoc()))
         .Attr(
             "activations",
             "One (or two if bidirectional) activation function for "
@@ -309,7 +310,8 @@ ONNX_OPERATOR_SET_SCHEMA(
     GRU,
     7,
     OpSchema()
-        .SetDoc(GRU_ver7_doc + GenerateOptionalArgumentsDoc())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(GRU_ver7_doc) + GenerateOptionalArgumentsDoc()))
         .Attr(
             "activations",
             "A list of 2 (or 4 if bidirectional) activation functions "
@@ -317,7 +319,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "of the activation functions specified above. Optional: See the equations "
             "for default if not specified.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "linear_before_reset",
             "When computing the output of the hidden gate, "
@@ -437,7 +439,8 @@ ONNX_OPERATOR_SET_SCHEMA(
     LSTM,
     7,
     OpSchema()
-        .SetDoc(LSTM_ver7_doc + GenerateOptionalArgumentsDoc())
+        .SetDoc(GET_OP_DOC_STR(
+            std::string(LSTM_ver7_doc) + GenerateOptionalArgumentsDoc()))
         .Attr(
             "activations",
             "A list of 3 (or 6 if bidirectional) activation functions "
@@ -445,7 +448,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "be one of the activation functions specified above. Optional: See the equations "
             "for default if not specified.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "input_forget",
             "Couple the input and forget gates if 1.",
diff --git a/onnx/defs/rnn/old.cc b/onnx/defs/rnn/old.cc
index 28c0861328e..a75858b3f2a 100644
--- a/onnx/defs/rnn/old.cc
+++ b/onnx/defs/rnn/old.cc
@@ -13,21 +13,21 @@ std::function<void(OpSchema&)> RNNDocGeneratorOld(const char* /*name*/) {
         "hidden_size",
         "Number of neurons in the hidden layer",
         AttributeProto::INT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "activation_alpha",
         "Optional scaling values used by some activation functions. The values "
         "are consumed in the order of activation functions, for example (f, g, h) "
         "in LSTM.",
         AttributeProto::FLOATS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "activation_beta",
         "Optional scaling values used by some activation functions. The values "
         "are consumed in the order of activation functions, for example (f, g, h) "
         "in LSTM.",
         AttributeProto::FLOATS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "output_sequence",
         "The sequence output for the hidden is optional if 0. Default 0.",
@@ -39,7 +39,7 @@ std::function<void(OpSchema&)> RNNDocGeneratorOld(const char* /*name*/) {
         "in the range of [-threshold, +threshold] and is applied to the input "
         "of activations. No clip if not specified.",
         AttributeProto::FLOAT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Input(
         0,
         "X",
@@ -171,7 +171,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "of the activation functions specified above. Optional: See the equations "
             "for default if not specified.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(
             1,
             "W",
@@ -266,7 +266,7 @@ std::function<void(OpSchema&)> RNNDocGenerator1(const char* /*name*/) {
         "hidden_size",
         "Number of neurons in the hidden layer",
         AttributeProto::INT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "activation_alpha",
         "Optional scaling values used by some activation functions. The values "
@@ -274,14 +274,14 @@ std::function<void(OpSchema&)> RNNDocGenerator1(const char* /*name*/) {
         "in LSTM. Default values are the same as of corresponding ONNX operators."
         "For example with LeakyRelu, the default alpha is 0.01.",
         AttributeProto::FLOATS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "activation_beta",
         "Optional scaling values used by some activation functions. The values "
         "are consumed in the order of activation functions, for example (f, g, h) "
         "in LSTM. Default values are the same as of corresponding ONNX operators.",
         AttributeProto::FLOATS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "output_sequence",
         "The sequence output for the hidden is optional if 0. Default 0.",
@@ -293,7 +293,7 @@ std::function<void(OpSchema&)> RNNDocGenerator1(const char* /*name*/) {
         "in the range of [-threshold, +threshold] and is applied to the input "
         "of activations. No clip if not specified.",
         AttributeProto::FLOAT,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Input(
         0,
         "X",
@@ -527,7 +527,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "of the activation functions specified above. Optional: See the equations "
             "for default if not specified.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "linear_before_reset",
             "When computing the output of the hidden gate, "
@@ -655,7 +655,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "be one of the activation functions specified above. Optional: See the equations "
             "for default if not specified.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "input_forget",
             "Couple the input and forget gates if 1, default 0.",
diff --git a/onnx/defs/schema.cc b/onnx/defs/schema.cc
index 01d0829b23e..fb32c8518ec 100644
--- a/onnx/defs/schema.cc
+++ b/onnx/defs/schema.cc
@@ -28,36 +28,6 @@ DbgOperatorSetTracker& DbgOperatorSetTracker::Instance() {
 }
 #endif
 
-OpSchema::FormalParameter::FormalParameter(
-    std::string name,
-    DataTypeSet allowed_type_set,
-    std::string type_str,
-    std::string description,
-    FormalParameterOption param_option,
-    bool is_homogeneous,
-    int min_arity)
-    : name_(std::move(name)),
-      type_set_(std::move(allowed_type_set)),
-      type_str_(std::move(type_str)),
-      description_(std::move(description)),
-      param_option_(param_option),
-      is_homogeneous_(is_homogeneous),
-      min_arity_(min_arity) {}
-
-OpSchema::FormalParameter::FormalParameter(
-    std::string name,
-    std::string description,
-    std::string type_str,
-    FormalParameterOption param_option,
-    bool is_homogeneous,
-    int min_arity)
-    : name_(std::move(name)),
-      type_str_(std::move(type_str)),
-      description_(std::move(description)),
-      param_option_(param_option),
-      is_homogeneous_(is_homogeneous),
-      min_arity_(min_arity) {}
-
 const std::string& OpSchema::FormalParameter::GetName() const {
   return name_;
 }
@@ -441,11 +411,6 @@ OpSchema& OpSchema::SetSupportLevel(SupportType support) {
   return *this;
 }
 
-OpSchema& OpSchema::SetDoc(std::string doc) {
-  doc_ = std::move(doc);
-  return *this;
-}
-
 // Functions to specify name for the operator schema.
 OpSchema& OpSchema::SetName(std::string name) {
   name_ = std::move(name);
@@ -607,7 +572,7 @@ OpSchema& OpSchema::AllowUncheckedAttributes() {
 OpSchema& OpSchema::Input(
     int n,
     std::string name,
-    std::string description,
+    const std::string& description,
     std::string type_str,
     OpSchema::FormalParameterOption param_option,
     bool is_homogeneous,
@@ -617,7 +582,11 @@ OpSchema& OpSchema::Input(
   }
   inputs_[n] = FormalParameter(
       std::move(name),
-      std::move(description),
+#ifndef __ONNX_NO_DOC_STRINGS
+      description,
+#else
+      std::string(),
+#endif
       std::move(type_str),
       param_option,
       is_homogeneous,
@@ -636,7 +605,11 @@ OpSchema& OpSchema::Input(
   return Input(
       n,
       std::string(name),
+#ifndef __ONNX_NO_DOC_STRINGS
       std::string(description),
+#else
+      std::string(),
+#endif
       std::string(type_str),
       param_option,
       is_homogeneous,
@@ -646,7 +619,7 @@ OpSchema& OpSchema::Input(
 OpSchema& OpSchema::Output(
     int n,
     std::string name,
-    std::string description,
+    const std::string& description,
     std::string type_str,
     OpSchema::FormalParameterOption param_option,
     bool is_homogeneous,
@@ -656,7 +629,11 @@ OpSchema& OpSchema::Output(
   }
   outputs_[n] = FormalParameter(
       std::move(name),
-      std::move(description),
+#ifndef __ONNX_NO_DOC_STRINGS
+      description,
+#else
+      std::string(),
+#endif
       std::move(type_str),
       param_option,
       is_homogeneous,
@@ -675,7 +652,11 @@ OpSchema& OpSchema::Output(
   return Output(
       n,
       std::string(name),
+#ifndef __ONNX_NO_DOC_STRINGS
       std::string(description),
+#else
+      std::string(),
+#endif
       std::string(type_str),
       param_option,
       is_homogeneous,
@@ -747,7 +728,7 @@ bool OpSchema::BuildContextDependentFunction(
 }
 
 OpSchema& OpSchema::FunctionBody(const std::vector<NodeProto>& func_nodes) {
-  for (const auto node : func_nodes) {
+  for (const auto& node : func_nodes) {
     auto new_node = function_body_.add_node();
     new_node->CopyFrom(node);
   }
diff --git a/onnx/defs/schema.h b/onnx/defs/schema.h
index 3a2dec8c20e..23cc72530fb 100644
--- a/onnx/defs/schema.h
+++ b/onnx/defs/schema.h
@@ -28,12 +28,13 @@ namespace ONNX_NAMESPACE {
 struct FunctionBodyBuildContext {
   virtual const AttributeProto* getAttribute(const std::string& name) const = 0;
   virtual bool hasInput(int i) const = 0;
-  virtual bool hasOutput(int i) const  = 0;
+  virtual bool hasOutput(int i) const = 0;
   virtual ~FunctionBodyBuildContext() {}
 };
 
 struct FunctionBodyBuildContextImpl : public FunctionBodyBuildContext {
-  FunctionBodyBuildContextImpl(NodeProto& node_proto) : node_proto_(node_proto) {
+  FunctionBodyBuildContextImpl(NodeProto& node_proto)
+      : node_proto_(node_proto) {
     for (auto& attr : *node_proto.mutable_attribute()) {
       attributesByName_[attr.name()] = &attr;
     }
@@ -58,17 +59,19 @@ struct FunctionBodyBuildContextImpl : public FunctionBodyBuildContext {
     if (i >= node_proto_.output_size())
       return false;
     return node_proto_.output(i) != "";
-  } 
+  }
 
   std::unordered_map<std::string, const AttributeProto*> attributesByName_;
 
   NodeProto node_proto_;
 };
 
-using FunctionBodyQueryFunction = std::function<bool(FunctionBodyBuildContext&)>;
+using FunctionBodyQueryFunction =
+    std::function<bool(FunctionBodyBuildContext&)>;
 
 class OpSchema;
-using ContextDependentFunctionBodyBuilder = std::function<bool(const FunctionBodyBuildContext&, const OpSchema&, FunctionProto&)>;
+using ContextDependentFunctionBodyBuilder = std::function<
+    bool(const FunctionBodyBuildContext&, const OpSchema&, FunctionProto&)>;
 
 class SchemaError final : public std::runtime_error {
  public:
@@ -148,20 +151,39 @@ class OpSchema final {
 
     explicit FormalParameter(
         std::string name,
-        DataTypeSet type_set,
+        DataTypeSet allowed_type_set,
         std::string type_str,
-        std::string description,
+        const std::string& description,
         FormalParameterOption param_option = Single,
         bool is_homogeneous = true,
-        int min_arity = 1);
+        int min_arity = 1)
+        : name_(std::move(name)),
+          type_set_(std::move(allowed_type_set)),
+          type_str_(std::move(type_str)),
+#ifndef __ONNX_NO_DOC_STRINGS
+          description_(description),
+#endif
+          param_option_(param_option),
+          is_homogeneous_(is_homogeneous),
+          min_arity_(min_arity) {
+    }
 
     explicit FormalParameter(
         std::string name,
-        std::string description,
+        const std::string& description,
         std::string type_str,
         FormalParameterOption param_option = Single,
         bool is_homogeneous = true,
-        int min_arity = 1);
+        int min_arity = 1)
+        : name_(std::move(name)),
+          type_str_(std::move(type_str)),
+#ifndef __ONNX_NO_DOC_STRINGS
+          description_(description),
+#endif
+          param_option_(param_option),
+          is_homogeneous_(is_homogeneous),
+          min_arity_(min_arity) {
+    }
 
     // Get formal parameter name.
     const std::string& GetName() const;
@@ -329,7 +351,14 @@ class OpSchema final {
     return *this;
   }
 
-  OpSchema& SetDoc(std::string doc);
+  OpSchema& SetDoc(const std::string& doc) {
+#ifndef __ONNX_NO_DOC_STRINGS
+    doc_ = doc;
+#else
+    ONNX_UNUSED_PARAMETER(doc);
+#endif
+    return *this;
+  }
 
   // Functions to specify name for the operator schema.
   OpSchema& SetName(const char* name);
@@ -464,7 +493,7 @@ class OpSchema final {
   OpSchema& Input(
       int n,
       std::string name,
-      std::string description,
+      const std::string& description,
       std::string type_str,
       FormalParameterOption param_option = Single,
       bool is_homogeneous = true,
@@ -483,7 +512,7 @@ class OpSchema final {
   OpSchema& Output(
       int n,
       std::string name,
-      std::string description,
+      const std::string& description,
       std::string type_str,
       FormalParameterOption param_option = Single,
       bool is_homogeneous = true,
@@ -663,7 +692,9 @@ class OpSchema final {
   }
 
   OpSchema& FunctionBody(const std::vector<NodeProto>& func_nodes);
-  OpSchema& FunctionBody(const std::vector<NodeProto>& func_nodes, const std::vector<OperatorSetIdProto>& opsets);
+  OpSchema& FunctionBody(
+      const std::vector<NodeProto>& func_nodes,
+      const std::vector<OperatorSetIdProto>& opsets);
 
   const FunctionProto* GetFunction() const;
 
@@ -671,9 +702,12 @@ class OpSchema final {
     return functionBuilder_ != nullptr;
   }
 
-  OpSchema& SetContextDependentFunctionBodyBuilder(ContextDependentFunctionBodyBuilder);
-  
-  bool BuildContextDependentFunction(const FunctionBodyBuildContext& ctx, FunctionProto& functionProto) const;
+  OpSchema& SetContextDependentFunctionBodyBuilder(
+      ContextDependentFunctionBodyBuilder);
+
+  bool BuildContextDependentFunction(
+      const FunctionBodyBuildContext& ctx,
+      FunctionProto& functionProto) const;
 
   // Verifies that the schema is valid and all specifications are compatible.
   // It will also parse all type strings specified for inputs/outputs into valid
@@ -946,7 +980,8 @@ OpSchema GetOpSchema();
   ONNX_OPERATOR_SET_SCHEMA_EX(name, OnnxML, AI_ONNX_ML_DOMAIN, ver, true, impl)
 
 #define ONNX_TRAINING_OPERATOR_SET_SCHEMA(name, ver, impl) \
-  ONNX_OPERATOR_SET_SCHEMA_EX(name, OnnxTraining, AI_ONNX_TRAINING_DOMAIN, ver, true, impl)
+  ONNX_OPERATOR_SET_SCHEMA_EX(                             \
+      name, OnnxTraining, AI_ONNX_TRAINING_DOMAIN, ver, true, impl)
 
 // Defines specialization of GetOpSchema for a class whose name is determined
 // based on a convention using name, domain, and version.  Operator schema are
@@ -1044,4 +1079,49 @@ inline std::string GenerateBroadcastingDocUni(
   return ret;
 }
 
+/*
+ * Macros for setting operator documentation
+ * Use this macro for simple SetDoc() calls that generate documentation
+ * directly. This is the macro to use in almost all cases.
+ * Sample usage guidelines:
+ * const char* doc_str = "foo";
+ * SetDoc(GET_OP_DOC_STR(doc_str))
+ *
+ * SetDoc(GET_OP_DOC_STR(
+            std::string(BitShift_ver11_doc) + GenerateBroadcastingDocMul()))
+ */
+#ifndef __ONNX_NO_DOC_STRINGS
+#define GET_OP_DOC_STR(doc_str) (doc_str)
+#else
+#define GET_OP_DOC_STR(doc_str) ("")
+#endif
+
+/*
+ * Use this macro when the documentation needs to be populated in some
+ * complicated way like string substitutions, etc before calling SetDoc.
+ * Sample usage guidelines:
+    std::string doc;
+    POPULATE_OP_DOC_STR(
+        doc = R"DOC(
+Returns the tensor resulted from performing the `{name}` logical operation
+elementwise on the input tensors `A` and `B` (with Numpy-style broadcasting
+support).
+
+{broadcast_doc}
+)DOC";
+        ReplaceAll(doc, "{name}", name);
+        ReplaceAll(
+            doc, "{broadcast_doc}", GenerateBroadcastingDocMul().c_str()););
+    schema.SetDoc(doc);
+ *
+ */
+#ifndef __ONNX_NO_DOC_STRINGS
+#define POPULATE_OP_DOC_STR(DocPopulatorCode) \
+  do {                                        \
+    DocPopulatorCode                          \
+  } while (0)
+#else
+#define POPULATE_OP_DOC_STR(DocPopulatorCode)
+#endif
+
 } // namespace ONNX_NAMESPACE
diff --git a/onnx/defs/sequence/defs.cc b/onnx/defs/sequence/defs.cc
index 571729cfb7b..fd4e5fd562f 100644
--- a/onnx/defs/sequence/defs.cc
+++ b/onnx/defs/sequence/defs.cc
@@ -23,7 +23,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "(Optional) The data type of the tensors in the output sequence. "
             "The default type is 'float'.",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Output(
             0,
             "output",
diff --git a/onnx/defs/tensor/defs.cc b/onnx/defs/tensor/defs.cc
index bc4b3afc2bb..66f54df264a 100644
--- a/onnx/defs/tensor/defs.cc
+++ b/onnx/defs/tensor/defs.cc
@@ -147,12 +147,12 @@ num_blocks[d] = floor((input_spatial_shape[d] + 2 * padding[d] - dilation[d] * (
         "dilations",
         "Dilation value along each spatial axis of the extracted blocks. If not present, the dilation defaults is 1 along each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "strides",
         "Stride along each spatial axis of the input image. If not present, the stride defaults is 1 along each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.Attr(
         "pads",
         "Padding for the beginning and ending along each spatial axis, it can take any value greater "
@@ -162,7 +162,7 @@ num_blocks[d] = floor((input_spatial_shape[d] + 2 * padding[d] - dilation[d] * (
         "added at the beginning of axis `i` and xi_end, the number of pixels added at "
         "the end of axis `i`. If not present, the padding defaults to 0 along start and end of each spatial axis.",
         AttributeProto::INTS,
-        OPTIONAL);
+        OPTIONAL_VALUE);
     schema.TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
       propagateElemTypeFromInputToOutput(ctx, 0, 0);
       unfoldToDepthShapeInference(ctx);
@@ -566,7 +566,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "where r = rank(input).",
             AttributeProto::INT,
             static_cast<int64_t>(0))
-        .Attr("split", "length of each output. Values should be >= 0.", AttributeProto::INTS, OPTIONAL)
+        .Attr("split", "length of each output. Values should be >= 0.", AttributeProto::INTS, OPTIONAL_VALUE)
         .SetDoc(Split_ver11_doc)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           for (int i = 0; i < static_cast<int>(ctx.getNumOutputs()); ++i) {
@@ -912,7 +912,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "A list of integers. By default, reverse the dimensions, "
             "otherwise permute the axes according to the values given.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(0, "data", "An input tensor.", "T")
         .Output(0, "transposed", "Transposed output.", "T")
         .TypeConstraint(
@@ -1464,7 +1464,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "List of integers indicating the dimensions to squeeze. Negative value means counting dimensions "
             "from the back. Accepted range is [-r, r-1] where r = rank(data).",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .SetDoc(Squeeze_ver11_doc)
         .Input(0, "data", "Tensors with at least max(dims) dimensions.", "T")
         .Output(0, "squeezed", "Reshaped tensor with same data as input.", "T")
@@ -2007,7 +2007,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "input is flattened before elements being selected. Negative value means counting dimensions "
             "from the back. Accepted range is [-r, r-1] where r = rank(input).",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(0, "input", "Tensor of rank r >= 1.", "T")
         .Input(
             1,
@@ -2514,7 +2514,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "flattened input are returned. Negative value means counting dimensions "
             "from the back. Accepted range is [-r, r-1] where r = rank(input).",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(0, "X", "A N-D input tensor that is to be processed.", "T")
         .Output(
             0,
diff --git a/onnx/defs/tensor/old.cc b/onnx/defs/tensor/old.cc
index 2c5af8d04a5..2c17914c3f4 100644
--- a/onnx/defs/tensor/old.cc
+++ b/onnx/defs/tensor/old.cc
@@ -136,7 +136,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "axis",
             "Which axis to concat on.  Default value is 1.",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .SetDoc(Concat_ver1_doc)
         .Input(
             0,
@@ -251,8 +251,8 @@ ONNX_OPERATOR_SET_SCHEMA(
             "T",
             {"tensor(float16)", "tensor(float)", "tensor(double)"},
             "Constrain input types to float tensors.")
-        .Attr("axis", "Which axis to split on", AttributeProto::INT, OPTIONAL)
-        .Attr("split", "length of each output", AttributeProto::INTS, OPTIONAL)
+        .Attr("axis", "Which axis to split on", AttributeProto::INT, OPTIONAL_VALUE)
+        .Attr("split", "length of each output", AttributeProto::INTS, OPTIONAL_VALUE)
         .SetDoc(Split_ver1_doc));
 
 static const char* Pad_ver1_doc = R"DOC(
@@ -318,7 +318,7 @@ ONNX_OPERATOR_SET_SCHEMA(
     1,
     OpSchema()
         .SetDoc(Reshape_ver1_doc)
-        .Attr("shape", "New shape", AttributeProto::INTS, OPTIONAL)
+        .Attr("shape", "New shape", AttributeProto::INTS, OPTIONAL_VALUE)
         // This attribute was added via AllowConsumed API in OpSchema.
         // After removing the API, we're now using the Attr API to simulate the
         // old definition.
@@ -326,7 +326,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "consumed_inputs",
             "legacy optimization attribute.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(0, "data", "An input tensor.", "T")
         .Output(0, "reshaped", "Reshaped data.", "T")
         .TypeConstraint(
@@ -602,7 +602,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "It's optional. If not present, will be treated as "
             "[0, 1, ..., len(`starts`) - 1].",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "starts",
             "Starting indices of corresponding axis in `axes`",
@@ -1167,7 +1167,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "axes",
             "List of non-negative integers, indicate the dimensions to squeeze.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .SetDoc(Squeeze_ver1_doc)
         .Input(0, "data", "Tensors with at least max(dims) dimensions.", "T")
         .Output(0, "squeezed", "Reshaped tensor with same data as input.", "T")
@@ -1450,7 +1450,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "(Optional) Axis along which to take slices. If not specified, "
             "input is flattened before elements being selected.",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Input(0, "input", "Tensor of rank r >= 1.", "T")
         .Input(
             1,
@@ -1500,7 +1500,7 @@ ONNX_OPERATOR_SET_SCHEMA(
             "Which axis to split on. ",
             AttributeProto::INT,
             static_cast<int64_t>(0))
-        .Attr("split", "length of each output", AttributeProto::INTS, OPTIONAL)
+        .Attr("split", "length of each output", AttributeProto::INTS, OPTIONAL_VALUE)
         .SetDoc(Split_ver2_doc)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           for (int i = 0; i < static_cast<int>(ctx.getNumOutputs()); ++i) {
diff --git a/onnx/defs/traditionalml/defs.cc b/onnx/defs/traditionalml/defs.cc
index 696e3e8cb8d..ad1496d6d66 100644
--- a/onnx/defs/traditionalml/defs.cc
+++ b/onnx/defs/traditionalml/defs.cc
@@ -146,12 +146,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "cats_strings",
             "The strings of the map. This sequence must be the same length as the 'cats_int64s' sequence",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "cats_int64s",
             "The integers of the map. This sequence must be the same length as the 'cats_strings' sequence.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "default_string",
             "A string to use when an input integer value is not found in the map.<br>One and only one of the 'default_*' attributes must be defined.",
@@ -217,12 +217,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "string_vocabulary",
             "A string vocabulary array.<br>One and only one of the vocabularies must be defined.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "int64_vocabulary",
             "An integer vocabulary array.<br>One and only one of the vocabularies must be defined.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           auto input_elem_type = ctx.getInputType(0)
                                      ->map_type()
@@ -267,7 +267,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "inputdimensions",
             "The size of each input in the input list",
             AttributeProto::INTS,
-            OPTIONAL));
+            OPTIONAL_VALUE));
 
 static const char* Imputer_ver1_doc = R"DOC(
     Replaces inputs that equal one value with another, leaving all other elements alone.<br>
@@ -298,7 +298,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "imputed_value_floats",
             "Value(s) to change to",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "replaced_value_float",
             "A value that needs replacing.",
@@ -308,7 +308,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "imputed_value_int64s",
             "Value(s) to change to.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "replaced_value_int64",
             "A value that needs replacing.",
@@ -354,28 +354,28 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "keys_strings",
             "A list of strings. One and only one of 'keys_*'s should be set.",
             AttributeProto::STRINGS,
-            OPTIONAL)
-        .Attr("keys_int64s", "A list of ints.", AttributeProto::INTS, OPTIONAL)
+            OPTIONAL_VALUE)
+        .Attr("keys_int64s", "A list of ints.", AttributeProto::INTS, OPTIONAL_VALUE)
         .Attr(
             "keys_floats",
             "A list of floats.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "values_strings",
             "A list of strings. One and only one of 'value_*'s should be set.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "values_int64s",
             "A list of ints.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "values_floats",
             "A list of floats.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "default_string",
             "A string.",
@@ -493,7 +493,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "intercepts",
             "A collection of intercepts.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "multi_class",
             "Indicates whether to do OvR or multinomial (0=OvR is the default).",
@@ -503,12 +503,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "classlabels_strings",
             "Class labels when using string labels. One and only one 'classlabels' attribute must be defined.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "classlabels_ints",
             "Class labels when using integer labels. One and only one 'classlabels' attribute must be defined.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "post_transform",
             "Indicates the transform to apply to the scores vector.<br>One of 'NONE,' 'SOFTMAX,' 'LOGISTIC,' 'SOFTMAX_ZERO,' or 'PROBIT'",
@@ -607,12 +607,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "coefficients",
             "Weights of the model(s).",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "intercepts",
             "Weights of the intercepts, if used.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "targets",
             "The total number of regression targets, 1 if not defined.",
@@ -690,12 +690,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "cats_int64s",
             "List of categories, ints.<br>One and only one of the 'cats_*' attributes must be defined.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "cats_strings",
             "List of categories, strings.<br>One and only one of the 'cats_*' attributes must be defined.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "zeros",
             "If true and category is not present, will return all zeros; if false and a category if not found, the operator will fail.",
@@ -724,12 +724,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "offset",
             "First, offset by this.<br>Can be length of features in an [N,F] tensor or length 1, in which case it applies to all features, regardless of dimension count.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "scale",
             "Second, multiply by this.<br>Can be length of features in an [N,F] tensor or length 1, in which case it applies to all features, regardless of dimension count.<br>Must be same length as 'offset'",
             AttributeProto::FLOATS,
-            OPTIONAL));
+            OPTIONAL_VALUE));
 
 static const char* SVMClassifier_ver1_doc = R"DOC(
     Support Vector Machine classifier
@@ -767,21 +767,21 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "kernel_params",
             "List of 3 elements containing gamma, coef0, and degree, in that order. Zero if unused for the kernel.",
             AttributeProto::FLOATS,
-            OPTIONAL)
-        .Attr("vectors_per_class", "", AttributeProto::INTS, OPTIONAL)
-        .Attr("support_vectors", "", AttributeProto::FLOATS, OPTIONAL)
-        .Attr("coefficients", "", AttributeProto::FLOATS, OPTIONAL)
+            OPTIONAL_VALUE)
+        .Attr("vectors_per_class", "", AttributeProto::INTS, OPTIONAL_VALUE)
+        .Attr("support_vectors", "", AttributeProto::FLOATS, OPTIONAL_VALUE)
+        .Attr("coefficients", "", AttributeProto::FLOATS, OPTIONAL_VALUE)
         .Attr(
             "prob_a",
             "First set of probability coefficients.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "prob_b",
             "Second set of probability coefficients. This array must be same size as prob_a.<br>If these are provided then output Z are probability estimates, otherwise they are raw scores.",
             AttributeProto::FLOATS,
-            OPTIONAL)
-        .Attr("rho", "", AttributeProto::FLOATS, OPTIONAL)
+            OPTIONAL_VALUE)
+        .Attr("rho", "", AttributeProto::FLOATS, OPTIONAL_VALUE)
         .Attr(
             "post_transform",
             "Indicates the transform to apply to the score. <br>One of 'NONE,' 'SOFTMAX,' 'LOGISTIC,' 'SOFTMAX_ZERO,' or 'PROBIT'",
@@ -791,12 +791,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "classlabels_strings",
             "Class labels if using string labels.<br>One and only one of the 'classlabels_*' attributes must be defined.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "classlabels_ints",
             "Class labels if using integer labels.<br>One and only one of the 'classlabels_*' attributes must be defined.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           std::vector<std::string> label_strs;
           auto result =
@@ -841,12 +841,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "kernel_params",
             "List of 3 elements containing gamma, coef0, and degree, in that order. Zero if unused for the kernel.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "support_vectors",
             "Chosen support vectors",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "one_class",
             "Flag indicating whether the regression is a one-class SVM or not.",
@@ -856,7 +856,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "coefficients",
             "Support vector coefficients.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "n_supports",
             "The number of support vectors.",
@@ -867,7 +867,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "Indicates the transform to apply to the score. <br>One of 'NONE,' 'SOFTMAX,' 'LOGISTIC,' 'SOFTMAX_ZERO,' or 'PROBIT.'",
             AttributeProto::STRING,
             std::string("NONE"))
-        .Attr("rho", "", AttributeProto::FLOATS, OPTIONAL));
+        .Attr("rho", "", AttributeProto::FLOATS, OPTIONAL_VALUE));
 
 static const char* TreeEnsembleClassifier_ver1_doc = R"DOC(
     Tree Ensemble classifier.  Returns the top class for each of N inputs.<br>
@@ -908,77 +908,77 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "nodes_treeids",
             "Tree id for each node.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_nodeids",
             "Node id for each node. Ids may restart at zero for each tree, but it not required to.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_featureids",
             "Feature id for each node.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_values",
             "Thresholds to do the splitting on for each node.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_hitrates",
             "Popularity of each node, used for performance and may be omitted.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_modes",
             "The node kind, that is, the comparison to make at the node. There is no comparison to make at a leaf node.<br>One of 'BRANCH_LEQ', 'BRANCH_LT', 'BRANCH_GTE', 'BRANCH_GT', 'BRANCH_EQ', 'BRANCH_NEQ', 'LEAF'",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_truenodeids",
             "Child node if expression is true.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_falsenodeids",
             "Child node if expression is false.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_missing_value_tracks_true",
             "For each node, define what to do in the presence of a missing value: if a value is missing (NaN), use the 'true' or 'false' branch based on the value in this array.<br>This attribute may be left undefined, and the defalt value is false (0) for all nodes.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "class_treeids",
             "The id of the tree that this node is in.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "class_nodeids",
             "node id that this weight is for.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "class_ids",
             "The index of the class list that each weight is for.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "class_weights",
             "The weight for the class in class_id.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "classlabels_strings",
             "Class labels if using string labels.<br>One and only one of the 'classlabels_*' attributes must be defined.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "classlabels_int64s",
             "Class labels if using integer labels.<br>One and only one of the 'classlabels_*' attributes must be defined.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "post_transform",
             "Indicates the transform to apply to the score. <br> One of 'NONE,' 'SOFTMAX,' 'LOGISTIC,' 'SOFTMAX_ZERO,' or 'PROBIT.'",
@@ -988,7 +988,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "base_values",
             "Base values for classification, added to final class score; the size must be the same as the classes or can be left unassigned (assumed 0)",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           std::vector<std::string> label_strs;
           auto result =
@@ -1033,72 +1033,72 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "nodes_treeids",
             "Tree id for each node.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_nodeids",
             "Node id for each node. Node ids must restart at zero for each tree and increase sequentially.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_featureids",
             "Feature id for each node.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_values",
             "Thresholds to do the splitting on for each node.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_hitrates",
             "Popularity of each node, used for performance and may be omitted.",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_modes",
             "The node kind, that is, the comparison to make at the node. There is no comparison to make at a leaf node.<br>One of 'BRANCH_LEQ', 'BRANCH_LT', 'BRANCH_GTE', 'BRANCH_GT', 'BRANCH_EQ', 'BRANCH_NEQ', 'LEAF'",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_truenodeids",
             "Child node if expression is true",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_falsenodeids",
             "Child node if expression is false",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "nodes_missing_value_tracks_true",
             "For each node, define what to do in the presence of a NaN: use the 'true' (if the attribute value is 1) or 'false' (if the attribute value is 0) branch based on the value in this array.<br>This attribute may be left undefined and the defalt value is false (0) for all nodes.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "target_treeids",
             "The id of the tree that each node is in.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "target_nodeids",
             "The node id of each weight",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "target_ids",
             "The index of the target that each weight is for",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "target_weights",
             "The weight for each target",
             AttributeProto::FLOATS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "n_targets",
             "The total number of targets.",
             AttributeProto::INT,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "post_transform",
             "Indicates the transform to apply to the score. <br>One of 'NONE,' 'SOFTMAX,' 'LOGISTIC,' 'SOFTMAX_ZERO,' or 'PROBIT'",
@@ -1113,7 +1113,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "base_values",
             "Base values for classification, added to final class score; the size must be the same as the classes or can be left unassigned (assumed 0)",
             AttributeProto::FLOATS,
-            OPTIONAL));
+            OPTIONAL_VALUE));
 
 static const char* ZipMap_ver1_doc = R"DOC(
     Creates a map from the input and the attributes.<br>
@@ -1137,12 +1137,12 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "classlabels_strings",
             "The keys when using string keys.<br>One and only one of the 'classlabels_*' attributes must be defined.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "classlabels_int64s",
             "The keys when using int keys.<br>One and only one of the 'classlabels_*' attributes must be defined.",
             AttributeProto::INTS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
           std::vector<std::string> classlabels_strings;
           bool result = getRepeatedAttribute(
diff --git a/onnx/defs/traditionalml/old.cc b/onnx/defs/traditionalml/old.cc
index 81f7646d0c3..cd3d7dddd37 100644
--- a/onnx/defs/traditionalml/old.cc
+++ b/onnx/defs/traditionalml/old.cc
@@ -37,7 +37,7 @@ ONNX_ML_OPERATOR_SET_SCHEMA(
             "classes_strings",
             "A list of labels.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "default_int64",
             "An integer to use when an input string value is not found in the map.<br>One and only one of the 'default_*' attributes must be defined.",
diff --git a/onnx/defs/training/defs.cc b/onnx/defs/training/defs.cc
index 96d3f7c873e..607f469a752 100644
--- a/onnx/defs/training/defs.cc
+++ b/onnx/defs/training/defs.cc
@@ -177,7 +177,7 @@ ONNX_TRAINING_OPERATOR_SET_SCHEMA(
             "intermediate variables) that can be generated from inputs "
             "cannot be included in this attribute.",
             AttributeProto::STRINGS,
-            OPTIONAL)
+            OPTIONAL_VALUE)
         .Attr(
             "y",
             "The targeted tensor. It can be viewed as the output of the "
@@ -461,4 +461,154 @@ ONNX_TRAINING_OPERATOR_SET_SCHEMA(
               propagateShapeFromInputToOutput(ctx, i_in, i_out);
             }}));
 
-} // namespace ONNX_NAMESPACE
+static const char* Momentum_ver1_doc = R"DOC(
+    Compute one iteration of stochastic gradient update with momentum.
+    This operator can conduct the optimization of multiple tensor variables.
+
+    Let's define the behavior of this operator. As you can imagine, SG with momentum requires
+    several parameters:
+     
+     - The learning-rate "R".
+     - The update count "T". That is, the number of conducted training iterations. It should
+       be zero in the first training iteration.
+     - A L2-norm regularization coefficient "norm_coefficient".
+     - A decay coefficient of previous accumulated gradient (i.e., momentum) "alpha".
+     - The scaling coefficient of current gradient "beta".
+     - An attribute to choose either standard momentum or Nesterov's momentum "mode" should
+       be used.
+
+    For the sake of simplicity, assume that there is only one tensor (called "X") to be optimized.
+    Other necessary inputs are "X"'s gradient (called "G") and "X"'s momentum (called "V"). This
+    Momentum operator maps all these inputs to the new value of "X" (called "X_new") and its new
+    momentum (called "V_new").
+    
+    This operator supports two different momentum algorithms. Set the attribute "mode" to
+    "nesterov" if Nesterov's momentum is desired. Otherwise, set the attribute "model" to
+    "standard" to use standard momentum. Computation details are described subsequently.
+
+    Let "+", "-", "*", and "/" are all element-wise operations with numpy-style broadcasting.
+
+    Pseudo code for SG with standard momentum:
+
+      // Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
+      // values of all elements in X.
+      G_regularized = norm_coefficient * X + G
+
+      // In the first training iteration, beta should always be 1.
+      beta_adjusted = T > 0 ? beta : 1
+
+      // Compute the current momentum based on previous momentum and the current gradient.
+      V_new = alpha * V + beta_adjusted * G_regularized
+
+      // Update X.
+      X_new = X - R * V_new
+
+    Pseudo code for SG with Nesterov's momentum:
+
+      // Add gradient of 0.5 * norm_coefficient * ||X||^2, where ||X|| is the sum of squared
+      // values of all elements in X.
+      G_regularized = norm_coefficient * X + G;
+
+      // In the first training iteration, beta should always be 1.
+      beta_adjusted = T > 0 ? beta : 1
+
+      // Compute the current momentum based on previous momentum and the current gradient.
+      V_new = alpha * V + beta_adjusted * G_regularized;
+
+      // Compute final update direction and then update X.
+      X_new = X - R * (G_regularized + alpha * V_new)
+
+    If one assign this operators to optimize multiple inputs, for example, "X_1" and "X_2". The same
+    pseudo code would be extended to handle all tensors jointly. More specifically, we can view "X" as a
+    concatenation of "X_1" and "X_2" (of course, their gradient and accumulate gradient should
+    be concatenated too) and then our pseudo code becomes applicable.
+)DOC";
+
+ONNX_TRAINING_OPERATOR_SET_SCHEMA(
+    Momentum,
+    1,
+    OpSchema()
+        .SetDoc(Momentum_ver1_doc)
+        .Input(0, "R", "The learning rate.", "T1")
+        .Input(1, "T", "Update count of \"X\". It should be a scalar.", "T2")
+        .Input(
+            2,
+            "inputs",
+            "It sequentially contains the current values of optimized tensors, then their "
+            "gradient tensors, and finally their momentum tensors. For example, if two tensors "
+            "\"X_1\" and \"X_2\" are optimized, The expected input list would be "
+            "[\"X_1\", \"X_2\", gradient of \"X_1\", gradient of \"X_2\", momentum of \"X_1\", momentum of \"X_2\"].",
+            "T3",
+            OpSchema::Variadic,
+            false)
+        .Output(
+            0,
+            "outputs",
+            "It sequentially contains the new values of optimized tensors and then the new "
+            "values of their momentum tensors. For example, if two tensors \"X_1\" and \"X_2\" are "
+            "optimized, the output list would be [new value of \"X_1,\" new value of \"X_2\" "
+            "new momentum of \"X_1\", new momentum of \"X_2\"].",
+            "T3",
+            OpSchema::Variadic,
+            false)
+        .Attr(
+            "alpha",
+            "The decay factor of momentum. It should be a scalar.",
+            AttributeProto::FLOAT)
+        .Attr(
+            "beta",
+            "The coefficient of gradient in computing new momentum. It should be a scalar.",
+            AttributeProto::FLOAT)
+        .Attr(
+            "norm_coefficient",
+            "Coefficient of 0.5 * norm_coefficient * ||X||^2.",
+            AttributeProto::FLOAT)
+        .Attr(
+            "mode",
+            "Its value should be either \"nesterov\" or \"standard\". The value \"nesterov\" leads "
+            "to the use of Nesterov's momentum while \"standard\" invokes stochastic gradient method "
+            "using standard momentum",
+            AttributeProto::STRING)
+        .TypeConstraint(
+            "T1",
+            {"tensor(float)", "tensor(double)"},
+            "Constrain input types to float scalars.")
+        .TypeConstraint(
+            "T2",
+            {"tensor(int64)"},
+            "Constrain input types to 64-bit integer scalars.")
+        .TypeConstraint(
+            "T3",
+            {"tensor(float)", "tensor(double)"},
+            "Constrain input types to float tensors.")
+        .TypeAndShapeInferenceFunction([](InferenceContext& ctx) {
+            // Assume that the input list is [R, T, X1, X2, G1, G2, V1, V2] and
+            // output list is [X1_new, X2_new, V1_new, V2_new] for explaining
+            // the code below in a simpler way.
+
+            // The count of input tensors excluding "R" and "T".
+            auto num_adjustable_tensors = ctx.getNumInputs() - 2;
+
+            // Check number of (optimized tensor, gradient, momentum) tuples.
+            if (num_adjustable_tensors % 3 != 0)
+              fail_shape_inference(
+                  "The sum of optimized tensor count and momentum tensor count ",
+                  "should be a multiple of 2 in the input list of Momentum operator");
+
+            // The count of "X1" and "X2".
+            auto num_optimized_tensors = num_adjustable_tensors / 3;
+            for (size_t i = 0; i < num_optimized_tensors; ++i){
+              // Pass X1's/X2's shapes to X1_new/X2_new.
+              size_t i_in = 2 + i;
+              size_t i_out = i;
+              propagateElemTypeFromInputToOutput(ctx, i_in, i_out);
+              propagateShapeFromInputToOutput(ctx, i_in, i_out);
+              // Pass V1's/V2's shapes to V1_new/V2_new.
+              i_in = 2 + 2 * num_optimized_tensors + i;
+              i_out = i + num_optimized_tensors;
+              propagateElemTypeFromInputToOutput(ctx, i_in, i_out);
+              propagateShapeFromInputToOutput(ctx, i_in, i_out);
+            }
+        }));
+
+} // namespace ONNX_NAMESPACE
\ No newline at end of file
diff --git a/onnx/optimizer/pass_manager.cc b/onnx/optimizer/pass_manager.cc
index 8f1ebe38ba5..7ad365a0636 100644
--- a/onnx/optimizer/pass_manager.cc
+++ b/onnx/optimizer/pass_manager.cc
@@ -14,7 +14,7 @@ void GeneralPassManager::add(std::shared_ptr<Pass> pass) {
 }
 
 std::shared_ptr<PassManagerAnalysis> GeneralPassManager::run(Graph& graph) {
-  for (std::shared_ptr<Pass> pass : this->passes) {
+  for (const std::shared_ptr<Pass>& pass : this->passes) {
     auto pass_analysis = pass->runPass(graph);
   }
   return std::shared_ptr<PassManagerAnalysis>(new EmptyPassManagerAnalysis());
@@ -25,7 +25,7 @@ std::shared_ptr<PassManagerAnalysis> FixedPointPassManager::run(Graph& graph) {
 
   do {
     fixed_point_optimization_done = false;
-    for (std::shared_ptr<Pass> pass : this->passes) {
+    for (const std::shared_ptr<Pass>& pass : this->passes) {
       std::shared_ptr<PostPassAnalysis> analysis = pass->runPass(graph);
       if (pass->getPassAnalysisType() == PassAnalysisType::Empty) {
         continue;
diff --git a/onnx/shape_inference/implementation.cc b/onnx/shape_inference/implementation.cc
index 237ca6c22e2..f1a0b16c68e 100644
--- a/onnx/shape_inference/implementation.cc
+++ b/onnx/shape_inference/implementation.cc
@@ -354,12 +354,12 @@ void InferShapeForFunctionNode(
     NodeProto copy_n(n);
     // Add attribute information into the temporary node
     copy_n.clear_attribute();
-    for (auto attr : n.attribute()) {
+    for (const auto& attr : n.attribute()) {
       if (attr.has_ref_attr_name()) {
         if (attr_map.count(attr.ref_attr_name())) {
           auto copy_attr = *attr_map[attr.ref_attr_name()];
           copy_attr.set_name(attr.name());
-          copy_n.add_attribute()->CopyFrom(std::move(copy_attr));
+          copy_n.add_attribute()->CopyFrom(copy_attr);
         }
       } else {
         copy_n.add_attribute()->CopyFrom(attr);
diff --git a/onnx/test/shape_inference_test.py b/onnx/test/shape_inference_test.py
index 9e6e121c84a..24df2c4797c 100644
--- a/onnx/test/shape_inference_test.py
+++ b/onnx/test/shape_inference_test.py
@@ -2893,6 +2893,47 @@ def test_adagrad_multiple(self):  # type: () -> None
              make_tensor_value_info('H2_new', TensorProto.FLOAT, (3, 4))],
             opset_imports=[helper.make_opsetid('', 12), helper.make_opsetid('ai.onnx.training', 1)])
 
+    def test_momentum(self):  # type: () -> None
+        graph = self._make_graph(
+            [('R', TensorProto.FLOAT, ()),  # scalar's shape is ()
+             ('T', TensorProto.INT64, ()),  # scalar's shape is ()
+             ('X', TensorProto.FLOAT, (1, 2)),
+             ('G', TensorProto.FLOAT, (1, 2)),
+             ('V', TensorProto.FLOAT, (1, 2))],
+            [make_node('Momentum', ['R', 'T', 'X', 'G', 'V'], ['X_new', 'V_new'],
+             alpha=0.9, beta=1.0, norm_coefficient=0.02, mode='standard',
+             domain='ai.onnx.training')],
+            [])
+        self._assert_inferred(
+            graph,
+            [make_tensor_value_info('X_new', TensorProto.FLOAT, (1, 2)),
+             make_tensor_value_info('V_new', TensorProto.FLOAT, (1, 2))],
+            opset_imports=[helper.make_opsetid('', 12), helper.make_opsetid('ai.onnx.training', 1)])
+
+    def test_momentum_multiple(self):  # type: () -> None
+        graph = self._make_graph(
+            [('R', TensorProto.FLOAT, ()),  # scalar's shape is ()
+             ('T', TensorProto.INT64, ()),  # scalar's shape is ()
+             ('X1', TensorProto.FLOAT, (1, 2)),
+             ('X2', TensorProto.FLOAT, (3, 4)),
+             ('G1', TensorProto.FLOAT, (1, 2)),
+             ('G2', TensorProto.FLOAT, (3, 4)),
+             ('V1', TensorProto.FLOAT, (1, 2)),
+             ('V2', TensorProto.FLOAT, (3, 4))],
+            [make_node('Momentum', ['R', 'T', 'X1', 'X2', 'G1', 'G2', 'V1', 'V2'],
+             ['X1_new', 'X2_new', 'V1_new', 'V2_new'],
+             alpha=0.9, beta=1.0, norm_coefficient=0.02, mode='nesterov',
+             domain='ai.onnx.training')],
+            [])
+
+        self._assert_inferred(
+            graph,
+            [make_tensor_value_info('X1_new', TensorProto.FLOAT, (1, 2)),
+             make_tensor_value_info('X2_new', TensorProto.FLOAT, (3, 4)),
+             make_tensor_value_info('V1_new', TensorProto.FLOAT, (1, 2)),
+             make_tensor_value_info('V2_new', TensorProto.FLOAT, (3, 4))],
+            opset_imports=[helper.make_opsetid('', 12), helper.make_opsetid('ai.onnx.training', 1)])
+
     def test_pad_opset10(self):  # type: () -> None
         graph = self._make_graph(
             [('x', TensorProto.FLOAT, (1, None, 2))],
diff --git a/onnx/version_converter/convert.cc b/onnx/version_converter/convert.cc
index 256b2e405db..c2e4438ea0c 100644
--- a/onnx/version_converter/convert.cc
+++ b/onnx/version_converter/convert.cc
@@ -22,8 +22,8 @@ ModelProto DefaultVersionConverter::convert_version(
     const ModelProto& mp_in,
     const OpSetID& initial_version,
     const OpSetID& target_version) const {
-  const std::string initial_domain = initial_version.domain();
-  const std::string target_domain = target_version.domain();
+  const std::string& initial_domain = initial_version.domain();
+  const std::string& target_domain = target_version.domain();
   assertDefaultDomain(initial_domain, target_domain);
 
   for (auto it = mp_in.opset_import().begin(); it != mp_in.opset_import()
diff --git a/third_party/pybind11 b/third_party/pybind11
index a1041190c8b..80d452484c5 160000
--- a/third_party/pybind11
+++ b/third_party/pybind11
@@ -1 +1 @@
-Subproject commit a1041190c8b8ff0cd9e2f0752248ad5e3789ea0c
+Subproject commit 80d452484c5409444b0ec19383faa84bb7a4d351