[DO NOT MERGE, FOR REVIEW ONLY] ort_training to master with squashed commits by edgchen1 · Pull Request #3681 · microsoft/onnxruntime

edgchen1 · 2020-04-24T05:47:03Z

Introduce training changes.
Enable CI for training.
Change Tensor::[Set]ByteOffset() to use ptrdiff_t.
Add back orttraining-linux-gpu-inference-only-ci-pipeline.yml. (Add back orttraining-linux-gpu-inference-only-ci-pipeline.yml. #3182)
Initial implementation of graph cut and pipeline

This is a draft of graph cut and wait/record to demonstrate cut and Wait/Record design. You may find sub models and profiling json under onnxruntime/test if you run "onnxruntime_test_all --gtest_filter=GradientGraphBuilderTest.TrainingSession_WithPipeline"

Merged PR 5686: fix P100/fp16 issues

misaligned address in atomic_add()
GatherNDGradKernel to use atomic_add
enable/add UTs for GatherNDGrad and reduction_ops using half

CUDA_ARCH won't take effect on .cc code, leverage HasCudaEnvironment() instead

verified convergence graph and perf test

p100 is much slower than v100 on fp16
fp16/128 need to reduce batch size from 66 to 64 to avoid OOM issue

verify convergence test on Dev3/v100

TBD - broken UTs related to MatmulIntegerOpTest (works on v100/windows, though)

Merged PR 5688: Upgrade ONNX submodule to the latest from github ONNX master.

We want to implement SoftmaxCrossentropy and NegativeLossLikelihoodLoss forward training ops for opset-12 but that requires ONNX submodule to point to the latest commit to have the latest and greatest ONNX spec!

Reverse integrate changes from *.in.proto files in github ONNX repo.
Regenerate csharp/test/Microsoft.ML.OnnxRuntime.Tests/OnnxMl.cs
Disable ONNX tests that don't have op implementation for the latest opset.

Revert change from RelWithDebInfo to Release in OnnxRuntime.CSharp.sln.
Tweak the dropout calculation.
Update bert-base convergence values
Udpate License Header (Update License in Header files #3212)
Fix build issues (Fix build issues #3214)
Fixed issues with Python and inference-only build.
Handle ImportError for training imports.
fix windows build
fix compile error
fix centos build
fix windows build
fix compile error
Use SafeInt for allocation calculation, fix typo.

Co-authored-by: Ethan Tao ettao@microsoft.com

Register ONNX Training Ops (Register ONNX Training Ops #3252)
Add ort_training build status file. (Add ort_training build status file. #3257)
Address PR comments (Address PR comments #3255)
Added comment for ntfw_remove().
Rewrite WindowsEnv::DeleteFolder(), some other clean up.
Address PR comments (Address PR comments #3256)
comments
fix path
fix path

Co-authored-by: Ethan Tao ettao@microsoft.com

Remove orttraining/tools/scripts/profile directory. (Remove orttraining/tools/scripts/profile directory. #3268)
refactor frontend (refactor frontend #3235)
refactor frontend
remove training python files from inferencing build
update according to reviewer's comments
merge pybind_state.cc
refactor pybind_state.cc
code clean up
missed a forward declaration in ort_pybind_state.cc
passed pytest
move training_session.py into a subfolder per reviewer's comment
add copyright

Co-authored-by: liqun liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net

unittests comments (address PR comments #3278)

Co-authored-by: Ethan Tao ettao@microsoft.com

fix build break
Make gradient clipping configurable. (Make gradient clipping configurable. #3243)
Make gradient clipping configurable.
add control flag to c++ and python frontend
fix pybind issue introduced by merge
Implement pipeline event generator (Implement pipeline event generator #3206)

Implement pipeline event generator with OneFWOneBW schedule in timeline. Each stage of pipeline contains FW and BW of a subset of the model and are scheduled in one worker thread for each microbatch.

Aggregated Send/Recv (Aggregated Send/Recv #3232)
Aggregated Send/Recv
fix typos
CR refine
CR refine
CR refine
Add scalar check.
typo
reformat
CR refine
Forgot to swap order in the implementation after spec changed
CR refine
Cr refine
add Send's input type checking
Update ort_trainer.py with lazy onnx export (Update ort_trainer.py with lazy onnx export #3244)
Delay onnx export to avoid extra info
handle cases where onnx model is provided at initialization
address comments
fix rebase error
fix python error
Add bias correction in Adam & Lamb for C++ frontend & python frontend (Add bias correction in Adam & Lamb for C++ frontend & python frontend #3301)
move env to .cc file
fix windows build
address PR comments (address PR comments #3312)
address PR comments
PR comments
PR comments
disable logging
typo

Co-authored-by: Ethan Tao ettao@microsoft.com

Expose frozen_weights in PyTorch Frontend (Expose frozen_weights in PyTorch Frontend #3317)
Addressing PR comments (Addressing PR comments #3334)
PR comments
PR comments
PR comments
error out bad shape

Co-authored-by: Ethan Tao ettao@microsoft.com

support Huggingface's adamw (support Huggingface's adamw #3318)
add weight decay mode to support both pytorch and huggingface's adamw
Implement WhereGrad (Implement WhereGrad #3343)
Don't cast to fp16 in LayernormGrad (Don't cast to fp16 in LayernormGrad #3328)

Co-authored-by: Weixing Zhang wezhan@microsoft.com

Address PR comments (Address PR comments #3352)
PR comments
revert code for a couple comments
add negative test case

Co-authored-by: Ethan Tao ettao@microsoft.com

Address master merge PR comments (Address master merge PR comments #3348)

Address some comments from #3174.

Fix code-base after breaking API changes
add pipeline graph split script (add pipeline graph split script #3275)
pipeline graph cut
add element type
add input wait event and shape info
shape inference
support multiple cuts
format script
address feedback
address feedback
Fix InferenceSession API
Update Op's Domain and Version (Update Op's Domain and Version #3356)
Update Nccl ops domain opset
Update ZeroGradient Domain OpSet
Update InPlaceAccumulator Domain OpSet
Update SoftmaxGrad Domain and OpSet
Update LayerNormalizationGrad Domain and OpSet
Update BatchNormGrad Domain and Opset
Update IsAllFinite Domain and Opset
Update DivGrad Domain and Opset
Update GatherGrad Domain and Opset
Update IsFinite Domain and OpSet
Update ReduceAllL2 Domain and Opset
Update MixedPrecisionScale Doman and Opset
Update AllOp Domain and Opset
Update GroupOp Domain and OpSet
Update ViewOp Domain and OpSet
PR comments (PR comments #3374)
PR comments
PR comments
PR comments
PR comments
PR comments
PR comments
PR comments

Co-authored-by: Ethan Tao ettao@microsoft.com

Disable tests (temporary)
Revert _SliceKernel cuda implementation
Revert Session and InferenceSession implementation
Disable GradientCheckerTest tests for GPU/Debug build (Remove Windows GPU tests for Debug build #3407)
Disable GradientCheckerTest tests for GPU/Debug build (Remove Windows GPU tests for Debug build #3407)
Revert "Addressing PR comments (Addressing PR comments #3334)" (Revert "Addressing PR comments (#3334)" #3412)

This reverts commit 131c65d.

Enable loss scale input from Python frontend (Enable loss scale input from Python frontend #3327)

Made some fixes to enable loss scale to be wired up to ORT from the Python frontend. In particular, now addition of loss scaling is done unconditionally if mixed precision is enabled. The generated loss scale input name is passed back to the frontend.

Also fixed how inputs were added during the training graph configuration. Graph::SetInputs() was causing some issues - it seems to not be working correctly.

Also added some mixed precision Python frontend tests.

Co-authored-by: liqun liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net

Reapply commit 131c65d; Fix memory regression issue. (Reapply commit 131c65d; Fix memory regression issue. #3423)
Reapply commit 131c65d
fix merge error
Disable gradient clipping for E2E test.
View Op - new unit tests and add support for tensor memcpy by offset/size (View Op - new unit tests and add support for tensor memcpy by offset/size #3439)
view ops UTs
update per comments
PR comments - code clean up
code clean up per comments

Co-authored-by: Ethan Tao ettao@microsoft.com

frontend test to use random seed (frontend test to use random seed #3209)

frontend test to use random seed

safeint for region bytes in bfc arena and code clean up (safeint for region bytes in bfc arena and code clean up #3447)
PR comments
remove build issue workaround
SafeInt for region bytes
fix build
fix build

Co-authored-by: Ethan Tao ettao@microsoft.com

raid rtol to unblock CI (raise rtol to unblock CI #3457)

raise rtol to avoid expected CI test failure in onnxruntime_test_ort_trainer.py

Address comments around bfc arena (Address comments around bfc arena #3460)
rename setting
todo comments
fix build

Co-authored-by: Ethan Tao ettao@microsoft.com

Fix onnxruntime_unittests.cmake after merge.
Fix dynamicslice.cc after merge.
create pipeline for ci frontend tests (create pipeline for ci frontend tests #3422)

create pipeline for nightly python front-end e2e tests

Rename ONNX OPTIONAL to OPTIONAL_VALUE.
Get cuda_common.h from master.
Get onnxruntime/core/providers/cuda/tensor/slice.h from ort_training.
Get onnxruntime/contrib_ops/cuda/bert/fast_gelu.cc from ort_training.
Get onnxruntime/core/providers/cuda/cu from ort_training.
Get onnxruntime/core/providers/cuda/math/matmul_integer.cc from ort_training.
Remove FastGelu from activations.
Fixes for Where, ConcatGrad and ReduceSumGrad (Fixes for Where, ConcatGrad and ReduceSumGrad #3415)
Fixes for Expand, Where, ConcatGrad ReduceSumGrad.
Roll back expand, fix, add tests for reduce grad.
Roll back CPU Expand change.
Fix after merge.

Co-authored-by: Vincent Wang weicwang@microsoft.com

Remove orttraining/docker directory. (Remove orttraining/docker directory. #3476)

The docker images are not publicly available yet.
Addressing PR comment: #3174 (comment)

Put dropout_default, dropout_random, celu back in the list of broken tests.
fix internal loss scale (fix internal loss scale #3483)
Changed internal loss scale to 1-D
added test

Co-authored-by: root root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net

Publish unit test results from Linux and Mac builds (Publish unit test results from Linux and Mac builds #3480)
Added publish test results step to Linux and Mac builds.
Fix test result file pattern.
Add to list of failing backend tests from master.
Get cudnn_common.cc from master.
Remove usage of DeviceProp (which is removed in ort_training) from cudnn_common.cc.
Put back SubmoduleCheckoutMode parameter into mac-ci.yml.
Update Graph SetInputs and SetOutputs for training (Update Graph SetInputs and SetOutputs for training #3446)

Fix training modification of Graph SetInputs() and SetOutputs(). Originally there were distinct code paths in Graph based on whether the graph was loaded from a GraphProto or created from scratch. The training modifications made that distinction a bit ambiguous - i.e., even though the Graph is loaded from a GraphProto for training, sometimes we rely on the other code path, e.g., to deduce the graph inputs after modifying it. Consequently, there was some odd behavior when using SetInputs(). For correctness, this change separates the cases where the graph is loaded from a GraphProto and where it is created from scratch.

Fix fp16 type mismatch when graph output is an fp32-only node (Fix fp16 type mismatch when graph output is an fp32-only node #3411)
verify output node before changing its type in mixed precision mode
Remove cast to OpKernelContextInternal to get threadpool and directly use OpKernelContext. (Remove cast to OpKernelContextInternal to get threadpool and directly use OpKernelContext. #3523)
MaxBatchSize E2E Test (MaxBatchSize E2E Test #3454)
max batch size e2e test

*update test data snapshot

Add Python API to set random seed: onnxruntime.seed()
Rename API to onnxruntime.set_seed()
Address PR comments and clean up. (Address PR comments and clean up. #3536)

Address PR comments and clean up.

Put safeint_interface include directory into onnxruntime_common interface include directories to simplify usage by other targets. (Simplify SafeInt include #3546)
SoftmaxCrossEntropyLoss-12 forward and backward kernel implementation. (SoftmaxCrossEntropyLoss-12 forward and backward kernel implementation. #3465)
Update ONNX submodule commit to the latest.
build break.
SoftmaxCrossEntropyLoss: Forward and backward kernel implementation.
Revert "build break."

This reverts commit 847cb50.

Add more tests and misc clean up.
revert unintended changes.
PR feedback.
cleanup.
PR feedback.
Ort training README (Ort training README #3404)

Added README for ORT Training

Fix GraphTest.UnusedValueInfoSerializes.
Add SafeInt include to WinML targets (Add SafeInt include to WinML targets #3558)

Fixing Windows builds on the ort_training branch in preparation for the merge to master.
SafeInt (included via onnxruntime/core/common/safeint.h) was recently made a dependency of onnxruntime/core/framework/bfc_arena.h. That requires consumers of bfc_arena to compile with the SafeInt include directory.

Disable or update flaky tests, improve test random seed accessibility. (Disable or update flaky tests, improve test random seed accessibility. #3495)

Add output of test random seed
Allow setting of test random seed with environment variable
Disable / relax tolerance for flaky tests

subgraph type override handling and unit test (subgraph type override handling and unit test #3560)
unit test for subgraph type override
unit test - re-wire input properly to subgraph
update args

Co-authored-by: Ethan Tao ettao@microsoft.com

Clean up docs. (Clean up docs. #3579)
Fix orttraining/README.md formatting.
Delete ORT_TRAINING_BUILDS.md.
Fix typo.
Support ONNX test version parsing from path on Windows in onnx_test_runner. (Support ONNX test version parsing from path on Windows in onnx_test_runner. #3588)
Add front-end MNIST test (Add front-end MNIST test #3231)
add frontend minst test
to use torch nightly with torchvision
remove incorrect comment per reviewer's comment
experiment torchvision import failure
experiment install_deps.sh
more experiment install_deps.sh
experiment install_deps.sh with --upgrade
Experiment with install_deps.sh.
Experiment with install_ubuntu.sh.
Use Ubuntu 18.04 and Python 3.6 for CI.
Update cmake version for CI.
Install MPI on Ubuntu 18.04 for CI.
Increase tolerance for MNIST test.
Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa.
Clean-up.
Update ort_trainer.py from ort_training.
Get default Ubuntu Python ver back to 3.5.
Add underscore to opset_version parameter name in ORTTrainer constructor.
Move loss/model wrap before the call for sample output.
Update expected values for MNIST test.

Co-authored-by: liqun liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net
Co-authored-by: Sergii Dymchenko sedymche@microsoft.com

Sync onnx_backend_test_series.py disabled tests (Edgchen1/sync onnx backend disabled tests #3603)

Make the set of disabled tests consistent between ort_training and master. Fix some regex patterns.

Fix merge issue.
Fix GraphTransformationTests tests.
Revert "Convert Gelu to use TryParallelFor (Convert Gelu to use TryParallelFor #3599)"

This reverts commit 2579a72.

Disable CudaKernelTest.SoftmaxCrossEntropyLoss_LargeSizeTensor because it's flaky.
Add --enable_onnx_tests to Windows builds to allow set up of test data directory.
Add --skip_onnx_tests to orttraining Windows builds.
Update Optimizer Domain and Opset (Update Optimizer Domain and Opset #3602)
Update Domain and Opset for SGD
Update Adam Domain and Opset
Update Lamb Domain and Opset
Remove Windows CUDA 9 build definition and helper scripts. (Remove Windows CUDA 9 build definition and helper scripts. #3615)
Eliminate Useless Cast during Transformer. (Eliminate Useless Cast during Transformer. #3606)
Remove Useless Cast during Transformer.
Resolve comments.
Check if graph can remove the node.

Co-authored-by: Vincent Wang weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net

Clean up OPTIONAL name conflict workarounds in ort_training. (Clean up OPTIONAL name conflict workarounds in ort_training. #3622)
Clean up OPTIONAL name conflict workarounds.
Cleanup unnecessory header files onnx_protobuf.h

Co-authored-by: Sherlock Huang

Add Lamb shape inference (Add Lamb Shape Inference #3634)
Refactoring code related to WARP_SIZE. (Refactoring code related to WARP_SIZE. #3623)

Centralize its definition in common.cuh.
Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later.
Centralize warp shuffle functions.

Co-authored-by: Weixing Zhang wezhan@microsoft.com

fixes for ort_trainer.py to resume from checkpoint (fixes for ort_trainer.py to resume from checkpoint #3510)
fixes for ort_trainer.py to resume from checkpoint
define self.state_dict_ during init
add comment of explanation
add unit test for restore from checkpoint
fix file not found

Co-authored-by: suffian khan sukha@microsoft.com

Add check for nullptr in PlannerImpl::FindReusableTensor(). (Add check for nullptr in PlannerImpl::FindReusableTensor(). #3619)
expose training session so the training app could register custom kernel and transformers (expose training session in training runner #3642)

Co-authored-by: Cheng Tang chenta@microsoft.com

Expand elimination and Expand gradient. (Expand elimination and Expand gradient. #3610)
Expand elmination and Expand gradient.
Resolve comments.
Fix test break.
Check if graph can remove the node.
Resolve comment.

Co-authored-by: Vincent Wang weicwang@microsoft.com

Try not to modify base name (Avoid appending index to names when possible #3638)
GatherElementsGrad Kernels (GatherElementsGrad Kernels #3627)
GatherElementsGrad cuda kernel & tests
Fix comments
Fix include path
Add pipeline transformer for wait/record node (Add pipeline transformer for wait/record node #3513)
pipeline transformer
clean up
address feedback
add record/wait for first stage and updated split script
address feedback
make recv/send signal as initializer
merge
address feedback
unify input and initializer
address feedback and bug fix
minor fix
windows build
fix
fixed mnist bug (fixed mnist bug #3569)
fixed mnist bug
fixed train_step param
Simplify and clean code (Simplify and clean code #3655)

It is not necessary to include cudnn_common.h for kernels which are not implemented with CUDNN.
Minor change in layer norm kernel to simplify the code and resolve building warning.

Co-authored-by: Weixing Zhang wezhan@microsoft.com

Change CentOS build to use agent pool because builds on hosted agents run out of disk space. (Change CentOS build to use agent pool because builds on hosted agents run out of disk space. #3662)
disable broken test in DML (disable broken test in DML #3666)
temporary disable LSTM_Seq_lens_unpacked for dml test
temporary disable LSTM_Seq_lens_unpacked for dml test
temporary disable LSTM_Seq_lens_unpacked

Co-authored-by: Ethan Tao ettao@microsoft.com

Revert "Try not to modify base name (Avoid appending index to names when possible #3638)"

This reverts commit d9641f2.

Reverting to fix onnx_test_runner test failures.

Description: Describe your changes.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.

* Introduce training changes. * Enable CI for training. * Change Tensor::[Set]ByteOffset() to use ptrdiff_t. * Add back orttraining-linux-gpu-inference-only-ci-pipeline.yml. (#3182) * Initial implementation of graph cut and pipeline This is a draft of graph cut and wait/record to demonstrate cut and Wait/Record design. You may find sub models and profiling json under onnxruntime/test if you run "onnxruntime_test_all --gtest_filter=GradientGraphBuilderTest.TrainingSession_WithPipeline" * Merged PR 5686: fix P100/fp16 issues 1. misaligned address in atomic_add() 2. GatherNDGradKernel to use atomic_add 3. enable/add UTs for GatherNDGrad and reduction_ops using half - __CUDA_ARCH__ won't take effect on .cc code, leverage HasCudaEnvironment() instead 4. verified convergence graph and perf test - p100 is much slower than v100 on fp16 - fp16/128 need to reduce batch size from 66 to 64 to avoid OOM issue 5. verify convergence test on Dev3/v100 TBD - broken UTs related to MatmulIntegerOpTest (works on v100/windows, though) * Merged PR 5688: Upgrade ONNX submodule to the latest from github ONNX master. We want to implement SoftmaxCrossentropy and NegativeLossLikelihoodLoss forward training ops for opset-12 but that requires ONNX submodule to point to the latest commit to have the latest and greatest ONNX spec! - Reverse integrate changes from *.in.proto files in github ONNX repo. - Regenerate csharp/test/Microsoft.ML.OnnxRuntime.Tests/OnnxMl.cs - Disable ONNX tests that don't have op implementation for the latest opset. * Revert change from RelWithDebInfo to Release in OnnxRuntime.CSharp.sln. * Tweak the dropout calculation. * Update bert-base convergence values * Udpate License Header (#3212) * Fix build issues (#3214) * Fixed issues with Python and inference-only build. * Handle ImportError for training imports. * fix windows build * fix compile error * fix centos build * fix windows build * fix compile error * Use SafeInt for allocation calculation, fix typo. Co-authored-by: Ethan Tao <ettao@microsoft.com> * Register ONNX Training Ops (#3252) * Add ort_training build status file. (#3257) * Address PR comments (#3255) * Added comment for ntfw_remove(). * Rewrite WindowsEnv::DeleteFolder(), some other clean up. * Address PR comments (#3256) * comments * fix path * fix path Co-authored-by: Ethan Tao <ettao@microsoft.com> * Remove orttraining/tools/scripts/profile directory. (#3268) * refactor frontend (#3235) * refactor frontend * remove training python files from inferencing build * update according to reviewer's comments * merge pybind_state.cc * refactor pybind_state.cc * code clean up * missed a forward declaration in ort_pybind_state.cc * passed pytest * move training_session.py into a subfolder per reviewer's comment * add copyright Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * unittests comments (#3278) Co-authored-by: Ethan Tao <ettao@microsoft.com> * fix build break * Make gradient clipping configurable. (#3243) * Make gradient clipping configurable. add control flag to c++ and python frontend * fix pybind issue introduced by merge * Implement pipeline event generator (#3206) Implement pipeline event generator with OneFWOneBW schedule in timeline. Each stage of pipeline contains FW and BW of a subset of the model and are scheduled in one worker thread for each microbatch. * Aggregated Send/Recv (#3232) * Aggregated Send/Recv * fix typos * CR refine * CR refine * CR refine * Add scalar check. * typo * reformat * CR refine * Forgot to swap order in the implementation after spec changed * CR refine * Cr refine * add Send's input type checking * Update ort_trainer.py with lazy onnx export (#3244) * Delay onnx export to avoid extra info * handle cases where onnx model is provided at initialization * address comments * fix rebase error * fix python error * Add bias correction in Adam & Lamb for C++ frontend & python frontend (#3301) * move env to .cc file * fix windows build * address PR comments (#3312) * address PR comments * PR comments * PR comments * disable logging * typo Co-authored-by: Ethan Tao <ettao@microsoft.com> * Expose frozen_weights in PyTorch Frontend (#3317) * Addressing PR comments (#3334) * PR comments * PR comments * PR comments * error out bad shape Co-authored-by: Ethan Tao <ettao@microsoft.com> * support Huggingface's adamw (#3318) * add weight decay mode to support both pytorch and huggingface's adamw * Implement WhereGrad (#3343) * Don't cast to fp16 in LayernormGrad (#3328) Co-authored-by: Weixing Zhang <wezhan@microsoft.com> * Address PR comments (#3352) * PR comments * revert code for a couple comments * add negative test case Co-authored-by: Ethan Tao <ettao@microsoft.com> * Address master merge PR comments (#3348) Address some comments from #3174. - #3174 (comment) - #3174 (comment) - #3174 (comment) - #3174 (comment) - #3174 (comment) * Fix code-base after breaking API changes * add pipeline graph split script (#3275) * pipeline graph cut * add element type * add input wait event and shape info * shape inference * support multiple cuts * format script * address feedback * address feedback * Fix InferenceSession API * Update Op's Domain and Version (#3356) * Update Nccl ops domain opset * Update ZeroGradient Domain OpSet * Update InPlaceAccumulator Domain OpSet * Update SoftmaxGrad Domain and OpSet * Update LayerNormalizationGrad Domain and OpSet * Update BatchNormGrad Domain and Opset * Update IsAllFinite Domain and Opset * Update DivGrad Domain and Opset * Update GatherGrad Domain and Opset * Update IsFinite Domain and OpSet * Update ReduceAllL2 Domain and Opset * Update MixedPrecisionScale Doman and Opset * Update AllOp Domain and Opset * Update GroupOp Domain and OpSet * Update ViewOp Domain and OpSet * PR comments (#3374) * PR comments * PR comments * PR comments * PR comments * PR comments * PR comments * PR comments Co-authored-by: Ethan Tao <ettao@microsoft.com> * Disable tests (temporary) * Revert _SliceKernel cuda implementation * Revert Session and InferenceSession implementation * Disable GradientCheckerTest tests for GPU/Debug build (#3407) * Disable GradientCheckerTest tests for GPU/Debug build (#3407) * Revert "Addressing PR comments (#3334)" (#3412) This reverts commit 131c65d. * Enable loss scale input from Python frontend (#3327) Made some fixes to enable loss scale to be wired up to ORT from the Python frontend. In particular, now addition of loss scaling is done unconditionally if mixed precision is enabled. The generated loss scale input name is passed back to the frontend. Also fixed how inputs were added during the training graph configuration. Graph::SetInputs() was causing some issues - it seems to not be working correctly. Also added some mixed precision Python frontend tests. Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Reapply commit 131c65d; Fix memory regression issue. (#3423) * Reapply commit 131c65d * fix merge error * Disable gradient clipping for E2E test. * View Op - new unit tests and add support for tensor memcpy by offset/size (#3439) * view ops UTs * update per comments * PR comments - code clean up * code clean up per comments Co-authored-by: Ethan Tao <ettao@microsoft.com> * frontend test to use random seed (#3209) frontend test to use random seed * safeint for region bytes in bfc arena and code clean up (#3447) * PR comments * remove build issue workaround * SafeInt for region bytes * fix build * fix build Co-authored-by: Ethan Tao <ettao@microsoft.com> * raid rtol to unblock CI (#3457) raise rtol to avoid expected CI test failure in onnxruntime_test_ort_trainer.py * Address comments around bfc arena (#3460) * rename setting * todo comments * fix build Co-authored-by: Ethan Tao <ettao@microsoft.com> * Fix onnxruntime_unittests.cmake after merge. * Fix dynamicslice.cc after merge. * create pipeline for ci frontend tests (#3422) create pipeline for nightly python front-end e2e tests * Rename ONNX OPTIONAL to OPTIONAL_VALUE. * Get cuda_common.h from master. * Get onnxruntime/core/providers/cuda/tensor/slice.h from ort_training. * Get onnxruntime/contrib_ops/cuda/bert/fast_gelu.cc from ort_training. * Get onnxruntime/core/providers/cuda/cu from ort_training. * Get onnxruntime/core/providers/cuda/math/matmul_integer.cc from ort_training. * Remove FastGelu from activations. * Fixes for Where, ConcatGrad and ReduceSumGrad (#3415) * Fixes for Expand, Where, ConcatGrad ReduceSumGrad. * Roll back expand, fix, add tests for reduce grad. * Roll back CPU Expand change. * Fix after merge. Co-authored-by: Vincent Wang <weicwang@microsoft.com> * Remove orttraining/docker directory. (#3476) The docker images are not publicly available yet. Addressing PR comment: #3174 (comment) * Put dropout_default, dropout_random, celu back in the list of broken tests. * fix internal loss scale (#3483) * Changed internal loss scale to 1-D * added test Co-authored-by: root <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net> * Publish unit test results from Linux and Mac builds (#3480) * Added publish test results step to Linux and Mac builds. * Fix test result file pattern. * Add to list of failing backend tests from master. * Get cudnn_common.cc from master. * Remove usage of DeviceProp (which is removed in ort_training) from cudnn_common.cc. * Put back SubmoduleCheckoutMode parameter into mac-ci.yml. * Update Graph SetInputs and SetOutputs for training (#3446) Fix training modification of Graph SetInputs() and SetOutputs(). Originally there were distinct code paths in Graph based on whether the graph was loaded from a GraphProto or created from scratch. The training modifications made that distinction a bit ambiguous - i.e., even though the Graph is loaded from a GraphProto for training, sometimes we rely on the other code path, e.g., to deduce the graph inputs after modifying it. Consequently, there was some odd behavior when using SetInputs(). For correctness, this change separates the cases where the graph is loaded from a GraphProto and where it is created from scratch. * Fix fp16 type mismatch when graph output is an fp32-only node (#3411) * verify output node before changing its type in mixed precision mode * Remove cast to OpKernelContextInternal to get threadpool and directly use OpKernelContext. (#3523) * MaxBatchSize E2E Test (#3454) * max batch size e2e test *update test data snapshot * Add Python API to set random seed: onnxruntime.seed(<seed>) * Rename API to onnxruntime.set_seed(<seed>) * Address PR comments and clean up. (#3536) Address PR comments and clean up. - #3174 (comment) - #3174 (comment) * Put safeint_interface include directory into onnxruntime_common interface include directories to simplify usage by other targets. (#3546) * SoftmaxCrossEntropyLoss-12 forward and backward kernel implementation. (#3465) * Update ONNX submodule commit to the latest. * build break. * SoftmaxCrossEntropyLoss: Forward and backward kernel implementation. * Revert "build break." This reverts commit 847cb50. * Add more tests and misc clean up. * revert unintended changes. * PR feedback. * cleanup. * PR feedback. * Ort training README (#3404) Added README for ORT Training * Fix GraphTest.UnusedValueInfoSerializes. * Add SafeInt include to WinML targets (#3558) Fixing Windows builds on the ort_training branch in preparation for the merge to master. SafeInt (included via onnxruntime/core/common/safeint.h) was recently made a dependency of onnxruntime/core/framework/bfc_arena.h. That requires consumers of bfc_arena to compile with the SafeInt include directory. * Disable or update flaky tests, improve test random seed accessibility. (#3495) - Add output of test random seed - Allow setting of test random seed with environment variable - Disable / relax tolerance for flaky tests * subgraph type override handling and unit test (#3560) * unit test for subgraph type override * unit test - re-wire input properly to subgraph * update args Co-authored-by: Ethan Tao <ettao@microsoft.com> * Clean up docs. (#3579) * Fix orttraining/README.md formatting. * Delete ORT_TRAINING_BUILDS.md. * Fix typo. * Support ONNX test version parsing from path on Windows in onnx_test_runner. (#3588) * Add front-end MNIST test (#3231) * add frontend minst test * to use torch nightly with torchvision * remove incorrect comment per reviewer's comment * experiment torchvision import failure * experiment install_deps.sh * more experiment install_deps.sh * experiment install_deps.sh with --upgrade * Experiment with install_deps.sh. * Experiment with install_ubuntu.sh. * Use Ubuntu 18.04 and Python 3.6 for CI. * Update cmake version for CI. * Install MPI on Ubuntu 18.04 for CI. * Increase tolerance for MNIST test. * Go back to Ubuntu 16.04 for CI, fix installing from deadsnakes ppa. * Clean-up. * Update ort_trainer.py from ort_training. * Get default Ubuntu Python ver back to 3.5. * Add underscore to opset_version parameter name in ORTTrainer constructor. * Move loss/model wrap before the call for sample output. * Update expected values for MNIST test. Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com> * Sync onnx_backend_test_series.py disabled tests (#3603) Make the set of disabled tests consistent between ort_training and master. Fix some regex patterns. * Fix merge issue. * Fix GraphTransformationTests tests. * Revert "Convert Gelu to use TryParallelFor (#3599)" This reverts commit 2579a72. * Disable CudaKernelTest.SoftmaxCrossEntropyLoss_LargeSizeTensor because it's flaky. * Add --enable_onnx_tests to Windows builds to allow set up of test data directory. * Add --skip_onnx_tests to orttraining Windows builds. * Update Optimizer Domain and Opset (#3602) * Update Domain and Opset for SGD * Update Adam Domain and Opset * Update Lamb Domain and Opset * Remove Windows CUDA 9 build definition and helper scripts. (#3615) * Eliminate Useless Cast during Transformer. (#3606) * Remove Useless Cast during Transformer. * Resolve comments. * Check if graph can remove the node. Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Clean up OPTIONAL name conflict workarounds in ort_training. (#3622) * Clean up OPTIONAL name conflict workarounds. * Cleanup unnecessory header files onnx_protobuf.h Co-authored-by: Sherlock Huang * Add Lamb shape inference (#3634) * Refactoring code related to WARP_SIZE. (#3623) 1. Centralize its definition in common.cuh. 2. Rename it to GPU_WARP_SIZE which can be extended to AMD GPU later. 3. Centralize warp shuffle functions. Co-authored-by: Weixing Zhang <wezhan@microsoft.com> * fixes for ort_trainer.py to resume from checkpoint (#3510) * fixes for ort_trainer.py to resume from checkpoint * define self.state_dict_ during init * add comment of explanation * add unit test for restore from checkpoint * fix file not found Co-authored-by: suffian khan <sukha@microsoft.com> * Add check for nullptr in PlannerImpl::FindReusableTensor(). (#3619) * expose training session so the training app could register custom kernel and transformers (#3642) Co-authored-by: Cheng Tang <chenta@microsoft.com> * Expand elimination and Expand gradient. (#3610) * Expand elmination and Expand gradient. * Resolve comments. * Fix test break. * Check if graph can remove the node. * Resolve comment. Co-authored-by: Vincent Wang <weicwang@microsoft.com> * Try not to modify base name (#3638) * GatherElementsGrad Kernels (#3627) * GatherElementsGrad cuda kernel & tests * Fix comments * Fix include path * Add pipeline transformer for wait/record node (#3513) * pipeline transformer * clean up * address feedback * add record/wait for first stage and updated split script * address feedback * make recv/send signal as initializer * merge * address feedback * unify input and initializer * address feedback and bug fix * minor fix * windows build * fix * fixed mnist bug (#3569) * fixed mnist bug * fixed train_step param * Simplify and clean code (#3655) 1. It is not necessary to include cudnn_common.h for kernels which are not implemented with CUDNN. 2. Minor change in layer norm kernel to simplify the code and resolve building warning. Co-authored-by: Weixing Zhang <wezhan@microsoft.com> * Change CentOS build to use agent pool because builds on hosted agents run out of disk space. (#3662) * disable broken test in DML (#3666) * temporary disable LSTM_Seq_lens_unpacked for dml test * temporary disable LSTM_Seq_lens_unpacked for dml test * temporary disable LSTM_Seq_lens_unpacked Co-authored-by: Ethan Tao <ettao@microsoft.com> * Revert "Try not to modify base name (#3638)" This reverts commit d9641f2. Reverting to fix onnx_test_runner test failures. Co-authored-by: Ke Deng <kedeng@microsoft.com> Co-authored-by: Ethan Tao <ettao@microsoft.com> Co-authored-by: Zeeshan Siddiqui <mzs@microsoft.com> Co-authored-by: Jesse Benson <benson.jesse@gmail.com> Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: ytaous <4484531+ytaous@users.noreply.github.com> Co-authored-by: liqunfu <liqun_fu@hotmail.com> Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Xueyun Zhu <xzhu1900@gmail.com> Co-authored-by: Tixxx <tix@microsoft.com> Co-authored-by: Li-Wen Chang <30609447+liwchang@users.noreply.github.com> Co-authored-by: Bowen Bao <semisqg@gmail.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: Xueyun Zhu <40807589+xzhu1900@users.noreply.github.com> Co-authored-by: Weixing Zhang <weixingzhang@users.noreply.github.com> Co-authored-by: Weixing Zhang <wezhan@microsoft.com> Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com> Co-authored-by: liqunfu <liqfu@microsoft.com> Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com> Co-authored-by: Vincent Wang <wangwchpku@outlook.com> Co-authored-by: Vincent Wang <weicwang@microsoft.com> Co-authored-by: root <root@525204a066204ea794f942530b05ae7f000000.axlncovkyjne5caro2tmz3zryb.xx.internal.cloudapp.net> Co-authored-by: pengwa <pengwa@microsoft.com> Co-authored-by: harshitha <havenka@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: manashgoswami <magoswam@microsoft.com> Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: suffiank <suffiankh@gmail.com> Co-authored-by: suffian khan <sukha@microsoft.com> Co-authored-by: Tang, Cheng <souptc@gmail.com> Co-authored-by: Cheng Tang <chenta@microsoft.com> Co-authored-by: XiaocenDong <63833153+XiaocenDong@users.noreply.github.com>

include/onnxruntime/core/graph/graph.h

onnxruntime/__init__.py

skottmckay · 2020-04-24T07:46:34Z

onnxruntime/core/framework/allocation_planner.cc

+      if (!p_node_arg) {
+        // TODO this should be an error case, needs more investigation
+        // https://msdata.visualstudio.com/Vienna/_workitems/edit/724826/
+        continue;


should we be checking in links to internal work items in the public github?

i think it's ok - i like the searchability of it. is there a good reason not to?

In reply to: 414367443 [](ancestors = 414367443)

onnxruntime/core/framework/callback.h

onnxruntime/core/framework/utils.cc

onnxruntime/core/graph/graph.cc

skottmckay · 2020-04-24T08:03:36Z

OrtSessionOptionsAppendExecutionProvider_Tensorrt

Can this file be deleted? #Resolved

Refers to: onnxruntime/core/providers/tensorrt/symbols.txt:1 in d92906b. [](commit_id = d92906b, deletion_comment = True)

skottmckay · 2020-04-24T08:03:40Z

// Licensed under the MIT License.

Can this file be deleted? #Resolved

Refers to: onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.cc:2 in d92906b. [](commit_id = d92906b, deletion_comment = True)

onnxruntime/python/session.py

skottmckay · 2020-04-24T08:09:25Z

onnxruntime/test/providers/cpu/nn/batch_norm_op_test.cc


-// flacky test - disabled for now
+// TODO fix flaky test (https://msdata.visualstudio.com/Vienna/_workitems/edit/596949)
 TEST(BatchNormTest, DISABLED_ForwardTrainingTest) {


Is it ok to link internal task in github code or should we instead create a github issue for tracking this?

It is not ok. Here are the guidelines https://docs.opensource.microsoft.com/content/releasing/index.html.

Remove sensitive assets.

Remove any reference in the code to internal or confidential information, including internal paths, tools, codenames, proprietary fonts, internal telemetry and email aliases.

Remove any trademarks or product icons.

Prepare the code and commit history for publication, including running PoliCheck.

removed them here: #3643

edgchen1 · 2020-04-24T16:15:32Z

OrtSessionOptionsAppendExecutionProvider_Tensorrt

file mode changed to 644, looks like a codeflow rendering issue

In reply to: 618867018 [](ancestors = 618867018)

Refers to: onnxruntime/core/providers/tensorrt/symbols.txt:1 in d92906b. [](commit_id = d92906b, deletion_comment = True)

edgchen1 · 2020-04-24T16:16:11Z

// Licensed under the MIT License.

file mode changed to 644, looks like a codeflow rendering issue

In reply to: 618867059 [](ancestors = 618867059)

Refers to: onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.cc:2 in d92906b. [](commit_id = d92906b, deletion_comment = True)

tianleiwu · 2020-04-24T19:07:53Z

onnxruntime/contrib_ops/cuda/bert/fast_gelu.cc


 template <typename T>
-Status FastGelu<T>::ComputeInternal(OpKernelContext* context) const {
-  ORT_RETURN_IF_ERROR(bias_gelu_helper::CheckInputs(context));


Please keep the line CheckInputs for input validation. Validation of input shall be same for CPU and GPU so no need to duplicate the logic below. If there is extra change of validation logic, it is better to update the CheckInputs function. #Resolved

tianleiwu · 2020-04-24T19:38:25Z

include/onnxruntime/core/graph/graph.h

  std::unordered_map<std::string, std::unique_ptr<NodeArg>> node_args_;

+  // node arg to its producer node
+  std::unordered_map<std::string, NodeIndex> node_arg_to_producer_node_;


I suggest not to store this information since it can be deduced from graph. For example, in transformers, user could delete nodes, add nodes, add node_arg, reuse node_arg in a new node. All these changes will impact the consistent of this map with the graph. From this PR, I did not see that change on existing transformers to keep it in sync with the graph.

If needed, we can expose a function to generate such a map. It might slow down some task, but will avoid bugs caused by inconsistent.

will address this later, created a work item to track it

In reply to: 414817323 [](ancestors = 414817323)

- #3681 (comment) - #3681 (comment) - #3681 (comment) - #3681 (comment) - #3681 (comment) - #3681 (comment) - #3681 (comment) - #3681 (comment) - #3681 (comment)

edgchen1 requested a review from a team as a code owner April 24, 2020 05:47