Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests by Lafi7e · Pull Request #4465 · microsoft/onnxruntime

Lafi7e · 2020-07-09T04:34:30Z

Description: Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests as the FP32 loss scale fix generates different loss numbers.

Motivation and Context

The FP32 loss scale fix is to use FP32 for loss_scale calculation while before that the loss scale will be casted to FP16 before timing loss. FP16 is <65504 while our initial loss_scale value is 1<<16=65536, which means the number timing to loss and the number passed to optimizer are different. I tried passing loss_scale from the frontend as parameter, before the fix, if using 65536, the loss curve just didn't reducing at all, while it works fine when using a smaller number such as 1024. With the fix, no matter what number to use, we can get consistent loss curve. Below are some test results for comparison.

Before fix, dynamic loss scaler, init value is 65536:
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5228,0.717476
5,10.1875,7.75453,2.43238
10,8.42578,7.63755,0.792425
15,8.35156,7.60502,0.744699
20,8.22656,7.4854,0.749099
25,8.29688,7.56207,0.73899
30,8.125,7.40926,0.716592
35,7.99219,7.26281,0.726583
40,7.94531,7.26573,0.679934
45,7.94141,7.27335,0.668663

After fix, dynamic loss scaler, init value is 65536:
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.67238,7.51595,2.15643
10,8.27554,7.56897,0.706572
15,8.26999,7.55625,0.71374
20,8.16621,7.4605,0.70571
25,8.2243,7.53056,0.693737
30,8.07708,7.38574,0.691342
35,7.96605,7.25034,0.715708
40,7.94507,7.2564,0.688668
45,7.92962,7.26224,0.667383

Before fix, fixed loss scale = 65536
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5229,0.717479
5,11.2109,10.4883,0.725145
10,11.2266,10.5205,0.699967
15,11.1719,10.4668,0.701468
20,11.2109,10.5031,0.71233
25,11.1953,10.4858,0.711551
30,11.2031,10.4962,0.702751
35,11.2422,10.5054,0.733334
40,11.25,10.5118,0.73202
45,11.2109,10.4865,0.724446

After fix, fixed loss scale = 65536
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.6722,7.51595,2.15625
10,8.27545,7.56892,0.706533
15,8.27002,7.55626,0.713755
20,8.16617,7.46048,0.705697
25,8.22435,7.53055,0.693796
30,8.07706,7.3857,0.691354
35,7.96593,7.25026,0.715675
40,7.94511,7.25639,0.688721
45,7.92963,7.26226,0.667372

Before fix, fixed loss scale = 1024
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5229,0.717479
5,9.67188,7.51594,2.15556
10,8.27344,7.56891,0.706516
15,8.26562,7.55626,0.713709
20,8.16406,7.46044,0.705734
25,8.22656,7.53047,0.693739
30,8.07812,7.38566,0.691267
35,7.96484,7.25025,0.715929
40,7.94531,7.25627,0.688704
45,7.92969,7.26225,0.667403

After fix, fixed loss scale = 1024
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.67239,7.51593,2.15646
10,8.27551,7.56894,0.706573
15,8.26996,7.55628,0.713679
20,8.16623,7.4605,0.705729
25,8.22414,7.53043,0.693709
30,8.07699,7.38571,0.691282
35,7.96615,7.25021,0.715936
40,7.94493,7.25626,0.688668
45,7.92965,7.26224,0.667409

Update convergence baseline for ci_test.

5fc8bd7

Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Jul 9, 2020

Lafi7e requested review from SherlockNoMad and suffiank July 9, 2020 04:34

Lafi7e requested a review from a team as a code owner July 9, 2020 04:34

SherlockNoMad approved these changes Jul 9, 2020

View reviewed changes

Lafi7e merged commit 7fb194d into master Jul 9, 2020

Lafi7e deleted the weicwang/citest branch July 9, 2020 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests#4465

Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests#4465
Lafi7e merged 1 commit intomasterfrom
weicwang/citest

Lafi7e commented Jul 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Lafi7e commented Jul 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants