Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests#4465
Merged
Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests#4465
Conversation
SherlockNoMad
approved these changes
Jul 9, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description: Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests as the FP32 loss scale fix generates different loss numbers.
Motivation and Context
Before fix, dynamic loss scaler, init value is 65536:
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5228,0.717476
5,10.1875,7.75453,2.43238
10,8.42578,7.63755,0.792425
15,8.35156,7.60502,0.744699
20,8.22656,7.4854,0.749099
25,8.29688,7.56207,0.73899
30,8.125,7.40926,0.716592
35,7.99219,7.26281,0.726583
40,7.94531,7.26573,0.679934
45,7.94141,7.27335,0.668663
After fix, dynamic loss scaler, init value is 65536:
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.67238,7.51595,2.15643
10,8.27554,7.56897,0.706572
15,8.26999,7.55625,0.71374
20,8.16621,7.4605,0.70571
25,8.2243,7.53056,0.693737
30,8.07708,7.38574,0.691342
35,7.96605,7.25034,0.715708
40,7.94507,7.2564,0.688668
45,7.92962,7.26224,0.667383
Before fix, fixed loss scale = 65536
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5229,0.717479
5,11.2109,10.4883,0.725145
10,11.2266,10.5205,0.699967
15,11.1719,10.4668,0.701468
20,11.2109,10.5031,0.71233
25,11.1953,10.4858,0.711551
30,11.2031,10.4962,0.702751
35,11.2422,10.5054,0.733334
40,11.25,10.5118,0.73202
45,11.2109,10.4865,0.724446
After fix, fixed loss scale = 65536
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.6722,7.51595,2.15625
10,8.27545,7.56892,0.706533
15,8.27002,7.55626,0.713755
20,8.16617,7.46048,0.705697
25,8.22435,7.53055,0.693796
30,8.07706,7.3857,0.691354
35,7.96593,7.25026,0.715675
40,7.94511,7.25639,0.688721
45,7.92963,7.26226,0.667372
Before fix, fixed loss scale = 1024
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5229,0.717479
5,9.67188,7.51594,2.15556
10,8.27344,7.56891,0.706516
15,8.26562,7.55626,0.713709
20,8.16406,7.46044,0.705734
25,8.22656,7.53047,0.693739
30,8.07812,7.38566,0.691267
35,7.96484,7.25025,0.715929
40,7.94531,7.25627,0.688704
45,7.92969,7.26225,0.667403
After fix, fixed loss scale = 1024
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.67239,7.51593,2.15646
10,8.27551,7.56894,0.706573
15,8.26996,7.55628,0.713679
20,8.16623,7.4605,0.705729
25,8.22414,7.53043,0.693709
30,8.07699,7.38571,0.691282
35,7.96615,7.25021,0.715936
40,7.94493,7.25626,0.688668
45,7.92965,7.26224,0.667409