Skip to content

Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests#4465

Merged
Lafi7e merged 1 commit intomasterfrom
weicwang/citest
Jul 9, 2020
Merged

Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests#4465
Lafi7e merged 1 commit intomasterfrom
weicwang/citest

Conversation

@Lafi7e
Copy link
Contributor

@Lafi7e Lafi7e commented Jul 9, 2020

Description: Update Convergence Baseline for github-onnxruntime-linux-gpu-training-e2e-tests as the FP32 loss scale fix generates different loss numbers.

Motivation and Context

  • The FP32 loss scale fix is to use FP32 for loss_scale calculation while before that the loss scale will be casted to FP16 before timing loss. FP16 is <65504 while our initial loss_scale value is 1<<16=65536, which means the number timing to loss and the number passed to optimizer are different. I tried passing loss_scale from the frontend as parameter, before the fix, if using 65536, the loss curve just didn't reducing at all, while it works fine when using a smaller number such as 1024. With the fix, no matter what number to use, we can get consistent loss curve. Below are some test results for comparison.

Before fix, dynamic loss scaler, init value is 65536:
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5228,0.717476
5,10.1875,7.75453,2.43238
10,8.42578,7.63755,0.792425
15,8.35156,7.60502,0.744699
20,8.22656,7.4854,0.749099
25,8.29688,7.56207,0.73899
30,8.125,7.40926,0.716592
35,7.99219,7.26281,0.726583
40,7.94531,7.26573,0.679934
45,7.94141,7.27335,0.668663

After fix, dynamic loss scaler, init value is 65536:
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.67238,7.51595,2.15643
10,8.27554,7.56897,0.706572
15,8.26999,7.55625,0.71374
20,8.16621,7.4605,0.70571
25,8.2243,7.53056,0.693737
30,8.07708,7.38574,0.691342
35,7.96605,7.25034,0.715708
40,7.94507,7.2564,0.688668
45,7.92962,7.26224,0.667383

Before fix, fixed loss scale = 65536
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5229,0.717479
5,11.2109,10.4883,0.725145
10,11.2266,10.5205,0.699967
15,11.1719,10.4668,0.701468
20,11.2109,10.5031,0.71233
25,11.1953,10.4858,0.711551
30,11.2031,10.4962,0.702751
35,11.2422,10.5054,0.733334
40,11.25,10.5118,0.73202
45,11.2109,10.4865,0.724446

After fix, fixed loss scale = 65536
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.6722,7.51595,2.15625
10,8.27545,7.56892,0.706533
15,8.27002,7.55626,0.713755
20,8.16617,7.46048,0.705697
25,8.22435,7.53055,0.693796
30,8.07706,7.3857,0.691354
35,7.96593,7.25026,0.715675
40,7.94511,7.25639,0.688721
45,7.92963,7.26226,0.667372

Before fix, fixed loss scale = 1024
step,total_loss,mlm_loss,nsp_loss
0,11.2422,10.5229,0.717479
5,9.67188,7.51594,2.15556
10,8.27344,7.56891,0.706516
15,8.26562,7.55626,0.713709
20,8.16406,7.46044,0.705734
25,8.22656,7.53047,0.693739
30,8.07812,7.38566,0.691267
35,7.96484,7.25025,0.715929
40,7.94531,7.25627,0.688704
45,7.92969,7.26225,0.667403

After fix, fixed loss scale = 1024
step,total_loss,mlm_loss,nsp_loss
0,11.2403,10.5228,0.717444
5,9.67239,7.51593,2.15646
10,8.27551,7.56894,0.706573
15,8.26996,7.55628,0.713679
20,8.16623,7.4605,0.705729
25,8.22414,7.53043,0.693709
30,8.07699,7.38571,0.691282
35,7.96615,7.25021,0.715936
40,7.94493,7.25626,0.688668
45,7.92965,7.26224,0.667409

@Lafi7e Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Jul 9, 2020
@Lafi7e Lafi7e requested review from SherlockNoMad and suffiank July 9, 2020 04:34
@Lafi7e Lafi7e requested a review from a team as a code owner July 9, 2020 04:34
@Lafi7e Lafi7e merged commit 7fb194d into master Jul 9, 2020
@Lafi7e Lafi7e deleted the weicwang/citest branch July 9, 2020 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants