Abnormal learning curve bumping at early batches of each epoch during DS2 training. #100

xinghai-sun · 2017-06-15T04:39:49Z

After merging PR #74, we have seen such abnormal learning curve:

The figure plots the training cost. Notice that in the tails of the curve, there are many spikes, exactly locating at the first batch of each epoch.

Besides, it is not easy to reproduce the phenomenon in a small dataset.

xinghai-sun · 2017-06-15T04:46:11Z

One weird thing: when I resume the training from a saved model from the above figure, the phenomena did not appear again.

qingqing01 · 2017-06-15T05:07:05Z

抱歉，我用中文吧 :(

基于libri.train-clean-100，14个pass开始收敛比较奇怪，突然上涨，然后下降：

Pass: 13, Batch: 800, TrainCost: 33.875851
.................................................
Pass: 13, Batch: 850, TrainCost: 32.388756
.........................................
------- Time: 2996 sec,  Pass: 13, ValidationCost: 270.575763434

Pass: 14, Batch: 0, TrainCost: 45.968803
.................................................
Pass: 14, Batch: 50, TrainCost: 492.662450

去掉batch-shuffle中的下面几行，即扔掉开头一些短样本，和不够组batch的长样本，

res_len = len(manifest) - shift_len - len(batch_manifest)
batch_manifest.extend(manifest[-res_len:])
batch_manifest.extend(manifest[0:shift_len])

收敛情况没有出现突然上升，看着都比较正常：

.........

Pass: 19, Batch: 890, TrainCost: 25.492654 CurCost: 14.879614

------- Time: 2977 sec,  Pass: 19, ValidationCost: 61.139245818

Pass: 20, Batch: 0, TrainCost: 27.037848 CurCost: 27.037848

xinghai-sun · 2017-06-15T09:24:33Z

I've given up the attempt to reproduce the phenomenon from a pre-trained model.

Now I've started three from-scratch jobs with three different shuffle methods, i.e.

instance shuffle
batch shuffle
batch shuffle with clipping

(For more details, please refer here)

with full LibriSpeech data, in order to reproduce what @qingqing01 has observed in a small dataset.

xinghai-sun · 2017-06-22T04:27:42Z

Here is the results for batch size = 32, with all three shuffle methods running into an abnormal convergence. Besides, all bumping points are not located in the first batches of some epoch any more (This is contradictory to what we have observed previously).

However, when we change the batch size from 32 to 256, the convergence is much more stable and we haven't seen the abnormal phenomenon by far.

Larger batches reduce the gradient variance, thus stabilizing the convergence.

Conclusion: Batch size 32 is too small for a stable training, use 256 or larger instead.

TODO:

Try smaller learning rate for batch size 32.
Train more epochs to see whether batch size 256 can really stabilize the training.

shanyi15 · 2018-08-15T10:16:28Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

xinghai-sun assigned xinghai-sun and qingqing01 Jun 15, 2017

xinghai-sun mentioned this issue Jun 15, 2017

Add shuffle type of instance_shuffle and batch_shuffle_clipped. #101

Merged

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormal learning curve bumping at early batches of each epoch during DS2 training. #100

Abnormal learning curve bumping at early batches of each epoch during DS2 training. #100

xinghai-sun commented Jun 15, 2017

xinghai-sun commented Jun 15, 2017 •

edited

qingqing01 commented Jun 15, 2017 •

edited

xinghai-sun commented Jun 15, 2017

xinghai-sun commented Jun 22, 2017 •

edited

shanyi15 commented Aug 15, 2018

Abnormal learning curve bumping at early batches of each epoch during DS2 training. #100

Abnormal learning curve bumping at early batches of each epoch during DS2 training. #100

Comments

xinghai-sun commented Jun 15, 2017

xinghai-sun commented Jun 15, 2017 • edited

qingqing01 commented Jun 15, 2017 • edited

xinghai-sun commented Jun 15, 2017

xinghai-sun commented Jun 22, 2017 • edited

shanyi15 commented Aug 15, 2018

xinghai-sun commented Jun 15, 2017 •

edited

qingqing01 commented Jun 15, 2017 •

edited

xinghai-sun commented Jun 22, 2017 •

edited