OOM error occurred after having 100k+ train steps #38

trainchou · 2018-08-27T02:20:37Z

My device info:
NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] x 4
Model params:
AUDIO_FEATURE_LENGTH = 200
batch_size = 112

bath size设置为112以上时，运行不久后便会OOM，设置为112，可以跑到10万步左右报OOM。
有两个问题请教一下：
1、请问在10万步左右挂掉后，再load最后的模型继续训练，是不是依旧从原来的数据进行重新训练，也就是说这样会导致前部分数据多次训练，而后面的数据没有机会参加训练？
2、为解决这个OOM的问题要降低AUDIO_FEATURE_LENGTH和batch_size这两种参数？

nl8590687 · 2018-08-27T03:20:15Z

1.不会呀，我这个时每次随机读取数据的
2.我看到你batch_size设置为112，为什么要设置这么大呢？32就最多了，再大的话在训练中是没用的，跟32条数据所代表的梯度区别不大。

songmianmian · 2018-08-27T03:28:18Z

您好，请问如何判断得出32条数据所代表的梯度和大于32条数据所代表的梯度区别不大呢？

nl8590687 · 2018-08-27T03:42:35Z

@songmianmian
这个32是图像和计算机视觉领域的各个研究员普遍使用的标准的batch大小，而且有一些文章和课程视频有讲到为什么使用mini-batch梯度下降而不是一次使用所有数据集进行批量梯度下降，而且针对不同的领域应该使用多大的batch最好也有说到，您可以去看一下。
其实语音识别甚至连32都用不到，不过由于我使用的方法借鉴于计算机视觉，所以也建议32

trainchou · 2018-08-28T02:31:34Z

我来分享一下最新进展，供大家参考：
导入batch_size为112，训练了200k steps的模型，继续用 batch_size 32 进行了一天的训练，目前训练了134k steps，loss从16左右上升到26左右，错误率从20%左右上升到30%-40%，且非常不稳定。
从上述结果看，batch_size 112 貌似比 32 效果好。

nl8590687 · 2018-08-28T07:10:53Z

试试直接从头开始训练？

nl8590687 closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error occurred after having 100k+ train steps #38

OOM error occurred after having 100k+ train steps #38

trainchou commented Aug 27, 2018 •

edited

nl8590687 commented Aug 27, 2018

songmianmian commented Aug 27, 2018

nl8590687 commented Aug 27, 2018 •

edited

trainchou commented Aug 28, 2018

nl8590687 commented Aug 28, 2018

OOM error occurred after having 100k+ train steps #38

OOM error occurred after having 100k+ train steps #38

Comments

trainchou commented Aug 27, 2018 • edited

nl8590687 commented Aug 27, 2018

songmianmian commented Aug 27, 2018

nl8590687 commented Aug 27, 2018 • edited

trainchou commented Aug 28, 2018

nl8590687 commented Aug 28, 2018

trainchou commented Aug 27, 2018 •

edited

nl8590687 commented Aug 27, 2018 •

edited