-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM error occurred after having 100k+ train steps #38
Comments
1.不会呀,我这个时每次随机读取数据的 |
您好,请问如何判断得出32条数据所代表的梯度和大于32条数据所代表的梯度区别不大呢? |
@songmianmian |
我来分享一下最新进展,供大家参考: |
试试直接从头开始训练? |
My device info:
NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] x 4
Model params:
AUDIO_FEATURE_LENGTH = 200
batch_size = 112
bath size设置为112以上时,运行不久后便会OOM,设置为112,可以跑到10万步左右报OOM。
有两个问题请教一下:
1、请问在10万步左右挂掉后,再load最后的模型继续训练,是不是依旧从原来的数据进行重新训练,也就是说这样会导致前部分数据多次训练,而后面的数据没有机会参加训练?
2、为解决这个OOM的问题要降低AUDIO_FEATURE_LENGTH和batch_size这两种参数?
The text was updated successfully, but these errors were encountered: