-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequentially training multiple models gets running process killed? #2788
Comments
Hi @macabdul9 It seems the previous model is still in memory when you load the next model, which caused the OOM error. Could you try to refactor the script so it only trains one model? You can use a bash script to call the training script multiple times on different splits of data |
Thanks, @HeyangQin for the suggestion. Now I do execute the training script from the bash but the process gets terminated right before the last iteration.
I am sure there's no issue with CUDA memory. Since data size changes with each iteration can there be an issue with CPU memory? @HeyangQin |
@macabdul9 Yes. From the log, it indicates the CPU out of memory issue. Maybe there is a hidden memory leak in your training script? |
I also get a "exits with return code = -9" error using deepspeed, does code = -9 means cpu out of memory? |
Hi @sxthunder, return code -9 doesn't necessarily mean OOM. You need to check the log. |
I am trying to train a model for multiple splits of data using a hugging face trainer with deepspeed (zero stage 2) sequentially but the process gets killed after the first iteration. Something like this:
Error from logs [main part]:
For the second iteration when it tries to load the model again the running process gets killed possibly due to CUDA/CPU OOM.
ds_report
More details:
GPU: Single Node 4xv100 32GB GPUs
cc: @tjruwase @HeyangQin
The text was updated successfully, but these errors were encountered: