Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequentially training multiple models gets running process killed? #2788

Closed
macabdul9 opened this issue Feb 3, 2023 · 5 comments
Closed
Assignees
Labels
bug Something isn't working training

Comments

@macabdul9
Copy link

I am trying to train a model for multiple splits of data using a hugging face trainer with deepspeed (zero stage 2) sequentially but the process gets killed after the first iteration. Something like this:

for current_split in [s1, s2....]:

   model = AutoModel.from_pretrained(".....")
   trainer = (model=model, args=args ...)
   trainer.train()

   evaluation and saving of the results and model goes here

Error from logs [main part]:

02/03/2023 03:17:08 - INFO - __main__ -   ***** test metrics *****
02/03/2023 03:17:08 - INFO - __main__ -     test_loss = 1.4941
02/03/2023 03:17:08 - INFO - __main__ -     test_runtime = 234.9303
02/03/2023 03:17:08 - INFO - __main__ -     test_samples_per_second = 1.822
02/03/2023 03:17:08 - INFO - __main__ -     test_steps_per_second = 0.017

02/03/2023 03:17:19 - WARNING - datasets.arrow_dataset -   Loading cached shuffled indices for dataset at data/cache/path
loading configuration file config.json from cache at config/cache/path

loading weights file pytorch_model.bin from models/

[2023-02-03 03:17:53,592] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143917
[2023-02-03 03:17:53,593] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143918
[2023-02-03 03:17:59,687] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143919
[2023-02-03 03:18:03,471] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143920
[2023-02-03 03:18:04,988] [ERROR] [launch.py:324:sigkill_handler] exits with return code = -9

For the second iteration when it tries to load the model again the running process gets killed possibly due to CUDA/CPU OOM.

ds_report

(venv) [awaheed@cdr2636 whisper-experiments]$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/awaheed/venv/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1
torch cuda version ............... 11.4
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed install path ........... ['/home/awaheed/venv/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2

More details:

GPU: Single Node 4xv100 32GB GPUs

cc: @tjruwase @HeyangQin

@HeyangQin
Copy link
Contributor

Hi @macabdul9

It seems the previous model is still in memory when you load the next model, which caused the OOM error. Could you try to refactor the script so it only trains one model? You can use a bash script to call the training script multiple times on different splits of data

@macabdul9
Copy link
Author

macabdul9 commented Feb 4, 2023

Thanks, @HeyangQin for the suggestion.

Now I do execute the training script from the bash but the process gets terminated right before the last iteration.

slurmstepd: error: Detected 23782 oom-kill event(s) in StepId=58498582.interactive. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cdr2639: task 0: Out Of Memory

I am sure there's no issue with CUDA memory. Since data size changes with each iteration can there be an issue with CPU memory? @HeyangQin

@tjruwase tjruwase added bug Something isn't working training labels Feb 5, 2023
@HeyangQin
Copy link
Contributor

@macabdul9 Yes. From the log, it indicates the CPU out of memory issue. Maybe there is a hidden memory leak in your training script?

@HeyangQin HeyangQin self-assigned this Feb 7, 2023
@sxthunder
Copy link

I also get a "exits with return code = -9" error using deepspeed, does code = -9 means cpu out of memory?

@HeyangQin
Copy link
Contributor

I also get a "exits with return code = -9" error using deepspeed, does code = -9 means cpu out of memory?

Hi @sxthunder, return code -9 doesn't necessarily mean OOM. You need to check the log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

4 participants