Sequentially training multiple models gets running process killed? #2788

macabdul9 · 2023-02-03T12:04:33Z

I am trying to train a model for multiple splits of data using a hugging face trainer with deepspeed (zero stage 2) sequentially but the process gets killed after the first iteration. Something like this:

for current_split in [s1, s2....]:

   model = AutoModel.from_pretrained(".....")
   trainer = (model=model, args=args ...)
   trainer.train()

   evaluation and saving of the results and model goes here

Error from logs [main part]:

02/03/2023 03:17:08 - INFO - __main__ -   ***** test metrics *****
02/03/2023 03:17:08 - INFO - __main__ -     test_loss = 1.4941
02/03/2023 03:17:08 - INFO - __main__ -     test_runtime = 234.9303
02/03/2023 03:17:08 - INFO - __main__ -     test_samples_per_second = 1.822
02/03/2023 03:17:08 - INFO - __main__ -     test_steps_per_second = 0.017

02/03/2023 03:17:19 - WARNING - datasets.arrow_dataset -   Loading cached shuffled indices for dataset at data/cache/path
loading configuration file config.json from cache at config/cache/path

loading weights file pytorch_model.bin from models/

[2023-02-03 03:17:53,592] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143917
[2023-02-03 03:17:53,593] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143918
[2023-02-03 03:17:59,687] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143919
[2023-02-03 03:18:03,471] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 143920
[2023-02-03 03:18:04,988] [ERROR] [launch.py:324:sigkill_handler] exits with return code = -9

For the second iteration when it tries to load the model again the running process gets killed possibly due to CUDA/CPU OOM.

ds_report

(venv) [awaheed@cdr2636 whisper-experiments]$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/awaheed/venv/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1
torch cuda version ............... 11.4
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed install path ........... ['/home/awaheed/venv/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2

More details:

GPU: Single Node 4xv100 32GB GPUs

cc: @tjruwase @HeyangQin

The text was updated successfully, but these errors were encountered:

HeyangQin · 2023-02-03T17:01:58Z

Hi @macabdul9

It seems the previous model is still in memory when you load the next model, which caused the OOM error. Could you try to refactor the script so it only trains one model? You can use a bash script to call the training script multiple times on different splits of data

macabdul9 · 2023-02-04T20:41:51Z

Thanks, @HeyangQin for the suggestion.

Now I do execute the training script from the bash but the process gets terminated right before the last iteration.

slurmstepd: error: Detected 23782 oom-kill event(s) in StepId=58498582.interactive. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cdr2639: task 0: Out Of Memory

I am sure there's no issue with CUDA memory. Since data size changes with each iteration can there be an issue with CPU memory? @HeyangQin

HeyangQin · 2023-02-06T18:07:41Z

@macabdul9 Yes. From the log, it indicates the CPU out of memory issue. Maybe there is a hidden memory leak in your training script?

sxthunder · 2023-02-17T11:10:51Z

I also get a "exits with return code = -9" error using deepspeed, does code = -9 means cpu out of memory?

HeyangQin · 2023-02-17T17:33:05Z

I also get a "exits with return code = -9" error using deepspeed, does code = -9 means cpu out of memory?

Hi @sxthunder, return code -9 doesn't necessarily mean OOM. You need to check the log.

tjruwase added bug Something isn't working training labels Feb 5, 2023

HeyangQin self-assigned this Feb 7, 2023

HeyangQin closed this as completed Feb 10, 2023

zincnode mentioned this issue Feb 18, 2023

[BUG] Get "exits with return code = -9" when Creating fp16 ZeRO stage 2 optimizer #2852

Closed

zhijianma mentioned this issue Oct 20, 2023

FT-Data Ranker-1b OOM finetuning on single GPU modelscope/data-juicer#39

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequentially training multiple models gets running process killed? #2788

Sequentially training multiple models gets running process killed? #2788

macabdul9 commented Feb 3, 2023

HeyangQin commented Feb 3, 2023

macabdul9 commented Feb 4, 2023 •

edited

HeyangQin commented Feb 6, 2023

sxthunder commented Feb 17, 2023

HeyangQin commented Feb 17, 2023

Sequentially training multiple models gets running process killed? #2788

Sequentially training multiple models gets running process killed? #2788

Comments

macabdul9 commented Feb 3, 2023

HeyangQin commented Feb 3, 2023

macabdul9 commented Feb 4, 2023 • edited

HeyangQin commented Feb 6, 2023

sxthunder commented Feb 17, 2023

HeyangQin commented Feb 17, 2023

macabdul9 commented Feb 4, 2023 •

edited