ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685) #1969

JayQine · 2022-05-18T07:02:42Z

I write my own dataset class and dataloader, and while train with mmcv.runner, I get the error "ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)".
I cannot locate the key problem according to this error report. How to resolve this issue?

imabackstabber · 2022-05-18T09:16:27Z

Hi,could you paste all error report?

JayQine · 2022-05-18T14:13:20Z

sys.platform: linux
Python: 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: x86_64-linux-gnu-gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.10.0
TorchVision: 0.11.1+cu113
OpenCV: 4.5.5
MMCV: 1.5.0
MMCV Compiler: GCC 8.3
MMCV CUDA Compiler: 11.3
MMSegmentation: 0.21.1+6585937

error_log:
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024640 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024641 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024642 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024643 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024652 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024661 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024662 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 7 (pid: 2024663) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 723, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED

imabackstabber · 2022-05-19T02:59:38Z

Hmm,actually it seems that the fault trace stack doesn't give any information for mmcv though.
In fact,you can assure you install mmcv_full correctly and the version of mmcv_full is on the same page with your CUDA_VERSION.see this issue for more detail.
If you're sure it should not blame to cuda,could you please paste your:

running command
training config

I believe that could help us solve the problem.

JayQine · 2022-05-20T15:04:28Z

1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py
2, training config is too complex.

imabackstabber · 2022-05-23T06:44:16Z

1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py 2, training config is too complex.

ok,but pasting training config may help us locate the poetential bugs,and paste it may help solve the problem.Did the cuda version checking helps?

FuNian788 · 2022-06-21T06:29:43Z

I met the same problem, it happens only when the dataset is too large, eg: objects365, bigdetection. Small datasets such as coco will not cause this problem. Hope this can help with debug

ywdong · 2022-06-27T02:38:17Z

ERROR:torch.distributed.elastic.multiprocessing.api:failed

i also met the same question? Any help?

imabackstabber · 2022-06-27T03:11:02Z

@FuNian788 @ywdong hi, can you paste your training config?

alaa-shubbak · 2022-07-06T16:50:04Z

I faced same error when trying to train on small dataset . I really don't know what exactly the issue , any help please?

jianlong-yuan · 2022-07-15T08:19:18Z

i have met same problem, when dataset is too large

CarloSaccardi · 2022-07-20T14:23:17Z

same problem here

AmrElsayed14 · 2022-07-30T20:57:55Z

i have the same issue.

RenyunLi0116 · 2022-08-03T20:56:43Z

same problem

ataey · 2022-08-06T07:23:23Z

Same issue when dataset is too large.

aissak21 · 2022-08-09T19:49:51Z

I had a similar issue - worked when I used 1% of my data which is 11GB. How would one go about a large?

zhouzaida · 2022-08-10T02:49:12Z

Hi, thanks for your report. We are trying to reproduce the error.

piglaker · 2022-08-16T03:03:47Z

same problem here

yiyexy · 2022-08-19T13:12:59Z

hello everybody, I met the same problem. And finally I found the key of this problem.
If you set your workers_per_gpu as 0, you will get the error log same as @JayQine. Otherwise, you will receive additional error about cuda --'cuDNN error: CUDNN_STATUS_NOT_INITIALIZED'.
Both of them are caused by oom. To prove my opinion, you can set a command dmesg -T | grep -E -i -B100 'killed process' in your terminal. And then you will get the information about why the process terminated.

If you want to avoid this issue, maybe you can reduce your batch size, and enter the command 'top' in your terminal to monitor memory information.

And I think if you want to solve this problem completely, you should change the way your data is preprocessed and loaded.

Nomi-Q · 2022-08-20T09:09:09Z

you need to set the launcher in init_dist(launcher,backend) according to your program. I set PyTorch

haixiongli · 2022-11-03T12:56:51Z

I had a similar issue

lh4027 · 2022-11-06T05:23:02Z

maybe you can try this,
"dist.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=5400))"
reset timeout time.

gihwan-kim · 2022-11-25T07:10:37Z

i have same issue did you resolve?

allanj · 2022-12-26T05:42:50Z

same issue here when loading very large model

yitianlian · 2022-12-27T02:17:39Z

I also meet the failed ,and I think I find the solution!
I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you

stdcoutzrh · 2023-01-03T07:35:50Z

您好，感谢您的报告。我们正在尝试重现错误。

Hi, I encountered this problem when training isaid datasets. Since there are more than 10000 images in the valid set, the memory will be full when the prediction num reaches 6000-7000. It seems that the memory occupied by the predicted pictures is not released after the prediction is completed.

walsvid · 2023-01-17T08:55:09Z

I reproduce the error and found that this error is related to OOM. An intuitive solution is to lower the batch_size on each GPU. During distributed training, a process exits because OOM exits. As a result, the overall training exits and raise the ERROR as mentioned above.

walsvid · 2023-01-18T05:18:13Z

I think this phenomenon is highly related to pytorch/pytorch#13246. Especially the discussion of copy-on-read overhead in this issue. For mmcv users, I think the new mmengine fixes the problem, please see the doc here.

xu19971109 · 2023-02-22T10:02:56Z

find_unused_parameters=True
Add it to config file.

The error is covered in dense warnings.

ggjy · 2023-03-23T16:37:37Z

same problem. I find that some specific model (# parameters) with some specific batch size will encounter the error (ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40349)), change the batch size can fix this.

aqppe · 2023-07-13T05:11:19Z

I got the same error, but I noticed empty records within json. It solves the problem for me.

annahambi · 2023-07-20T15:46:50Z

I got the same error, but I noticed empty records within json. It solves the problem for me.

Can you give more details @aqppe ?

Ignite616 · 2023-07-22T13:23:37Z

hello everybody, I met the same problem. And finally I found the key of this problem. If you set your workers_per_gpu as 0, you will get the error log same as @JayQine. Otherwise, you will receive additional error about cuda --'cuDNN error: CUDNN_STATUS_NOT_INITIALIZED'. Both of them are caused by oom. To prove my opinion, you can set a command dmesg -T | grep -E -i -B100 'killed process' in your terminal. And then you will get the information about why the process terminated.

If you want to avoid this issue, maybe you can reduce your batch size, and enter the command 'top' in your terminal to monitor memory information.

And I think if you want to solve this problem completely, you should change the way your data is preprocessed and loaded.
Hello, I would like to ask you for advice. When I trained my model, I encountered this problem when epoch1 was trained to 330/880. What might be the reason? Is it related to batch_zise, or pytorch version, or is my loss abnormal?

happybear1015 · 2023-08-11T00:41:30Z

same problem. I find that some specific model (# parameters) with some specific batch size will encounter the error (ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40349)), change the batch size can fix this.

the batch_size=1,error still appear，how to deal？

arazd · 2023-08-11T03:12:38Z

Check out this thread tatsu-lab/stanford_alpaca#81

zParquet · 2023-08-24T06:10:37Z

Thanks. I reduced the size of training set and it solved.

morning12138 · 2023-08-27T12:37:23Z

I also meet the failed ,and I think I find the solution! I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you

It's really useful!

rbareja25 · 2023-09-21T00:13:05Z

Has anyone able to fix this? I have increased RAM, reduced batch size, downgraded torch vision version as mentioned above but nothing works..

amundra15 · 2023-10-12T14:40:33Z

I also meet the failed ,and I think I find the solution! I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you

This pointed me in the right direction. I had cudatoolkit-dev in my env (whose version did not match cudatoolkit). Uninstalling it resolved the issue for me.

husnejahan · 2023-11-14T23:14:12Z

Hi,
I had same problem, my dataset was too big.
First change the batch size(lower), then Just Kill all process to free the GPU, CUDA memory, I hope it will solve issues.

blablabLACK · 2023-11-21T00:55:03Z

find_unused_parameters=True Add it to config file.

The error is covered in dense warnings.
This works，thanks！

puppystellar · 2024-01-08T09:11:58Z

I have reduce batchsize to 2 (1 will lead to another error), but how could i use all the training data? if enough GPU memory will help? thanks!

tanatomoe · 2024-01-16T17:23:43Z

find_unused_parameters=True Add it to config file.

The error is covered in dense warnings.

hi!
how exactly do you change batch size?

badrinath89 · 2024-03-31T22:11:14Z

If you are using hugging face, set this in your TrainingArguments()

ddp_find_unused_parameters=False

The error goes away.

iris0329 · 2024-05-27T09:21:42Z

meet this issue with cuda11.6+torch11.3. it happens only when the dataset is too large

mm-assistant bot assigned imabackstabber May 18, 2022

22842219 mentioned this issue Feb 6, 2023

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 22842219/SemanticParser4Graph#6

Closed

murphypei mentioned this issue Aug 24, 2023

使用streaming模式加载大型数据报错。 hiyouga/LLaMA-Factory#672

Closed

bokyeong1015 mentioned this issue Aug 31, 2023

data loading problem with 89M pairs Nota-NetsPresso/BK-SDM#29

Closed

TempleX98 mentioned this issue Sep 5, 2023

torch.distributed.elastic.multiprocessing.api:failed Sense-X/Co-DETR#44

Closed

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685) #1969

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685) #1969

Comments

JayQine commented May 18, 2022

imabackstabber commented May 18, 2022

JayQine commented May 18, 2022

imabackstabber commented May 19, 2022

JayQine commented May 20, 2022

imabackstabber commented May 23, 2022

FuNian788 commented Jun 21, 2022

ywdong commented Jun 27, 2022

imabackstabber commented Jun 27, 2022

alaa-shubbak commented Jul 6, 2022 • edited Loading

jianlong-yuan commented Jul 15, 2022

CarloSaccardi commented Jul 20, 2022

AmrElsayed14 commented Jul 30, 2022

RenyunLi0116 commented Aug 3, 2022

ataey commented Aug 6, 2022

aissak21 commented Aug 9, 2022

zhouzaida commented Aug 10, 2022

piglaker commented Aug 16, 2022

yiyexy commented Aug 19, 2022

Nomi-Q commented Aug 20, 2022

haixiongli commented Nov 3, 2022

lh4027 commented Nov 6, 2022

gihwan-kim commented Nov 25, 2022

allanj commented Dec 26, 2022

yitianlian commented Dec 27, 2022

stdcoutzrh commented Jan 3, 2023

walsvid commented Jan 17, 2023 • edited Loading

walsvid commented Jan 18, 2023

xu19971109 commented Feb 22, 2023

ggjy commented Mar 23, 2023

aqppe commented Jul 13, 2023

annahambi commented Jul 20, 2023

Ignite616 commented Jul 22, 2023

happybear1015 commented Aug 11, 2023

arazd commented Aug 11, 2023

zParquet commented Aug 24, 2023 • edited Loading

morning12138 commented Aug 27, 2023

rbareja25 commented Sep 21, 2023

amundra15 commented Oct 12, 2023

husnejahan commented Nov 14, 2023

blablabLACK commented Nov 21, 2023

puppystellar commented Jan 8, 2024

tanatomoe commented Jan 16, 2024

badrinath89 commented Mar 31, 2024

iris0329 commented May 27, 2024

alaa-shubbak commented Jul 6, 2022 •

edited

Loading

walsvid commented Jan 17, 2023 •

edited

Loading

zParquet commented Aug 24, 2023 •

edited

Loading