Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685) #1969

Open
JayQine opened this issue May 18, 2022 · 44 comments
Assignees

Comments

@JayQine
Copy link

JayQine commented May 18, 2022

I write my own dataset class and dataloader, and while train with mmcv.runner, I get the error "ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)".
I cannot locate the key problem according to this error report. How to resolve this issue?

@imabackstabber
Copy link
Contributor

Hi,could you paste all error report?

@JayQine
Copy link
Author

JayQine commented May 18, 2022

sys.platform: linux
Python: 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: x86_64-linux-gnu-gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.10.0
TorchVision: 0.11.1+cu113
OpenCV: 4.5.5
MMCV: 1.5.0
MMCV Compiler: GCC 8.3
MMCV CUDA Compiler: 11.3
MMSegmentation: 0.21.1+6585937

error_log:
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024640 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024641 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024642 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024643 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024652 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024661 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2024662 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 7 (pid: 2024663) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 723, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED

@imabackstabber
Copy link
Contributor

Hmm,actually it seems that the fault trace stack doesn't give any information for mmcv though.
In fact,you can assure you install mmcv_full correctly and the version of mmcv_full is on the same page with your CUDA_VERSION.see this issue for more detail.
If you're sure it should not blame to cuda,could you please paste your:

  1. running command
  2. training config

I believe that could help us solve the problem.

@JayQine
Copy link
Author

JayQine commented May 20, 2022

1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py
2, training config is too complex.

@imabackstabber
Copy link
Contributor

1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py 2, training config is too complex.

ok,but pasting training config may help us locate the poetential bugs,and paste it may help solve the problem.Did the cuda version checking helps?

@FuNian788
Copy link

I met the same problem, it happens only when the dataset is too large, eg: objects365, bigdetection. Small datasets such as coco will not cause this problem. Hope this can help with debug

@ywdong
Copy link

ywdong commented Jun 27, 2022

ERROR:torch.distributed.elastic.multiprocessing.api:failed

i also met the same question? Any help?

@imabackstabber
Copy link
Contributor

@FuNian788 @ywdong hi, can you paste your training config?

@alaa-shubbak
Copy link

alaa-shubbak commented Jul 6, 2022

I faced same error when trying to train on small dataset . I really don't know what exactly the issue , any help please?

error in soft teacher

@jianlong-yuan
Copy link

i have met same problem, when dataset is too large

@CarloSaccardi
Copy link

same problem here

@AmrElsayed14
Copy link

i have the same issue.

@RenyunLi0116
Copy link

same problem
image

@ataey
Copy link

ataey commented Aug 6, 2022

Same issue when dataset is too large.

@aissak21
Copy link

aissak21 commented Aug 9, 2022

I had a similar issue - worked when I used 1% of my data which is 11GB. How would one go about a large?

@zhouzaida
Copy link
Member

Hi, thanks for your report. We are trying to reproduce the error.

@piglaker
Copy link

same problem here

@yiyexy
Copy link

yiyexy commented Aug 19, 2022

hello everybody, I met the same problem. And finally I found the key of this problem.
If you set your workers_per_gpu as 0, you will get the error log same as @JayQine. Otherwise, you will receive additional error about cuda --'cuDNN error: CUDNN_STATUS_NOT_INITIALIZED'.
Both of them are caused by oom. To prove my opinion, you can set a command dmesg -T | grep -E -i -B100 'killed process' in your terminal. And then you will get the information about why the process terminated.

If you want to avoid this issue, maybe you can reduce your batch size, and enter the command 'top' in your terminal to monitor memory information.

And I think if you want to solve this problem completely, you should change the way your data is preprocessed and loaded.

@Nomi-Q
Copy link

Nomi-Q commented Aug 20, 2022

you need to set the launcher in init_dist(launcher,backend) according to your program. I set PyTorch

@haixiongli
Copy link

I had a similar issue

@lh4027
Copy link

lh4027 commented Nov 6, 2022

maybe you can try this,
"dist.init_process_group(backend='nccl', init_method='env://', timeout=datetime.timedelta(seconds=5400))"
reset timeout time.

@gihwan-kim
Copy link

i have same issue did you resolve?

@allanj
Copy link

allanj commented Dec 26, 2022

same issue here when loading very large model

@yitianlian
Copy link

I also meet the failed ,and I think I find the solution!
I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you

@stdcoutzrh
Copy link

您好,感谢您的报告。我们正在尝试重现错误。

Hi, I encountered this problem when training isaid datasets. Since there are more than 10000 images in the valid set, the memory will be full when the prediction num reaches 6000-7000. It seems that the memory occupied by the predicted pictures is not released after the prediction is completed.

@walsvid
Copy link

walsvid commented Jan 17, 2023

I reproduce the error and found that this error is related to OOM. An intuitive solution is to lower the batch_size on each GPU. During distributed training, a process exits because OOM exits. As a result, the overall training exits and raise the ERROR as mentioned above.

@walsvid
Copy link

walsvid commented Jan 18, 2023

I think this phenomenon is highly related to pytorch/pytorch#13246. Especially the discussion of copy-on-read overhead in this issue. For mmcv users, I think the new mmengine fixes the problem, please see the doc here.

@xu19971109
Copy link

find_unused_parameters=True
Add it to config file.

The error is covered in dense warnings.
image

@ggjy
Copy link

ggjy commented Mar 23, 2023

same problem. I find that some specific model (# parameters) with some specific batch size will encounter the error (ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40349)), change the batch size can fix this.

@aqppe
Copy link

aqppe commented Jul 13, 2023

I got the same error, but I noticed empty records within json. It solves the problem for me.

@annahambi
Copy link

I got the same error, but I noticed empty records within json. It solves the problem for me.

Can you give more details @aqppe ?

@Ignite616
Copy link

hello everybody, I met the same problem. And finally I found the key of this problem. If you set your workers_per_gpu as 0, you will get the error log same as @JayQine. Otherwise, you will receive additional error about cuda --'cuDNN error: CUDNN_STATUS_NOT_INITIALIZED'. Both of them are caused by oom. To prove my opinion, you can set a command dmesg -T | grep -E -i -B100 'killed process' in your terminal. And then you will get the information about why the process terminated.

If you want to avoid this issue, maybe you can reduce your batch size, and enter the command 'top' in your terminal to monitor memory information.

And I think if you want to solve this problem completely, you should change the way your data is preprocessed and loaded.
Hello, I would like to ask you for advice. When I trained my model, I encountered this problem when epoch1 was trained to 330/880. What might be the reason? Is it related to batch_zise, or pytorch version, or is my loss abnormal?

@happybear1015
Copy link

same problem. I find that some specific model (# parameters) with some specific batch size will encounter the error (ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40349)), change the batch size can fix this.

the batch_size=1,error still appear,how to deal?

@arazd
Copy link

arazd commented Aug 11, 2023

Check out this thread tatsu-lab/stanford_alpaca#81

@zParquet
Copy link

zParquet commented Aug 24, 2023

Thanks. I reduced the size of training set and it solved.

@morning12138
Copy link

I also meet the failed ,and I think I find the solution! I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you

It's really useful!

@rbareja25
Copy link

Has anyone able to fix this? I have increased RAM, reduced batch size, downgraded torch vision version as mentioned above but nothing works..

@amundra15
Copy link

I also meet the failed ,and I think I find the solution! I think it's a error with the torchvision and torch ,when i use the torchvision0.11.2+cu10.2 and torch 1.10.1+cu11.1 I meet the errror.But When I install torchvision0.11.2+cu11.1 THE PROBLEM FIXED,hope my advise will help you

This pointed me in the right direction. I had cudatoolkit-dev in my env (whose version did not match cudatoolkit). Uninstalling it resolved the issue for me.

@husnejahan
Copy link

Hi,
I had same problem, my dataset was too big.
First change the batch size(lower), then Just Kill all process to free the GPU, CUDA memory, I hope it will solve issues.

@blablabLACK
Copy link

find_unused_parameters=True Add it to config file.

The error is covered in dense warnings. image
This works,thanks!

@puppystellar
Copy link

I have reduce batchsize to 2 (1 will lead to another error), but how could i use all the training data? if enough GPU memory will help? thanks!

@tanatomoe
Copy link

find_unused_parameters=True Add it to config file.

The error is covered in dense warnings. image

hi!
how exactly do you change batch size?

@badrinath89
Copy link

If you are using hugging face, set this in your TrainingArguments()

ddp_find_unused_parameters=False

The error goes away.

@iris0329
Copy link

meet this issue with cuda11.6+torch11.3. it happens only when the dataset is too large

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests