-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685) #1969
Comments
Hi,could you paste all error report? |
sys.platform: linux error_log: |
Hmm,actually it seems that the fault trace stack doesn't give any information for
I believe that could help us solve the problem. |
1, running command: bash scripts/dist_train.sh 0,1,3,4,5,6,7 configs/base_nir/deeplabv3plus_r101.py |
ok,but pasting training config may help us locate the poetential bugs,and paste it may help solve the problem.Did the cuda version checking helps? |
I met the same problem, it happens only when the dataset is too large, eg: objects365, bigdetection. Small datasets such as coco will not cause this problem. Hope this can help with debug |
i also met the same question? Any help? |
@FuNian788 @ywdong hi, can you paste your training config? |
i have met same problem, when dataset is too large |
same problem here |
i have the same issue. |
Same issue when dataset is too large. |
I had a similar issue - worked when I used 1% of my data which is 11GB. How would one go about a large? |
Hi, thanks for your report. We are trying to reproduce the error. |
same problem here |
hello everybody, I met the same problem. And finally I found the key of this problem. If you want to avoid this issue, maybe you can reduce your batch size, and enter the command 'top' in your terminal to monitor memory information. And I think if you want to solve this problem completely, you should change the way your data is preprocessed and loaded. |
you need to set the launcher in init_dist(launcher,backend) according to your program. I set PyTorch |
I had a similar issue |
maybe you can try this, |
i have same issue did you resolve? |
same issue here when loading very large model |
I also meet the failed ,and I think I find the solution! |
Hi, I encountered this problem when training isaid datasets. Since there are more than 10000 images in the valid set, the memory will be full when the prediction num reaches 6000-7000. It seems that the memory occupied by the predicted pictures is not released after the prediction is completed. |
I reproduce the error and found that this error is related to OOM. An intuitive solution is to lower the |
I think this phenomenon is highly related to pytorch/pytorch#13246. Especially the discussion of copy-on-read overhead in this issue. For mmcv users, I think the new mmengine fixes the problem, please see the doc here. |
same problem. I find that some specific model (# parameters) with some specific batch size will encounter the error ( |
I got the same error, but I noticed empty records within json. It solves the problem for me. |
Can you give more details @aqppe ? |
|
the batch_size=1,error still appear,how to deal? |
Check out this thread tatsu-lab/stanford_alpaca#81 |
Thanks. I reduced the size of training set and it solved. |
It's really useful! |
Has anyone able to fix this? I have increased RAM, reduced batch size, downgraded torch vision version as mentioned above but nothing works.. |
This pointed me in the right direction. I had cudatoolkit-dev in my env (whose version did not match cudatoolkit). Uninstalling it resolved the issue for me. |
Hi, |
I have reduce batchsize to 2 (1 will lead to another error), but how could i use all the training data? if enough GPU memory will help? thanks! |
If you are using hugging face, set this in your
The error goes away. |
meet this issue with cuda11.6+torch11.3. it happens only when the dataset is too large |
I write my own dataset class and dataloader, and while train with mmcv.runner, I get the error "ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2762685)".
I cannot locate the key problem according to this error report. How to resolve this issue?
The text was updated successfully, but these errors were encountered: