Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #945

Closed
Winnie202 opened this issue Jul 5, 2022 · 7 comments
Closed

Comments

@Winnie202
Copy link

Please how to solve this erro
2022-07-05 11-49-08屏幕截图

2022-07-05 11-50-48屏幕截图

2022-07-05 11-50-07屏幕截图
r

@wangruohui
Copy link
Member

Please try git clone the repo instead of downloading the zip and extract.

@wty555
Copy link

wty555 commented Jul 5, 2022

请尝试 git clone 存储库,而不是下载 zip 并解压缩。

I have followed the method you said to git clone https://github.com/open-mmlab/mmediting.git, but the same error still exists, I don't know how to solve it

@wangruohui
Copy link
Member

wangruohui commented Jul 5, 2022

The error is still fatal, not a git reposository? like

@wangruohui
Copy link
Member

Does everything works well with single GPU?

@wty555
Copy link

wty555 commented Jul 5, 2022

单 GPU 是否一切正常?

yes, show erro
2022-07-05 15-07-22屏幕截图
2022-07-05 15-14-27屏幕截图
2022-07-05 15-14-40屏幕截图

@wangruohui
Copy link
Member

wangruohui commented Jul 6, 2022

从你贴的截图看就是 dataloader 被 kill 了,其他错误都应该是这个导致的。你先试一下单卡的用 python tools/train.py 或者 python tools/test.py,确认下文件读取没什么问题。
另外你是单机多少张卡?dataloader的 worker 开了多少个?看一下是不是进程多了超内存被kill了。

@zengyh1900
Copy link
Collaborator

Closing due to inactivity, please reopen if there are any further problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants