Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

All tensors must be on devices[0]: 0 #177

Closed
Ice-cool opened this issue Jan 21, 2022 · 6 comments 路 Fixed by #194
Closed

All tensors must be on devices[0]: 0 #177

Ice-cool opened this issue Jan 21, 2022 · 6 comments 路 Fixed by #194

Comments

@Ice-cool
Copy link

馃悰 Describe the bug

For https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet, when use python -m torch.distributed.launch --nproc_per_node 2 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py, there is an error that All tensors must be on devices[0]: 0

Environment

torch=1.8.1

@FrankLeeeee
Copy link
Contributor

Hi, did you build colossalai from source?

@Ice-cool
Copy link
Author

yes, using nproc_per_node 1 is ok. but nproc_per_node 2 will meet this error.

@FrankLeeeee
Copy link
Contributor

Ok, let me try to reproduce this error. May I know which GPU you are using and how many GPUs are available on your machine?

@Ice-cool
Copy link
Author

Tesla-V100 and 2GPUs are avaliable in my machine.

@FrankLeeeee
Copy link
Contributor

Got, let me try to reproduce this issue. I will get back to you soon!

@FrankLeeeee
Copy link
Contributor

Hi, sorry for my late reply. We only got A100 machines so it took a while for me to look for a V100 machine. This bug can be reproduced on torch 1.8 but not torch 1.10. This bug is due to an optional argument in pytorch DistributedDataParallel. This bug will be fixed in #194 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants