Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorchjob dist-mnist no training logs #1601

Closed
Findlazyfriend opened this issue May 30, 2022 · 4 comments
Closed

Pytorchjob dist-mnist no training logs #1601

Findlazyfriend opened this issue May 30, 2022 · 4 comments

Comments

@Findlazyfriend
Copy link

Hello, guys
As a novice, I encountered a seemingly simple problem. When executing examples/pytorch/mnist/mnist.py, I found that there is no log information after downloading the data, but the master and worker are always running Status, the training process can be displayed normally when debugging locally. This question may seem a bit stupid, I hope you can give pointers, thank you very much.
All outputs :

Using CUDA
Using distributed PyTorch with nccl backend
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Processing...
Done!

@johnugeorge
Copy link
Member

Do you GPUs with in your cluster?

Can you try Gloo backend?

@Findlazyfriend
Copy link
Author

Findlazyfriend commented Jun 6, 2022

Do you GPUs with in your cluster?

Can you try Gloo backend?

Thank you very much for your reply, the problem has been solved, just always stuck in model.cuda(). It is the problem of the version in my base image.😊

@johnugeorge
Copy link
Member

Thanks.
Closing this issue

@N-Kingsley
Copy link

Hello, guys As a novice, I encountered a seemingly simple problem. When executing examples/pytorch/mnist/mnist.py, I found that there is no log information after downloading the data, but the master and worker are always running Status, the training process can be displayed normally when debugging locally. This question may seem a bit stupid, I hope you can give pointers, thank you very much. All outputs :

Using CUDA Using distributed PyTorch with nccl backend Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz Processing... Done!

I have met the same problem, could you tell me what happened and how to solve?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants