-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang when run on distributed mode #247
Comments
The |
yes. I can run on both machines. but in cluster mode be hang. Maybe the network interface have problem |
Could you share the log from the other machine? You can also turn on debug log: export KUNGFU_CONFIG_LOG_LEVEL=DEBUG |
I fixed this issue. Log server1: Log server 2: So it work well. Thanks |
Following README. I can run on all local node successfully
xxx@master:/tmp/KungFu$ kungfu-run -np 2 python3 examples/tf1_mnist_session.py --data-dir=./mnist
...
[I] all 2/2 local peers finished, took 2.397370504s
but when run on cluster. It hang without any error.
@master:/tmp/KungFu$ kungfu-run -np 2 -H 10.208.209.163:1,10.208.209.171:1 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=2
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:1
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[nic] [0] lo :: 127.0.0.1/8
[nic] [1] eno1 :: 10.208.209.163/24
[nic] [2] docker0 :: 192.168.99.1/24
[nic] [3] br-fefb2fb37d81 :: 172.18.0.1/16
[cuda-env]: CUDA_VISIBLE_DEVICES=1
[I] will parallel run 1 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]
The text was updated successfully, but these errors were encountered: