Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 #153

Closed
gangliao opened this issue Jan 26, 2018 · 6 comments

Comments

@gangliao
Copy link
Contributor

gangliao commented Jan 26, 2018

kubectl get pods
NAME                                READY     STATUS              RESTARTS   AGE
ks-test-cnn-ps-ozp4-0-xg755         1/1       Running             0          37m
ks-test-cnn-worker-ozp4-0-zc5pb     1/1       Running             0          37m
my-nginx-59b6bdfc4-7hvhk            1/1       Running             0          46m
my-nginx-59b6bdfc4-dbzrz            1/1       Unknown             0          1d
my-nginx-59b6bdfc4-rmf9v            1/1       Unknown             0          1d
my-nginx-59b6bdfc4-whjbx            1/1       Running             0          46m
nginx-ds-h89gg                      1/1       Running             1          8d
nginx-ds-hqc26                      1/1       NodeLost            0          8d
nginx-ds-njmwl                      1/1       Running             1          7d
nginx-ds-nnh66                      0/1       ContainerCreating   0          22h
nginx-ds-p68g9                      1/1       NodeLost            0          20h
tf-hub-0                            1/1       Running             0          23h
tf-job-dashboard-59fcb66998-6wrwz   1/1       Unknown             0          23h
tf-job-dashboard-59fcb66998-wjntk   1/1       Running             0          21h
tf-job-operator-55b9c748b8-cx8ml    1/1       Running             0          23h
root@bjzw_104_73 ~/my-kubeflow# kubectl logs pod/ks-test-cnn-worker-ozp4-0-zc5pb -f
INFO|2018-01-26T03:47:24|/opt/launcher.py|48| Launcher started.
INFO|2018-01-26T03:47:24|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --job_name=worker --ps_hosts=ks-test-cnn-ps-ozp4-0:2222 --worker_hosts=ks-test-cnn-worker-ozp4-0:2222 --task_index=0
INFO|2018-01-26T03:47:24|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --job_name=worker --ps_hosts=ks-test-cnn-ps-ozp4-0:2222 --worker_hosts=ks-test-cnn-worker-ozp4-0:2222 --task_index=0
INFO|2018-01-26T03:47:25|/opt/launcher.py|27| 2018-01-26 03:47:25.906439: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| 2018-01-26 03:47:26.372473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 0 with properties:
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| pciBusID: 0000:0e:00.0
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| totalMemory: 11.90GiB freeMemory: 11.74GiB
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| 2018-01-26 03:47:26.372523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:0e:00.0, compute capability: 6.1)
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| 2018-01-26 03:50:25.988880: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ks-test-cnn-ps-ozp4-0:2222}
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| 2018-01-26 03:50:25.988925: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| 2018-01-26 03:50:25.991142: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| TensorFlow:  1.5
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Model:       resnet50
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Mode:        training
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| SingleSess:  False
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Batch size:  32 global
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| 32 per device
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Devices:     ['/job:worker/task:0/gpu:0']
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Data format: NCHW
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Optimizer:   sgd
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Variables:   parameter_server
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Sync:        True
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| ==========
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Generating model
INFO|2018-01-26T03:50:27|/opt/launcher.py|27| WARNING:tensorflow:From /opt/tf-benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:372: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
INFO|2018-01-26T03:50:27|/opt/launcher.py|27| Instructions for updating:
INFO|2018-01-26T03:50:27|/opt/launcher.py|27| keep_dims is deprecated, use keepdims instead
INFO|2018-01-26T03:50:40|/opt/launcher.py|27| 2018-01-26 03:50:40.810339: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:50:50|/opt/launcher.py|27| 2018-01-26INFO|2018-01-26T03:50:50|/opt/launcher.py|27| 2018-01-26 03:50:50.810569: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:00|/opt/launcher.py|27| 2018-01-26 03:51:00.810765: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:10|/opt/launcher.py|27| 2018-01-26 03:51:10.810949: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:20|/opt/launcher.py|27| 2018-01-26 03:51:20.811174: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:30|/opt/launcher.py|27| 2018-01-26 03:51:30.811460: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:40|/opt/launcher.py|27| 2018-01-26 03:51:40.811757: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:50|/opt/launcher.py|27| 2018-01-26 03:51:50.811970: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:52:00|/opt/launcher.py|27| 2018-01-26 03:52:00.812158: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
@gangliao
Copy link
Contributor Author

root@bjzw_104_73 ~/my-kubeflow# kubectl get pods  -o wide
NAME                                READY     STATUS              RESTARTS   AGE       IP            NODE
ks-test-cnn-ps-ozp4-0-xg755         1/1       Running             0          42m       172.30.31.7   10.141.186.118
ks-test-cnn-worker-ozp4-0-zc5pb     1/1       Running             0          42m       172.30.92.6   10.141.186.119
my-nginx-59b6bdfc4-7hvhk            1/1       Running             0          51m       172.30.31.6   10.141.186.118
my-nginx-59b6bdfc4-dbzrz            1/1       Unknown             0          1d        172.30.47.3   10.142.104.73
my-nginx-59b6bdfc4-rmf9v            1/1       Unknown             0          1d        172.30.47.4   10.142.104.73
my-nginx-59b6bdfc4-whjbx            1/1       Running             0          51m       172.30.92.5   10.141.186.119
nginx-ds-h89gg                      1/1       Running             1          8d        172.30.31.3   10.141.186.118
nginx-ds-hqc26                      1/1       NodeLost            0          8d        172.30.47.6   10.142.104.73
nginx-ds-njmwl                      1/1       Running             1          7d        172.30.92.3   10.141.186.119
nginx-ds-nnh66                      0/1       ContainerCreating   0          22h       <none>        10.141.176.113
nginx-ds-p68g9                      1/1       NodeLost            0          20h       172.30.59.3   10.141.176.112
tf-hub-0                            1/1       Running             0          23h       172.30.31.5   10.141.186.118
tf-job-dashboard-59fcb66998-6wrwz   1/1       Unknown             0          23h       172.30.47.2   10.142.104.73
tf-job-dashboard-59fcb66998-wjntk   1/1       Running             0          21h       172.30.31.2   10.141.186.118
tf-job-operator-55b9c748b8-cx8ml    1/1       Running             0          23h       172.30.92.4   10.141.186.119

@jlewi
Copy link
Contributor

jlewi commented Jan 26, 2018

What do the logs from your parameter server show?

Check the services and make sure ks-test-cnn-ps-ozp4-0-xg755 is created.

@gangliao
Copy link
Contributor Author

@jlewi

Here it is! Thank you.

kubectl logs pod/ks-test-cnn-ps-zf75-0-dj84v -f
INFO|2018-01-26T13:23:42|/opt/launcher.py|48| Launcher started.
INFO|2018-01-26T13:23:42|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=ps --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:23:42|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=ps --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.015274: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| E0126 13:23:44.015798095       7 ev_epoll1_linux.c:1051]     grpc epoll fd: 3
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.023075: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.023112: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> ks-test-cnn-worker-zf75-0:2222}
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.025193: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| TensorFlow:  1.5
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Model:       resnet50
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Mode:        training
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| SingleSess:  False
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Batch size:  64 global
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 32 per device
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Devices:     ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1']
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Data format: NCHW
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Optimizer:   sgd
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Variables:   parameter_server
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Sync:        True
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| ==========
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Running parameter server 0

@gaocegege
Copy link
Member

And, please make sure that kube-dns is up

@gangliao
Copy link
Contributor Author

Oh! Sorry, when I restart it! It works now.

I don't know what's happen...

kubectl logs pod/ks-test-cnn-worker-zf75-0-nw72s -f
INFO|2018-01-26T13:26:02|/opt/launcher.py|48| Launcher started.
INFO|2018-01-26T13:26:02|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=worker --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:26:02|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=worker --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| 2018-01-26 13:26:04.232227: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| 2018-01-26 13:26:04.911165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 0 with properties:
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| pciBusID: 0000:04:00.0
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| totalMemory: 11.90GiB freeMemory: 11.74GiB
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.316989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 1 with properties:
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| pciBusID: 0000:0f:00.0
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| totalMemory: 11.90GiB freeMemory: 11.74GiB
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1079] Device peer to peer matrix
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1085] DMA: 0 1
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 0:   Y Y
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 1:   Y Y
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: TITAN Xp, pci bus id: 0000:0f:00.0, compute capability: 6.1)
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.967613: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ks-test-cnn-ps-zf75-0:2222}
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.967650: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.969934: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| TensorFlow:  1.5
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Model:       resnet50
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Mode:        training
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| SingleSess:  False
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Batch size:  64 global
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 32 per device
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Devices:     ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1']
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Data format: NCHW
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Optimizer:   sgd
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Variables:   parameter_server
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Sync:        True
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| ==========
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Generating model
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| WARNING:tensorflow:From /opt/tf-benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:372: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| Instructions for updating:
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| keep_dims is deprecated, use keepdims instead
INFO|2018-01-26T13:29:01|/opt/launcher.py|27| 2018-01-26 13:29:01.058751: I tensorflow/core/distributed_runtime/master_session.cc:1011] Start master session 91d640aa1bdc9570 with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true
INFO|2018-01-26T13:29:03|/opt/launcher.py|27| Running warm up


INFO|2018-01-26T13:29:26|/opt/launcher.py|27| Done warm up
INFO|2018-01-26T13:29:26|/opt/launcher.py|27| Step	Img/sec	loss
INFO|2018-01-26T13:29:27|/opt/launcher.py|27| 1	images/sec: 34.6 +/- 0.0 (jitter = 0.0)	9.714

@jlewi
Copy link
Contributor

jlewi commented Jan 29, 2018

Glad its working.

@jlewi jlewi closed this as completed Jan 29, 2018
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
* delete vendor directory

Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com>

* update .gitignore

Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com>

* update tests

Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com>

* fix test

Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants