CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 #153

gangliao · 2018-01-26T04:23:38Z

kubectl get pods
NAME                                READY     STATUS              RESTARTS   AGE
ks-test-cnn-ps-ozp4-0-xg755         1/1       Running             0          37m
ks-test-cnn-worker-ozp4-0-zc5pb     1/1       Running             0          37m
my-nginx-59b6bdfc4-7hvhk            1/1       Running             0          46m
my-nginx-59b6bdfc4-dbzrz            1/1       Unknown             0          1d
my-nginx-59b6bdfc4-rmf9v            1/1       Unknown             0          1d
my-nginx-59b6bdfc4-whjbx            1/1       Running             0          46m
nginx-ds-h89gg                      1/1       Running             1          8d
nginx-ds-hqc26                      1/1       NodeLost            0          8d
nginx-ds-njmwl                      1/1       Running             1          7d
nginx-ds-nnh66                      0/1       ContainerCreating   0          22h
nginx-ds-p68g9                      1/1       NodeLost            0          20h
tf-hub-0                            1/1       Running             0          23h
tf-job-dashboard-59fcb66998-6wrwz   1/1       Unknown             0          23h
tf-job-dashboard-59fcb66998-wjntk   1/1       Running             0          21h
tf-job-operator-55b9c748b8-cx8ml    1/1       Running             0          23h

root@bjzw_104_73 ~/my-kubeflow# kubectl logs pod/ks-test-cnn-worker-ozp4-0-zc5pb -f
INFO|2018-01-26T03:47:24|/opt/launcher.py|48| Launcher started.
INFO|2018-01-26T03:47:24|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --job_name=worker --ps_hosts=ks-test-cnn-ps-ozp4-0:2222 --worker_hosts=ks-test-cnn-worker-ozp4-0:2222 --task_index=0
INFO|2018-01-26T03:47:24|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --job_name=worker --ps_hosts=ks-test-cnn-ps-ozp4-0:2222 --worker_hosts=ks-test-cnn-worker-ozp4-0:2222 --task_index=0
INFO|2018-01-26T03:47:25|/opt/launcher.py|27| 2018-01-26 03:47:25.906439: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| 2018-01-26 03:47:26.372473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 0 with properties:
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| pciBusID: 0000:0e:00.0
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| totalMemory: 11.90GiB freeMemory: 11.74GiB
INFO|2018-01-26T03:47:26|/opt/launcher.py|27| 2018-01-26 03:47:26.372523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:0e:00.0, compute capability: 6.1)
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| 2018-01-26 03:50:25.988880: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ks-test-cnn-ps-ozp4-0:2222}
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| 2018-01-26 03:50:25.988925: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| 2018-01-26 03:50:25.991142: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| TensorFlow:  1.5
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Model:       resnet50
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Mode:        training
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| SingleSess:  False
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Batch size:  32 global
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| 32 per device
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Devices:     ['/job:worker/task:0/gpu:0']
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Data format: NCHW
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Optimizer:   sgd
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Variables:   parameter_server
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Sync:        True
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| ==========
INFO|2018-01-26T03:50:25|/opt/launcher.py|27| Generating model
INFO|2018-01-26T03:50:27|/opt/launcher.py|27| WARNING:tensorflow:From /opt/tf-benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:372: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
INFO|2018-01-26T03:50:27|/opt/launcher.py|27| Instructions for updating:
INFO|2018-01-26T03:50:27|/opt/launcher.py|27| keep_dims is deprecated, use keepdims instead
INFO|2018-01-26T03:50:40|/opt/launcher.py|27| 2018-01-26 03:50:40.810339: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:50:50|/opt/launcher.py|27| 2018-01-26INFO|2018-01-26T03:50:50|/opt/launcher.py|27| 2018-01-26 03:50:50.810569: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:00|/opt/launcher.py|27| 2018-01-26 03:51:00.810765: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:10|/opt/launcher.py|27| 2018-01-26 03:51:10.810949: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:20|/opt/launcher.py|27| 2018-01-26 03:51:20.811174: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:30|/opt/launcher.py|27| 2018-01-26 03:51:30.811460: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:40|/opt/launcher.py|27| 2018-01-26 03:51:40.811757: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:51:50|/opt/launcher.py|27| 2018-01-26 03:51:50.811970: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
INFO|2018-01-26T03:52:00|/opt/launcher.py|27| 2018-01-26 03:52:00.812158: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

gangliao · 2018-01-26T04:27:35Z

root@bjzw_104_73 ~/my-kubeflow# kubectl get pods  -o wide
NAME                                READY     STATUS              RESTARTS   AGE       IP            NODE
ks-test-cnn-ps-ozp4-0-xg755         1/1       Running             0          42m       172.30.31.7   10.141.186.118
ks-test-cnn-worker-ozp4-0-zc5pb     1/1       Running             0          42m       172.30.92.6   10.141.186.119
my-nginx-59b6bdfc4-7hvhk            1/1       Running             0          51m       172.30.31.6   10.141.186.118
my-nginx-59b6bdfc4-dbzrz            1/1       Unknown             0          1d        172.30.47.3   10.142.104.73
my-nginx-59b6bdfc4-rmf9v            1/1       Unknown             0          1d        172.30.47.4   10.142.104.73
my-nginx-59b6bdfc4-whjbx            1/1       Running             0          51m       172.30.92.5   10.141.186.119
nginx-ds-h89gg                      1/1       Running             1          8d        172.30.31.3   10.141.186.118
nginx-ds-hqc26                      1/1       NodeLost            0          8d        172.30.47.6   10.142.104.73
nginx-ds-njmwl                      1/1       Running             1          7d        172.30.92.3   10.141.186.119
nginx-ds-nnh66                      0/1       ContainerCreating   0          22h       <none>        10.141.176.113
nginx-ds-p68g9                      1/1       NodeLost            0          20h       172.30.59.3   10.141.176.112
tf-hub-0                            1/1       Running             0          23h       172.30.31.5   10.141.186.118
tf-job-dashboard-59fcb66998-6wrwz   1/1       Unknown             0          23h       172.30.47.2   10.142.104.73
tf-job-dashboard-59fcb66998-wjntk   1/1       Running             0          21h       172.30.31.2   10.141.186.118
tf-job-operator-55b9c748b8-cx8ml    1/1       Running             0          23h       172.30.92.4   10.141.186.119

jlewi · 2018-01-26T13:22:26Z

What do the logs from your parameter server show?

Check the services and make sure ks-test-cnn-ps-ozp4-0-xg755 is created.

gangliao · 2018-01-26T13:28:34Z

@jlewi

Here it is! Thank you.

kubectl logs pod/ks-test-cnn-ps-zf75-0-dj84v -f
INFO|2018-01-26T13:23:42|/opt/launcher.py|48| Launcher started.
INFO|2018-01-26T13:23:42|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=ps --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:23:42|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=ps --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.015274: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| E0126 13:23:44.015798095       7 ev_epoll1_linux.c:1051]     grpc epoll fd: 3
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.023075: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.023112: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> ks-test-cnn-worker-zf75-0:2222}
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.025193: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| TensorFlow:  1.5
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Model:       resnet50
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Mode:        training
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| SingleSess:  False
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Batch size:  64 global
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 32 per device
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Devices:     ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1']
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Data format: NCHW
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Optimizer:   sgd
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Variables:   parameter_server
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Sync:        True
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| ==========
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Running parameter server 0

gaocegege · 2018-01-26T13:29:58Z

And, please make sure that kube-dns is up

gangliao · 2018-01-26T13:30:21Z

Oh! Sorry, when I restart it! It works now.

I don't know what's happen...

kubectl logs pod/ks-test-cnn-worker-zf75-0-nw72s -f
INFO|2018-01-26T13:26:02|/opt/launcher.py|48| Launcher started.
INFO|2018-01-26T13:26:02|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=worker --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:26:02|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=worker --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| 2018-01-26 13:26:04.232227: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| 2018-01-26 13:26:04.911165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 0 with properties:
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| pciBusID: 0000:04:00.0
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| totalMemory: 11.90GiB freeMemory: 11.74GiB
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.316989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 1 with properties:
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| pciBusID: 0000:0f:00.0
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| totalMemory: 11.90GiB freeMemory: 11.74GiB
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1079] Device peer to peer matrix
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1085] DMA: 0 1
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 0:   Y Y
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 1:   Y Y
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: TITAN Xp, pci bus id: 0000:0f:00.0, compute capability: 6.1)
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.967613: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ks-test-cnn-ps-zf75-0:2222}
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.967650: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.969934: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| TensorFlow:  1.5
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Model:       resnet50
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Mode:        training
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| SingleSess:  False
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Batch size:  64 global
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 32 per device
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Devices:     ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1']
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Data format: NCHW
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Optimizer:   sgd
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Variables:   parameter_server
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Sync:        True
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| ==========
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Generating model
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| WARNING:tensorflow:From /opt/tf-benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:372: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| Instructions for updating:
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| keep_dims is deprecated, use keepdims instead
INFO|2018-01-26T13:29:01|/opt/launcher.py|27| 2018-01-26 13:29:01.058751: I tensorflow/core/distributed_runtime/master_session.cc:1011] Start master session 91d640aa1bdc9570 with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true
INFO|2018-01-26T13:29:03|/opt/launcher.py|27| Running warm up


INFO|2018-01-26T13:29:26|/opt/launcher.py|27| Done warm up
INFO|2018-01-26T13:29:26|/opt/launcher.py|27| Step	Img/sec	loss
INFO|2018-01-26T13:29:27|/opt/launcher.py|27| 1	images/sec: 34.6 +/- 0.0 (jitter = 0.0)	9.714

jlewi · 2018-01-29T13:56:50Z

Glad its working.

* delete vendor directory Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com> * update .gitignore Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com> * update tests Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com> * fix test Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com>

jlewi closed this as completed Jan 29, 2018

pdmack mentioned this issue May 25, 2018

Have the JupyterHub spawner report issues with spawning the user's server #505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 #153

CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 #153

gangliao commented Jan 26, 2018 •

edited

gangliao commented Jan 26, 2018

jlewi commented Jan 26, 2018

gangliao commented Jan 26, 2018

gaocegege commented Jan 26, 2018

gangliao commented Jan 26, 2018

jlewi commented Jan 29, 2018

CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 #153

CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 #153

Comments

gangliao commented Jan 26, 2018 • edited

gangliao commented Jan 26, 2018

jlewi commented Jan 26, 2018

gangliao commented Jan 26, 2018

gaocegege commented Jan 26, 2018

gangliao commented Jan 26, 2018

jlewi commented Jan 29, 2018

gangliao commented Jan 26, 2018 •

edited