New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 #153
Comments
root@bjzw_104_73 ~/my-kubeflow# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
ks-test-cnn-ps-ozp4-0-xg755 1/1 Running 0 42m 172.30.31.7 10.141.186.118
ks-test-cnn-worker-ozp4-0-zc5pb 1/1 Running 0 42m 172.30.92.6 10.141.186.119
my-nginx-59b6bdfc4-7hvhk 1/1 Running 0 51m 172.30.31.6 10.141.186.118
my-nginx-59b6bdfc4-dbzrz 1/1 Unknown 0 1d 172.30.47.3 10.142.104.73
my-nginx-59b6bdfc4-rmf9v 1/1 Unknown 0 1d 172.30.47.4 10.142.104.73
my-nginx-59b6bdfc4-whjbx 1/1 Running 0 51m 172.30.92.5 10.141.186.119
nginx-ds-h89gg 1/1 Running 1 8d 172.30.31.3 10.141.186.118
nginx-ds-hqc26 1/1 NodeLost 0 8d 172.30.47.6 10.142.104.73
nginx-ds-njmwl 1/1 Running 1 7d 172.30.92.3 10.141.186.119
nginx-ds-nnh66 0/1 ContainerCreating 0 22h <none> 10.141.176.113
nginx-ds-p68g9 1/1 NodeLost 0 20h 172.30.59.3 10.141.176.112
tf-hub-0 1/1 Running 0 23h 172.30.31.5 10.141.186.118
tf-job-dashboard-59fcb66998-6wrwz 1/1 Unknown 0 23h 172.30.47.2 10.142.104.73
tf-job-dashboard-59fcb66998-wjntk 1/1 Running 0 21h 172.30.31.2 10.141.186.118
tf-job-operator-55b9c748b8-cx8ml 1/1 Running 0 23h 172.30.92.4 10.141.186.119 |
What do the logs from your parameter server show? Check the services and make sure ks-test-cnn-ps-ozp4-0-xg755 is created. |
Here it is! Thank you. kubectl logs pod/ks-test-cnn-ps-zf75-0-dj84v -f
INFO|2018-01-26T13:23:42|/opt/launcher.py|48| Launcher started.
INFO|2018-01-26T13:23:42|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=ps --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:23:42|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=ps --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.015274: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| E0126 13:23:44.015798095 7 ev_epoll1_linux.c:1051] grpc epoll fd: 3
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.023075: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.023112: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> ks-test-cnn-worker-zf75-0:2222}
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 2018-01-26 13:23:44.025193: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| TensorFlow: 1.5
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Model: resnet50
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Mode: training
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| SingleSess: False
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Batch size: 64 global
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| 32 per device
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1']
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Data format: NCHW
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Optimizer: sgd
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Variables: parameter_server
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Sync: True
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| ==========
INFO|2018-01-26T13:23:44|/opt/launcher.py|27| Running parameter server 0 |
And, please make sure that kube-dns is up |
Oh! Sorry, when I restart it! It works now. I don't know what's happen... kubectl logs pod/ks-test-cnn-worker-zf75-0-nw72s -f
INFO|2018-01-26T13:26:02|/opt/launcher.py|48| Launcher started.
INFO|2018-01-26T13:26:02|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=worker --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:26:02|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --job_name=worker --ps_hosts=ks-test-cnn-ps-zf75-0:2222 --worker_hosts=ks-test-cnn-worker-zf75-0:2222 --task_index=0
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| 2018-01-26 13:26:04.232227: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| 2018-01-26 13:26:04.911165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 0 with properties:
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| pciBusID: 0000:04:00.0
INFO|2018-01-26T13:26:04|/opt/launcher.py|27| totalMemory: 11.90GiB freeMemory: 11.74GiB
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.316989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 1 with properties:
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| pciBusID: 0000:0f:00.0
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| totalMemory: 11.90GiB freeMemory: 11.74GiB
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1079] Device peer to peer matrix
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1085] DMA: 0 1
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 0: Y Y
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 1: Y Y
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
INFO|2018-01-26T13:26:05|/opt/launcher.py|27| 2018-01-26 13:26:05.318669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: TITAN Xp, pci bus id: 0000:0f:00.0, compute capability: 6.1)
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.967613: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ks-test-cnn-ps-zf75-0:2222}
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.967650: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222}
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 2018-01-26 13:28:53.969934: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| TensorFlow: 1.5
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Model: resnet50
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Mode: training
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| SingleSess: False
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Batch size: 64 global
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| 32 per device
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1']
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Data format: NCHW
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Optimizer: sgd
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Variables: parameter_server
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Sync: True
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| ==========
INFO|2018-01-26T13:28:53|/opt/launcher.py|27| Generating model
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| WARNING:tensorflow:From /opt/tf-benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:372: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| Instructions for updating:
INFO|2018-01-26T13:28:55|/opt/launcher.py|27| keep_dims is deprecated, use keepdims instead
INFO|2018-01-26T13:29:01|/opt/launcher.py|27| 2018-01-26 13:29:01.058751: I tensorflow/core/distributed_runtime/master_session.cc:1011] Start master session 91d640aa1bdc9570 with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true
INFO|2018-01-26T13:29:03|/opt/launcher.py|27| Running warm up
INFO|2018-01-26T13:29:26|/opt/launcher.py|27| Done warm up
INFO|2018-01-26T13:29:26|/opt/launcher.py|27| Step Img/sec loss
INFO|2018-01-26T13:29:27|/opt/launcher.py|27| 1 images/sec: 34.6 +/- 0.0 (jitter = 0.0) 9.714 |
Glad its working. |
yanniszark
pushed a commit
to arrikto/kubeflow
that referenced
this issue
Feb 15, 2021
* delete vendor directory Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com> * update .gitignore Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com> * update tests Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com> * fix test Signed-off-by: YujiOshima <yuji.oshima0x3fd@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The text was updated successfully, but these errors were encountered: