Hang when run on distributed mode #247

sondv2 · 2020-01-16T02:12:26Z

Following README. I can run on all local node successfully
xxx@master:/tmp/KungFu$ kungfu-run -np 2 python3 examples/tf1_mnist_session.py --data-dir=./mnist

...
[I] all 2/2 local peers finished, took 2.397370504s

but when run on cluster. It hang without any error.
@master:/tmp/KungFu$ kungfu-run -np 2 -H 10.208.209.163:1,10.208.209.171:1 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=2
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:1
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[nic] [0] lo :: 127.0.0.1/8
[nic] [1] eno1 :: 10.208.209.163/24
[nic] [2] docker0 :: 192.168.99.1/24
[nic] [3] br-fefb2fb37d81 :: 172.18.0.1/16
[cuda-env]: CUDA_VISIBLE_DEVICES=1
[I] will parallel run 1 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]

lgarithm · 2020-01-16T10:12:24Z

The kungfu-run command should be executed on both machines.

sondv2 · 2020-01-17T09:04:18Z

The kungfu-run command should be executed on both machines.

yes. I can run on both machines. but in cluster mode be hang. Maybe the network interface have problem

lgarithm · 2020-01-17T12:18:00Z

Could you share the log from the other machine?

You can also turn on debug log:

export KUNGFU_CONFIG_LOG_LEVEL=DEBUG

sondv2 · 2020-01-18T02:52:22Z

I fixed this issue.
2 server have different NIC name so I miss configuration.

Log server1:
kungfu-run -np 3 -H 10.208.209.163:1,10.208.209.171:2 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=3
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:2
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=DEBUG
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eno1 :: 10.208.209.163/24, fe80::b9b2:6891:c63:5d72/64
[D] Using self=10.208.209.163
[I] will parallel run 1 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]
[10.208.209.163.10000::stdout] [D] listening: 0.0.0.0:10000
[10.208.209.163.10000::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.163.10000::stdout] [D] using name based hash
[10.208.209.163.10000::stdout] [D] got new connection of type Collective from: 10.208.209.171:10000
[10.208.209.163.10000::stdout] [D] connection to #<10.208.209.171:10000> established after 1 trials, took 227.647µs
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint8 = np.dtype([("qint8", np.int8, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint16 = np.dtype([("qint16", np.int16, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint32 = np.dtype([("qint32", np.int32, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] np_resource = np.dtype([("resource", np.ubyte, 1)])
[10.208.209.163.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
[10.208.209.163.10000::stderr] Instructions for updating:
[10.208.209.163.10000::stderr] Colocations handled automatically by placer.
[10.208.209.163.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
[10.208.209.163.10000::stderr] Instructions for updating:
[10.208.209.163.10000::stderr] Use tf.cast instead.
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.696298: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.793487: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.796763: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x23bdbf0 executing computations on platform CUDA. Devices:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.796773: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.817165: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818181: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x23de620 executing computations on platform Host. Devices:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818192: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
[10.208.209.163.10000::stderr] name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
[10.208.209.163.10000::stderr] pciBusID: 0000:01:00.0
[10.208.209.163.10000::stderr] totalMemory: 10.91GiB freeMemory: 10.35GiB
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10064 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
[10.208.209.163.10000::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.163.10000::stdout] training
[10.208.209.163.10000::stderr] 2020-01-18 09:49:09.305317: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[10.208.209.163.10000::stdout] training accuracy: 0.360000
[10.208.209.163.10000::stdout] validation accuracy: 0.329400
[10.208.209.163.10000::stdout] training accuracy: 0.900000
[10.208.209.163.10000::stdout] validation accuracy: 0.885000
[10.208.209.163.10000::stdout] training accuracy: 0.940000
[10.208.209.163.10000::stdout] validation accuracy: 0.896000
[10.208.209.163.10000::stdout] training accuracy: 0.920000
[10.208.209.163.10000::stdout] validation accuracy: 0.901700
[10.208.209.163.10000::stdout] test accuracy: 0.902400
[10.208.209.163.10000::stdout] [D] Server Closed
[D] #<10.208.209.163.10000> finished successfully
[I] all 1/3 local peers finished, took 21.875099576s
[D] kungfu-run finished, took 21.875377819s

Log server 2:
kungfu-run -np 3 -H 10.208.209.163:1,10.208.209.171:2 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=3
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:2
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=DEBUG
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eno1 :: 10.208.209.171/24, fe80::7af5:7968:e59f:de55/64
[nic] [2] docker0 :: 192.168.99.1/24, fe80::42:77ff:fe22:c7fd/64
[nic] [3] vetha6b3b40 :: fe80::f09b:a4ff:fe5f:fce0/64
[nic] [4] virbr0 :: 192.168.122.1/24
[nic] [5] virbr0-nic ::
[nic] [6] veth1598cdf :: fe80::28d0:b6ff:fe46:d0d1/64
[D] Using self=10.208.209.171
[I] will parallel run 2 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]
[10.208.209.171.10000::stdout] [D] listening: 0.0.0.0:10000
[10.208.209.171.10001::stdout] [D] listening: 0.0.0.0:10001
[10.208.209.171.10001::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.171.10001::stdout] [D] using name based hash
[10.208.209.171.10000::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.171.10000::stdout] [D] using name based hash
[10.208.209.171.10001::stdout] [D] connection to #<10.208.209.171:10000> established after 1 trials, took 69.678µs
[10.208.209.171.10000::stdout] [D] got new connection of type Collective from: 10.208.209.171:10001
[10.208.209.171.10000::stdout] [D] connection to #<10.208.209.163:10000> established after 1 trials, took 210.27µs
[10.208.209.171.10000::stdout] [D] got new connection of type Collective from: 10.208.209.163:10000
[10.208.209.171.10000::stdout] [D] connection to #<10.208.209.171:10001> established after 1 trials, took 53.139µs
[10.208.209.171.10001::stdout] [D] got new connection of type Collective from: 10.208.209.171:10000
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:69: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:7: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:8: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:9: The name tf.mod is deprecated. Please use tf.math.mod instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:10: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:94: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /home/sondv7/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
[10.208.209.171.10000::stderr] Instructions for updating:
[10.208.209.171.10000::stderr] If using Keras pass *_constraint arguments to layers.
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:101: The name tf.log is deprecated. Please use tf.math.log instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:69: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:7: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:8: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:9: The name tf.mod is deprecated. Please use tf.math.mod instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:10: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:94: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /home/sondv7/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
[10.208.209.171.10001::stderr] Instructions for updating:
[10.208.209.171.10001::stderr] If using Keras pass *_constraint arguments to layers.
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:101: The name tf.log is deprecated. Please use tf.math.log instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:208: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.083367: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.094570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
[10.208.209.171.10000::stderr] name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
[10.208.209.171.10000::stderr] pciBusID: 0000:01:00.0
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095188: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095222: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095250: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095277: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095305: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095332: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097481: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097507: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
[10.208.209.171.10000::stderr] Skipping registering GPU devices...
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097763: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:208: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.113296: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115491: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115507: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: slave
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115511: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: slave
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115534: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115546: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115549: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115724: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.118929: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.119398: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a7f3e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.119408: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[10.208.209.171.10001::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:151: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.121817: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.122286: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47ecba0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.122296: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/initializer/init.py:27: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179584: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179944: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47fb3b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179954: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.180000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.180004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]
[10.208.209.171.10000::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:151: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/initializer/init.py:27: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stdout] training
[10.208.209.171.10001::stdout] training
[10.208.209.171.10000::stdout] training accuracy: 0.460000
[10.208.209.171.10001::stdout] training accuracy: 0.480000
[10.208.209.171.10001::stdout] validation accuracy: 0.329400
[10.208.209.171.10000::stdout] validation accuracy: 0.329400
[10.208.209.171.10001::stdout] training accuracy: 0.880000
[10.208.209.171.10000::stdout] training accuracy: 0.920000
[10.208.209.171.10001::stdout] validation accuracy: 0.885000
[10.208.209.171.10000::stdout] validation accuracy: 0.885000
[10.208.209.171.10000::stdout] training accuracy: 0.900000
[10.208.209.171.10001::stdout] training accuracy: 0.880000
[10.208.209.171.10000::stdout] validation accuracy: 0.896000
[10.208.209.171.10001::stdout] validation accuracy: 0.896000
[10.208.209.171.10000::stdout] training accuracy: 0.940000
[10.208.209.171.10001::stdout] training accuracy: 0.960000
[10.208.209.171.10000::stdout] validation accuracy: 0.901700
[10.208.209.171.10001::stdout] validation accuracy: 0.901700
[10.208.209.171.10001::stdout] test accuracy: 0.902400
[10.208.209.171.10000::stdout] test accuracy: 0.902400
[10.208.209.171.10001::stdout] [D] Server Closed
[10.208.209.171.10000::stdout] [D] Server Closed
[D] #<10.208.209.171.10001> finished successfully
[D] #<10.208.209.171.10000> finished successfully
[I] all 2/3 local peers finished, took 2.948737336s
[D] kungfu-run finished, took 2.949251703s

So it work well.

Thanks

sondv2 changed the title ~~Hang when run~~ Hang when run on distributed mode Jan 16, 2020

sondv2 closed this as completed Jan 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang when run on distributed mode #247

Hang when run on distributed mode #247

sondv2 commented Jan 16, 2020 •

edited

Loading

lgarithm commented Jan 16, 2020 •

edited

Loading

sondv2 commented Jan 17, 2020 •

edited

Loading

lgarithm commented Jan 17, 2020

sondv2 commented Jan 18, 2020

Hang when run on distributed mode #247

Hang when run on distributed mode #247

Comments

sondv2 commented Jan 16, 2020 • edited Loading

lgarithm commented Jan 16, 2020 • edited Loading

sondv2 commented Jan 17, 2020 • edited Loading

lgarithm commented Jan 17, 2020

sondv2 commented Jan 18, 2020

sondv2 commented Jan 16, 2020 •

edited

Loading

lgarithm commented Jan 16, 2020 •

edited

Loading

sondv2 commented Jan 17, 2020 •

edited

Loading