Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] request parameters doc when the -init-version=-1 #302

Closed
zrss opened this issue Jun 24, 2020 · 5 comments
Closed

[doc] request parameters doc when the -init-version=-1 #302

zrss opened this issue Jun 24, 2020 · 5 comments
Assignees

Comments

@zrss
Copy link

zrss commented Jun 24, 2020

i'd like kungfu can provide a brief doc about the parameters, i have try set -init-version=-1 and ignore the -H param, but it seems kungfu-run can't handle it well

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=64
[arg] [3]=-w
[arg] [4]=-config-server
[arg] [5]=file:///home/ma-user/user-job-dir/config.json
[arg] [6]=-nic
[arg] [7]=ib0
[arg] [8]=-init-version
[arg] [9]=-1
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-17/image_classification_xk.py
[arg] [12]=--num_clases=1001
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.143.141/20
[nic] [8] bond0 :: 192.168.5.175/22, fe80::f816:3eff:fef7:d4fc/64
[nic] [9] docker0 :: 169.254.30.1/28, fe80::42:baff:fe91:cb50/64
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::ece2:eaff:fefb:ce44/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::c052:54ff:fe20:7f91/64
[nic] [15] gw_11cbf51a :: 172.16.0.193/16, fe80::44b8:36ff:febb:623c/64
[nic] [16] br_plc_a149041e ::
[nic] [17] veth_a149041e :: fe80::5428:eff:fe9b:5826/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
exit on error: 169.254.143.141:38080 not in 127.0.0.1:38080 at 7037287:/home/work/KungFu/srcs/go/cmd/kungfu-run/kungfu-run.go:62
@lgarithm
Copy link
Collaborator

it seems a bug of kungfu-run, could you try this workaround before we fix it.

# get self ipv4 of given nic
get_self_ip() {
    local nic=$1
    ifconfig $nic | grep inet | grep -v inet6 | awk '{print $2}'
}

#  construct kungfu-run flags
kungfu_run_flags() {
    local nic=$1
    local IP=$(get_self_ip $nic)

    echo -H $IP  # workaround
    echo -init-version -1
    echo -w
    echo -nic $nic
}

kungfu_run_with_nic() {
    local nic=$1
    kungfu-run $(kungfu_run_flags $nic) $@
}

kungfu_run_with_nic ib0 python3 train-xxx.py

@zrss
Copy link
Author

zrss commented Jun 28, 2020

thanks for the reply, i have tested kungfu-run with-H ${current_ip_of_nic} -nic ${current_nic} and without -np params, but it turns out that gpuPool is initialized with the wrong slots number in the newly added kungfu-run container ...

should i also set the correct value of np (gpu num in the newly added kungfu-run container)

to clarify, currently, we have the machine with 8 * V100 GPU, and the newly added kungfu-run should be start with

kungfu-run
-np 8
-H ${current_ip_of_nic}:8
-nic ${current_nic}
-init-version -1

@zrss
Copy link
Author

zrss commented Jun 28, 2020

this is my test case, init a job with 1 container (i.e. A.1 container), then scale up to 2 container (i.e. A.2 container is been added), but it turns out, both of containers hang after the sync to offset ...

init A.1 container

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.131.180:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-17/image_classification.py
...
[I] watching config server
[I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [8@1]{169.254.131.180:10000,169.254.131.180:10001,169.254.131.180:10002,169.254.131.180:10003,169.254.131.180:10004,169.254.131.180:10005,169.254.131.180:10006,169.254.131.180:10007}@{169.254.131.180:38080}
...
...
...
[169.254.131.180.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 2.084 top1: 0.000     top5: 0.000     ent_loss: 13.765        reg_loss: nan   total_loss: nan
I0628 17:07:26.790192       1 ma_fmk_kungfu.go:314] generated host file
[I] arrived at v1, new np=16, local: +0/-0, global: +8/-0
[169.254.131.180.10001::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10006::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10005::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10002::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10004::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10007::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10000::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178

newly added A.2 container

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.138.208:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=-init-version
[arg] [11]=-1
[arg] [12]=/home/work/anaconda/bin/python
[arg] [13]=kungfu-demo-6-17/image_classification.py
...
[I] ^[[1;34mwaiting to be initialized^[[m
[I] watching config server
W0628 17:07:26.790124       1 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not found
[I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [16@2]{169.254.131.180:10000,169.254.131.180:10001,169.254.131.180:10002,169.254.131.180:10003,169.254.131.180:10004,169.254.131.180:10005,169.254.131.180:10006,169.254.131.180:10007,169.254.138.208:10000,169.254.138.208:10001,169.254.138.208:10002,169.254.138.208:10003,169.254.138.208:10004,169.254.138.208:10005,169.254.138.208:10006,169.254.138.208:10007}@{169.254.131.180:38080,169.254.138.208:38080}
[169.254.138.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up

export KUNGFU_CONFIG_LOG_LEVEL=0

init A.1 container

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.143.141:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-10/bert_classifier.py
[arg] [12]=--data_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/data/5000.manifest
[arg] [13]=--train_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/train_url
[arg] [14]=--checkpoint_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/chinese_L-12_H-768_A-12
[arg] [15]=--variable_update=kungfu_ssgd
[arg] [16]=--train_batch_size=20
[arg] [17]=--num_train_epochs=30
[arg] [18]=save_summaries_steps=10000
[arg] [19]=--eval_batch_size=20
[arg] [20]=--learning_rate=2e-5
[arg] [21]=--max_seq_length=128
[arg] [22]=--save_model_steps=20
[arg] [23]=--save_interval_secs=40
[arg] [24]=--kungfu_elastic=True
[arg] [25]=--kungfu_batch_size=20
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=0
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.143.141/20
[nic] [8] bond0 :: 192.168.5.175/22, fe80::f816:3eff:fef7:d4fc/64
[nic] [9] docker0 :: 169.254.30.1/28, fe80::42:baff:fe91:cb50/64
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::ece2:eaff:fefb:ce44/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::c052:54ff:fe20:7f91/64
[nic] [15] gw_11cbf51a :: 172.16.0.193/16, fe80::44b8:36ff:febb:623c/64
[nic] [16] br_plc_a149041e ::
[nic] [17] veth_a149041e :: fe80::5428:eff:fe9b:5826/64
[nic] [18] vethf7542b2 :: fe80::9cce:b4ff:fe37:e823/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
[D] Using self=169.254.143.141
[D] listening: 0.0.0.0:38080
[I] watching config server
[I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
[D] waiting 0 peers to stop
[D] 0 peer removed: 0 - 0 = 0
[E] full update detected: [0@0]{}@{} -> [8@1]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007}@{169.254.143.141:38080}
[D] 8 peers created: 0 - 0 + 8 = 8

...

[169.254.143.141.10007::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.033 loss: 2.274     top-1: 0.150
[169.254.143.141.10005::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.033 loss: 2.329     top-1: 0.100
[169.254.143.141.10003::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.246     top-1: 0.100
[169.254.143.141.10002::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.332     top-1: 0.050
[169.254.143.141.10000::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.059 loss: 2.259     top-1: 0.150
[169.254.143.141.10006::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.393     top-1: 0.050
[169.254.143.141.10004::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.041 loss: 2.341     top-1: 0.100
[169.254.143.141.10001::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.041 loss: 2.370     top-1: 0.000
[169.254.143.141.10004::stdout] sync to offset 160 on step 1
[169.254.143.141.10007::stdout] sync to offset 160 on step 1
[169.254.143.141.10005::stdout] sync to offset 160 on step 1
[169.254.143.141.10003::stdout] sync to offset 160 on step 1
[169.254.143.141.10002::stdout] sync to offset 160 on step 1
[169.254.143.141.10006::stdout] sync to offset 160 on step 1
[169.254.143.141.10001::stdout] sync to offset 160 on step 1
[169.254.143.141.10000::stdout] sync to offset 160 on step 1
[169.254.143.141.10005::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10005::stdout] [D] ingore unchanged proposal
[169.254.143.141.10005::stdout] [D] ignore update
[169.254.143.141.10001::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10000::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10004::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10006::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10001::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10002::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10004::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] ingore unchanged proposal
[169.254.143.141.10002::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] ignore update
[169.254.143.141.10004::stdout] [D] ignore update
[169.254.143.141.10006::stdout] [D] ingore unchanged proposal
[169.254.143.141.10003::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10002::stdout] [D] ignore update
[169.254.143.141.10001::stdout] [D] ignore update
[169.254.143.141.10006::stdout] [D] ignore update
[169.254.143.141.10003::stdout] [D] ingore unchanged proposal
[169.254.143.141.10003::stdout] [D] ignore update
[169.254.143.141.10000::stdout] [D] ingore unchanged proposal
[169.254.143.141.10000::stdout] [D] ignore update

HANG happen at here

newly added A.2 container

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.135.208:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=-init-version
[arg] [11]=-1
[arg] [12]=/home/work/anaconda/bin/python
[arg] [13]=kungfu-demo-6-10/bert_classifier.py
[arg] [14]=--data_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/data/5000.manifest
[arg] [15]=--train_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/train_url
[arg] [16]=--checkpoint_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/chinese_L-12_H-768_A-12
[arg] [17]=--variable_update=kungfu_ssgd
[arg] [18]=--train_batch_size=20
[arg] [19]=--num_train_epochs=30
[arg] [20]=save_summaries_steps=10000
[arg] [21]=--eval_batch_size=20
[arg] [22]=--learning_rate=2e-5
[arg] [23]=--max_seq_length=128
[arg] [24]=--save_model_steps=20
[arg] [25]=--save_interval_secs=40
[arg] [26]=--kungfu_elastic=True
[arg] [27]=--kungfu_batch_size=20
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=0
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.135.208/20
[nic] [8] bond0 :: 192.168.6.37/22, fe80::f816:3eff:fed7:1503/64
[nic] [9] docker0 :: 169.254.30.1/28
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::481:6ff:fe7c:d368/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::8c7f:9ff:fe8a:6612/64
[nic] [15] gw_11cbf51a :: 172.16.1.17/16, fe80::d0f2:4cff:feb3:895c/64
[nic] [16] br_plc_949f84f2 ::
[nic] [17] veth_949f84f2 :: fe80::5410:51ff:fe16:6fed/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
[D] Using self=169.254.135.208
[I] waiting to be initialized
[D] listening: 0.0.0.0:38080
[I] watching config server
W0629 09:19:07.999957       8 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not found
[D] got control message from 169.254.143.141:10000, name: update, length: 644
[D] got control message from 169.254.143.141:10002, name: update, length: 644
[D] got control message from 169.254.143.141:10006, name: update, length: 644
[D] got control message from 169.254.143.141:10003, name: update, length: 644
[D] got control message from 169.254.143.141:10007, name: update, length: 644
[D] got control message from 169.254.143.141:10005, name: update, length: 644
[D] update to v1 with [16@2]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007,169.254.135.208:10000,169.254.135.208:10001,169.254.135.208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10007}@{169.254.143.141:38080,169.254.135.208:38080}
[I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
[D] waiting 0 peers to stop
[D] 0 peer removed: 0 - 0 = 0
[E] full update detected: [0@0]{}@{} -> [16@2]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007,169.254.135.208:10000,169.254.135.208:10001,169.254.135.208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10007}@{169.254.143.141:38080,169.254.135.208:38080}
[D] got control message from 169.254.143.141:10001, name: update, length: 644
[D] got control message from 169.254.143.141:10004, name: update, length: 644
[D] 8 peers created: 0 - 0 + 8 = 16

...

[169.254.135.208.10000::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10004::stderr] INFO:tensorflow:Saving checkpoints for 0 into /home/ma-user/user-job-dir/kungfu-demo-6-10/train_url/model.ckpt.
[169.254.135.208.10006::stdout] start with 0 trained samples.
[169.254.135.208.10005::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10007::stdout] start with 0 trained samples.
[169.254.135.208.10004::stdout] start with 0 trained samples.
[169.254.135.208.10003::stderr] INFO:tensorflow:Saving checkpoints for 0 into /home/ma-user/user-job-dir/kungfu-demo-6-10/train_url/model.ckpt.
[169.254.135.208.10001::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10002::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10006::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10003::stdout] start with 0 trained samples.
[169.254.135.208.10007::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10004::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10003::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10005::stdout] sync to offset 160 on step 0
[169.254.135.208.10004::stdout] sync to offset 160 on step 0
[169.254.135.208.10001::stdout] sync to offset 160 on step 0
[169.254.135.208.10002::stdout] sync to offset 160 on step 0
[169.254.135.208.10006::stdout] sync to offset 160 on step 0
[169.254.135.208.10007::stdout] sync to offset 160 on step 0
[169.254.135.208.10003::stdout] sync to offset 160 on step 0
[169.254.135.208.10000::stdout] sync to offset 160 on step 0
[169.254.135.208.10005::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10000::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10002::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10003::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10006::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10001::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10007::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10004::stdout] [D] New peer list is consistent after ONE attempt!

HANG happen at here

@zrss
Copy link
Author

zrss commented Jun 28, 2020

init a job with 1 container (i.e. A.1 container), then scale up to 2 container (i.e. A.2 container is been added)

by the way, it works well in this way (-np/-H is diff with the previous case)

init A.1 container

     22 [arg] [0]=kungfu-run
     23 [arg] [1]=-np
     24 [arg] [2]=8
     25 [arg] [3]=-H
     26 [arg] [4]=169.254.135.208:8
     27 [arg] [5]=-w
     28 [arg] [6]=-config-server
     29 [arg] [7]=file:///home/ma-user/user-job-dir/config.json
     30 [arg] [8]=-nic
     31 [arg] [9]=ib0
     32 [arg] [10]=/home/work/anaconda/bin/python
     33 [arg] [11]=kungfu-demo-6-10/bert_classifier.py
...
     48 [nic] [0] lo :: 127.0.0.1/8, ::1/128
     49 [nic] [1] eth0 ::
     50 [nic] [2] eth1 ::
     51 [nic] [3] enp220s0 ::
     52 [nic] [4] enp221s0 ::
     53 [nic] [5] enp222s0 ::
     54 [nic] [6] enp223s0 ::
     55 [nic] [7] ib0 :: 169.254.135.208/20
     56 [nic] [8] bond0 :: 192.168.6.37/22, fe80::f816:3eff:fed7:1503/64
     57 [nic] [9] docker0 :: 169.254.30.1/28
     58 [nic] [10] ovs-system ::
     59 [nic] [11] br_monitor :: fe80::481:6ff:fe7c:d368/64
     60 [nic] [12] overlay_br_int ::
     61 [nic] [13] br_tun_b0345198 ::
     62 [nic] [14] vxlan_sys_4789 :: fe80::8c7f:9ff:fe8a:6612/64
     63 [nic] [15] gw_11cbf51a :: 172.16.1.17/16, fe80::d0f2:4cff:feb3:895c/64
     64 [nic] [16] br_plc_949f84f2 ::
     65 [nic] [17] veth_949f84f2 :: fe80::5410:51ff:fe16:6fed/64

...

     66 [cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
     67 [cuda-env]: CUDA_VERSION=10.0.130
     68 [nccl-env]: NCCL_VERSION=2.4.2
     69 [I] watching config server
     70 [I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
     71 ^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [8@1]{169.254.135.208:10000,169.254.135.208:10001,169.254.135.2        08:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10        007}@{169.254.135.208:38080}

...

    516 [169.254.135.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.309             top-1: 0.250
    517 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.353             top-1: 0.100
    518 [169.254.135.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.351             top-1: 0.200
    519 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.246             top-1: 0.200
    520 [169.254.135.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.331 loss: 2.203             top-1: 0.100
    521 [169.254.135.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.176             top-1: 0.300
    522 [169.254.135.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 40(global step: 40)   step/sec: 1.884 loss: 1.823             top-1: 0.450

...

    650 I0628 20:16:27.978056       1 ma_fmk_kungfu.go:226] generated host file
    651 [I] arrived at v1, new np=16, local: +0/-0, global: +8/-0

    652 [169.254.135.208.10006::stdout] sync to offset 26240 on step 164
    653 [169.254.135.208.10005::stdout] sync to offset 26240 on step 164
    654 [169.254.135.208.10007::stdout] sync to offset 26240 on step 164
    655 [169.254.135.208.10003::stdout] sync to offset 26240 on step 164
    656 [169.254.135.208.10004::stdout] sync to offset 26240 on step 164
    657 [169.254.135.208.10002::stdout] sync to offset 26240 on step 164
    658 [169.254.135.208.10001::stdout] sync to offset 26240 on step 164
    659 [169.254.135.208.10000::stdout] sync to offset 26240 on step 164
    660 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    661 [169.254.135.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    662 [169.254.135.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    663 [169.254.135.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    664 [169.254.135.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    665 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    666 [169.254.135.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    667 [169.254.135.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    668 [169.254.135.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.038             top-1: 1.000
    669 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.073             top-1: 1.000
    670 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.126             top-1: 0.950

newly added A.2 container

     23 [arg] [0]=kungfu-run
     24 [arg] [1]=-np
     25 [arg] [2]=16
     26 [arg] [3]=-H
     27 [arg] [4]=169.254.135.208:8,169.254.138.11:8
     28 [arg] [5]=-w
     29 [arg] [6]=-config-server
     30 [arg] [7]=file:///home/ma-user/user-job-dir/config.json
     31 [arg] [8]=-nic
     32 [arg] [9]=ib0
     33 [arg] [10]=-init-version
     34 [arg] [11]=-1
     35 [arg] [12]=/home/work/anaconda/bin/python
     36 [arg] [13]=kungfu-demo-6-10/bert_classifier.py
...
     51 [nic] [0] lo :: 127.0.0.1/8, ::1/128
     52 [nic] [1] eth0 ::
     53 [nic] [2] eth1 ::
     54 [nic] [3] enp220s0 ::
     55 [nic] [4] enp221s0 ::
     56 [nic] [5] enp222s0 ::
     57 [nic] [6] enp223s0 ::
     58 [nic] [7] ib0 :: 169.254.138.11/20
     59 [nic] [8] bond0 :: 192.168.6.190/22, fe80::f816:3eff:fe75:a4a/64
     60 [nic] [9] docker0 :: 169.254.30.1/28
     61 [nic] [10] ovs-system ::
     62 [nic] [11] br_monitor :: fe80::206e:5aff:fe16:107/64
     63 [nic] [12] br_tun_b0345198 ::
     64 [nic] [13] overlay_br_int ::
     65 [nic] [14] vxlan_sys_4789 :: fe80::bc59:a2ff:fe10:65ab/64
     66 [nic] [15] gw_11cbf51a :: 172.16.1.1/16, fe80::3451:daff:fe7d:68c1/64
     67 [nic] [16] br_plc_8ea52e57 ::
     68 [nic] [17] veth_8ea52e57 :: fe80::e84f:f8ff:fe22:6d7a/64
     69 [cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
     70 [cuda-env]: CUDA_VERSION=10.0.130
     71 [nccl-env]: NCCL_VERSION=2.4.2
     72 [I] ^[[1;34mwaiting to be initialized^[[m
     73 [I] watching config server
     74 W0628 20:16:27.976171       1 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not         found
     75 [I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
     76 ^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [16@2]{169.254.135.208:10000,169.254.135.208:10001,169.254.135.        208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:1        0007,169.254.138.11:10000,169.254.138.11:10001,169.254.138.11:10002,169.254.138.11:10003,169.254.138.11:10004,169.2        54.138.11:10005,169.254.138.11:10006,169.254.138.11:10007}@{169.254.135.208:38080,169.254.138.11:38080}

...

    459 [169.254.138.11.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Running will end at step: 375    460 [169.254.138.11.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Running will end at step: 375    461 [169.254.138.11.10001::stdout] sync to offset 26240 on step 0    462 [169.254.138.11.10004::stdout] sync to offset 26240 on step 0    463 [169.254.138.11.10005::stdout] sync to offset 26240 on step 0    464 [169.254.138.11.10003::stdout] sync to offset 26240 on step 0    465 [169.254.138.11.10002::stdout] sync to offset 26240 on step 0
    466 [169.254.138.11.10006::stdout] sync to offset 26240 on step 0
    467 [169.254.138.11.10007::stdout] sync to offset 26240 on step 0
    468 [169.254.138.11.10000::stdout] sync to offset 26240 on step 0
    469 [169.254.138.11.10001::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.061 loss: 0.037             top-1: 1.000
    470 [169.254.138.11.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.074 loss: 0.073             top-1: 1.000
    471 [169.254.138.11.10002::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.072 loss: 0.121             top-1: 0.950
    472 [169.254.138.11.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.062 loss: 0.038             top-1: 1.000
    473 [169.254.138.11.10004::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.061 loss: 0.065             top-1: 1.000
    474 [169.254.138.11.10005::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.063 loss: 0.118             top-1: 0.950
    475 [169.254.138.11.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.069 loss: 0.036             top-1: 1.000
    476 [169.254.138.11.10003::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.062 loss: 0.039             top-1: 1.000

@kevin0525
Copy link

we try again but find out an other problem
init a job with a container, job 0 runs well, then scale up to 3, errors arise
container0
image
container1\2 (barrier failed...
image
image

lgarithm added a commit that referenced this issue Jun 29, 2020
lgarithm added a commit that referenced this issue Jul 1, 2020
* kungfu-notify-start (#302)

* fix elastic init parameters

* remove unused function

* InstallStallDetector for Peer::propose
@zrss zrss closed this as completed Jul 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants