kungfu job is hang in a inconsistent version when i scale down/up mutiple times #297

zrss · 2020-06-08T01:29:52Z

scale up from 1 instance to 2 instances

A: v0 -> v1

A container log

[10.0.0.230.10003::stdout] sync to offset 0 on step 0
[10.0.0.230.10004::stdout] sync to offset 0 on step 0
[10.0.0.230.10007::stdout] sync to offset 0 on step 0
[10.0.0.230.10002::stdout] sync to offset 0 on step 0
[10.0.0.230.10006::stdout] sync to offset 0 on step 0
[10.0.0.230.10001::stdout] sync to offset 0 on step 0
[10.0.0.230.10005::stdout] sync to offset 0 on step 0
[10.0.0.230.10000::stdout] sync to offset 0 on step 0
[I] arrived at v1, new np=16, local: +0/-0, global: +8/-0

B container log

[10.0.1.29.10004::stdout] sync to offset 0 on step 0
[10.0.1.29.10005::stdout] sync to offset 0 on step 0
[10.0.1.29.10001::stdout] sync to offset 0 on step 0
[10.0.1.29.10002::stdout] sync to offset 0 on step 0
[10.0.1.29.10003::stdout] sync to offset 0 on step 0
[10.0.1.29.10000::stdout] sync to offset 0 on step 0
[10.0.1.29.10006::stdout] sync to offset 0 on step 0
[10.0.1.29.10007::stdout] sync to offset 0 on step 0
[10.0.1.29.10002::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10003::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10006::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10005::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10000::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10001::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10007::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10004::stderr] [E] New root can't not be a new worker! State will be lost.

A/B running well

scale down from 2 instance to 1 instances

A: v1 -> v2

A container log

[10.0.0.230.10006::stderr] INFO:tensorflow:step: 60(global step: 60)    step/sec: 0.384 loss: 0.777     top-1: 0.800
[10.0.0.230.10003::stderr] INFO:tensorflow:step: 60(global step: 60)    step/sec: 0.384 loss: 0.687     top-1: 0.800
I0608 09:10:39.150780       1 watch_host_file.go:65] update host file
I0608 09:10:39.150810       1 ma_fmk_kungfu.go:116] scale down to 1
I0608 09:10:39.151063       1 ma_fmk_kungfu.go:150] generated host file
[I] arrived at v2, new np=8, local: +0/-0, global: +0/-8
[10.0.0.230.10006::stdout] sync to offset 20320 on step 64
[10.0.0.230.10005::stdout] sync to offset 20320 on step 64
[10.0.0.230.10001::stdout] sync to offset 20320 on step 64
[10.0.0.230.10007::stdout] sync to offset 20320 on step 64
[10.0.0.230.10004::stdout] sync to offset 20320 on step 64
[10.0.0.230.10002::stdout] sync to offset 20320 on step 64
[10.0.0.230.10000::stdout] sync to offset 20320 on step 64
[10.0.0.230.10003::stdout] sync to offset 20320 on step 64
[10.0.0.230.10000::stderr] INFO:tensorflow:step: 70(global step: 70)    step/sec: 1.041 loss: 0.683     top-1: 0.800
[10.0.0.230.10004::stderr] INFO:tensorflow:step: 70(global step: 70)    step/sec: 1.041 loss: 0.629     top-1: 0.800

B container log

[10.0.1.29.10007::stderr] INFO:tensorflow:step: 60(global step: 60)     step/sec: 0.384 loss: 0.926     top-1: 0.750
[10.0.1.29.10003::stderr] INFO:tensorflow:step: 60(global step: 60)     step/sec: 0.384 loss: 0.757     top-1: 0.800
I0608 09:10:39.152199       1 watch_host_file.go:65] update host file
I0608 09:10:39.152222       1 ma_fmk_kungfu.go:116] scale down to 1
I0608 09:10:39.152408       1 ma_fmk_kungfu.go:150] generated host file
[10.0.1.29.10002::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10007::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10004::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10000::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10005::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10003::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10006::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10007::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10002::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10004::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10001::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10000::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10006::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10005::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10003::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10001::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[W] terminated trapped
[E] canceled: context canceled
[I] stop watching

A running well, B is closed

scale up to 2 instances again

A: v2 -> v3

A container log

[10.0.0.230.10002::stderr] INFO:tensorflow:step: 210(global step: 210)  step/sec: 1.455 loss: 0.011     top-1: 1.000
[10.0.0.230.10001::stderr] INFO:tensorflow:step: 210(global step: 210)  step/sec: 1.455 loss: 0.020     top-1: 1.000
I0608 09:12:30.528867       1 ma_fmk_kungfu.go:150] generated host file
[I] arrived at v3, new np=16, local: +0/-0, global: +8/-0
[10.0.0.230.10003::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10005::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10004::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10000::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10007::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10001::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10002::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10006::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
exit on error: inconsistent update detected at 7031490:/home/work/KungFu/srcs/go/kungfu/runner/handler.go:102

i found the runner of A is exited as exit on error: inconsistent update detected at 7031490:/home/work/KungFu/srcs/go/kungfu/runner/handler.go:102

B container log

[10.0.1.30.10000::stdout] start with 0 trained samples.
[10.0.1.30.10000::stderr] INFO:tensorflow:Running will end at step: 750
[10.0.1.30.10006::stdout] sync to offset 0 on step 0
[10.0.1.30.10002::stdout] sync to offset 0 on step 0
[10.0.1.30.10004::stdout] sync to offset 0 on step 0
[10.0.1.30.10003::stdout] sync to offset 0 on step 0
[10.0.1.30.10005::stdout] sync to offset 0 on step 0
[10.0.1.30.10007::stdout] sync to offset 0 on step 0
[10.0.1.30.10000::stdout] sync to offset 0 on step 0
[10.0.1.30.10001::stdout] sync to offset 0 on step 0
[I] arrived at v1, new np=16, local: +0/-0, global: +0/-0
[10.0.1.30.10004::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10005::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10006::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10003::stdout] exit on error: par failed with 1 error: can't establish connection at 140503338179417:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10001::stdout] exit on error: par failed with 1 error: can't establish connection at 139700864633689:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10007::stdout] exit on error: par failed with 1 error: can't establish connection at 139746162006873:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10000::stdout] exit on error: par failed with 1 error: can't establish connection at 139711679425369:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10002::stdout] exit on error: par failed with 1 error: can't establish connection at 140323524553561:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[I] 10.0.1.30.10000 finished with error: exit status 1
exit on error: exit status 1 at 7030827:/home/work/KungFu/srcs/go/kungfu/runner/watch.go:147

now A/B is hang ...

currently, can kung-fu support my test case ? or how should i handle this case ...

The text was updated successfully, but these errors were encountered:

lgarithm · 2020-06-08T11:08:43Z

@zrss could you also share the config.json that you applied for each scaling step?

lgarithm · 2020-06-08T11:18:37Z

What are the flags that you passed to kungfu-run? This can be found in the very beginning of the log, e.g.

$ kungfu-run -w -np 4 echo 2
[arg] [0]=kungfu-run
[arg] [1]=-w
[arg] [2]=-np
[arg] [3]=4
[arg] [4]=echo
[arg] [5]=2
[kf-env]: KUNGFU_GIT_URL=/Users/lg/code/mirrors/github.com/lsds/KungFu
[nic] [0] lo0 :: 127.0.0.1/8, ::1/128, fe80::1/64
[nic] [1] gif0 :: 
[nic] [2] stf0 :: 
[nic] [3] en0 :: 192.168.1.85/24
[nic] [4] en3 :: 
[nic] [5] en4 :: 
[nic] [6] en1 :: 
[nic] [7] en2 :: 
[nic] [8] bridge0 :: 
[nic] [9] p2p0 :: 
[nic] [10] awdl0 :: fe80::4c97:11ff:feab:ca51/64
[nic] [11] llw0 :: fe80::4c97:11ff:feab:ca51/64
[nic] [12] utun0 :: fe80::9d17:9272:6aa8:a8e8/64
[nic] [13] utun1 :: fe80::e340:7496:425b:77ee/64
[nic] [14] en5 :: fe80::aede:48ff:fe00:1122/64
[I] watching config server

I suspect the -init-version flag is not set correctly for the second scaling up.
According to the example

https://github.com/lsds/KungFu/blob/master/tests/go/cmd/kungfu-cluster-manager-example/kungfu-cluster-manager-example.go#L89

it should be set to -1 if the kungfu-run is not the first generation.

zrss · 2020-06-08T12:04:27Z

thanks for the reply ~

it should be set to -1 if the kungfu-run is not the first generation.

kungfu-run params are all the same among my scale up/down cases ...

zrss · 2020-06-08T12:24:38Z

@lgarithm , can i make this conclusion

bootstrap a new kungfu job with default init-version (not set it)
always set init-version to -1 for newly added kungfu-run

rankeey · 2020-06-08T12:34:11Z

@lgarithm can we just set the init-version to -1，not only for the newly added kungfu-run，but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

lgarithm · 2020-06-08T13:05:12Z

@lgarithm , can i make this conclusion

bootstrap a new kungfu job with default init-version (not set it)

always set init-version to -1 for newly added kungfu-run

Yes, this is correct.

lgarithm · 2020-06-08T13:07:08Z

@lgarithm can we just set the init-version to -1，not only for the newly added kungfu-run，but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

We can consider this as future improvement. But currently I can't think of how to do it in a clean way.

lgarithm · 2020-06-08T13:15:05Z

@lgarithm can we just set the init-version to -1，not only for the newly added kungfu-run，but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

If you can manually initialize the first generation kungfu-runs, then you can always set init-version to -1.

lgarithm · 2020-06-08T13:20:14Z

i.e. start the first generation kungfu-run with -init-version -1, then run this

KungFu/srcs/go/kungfu/peer/peer.go

Lines 191 to 205 in 06d742e

    
           var notify execution.PeerFunc = func(ctrl plan.PeerID) error { 
        
           	ctx, cancel := context.WithTimeout(context.TODO(), config.WaitRunnerTimeout) 
        
           	defer cancel() 
        
           	n, err := p.router.Wait(ctx, ctrl) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	if n > 0 { 
        
           		log.Warnf("%s is up after pinged %d times", ctrl, n+1) 
        
           	} 
        
           	return p.router.Send(ctrl.WithName("update"), stage.Encode(), connection.ConnControl, 0) 
        
           } 
        
           if err := notify.Par(cluster.Runners); err != nil { 
        
           	utils.ExitErr(err) 
        
           }

in your cluster manager.

zrss · 2020-06-08T15:11:16Z

@lgarithm thanks for the reply, i'd like to try, ~~currently, it seems the only way for us to do~~

to clarify

in our current arch, a host file (the file only records the ip of containers) is generated by cluster manager, and we import a kungfu-mng process for converting the host file to config.json of kungfu

the kungfu-mng is running in container, and every container has the same meta info (included the bootstrap command, that's the reason why we want to set the init-version as a fixed value) as we can only modify the number of containers by cluster manager (i.e. the elastic feature of Volcano on K8S)

the cluster manager will update the host file and bootstrap (shutdown) the new container when we scale up/down the kungfu-job

so now i can't think of a way for us to distinguish it is a newly added container in container unless the cluster manager can tag the newly added container with some labels (for example, add a SCALE_OUT env in the newly added container)

the kungfu-mng can compare the number of container in the host file with the bootstrap command of kungfu-run -H

the number of container (and ip) == -H, the first generation
the number of container (and ip) != -H, not the first generation

then

the first generation, bootstrap kungfu-run by init-version=0
not the first generation, bootstrap kungfu-run by init-version=-1

zrss · 2020-06-08T15:12:10Z

https://github.com/volcano-sh/volcano

lgarithm · 2020-06-08T16:59:07Z

What if the config.json restored to the origin after two scaling operations?

the number of container (and ip) == -H, the first generation

the number of container (and ip) != -H, not the first generation

then

the first generation, bootstrap kungfu-run by init-version=0

not the first generation, bootstrap kungfu-run by init-version=-1

lgarithm · 2020-06-08T17:10:37Z

How about add a version field in the config.json object?

zrss · 2020-06-09T00:50:59Z

What if the config.json restored to the origin after two scaling operations?

the number of container (and ip) == -H, the first generation

the number of container (and ip) != -H, not the first generation

then

the first generation, bootstrap kungfu-run by init-version=0

not the first generation, bootstrap kungfu-run by init-version=-1

we (platform) should limit the number of instances that cannot be smaller than the default value when scaling down, and this can simplify the scene

How about add a version field in the config.json object?

good idea, we can post a feature request to cluster manager for adding a version field in host file. generally saying, version=version + 1 in every scale up/down case

zrss closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kungfu job is hang in a inconsistent version when i scale down/up mutiple times #297

kungfu job is hang in a inconsistent version when i scale down/up mutiple times #297

zrss commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020 •

edited

Loading

zrss commented Jun 8, 2020

zrss commented Jun 8, 2020 •

edited

Loading

rankeey commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020

zrss commented Jun 8, 2020 •

edited

Loading

zrss commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020

zrss commented Jun 9, 2020 •

edited

Loading

kungfu job is hang in a inconsistent version when i scale down/up mutiple times #297

kungfu job is hang in a inconsistent version when i scale down/up mutiple times #297

Comments

zrss commented Jun 8, 2020

scale up from 1 instance to 2 instances

scale down from 2 instance to 1 instances

scale up to 2 instances again

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020 • edited Loading

zrss commented Jun 8, 2020

zrss commented Jun 8, 2020 • edited Loading

rankeey commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020

zrss commented Jun 8, 2020 • edited Loading

zrss commented Jun 8, 2020

lgarithm commented Jun 8, 2020

lgarithm commented Jun 8, 2020

zrss commented Jun 9, 2020 • edited Loading

lgarithm commented Jun 8, 2020 •

edited

Loading

zrss commented Jun 8, 2020 •

edited

Loading

zrss commented Jun 8, 2020 •

edited

Loading

zrss commented Jun 9, 2020 •

edited

Loading