Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kungfu job is hang in a inconsistent version when i scale down/up mutiple times #297

Closed
zrss opened this issue Jun 8, 2020 · 14 comments
Closed

Comments

@zrss
Copy link

zrss commented Jun 8, 2020

scale up from 1 instance to 2 instances

A: v0 -> v1

A container log

[10.0.0.230.10003::stdout] sync to offset 0 on step 0
[10.0.0.230.10004::stdout] sync to offset 0 on step 0
[10.0.0.230.10007::stdout] sync to offset 0 on step 0
[10.0.0.230.10002::stdout] sync to offset 0 on step 0
[10.0.0.230.10006::stdout] sync to offset 0 on step 0
[10.0.0.230.10001::stdout] sync to offset 0 on step 0
[10.0.0.230.10005::stdout] sync to offset 0 on step 0
[10.0.0.230.10000::stdout] sync to offset 0 on step 0
[I] arrived at v1, new np=16, local: +0/-0, global: +8/-0

B container log

[10.0.1.29.10004::stdout] sync to offset 0 on step 0
[10.0.1.29.10005::stdout] sync to offset 0 on step 0
[10.0.1.29.10001::stdout] sync to offset 0 on step 0
[10.0.1.29.10002::stdout] sync to offset 0 on step 0
[10.0.1.29.10003::stdout] sync to offset 0 on step 0
[10.0.1.29.10000::stdout] sync to offset 0 on step 0
[10.0.1.29.10006::stdout] sync to offset 0 on step 0
[10.0.1.29.10007::stdout] sync to offset 0 on step 0
[10.0.1.29.10002::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10003::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10006::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10005::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10000::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10001::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10007::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10004::stderr] [E] New root can't not be a new worker! State will be lost.

A/B running well

scale down from 2 instance to 1 instances

A: v1 -> v2

A container log

[10.0.0.230.10006::stderr] INFO:tensorflow:step: 60(global step: 60)    step/sec: 0.384 loss: 0.777     top-1: 0.800
[10.0.0.230.10003::stderr] INFO:tensorflow:step: 60(global step: 60)    step/sec: 0.384 loss: 0.687     top-1: 0.800
I0608 09:10:39.150780       1 watch_host_file.go:65] update host file
I0608 09:10:39.150810       1 ma_fmk_kungfu.go:116] scale down to 1
I0608 09:10:39.151063       1 ma_fmk_kungfu.go:150] generated host file
[I] arrived at v2, new np=8, local: +0/-0, global: +0/-8
[10.0.0.230.10006::stdout] sync to offset 20320 on step 64
[10.0.0.230.10005::stdout] sync to offset 20320 on step 64
[10.0.0.230.10001::stdout] sync to offset 20320 on step 64
[10.0.0.230.10007::stdout] sync to offset 20320 on step 64
[10.0.0.230.10004::stdout] sync to offset 20320 on step 64
[10.0.0.230.10002::stdout] sync to offset 20320 on step 64
[10.0.0.230.10000::stdout] sync to offset 20320 on step 64
[10.0.0.230.10003::stdout] sync to offset 20320 on step 64
[10.0.0.230.10000::stderr] INFO:tensorflow:step: 70(global step: 70)    step/sec: 1.041 loss: 0.683     top-1: 0.800
[10.0.0.230.10004::stderr] INFO:tensorflow:step: 70(global step: 70)    step/sec: 1.041 loss: 0.629     top-1: 0.800

B container log

[10.0.1.29.10007::stderr] INFO:tensorflow:step: 60(global step: 60)     step/sec: 0.384 loss: 0.926     top-1: 0.750
[10.0.1.29.10003::stderr] INFO:tensorflow:step: 60(global step: 60)     step/sec: 0.384 loss: 0.757     top-1: 0.800
I0608 09:10:39.152199       1 watch_host_file.go:65] update host file
I0608 09:10:39.152222       1 ma_fmk_kungfu.go:116] scale down to 1
I0608 09:10:39.152408       1 ma_fmk_kungfu.go:150] generated host file
[10.0.1.29.10002::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10007::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10004::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10000::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10005::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10003::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10006::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10007::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10002::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10004::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10001::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10000::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10006::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10005::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10003::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10001::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[W] terminated trapped
[E] canceled: context canceled
[I] stop watching

A running well, B is closed

scale up to 2 instances again

A: v2 -> v3

A container log

[10.0.0.230.10002::stderr] INFO:tensorflow:step: 210(global step: 210)  step/sec: 1.455 loss: 0.011     top-1: 1.000
[10.0.0.230.10001::stderr] INFO:tensorflow:step: 210(global step: 210)  step/sec: 1.455 loss: 0.020     top-1: 1.000
I0608 09:12:30.528867       1 ma_fmk_kungfu.go:150] generated host file
[I] arrived at v3, new np=16, local: +0/-0, global: +8/-0
[10.0.0.230.10003::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10005::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10004::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10000::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10007::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10001::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10002::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10006::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
exit on error: inconsistent update detected at 7031490:/home/work/KungFu/srcs/go/kungfu/runner/handler.go:102

i found the runner of A is exited as exit on error: inconsistent update detected at 7031490:/home/work/KungFu/srcs/go/kungfu/runner/handler.go:102

B container log

[10.0.1.30.10000::stdout] start with 0 trained samples.
[10.0.1.30.10000::stderr] INFO:tensorflow:Running will end at step: 750
[10.0.1.30.10006::stdout] sync to offset 0 on step 0
[10.0.1.30.10002::stdout] sync to offset 0 on step 0
[10.0.1.30.10004::stdout] sync to offset 0 on step 0
[10.0.1.30.10003::stdout] sync to offset 0 on step 0
[10.0.1.30.10005::stdout] sync to offset 0 on step 0
[10.0.1.30.10007::stdout] sync to offset 0 on step 0
[10.0.1.30.10000::stdout] sync to offset 0 on step 0
[10.0.1.30.10001::stdout] sync to offset 0 on step 0
[I] arrived at v1, new np=16, local: +0/-0, global: +0/-0
[10.0.1.30.10004::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10005::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10006::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10003::stdout] exit on error: par failed with 1 error: can't establish connection at 140503338179417:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10001::stdout] exit on error: par failed with 1 error: can't establish connection at 139700864633689:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10007::stdout] exit on error: par failed with 1 error: can't establish connection at 139746162006873:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10000::stdout] exit on error: par failed with 1 error: can't establish connection at 139711679425369:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10002::stdout] exit on error: par failed with 1 error: can't establish connection at 140323524553561:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[I] 10.0.1.30.10000 finished with error: exit status 1
exit on error: exit status 1 at 7030827:/home/work/KungFu/srcs/go/kungfu/runner/watch.go:147

now A/B is hang ...

currently, can kung-fu support my test case ? or how should i handle this case ...

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 8, 2020

@zrss could you also share the config.json that you applied for each scaling step?

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 8, 2020

What are the flags that you passed to kungfu-run? This can be found in the very beginning of the log, e.g.

$ kungfu-run -w -np 4 echo 2
[arg] [0]=kungfu-run
[arg] [1]=-w
[arg] [2]=-np
[arg] [3]=4
[arg] [4]=echo
[arg] [5]=2
[kf-env]: KUNGFU_GIT_URL=/Users/lg/code/mirrors/github.com/lsds/KungFu
[nic] [0] lo0 :: 127.0.0.1/8, ::1/128, fe80::1/64
[nic] [1] gif0 :: 
[nic] [2] stf0 :: 
[nic] [3] en0 :: 192.168.1.85/24
[nic] [4] en3 :: 
[nic] [5] en4 :: 
[nic] [6] en1 :: 
[nic] [7] en2 :: 
[nic] [8] bridge0 :: 
[nic] [9] p2p0 :: 
[nic] [10] awdl0 :: fe80::4c97:11ff:feab:ca51/64
[nic] [11] llw0 :: fe80::4c97:11ff:feab:ca51/64
[nic] [12] utun0 :: fe80::9d17:9272:6aa8:a8e8/64
[nic] [13] utun1 :: fe80::e340:7496:425b:77ee/64
[nic] [14] en5 :: fe80::aede:48ff:fe00:1122/64
[I] watching config server

I suspect the -init-version flag is not set correctly for the second scaling up.
According to the example

https://github.com/lsds/KungFu/blob/master/tests/go/cmd/kungfu-cluster-manager-example/kungfu-cluster-manager-example.go#L89

it should be set to -1 if the kungfu-run is not the first generation.

@zrss
Copy link
Author

zrss commented Jun 8, 2020

thanks for the reply ~

it should be set to -1 if the kungfu-run is not the first generation.

kungfu-run params are all the same among my scale up/down cases ...

@zrss
Copy link
Author

zrss commented Jun 8, 2020

@lgarithm , can i make this conclusion

  1. bootstrap a new kungfu job with default init-version (not set it)
  2. always set init-version to -1 for newly added kungfu-run

@rankeey
Copy link
Collaborator

rankeey commented Jun 8, 2020

@lgarithm can we just set the init-version to -1,not only for the newly added kungfu-run,but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 8, 2020

@lgarithm , can i make this conclusion

  1. bootstrap a new kungfu job with default init-version (not set it)
  2. always set init-version to -1 for newly added kungfu-run

Yes, this is correct.

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 8, 2020

@lgarithm can we just set the init-version to -1,not only for the newly added kungfu-run,but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

We can consider this as future improvement. But currently I can't think of how to do it in a clean way.

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 8, 2020

@lgarithm can we just set the init-version to -1,not only for the newly added kungfu-run,but also for the first generation kungfu-run. It is hard for cluster to distinguish wether it is the first generation.

If you can manually initialize the first generation kungfu-runs, then you can always set init-version to -1.

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 8, 2020

i.e. start the first generation kungfu-run with -init-version -1, then run this

var notify execution.PeerFunc = func(ctrl plan.PeerID) error {
ctx, cancel := context.WithTimeout(context.TODO(), config.WaitRunnerTimeout)
defer cancel()
n, err := p.router.Wait(ctx, ctrl)
if err != nil {
return err
}
if n > 0 {
log.Warnf("%s is up after pinged %d times", ctrl, n+1)
}
return p.router.Send(ctrl.WithName("update"), stage.Encode(), connection.ConnControl, 0)
}
if err := notify.Par(cluster.Runners); err != nil {
utils.ExitErr(err)
}

in your cluster manager.

@zrss
Copy link
Author

zrss commented Jun 8, 2020

@lgarithm thanks for the reply, i'd like to try, currently, it seems the only way for us to do

to clarify

in our current arch, a host file (the file only records the ip of containers) is generated by cluster manager, and we import a kungfu-mng process for converting the host file to config.json of kungfu

the kungfu-mng is running in container, and every container has the same meta info (included the bootstrap command, that's the reason why we want to set the init-version as a fixed value) as we can only modify the number of containers by cluster manager (i.e. the elastic feature of Volcano on K8S)

the cluster manager will update the host file and bootstrap (shutdown) the new container when we scale up/down the kungfu-job

so now i can't think of a way for us to distinguish it is a newly added container in container unless the cluster manager can tag the newly added container with some labels (for example, add a SCALE_OUT env in the newly added container)

the kungfu-mng can compare the number of container in the host file with the bootstrap command of kungfu-run -H

  1. the number of container (and ip) == -H, the first generation
  2. the number of container (and ip) != -H, not the first generation

then

  1. the first generation, bootstrap kungfu-run by init-version=0
  2. not the first generation, bootstrap kungfu-run by init-version=-1

@zrss
Copy link
Author

zrss commented Jun 8, 2020

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 8, 2020

What if the config.json restored to the origin after two scaling operations?

  1. the number of container (and ip) == -H, the first generation
  2. the number of container (and ip) != -H, not the first generation

then

  1. the first generation, bootstrap kungfu-run by init-version=0
  2. not the first generation, bootstrap kungfu-run by init-version=-1

@lgarithm
Copy link
Collaborator

lgarithm commented Jun 8, 2020

How about add a version field in the config.json object?

@zrss
Copy link
Author

zrss commented Jun 9, 2020

What if the config.json restored to the origin after two scaling operations?

  1. the number of container (and ip) == -H, the first generation
  2. the number of container (and ip) != -H, not the first generation

then

  1. the first generation, bootstrap kungfu-run by init-version=0
  2. not the first generation, bootstrap kungfu-run by init-version=-1

we (platform) should limit the number of instances that cannot be smaller than the default value when scaling down, and this can simplify the scene

How about add a version field in the config.json object?

good idea, we can post a feature request to cluster manager for adding a version field in host file. generally saying, version=version + 1 in every scale up/down case

@zrss zrss closed this as completed Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants