allow using WORKER:0 as chief #221

lluunn · 2017-12-13T08:20:34Z

second part for this issue

This change is

k8s-ci-robot · 2017-12-13T08:20:42Z

Hi @lluunn. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi · 2017-12-13T20:30:28Z

/ok-to-test

jlewi · 2017-12-15T18:32:56Z

Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

jlewi · 2017-12-15T18:36:06Z

This is awesome.
Thanks

@lluunn could you add

Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/trainer/training.go, line 216 at r1 (raw file):

	chief := j.job.Spec.TerminationPolicy.Chief
	if v, ok := replicaSetStates[spec.TfReplicaType(chief.ReplicaName)]; ok && v == spec.ReplicaStateSucceeded {

Can we check the specific state of the replica that is the chief rather than the overall state?

Comments from Reviewable

lluunn · 2017-12-15T23:57:16Z

What do you want to add?

Review status: all files reviewed at latest revision, 1 unresolved discussion.

pkg/trainer/training.go, line 216 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Can we check the specific state of the replica that is the chief rather than the overall state?

It is checking the replica state of chief right? Maybe I misunderstood, could you elaborate?

Comments from Reviewable

jlewi · 2017-12-16T23:00:40Z

The policy specifies the index of a single replica; e.g. if we have 10 workers only worker 0 is the chief.
So we should only take status of worker 0 into account when deciding what to do.

replicaSetStates is some aggregation of all the workers but that's not quite what we want.

Review status: all files reviewed at latest revision, 1 unresolved discussion.

Comments from Reviewable

…e replica.

googlebot · 2017-12-19T00:34:48Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

lluunn · 2017-12-19T00:35:46Z

Done.

Review status: 3 of 5 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/trainer/training.go, line 216 at r1 (raw file):

Previously, lluunn wrote…

It is checking the replica state of chief right? Maybe I misunderstood, could you elaborate?

Done.

Comments from Reviewable

jlewi · 2017-12-20T04:12:57Z

Review status: 3 of 5 files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

jlewi · 2017-12-20T04:14:52Z

Looks good but there are merge conflicts now. Can you fix please?

Also can you open up an issue to add an E2E test for the case where the chief isn't the master?

lluunn · 2017-12-20T07:16:03Z

Thanks for the review. I think it's fixed now. PTAL

Review status: 0 of 5 files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

coveralls · 2017-12-20T07:29:10Z

Coverage increased (+0.3%) to 37.782% when pulling 91d8a23 on lluunn:kai into 6a2fc9c on tensorflow:master.

lluunn · 2017-12-20T07:31:01Z

Filed this issue for e2e test.
I am interested to work on that after this one.

…orker. * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.

…roken (#308) * In syncTfJob when checking whether a work queue item corresponds to a TrainingJob already in the map we need to check the UID. Otherwise we will not properly handle the case where a training job is deleted and then a new job is recreated with the same name. * We need to make sure that the Replicas field in TrainingJob is always properly set; * We were only initializing replicas in setup which was problematic in the case where the TfJob controller gets restarted because on restarted setup won't be invoked because the job is past that phase and as a result the replicas won't be reinitialized. * test_runner needs to ignore case when checking whether the job succeeded otherwise we conclude that successful jobs failed * The controller should only forget about job after the job has been cleaned up; not when it is marked as succeeded or failed. * Add back code to support termination policies use the worker and not the master as the chief *This was added in #221 and accidentally removed in the refactor in #234.

Lun-kai Hsu and others added 15 commits November 20, 2017 18:45

fix dev guilde

91ff045

add terminationPolicy to TfjobSpec

875dfc3

merge upstream

df214a9

Merge branch 'master' into master

529c0cb

Merge branch 'master' into master

dffa8ac

validate termination policy

747b3de

Merge branch 'master' of github.com:lluunn/k8s

c93bb59

fix validation error message

1ad7053

fix validation error message

4f4b0d0

Merge branch 'master' of github.com:lluunn/k8s

2ba09f3

Merge remote-tracking branch 'upstream/master' into kai

4fdd05a

allow using WORKER:0 as chief

4ff9417

Merge remote-tracking branch 'upstream/master'

a031d34

allow using WORKER:0 as chief

44372ca

Merge branch 'kai' of github.com:lluunn/k8s into kai

8d8c7af

k8s-ci-robot added the needs-ok-to-test label Dec 13, 2017

k8s-ci-robot removed the needs-ok-to-test label Dec 13, 2017

TrainingJob status returns chief's state, which is the state of singl…

6b48bba

…e replica.

lluunn added 2 commits December 19, 2017 22:44

Merge branch 'master' of https://github.com/tensorflow/k8s

0ef953b

Merge branch 'master' into kai

3b2cc84

fix unit test

91d8a23

jlewi merged commit cb1e053 into kubeflow:master Dec 20, 2017

jlewi added a commit to jlewi/k8s that referenced this pull request Jan 14, 2018

* Add back code to support termination policies which don't use the w…

29358ab

…orker. * This was added in kubeflow#221 and accidentally removed in the refactor in kubeflow#234.

jlewi mentioned this pull request Jan 14, 2018

Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow using WORKER:0 as chief #221

allow using WORKER:0 as chief #221

lluunn commented Dec 13, 2017 •

edited by jlewi

k8s-ci-robot commented Dec 13, 2017

jlewi commented Dec 13, 2017

jlewi commented Dec 15, 2017

jlewi commented Dec 15, 2017

lluunn commented Dec 15, 2017

jlewi commented Dec 16, 2017

googlebot commented Dec 19, 2017

lluunn commented Dec 19, 2017

jlewi commented Dec 20, 2017

jlewi commented Dec 20, 2017

lluunn commented Dec 20, 2017

coveralls commented Dec 20, 2017 •

edited

lluunn commented Dec 20, 2017

allow using WORKER:0 as chief #221

allow using WORKER:0 as chief #221

Conversation

lluunn commented Dec 13, 2017 • edited by jlewi

k8s-ci-robot commented Dec 13, 2017

jlewi commented Dec 13, 2017

jlewi commented Dec 15, 2017

jlewi commented Dec 15, 2017

lluunn commented Dec 15, 2017

jlewi commented Dec 16, 2017

googlebot commented Dec 19, 2017

lluunn commented Dec 19, 2017

jlewi commented Dec 20, 2017

jlewi commented Dec 20, 2017

lluunn commented Dec 20, 2017

coveralls commented Dec 20, 2017 • edited

lluunn commented Dec 20, 2017

lluunn commented Dec 13, 2017 •

edited by jlewi

coveralls commented Dec 20, 2017 •

edited