v1alpha2: Implementation #492

gaocegege · 2018-03-22T03:46:01Z

@ScorpioCPH and @gaocegege implemented the main logic and basic unit test cases (73% coverage).

I think it is time to open the PR, and we decided to add basic e2e test cases for v1alpha2 these days.

We have created a branch for v1alpha1 and when the v1alpha2 is merged into master we will remove the branch v1alpha2.

PS: Suggest to rebase the PR to master, not squash it since the commit message may be helpful for new contributors.

This change is

per this design proposal kubeflow/community#30. Update API to v1alpha2

* linter: Fix Signed-off-by: Ce Gao <gaoce@caicloud.io> * linter: Keep v1alpha2 Signed-off-by: Ce Gao <gaoce@caicloud.io> * client: Update Signed-off-by: Ce Gao <gaoce@caicloud.io>

* utils: Add FakeServiceControl Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Add basic test case Signed-off-by: Ce Gao <gaoce@caicloud.io> * travis: Ignore vendored code Signed-off-by: Ce Gao <gaoce@caicloud.io> * travis: Ignore vendered code Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Fix the status outdate problem Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Add check for status update Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Remove call for KeyFunc Signed-off-by: Ce Gao <gaoce@caicloud.io> * pod: Add comment and remove debug statements Signed-off-by: Ce Gao <gaoce@caicloud.io>

* *: Fix some errors in Travis CI Signed-off-by: Ce Gao <gaoce@caicloud.io> * controller: Fix Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Add internal state test Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Remove useless log Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Separate ps and worker pods Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Remove log Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Fix travis Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Separate ps and worker pods Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Remove log Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Add Signed-off-by: Ce Gao <gaoce@caicloud.io>

* defaulter: Add Signed-off-by: Ce Gao <gaoce@caicloud.io> * default: Add Signed-off-by: Ce Gao <gaoce@caicloud.io> * default: Fix Signed-off-by: Ce Gao <gaoce@caicloud.io> * defaulter: Remove image default value Signed-off-by: Ce Gao <gaoce@caicloud.io> * example: Add minimal example Signed-off-by: Ce Gao <gaoce@caicloud.io> * utils.go: Add copyright holder Signed-off-by: Ce Gao <gaoce@caicloud.io> * defaulter: Remove restartpolicy Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Add run test case Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Add enqueue test Signed-off-by: Ce Gao <gaoce@caicloud.io> * pod_test: Add Signed-off-by: Ce Gao <gaoce@caicloud.io> * service_test: Add Signed-off-by: Ce Gao <gaoce@caicloud.io>

googlebot · 2018-03-22T03:46:19Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

coveralls · 2018-03-22T03:53:58Z

Coverage increased (+27.08%) to 72.503% when pulling 70f2a8d on gaocegege:v1alpha2-master into 771c4f9 on kubeflow:master.

coveralls · 2018-03-22T03:53:58Z

Coverage increased (+3.07%) to 48.487% when pulling 0947506 on gaocegege:v1alpha2-master into eec56b5 on kubeflow:master.

coveralls · 2018-03-22T03:53:58Z

Coverage increased (+27.08%) to 72.503% when pulling 70f2a8d on gaocegege:v1alpha2-master into 771c4f9 on kubeflow:master.

jlewi · 2018-03-22T04:32:04Z

Fantastic job on test the test coverage.

jlewi · 2018-03-22T04:33:37Z

/assign jlewi

gaocegege · 2018-03-23T04:08:36Z

/retest

gaocegege · 2018-03-23T04:42:35Z

/retest

gaocegege · 2018-03-23T07:03:32Z

@jlewi Could we merge it ASAP since it blocks the continuous development of v1alpha2?

Signed-off-by: Ce Gao <gaoce@caicloud.io>

k8s-ci-robot · 2018-03-23T14:32:50Z

@gaocegege: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
kubeflow-tf-operator-presubmit	`0947506`	link	`/test kubeflow-tf-operator-presubmit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jlewi · 2018-03-23T20:33:51Z

The disk quota issue is kubeflow/testing#81 and should be fixed.

jlewi · 2018-03-23T20:36:17Z

Can we split the PR please and commit the code in v1alpha2 and code gen first?

Massive PRs slow down code review and block other work. Splitting a PR into multiple PRs is a great way to speed up the review process.

jlewi · 2018-03-24T02:07:20Z

Review status: 0 of 49 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

cmd/tf-operator.v2/app/server.go, line 53 at r1 (raw file):

const RecommendedKubeConfigPathEnv = "KUBECONFIG"

func Run(opt *options.ServerOption) error {

This looks very similar to
https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator/app/server.go

Only difference I see is we get rid of reading readControllerConfig and more importantly we use the v1alpha packages

Why not refactor the code a bit to promote code reuse?

What if we had a RunForVersionFunction

that could be called by Run.

I think this would be useful because this won't be the only time we change the API. For example at a minimum at sometime in the future we will have a v1beta1 and a v1 API.

cmd/tf-operator.v2/app/options/options.go, line 15 at r1 (raw file):

// limitations under the License.

package options

Could we just reuse the existing options package?
Is there anything different about this file?

examples/simple_tf_job.yaml, line 1 at r1 (raw file):

apiVersion: "kubeflow.org/v1alpha2"

Why do we need this?

examples/tf_job_v1alpha2.yaml, line 1 at r1 (raw file):

apiVersion: "kubeflow.org/v1alpha2"

Do we need this?

pkg/controller.v2/controller.go, line 142 at r1 (raw file):

	// - "tf-operator/tfjob-abc/worker/services", expects 4 adds.
	// - "tf-operator/tfjob-abc/worker/pods", expects 4 adds.
	expectations ControllerExpectationsInterface

Can you explain this design? Why do we need to check number of events? Why don't we just check that if we expect workers 1....N that there are 1...N pods as expected?

Are there other K8s controllers that are designed this way?

pkg/controller.v2/controller_pod.go, line 67 at r1 (raw file):

		expectationPodsKey := genExpectationPodsKey(tfjobKey, rt)
		err := tc.expectations.ExpectCreations(expectationPodsKey, diff)

Why are you only checking the number of creations? Why don't you check that the actual pods you expect exist?

Suppose we expect 4 worker replicas; 1...N.

But suppose we get the following create events
Worker 0
Worker 0
Worker 1
Worker 2

So worker 0 is created twice but worker 3 is never created.

Won't the code think all 4 workers have been created even though its not true?

pkg/controller.v2/controller_utils.go, line 398 at r1 (raw file):

	// CreatePodsOnNode creates a new pod according to the spec on the specified node,
	// and sets the ControllerRef.
	CreatePodsOnNode(nodeName, namespace string, template *v1.PodTemplateSpec, object runtime.Object, controllerRef *metav1.OwnerReference) error

Why do we need a function to create pods on specific nodes?

pkg/util/signals/signal.go, line 2 at r1 (raw file):

/*
Copyright 2017 The Kubernetes Authors.

Can you move this into a separate PR? It will speed up the review process if we can break this up into small PRs so its easier to focus on the important bits in each PR.

pkg/util/signals/signal_windows.go, line 17 at r1 (raw file):

*/

package signals

Why do we have a package for windows? When would we ever run this code on windows?

test/e2e/sleep-and-random-exit/Dockerfile, line 3 at r1 (raw file):

FROM ubuntu:16.04

COPY ./sleep_and_random_exit.sh ~/

If we need this, please move this into a separate PR.

Comments from Reviewable

gaocegege · 2018-03-24T02:09:02Z

@jlewi API and code-gen is 0d61bf9

V1alpha2 core logic is 4e1dd8d

These two commits are the most important commits and other commits mainly focus on test and I think we could just look through them.

Sorry that I think it is hard to split the PRs and here is the reason:

We removed the v1alpha1 during the development and added it again in one commit since you recommend to keep two implementations in master. Then if we split the PR, there are lots of work to do to keep two implementations from the beginning instead of the commit caf21fa.

Or we could keep one implementation and just remove v1alpha1 then I think I could split the previous commits since the commits in v1alpha2 already do.

Another option is to keep the development in v1alpha2 and wait until the downstream is ready, then we could split the commits in v1alpha2 and overwrite master.

/cc @ScorpioCPH

jlewi · 2018-03-24T02:17:16Z

I really think this PR needs to be split up. There's a bunch of code in here that should be relatively straightforward to approve ; e.g. signals. So the inclusion in this PR is just a distraction.

I don't understand why splitting up this PR is difficult.

Create a new based of master head.
Copy the files
pkg/apis/tensorflow/v1alpha2
pkg/clientset/.../v1alpha2
into the directory
Create a PR.

That PR would be just the files for the API which should be relatively straightforward to review.

Follow the same process to create a PR for the signals package.

Once those PRs are committed which should be relatively straightforward. You can sync with master. At that point the only thing left is the controller package which is where I think the most attention needs to be paid.

ScorpioCPH · 2018-03-26T01:40:59Z

@jlewi You can check out branch v1alpha2, the whole commit history are there.

jlewi · 2018-03-26T23:41:50Z

@ScorpioCPH I'm not sure I follow. The point of splitting up this PR is that the vast majority of it is straightforward to review. So if we split up the PR we can get most of the files committed pretty quickly.

This would leave us with the controller logic and the controller logic is where I'd like to focus during the review process because there appear to be substantial changes to how we track and reconcile resources as part of the TFJob.

jlewi · 2018-03-27T01:45:19Z

Review status: 0 of 49 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.

pkg/controller.v2/controller.go, line 76 at r1 (raw file):

	// wait between two reconciler sync.
	// It is set to 15 sec by default.
	// TODO(cph): maybe we can let it grows by multiple in the future

Do we really need exponential backoff?

If we correctly configure and listen to all appropriate events for a TFJob then we should be able to set the periodic reconcile to a large value 1-2 minutes.

pkg/controller.v2/controller.go, line 197 at r1 (raw file):

	// Create tfjob informer.
	tfJobInformer := tfJobInformerFactory.Kubeflow().V1alpha2().TFJobs()

Do we configure this informer to periodically generate update events to be used for reconcilation like we currently do?
If we do where does that happen?
If not why not?
https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator/app/server.go#L82

pkg/controller.v2/controller.go, line 289 at r1 (raw file):

// attempt to process it, by calling the syncHandler.
func (tc *TFJobController) processNextWorkItem() bool {
	key, quit := tc.workQueue.Get()

Can I suggest making processNextWorkItem a utility function that can be called from the v1 and v2 controller?

I just fixed some issues in #501. I don't think this code is actually different so it would be useful to avoid code duplication.

pkg/controller.v2/controller.go, line 325 at r1 (raw file):

	startTime := time.Now()
	defer func() {
		log.Infof("Finished syncing tfjob %q (%v)", key, time.Since(startTime))

It would be good to use logrus and add fields to tag log entries with job name.
Like
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller/controller.go#L221

This will make it alot easier to debug issues processing jobs.

pkg/controller.v2/controller.go, line 344 at r1 (raw file):

	tfjob := sharedtfjob.DeepCopy()
	tfjobNeedsSync := tc.satisfiedExpectations(tfjob)

What is satisifiedExpections? Why don't we just call tc.reconcileTFJobs and let the reconcile function figure out what if anything needs to be done?

pkg/controller.v2/controller.go, line 362 at r1 (raw file):

func (tc *TFJobController) reconcileTFJobs(tfjob *tfv1alpha2.TFJob) error {

	pods, err := tc.getPodsForTFJob(tfjob)

Why do we need to get the pods and service here?
Can't we just rely on reconcilePods and reconcileServices to do the correct thing?

pkg/controller.v2/controller_pod.go, line 37 at r1 (raw file):

// It will requeue the tfjob in case of an error while creating/deleting pods.
func (tc *TFJobController) reconcilePods(
	tfjob *tfv1alpha2.TFJob,

When a job is finished but not deleted, what prevents pods from being recreated?

pkg/controller.v2/controller_pod.go, line 242 at r1 (raw file):

}

// When a pod is created, enqueue the tfjob that manages it and update its expectations.

Why do we need to update the expectations? What are the expectations used for?

pkg/controller.v2/controller_pod.go, line 293 at r1 (raw file):

// and new replica set. old and cur must be *v1.Pod types.
func (tc *TFJobController) updatePod(old, cur interface{}) {
	// TODO(CPH): handle this gracefully.

Can we just enqueue a work item for the associated job which will cause us to call reconcile for the job?

We can rely on the workQueue to aggregate events and rate limit things.

pkg/controller.v2/controller_service.go, line 172 at r1 (raw file):

	}

	// If any adoptions are attempted, we should first recheck for deletion

What are adoptions?

pkg/controller.v2/controller_service.go, line 255 at r1 (raw file):

// and new replica set. old and cur must be *v1.Service types.
func (tc *TFJobController) updateService(old, cur interface{}) {
	// TODO(CPH): handle this gracefully.

Can we just enqueue a work item for the associated job which will cause us to call reconcile for the job?

We can rely on the workQueue to aggregate events and rate limit things.

Comments from Reviewable

jlewi · 2018-04-04T11:37:52Z

@gaocegege How's it going with v1alpha2?

gaocegege · 2018-04-04T12:00:45Z

@jlewi

We planned to discuss about the integration but that community meeting has no time for this topic. Maybe we could schedule a meeting for it

/cc @DjangoPeng @ScorpioCPH @ddysher

DjangoPeng · 2018-04-05T02:43:37Z

SGTM.

We could schedule a online meeting to discuss and make a decision after tomb-sweeping day (the traditional Qingming festival). How about Apr 8th?

@gaocegege @ddysher @ScorpioCPH WDYT

ddysher · 2018-04-12T06:48:40Z

For anyone who happens to read this PR, we will split the PR into smaller ones to ease review.

ScorpioCPH and others added 20 commits March 22, 2018 11:22

Update API to v1alpha2 (#457)

0d61bf9

per this design proposal kubeflow/community#30. Update API to v1alpha2

Add options kubeconfig/master/threadiness and remove deprecated options

e559cea

Update copyright with new boilerplate

278ceb7

Cleanup v1alpha1 code

de18f52

Import v1alpha2 logic code

4e1dd8d

Update examples for test

8ac85a8

Fix travis-ci error

79d55c4

Append labels instead of rewriting (#468)

910225d

linter: Fix linter ignore file (#466)

68f771e

* linter: Fix Signed-off-by: Ce Gao <gaoce@caicloud.io> * linter: Keep v1alpha2 Signed-off-by: Ce Gao <gaoce@caicloud.io> * client: Update Signed-off-by: Ce Gao <gaoce@caicloud.io>

Make RestartPolicy a property of the ReplicaSpec (#473)

8d3296e

Update tfjob status (#472)

99825ee

*: Fix some errors in Travis CI (#477)

78660a9

* *: Fix some errors in Travis CI Signed-off-by: Ce Gao <gaoce@caicloud.io> * controller: Fix Signed-off-by: Ce Gao <gaoce@caicloud.io>

controller: Add internal state test (#480)

e7e6005

* controller: Add internal state test Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Remove useless log Signed-off-by: Ce Gao <gaoce@caicloud.io>

controller: Separate ps and worker pods (#481)

326d3f0

* controller: Separate ps and worker pods Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Remove log Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Fix travis Signed-off-by: Ce Gao <gaoce@caicloud.io>

controller: Add check for service and fix service (#482)

683c6d4

* controller: Separate ps and worker pods Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Remove log Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Add Signed-off-by: Ce Gao <gaoce@caicloud.io>

Add sleep and random exit image for e2e test (#487)

7c15793

test: Add test cases (#488)

70f2a8d

* test: Add run test case Signed-off-by: Ce Gao <gaoce@caicloud.io> * test: Add enqueue test Signed-off-by: Ce Gao <gaoce@caicloud.io> * pod_test: Add Signed-off-by: Ce Gao <gaoce@caicloud.io> * service_test: Add Signed-off-by: Ce Gao <gaoce@caicloud.io>

k8s-ci-robot requested review from ankushagarwal and jimexist March 22, 2018 03:46

k8s-ci-robot added the size/XXL label Mar 22, 2018

k8s-ci-robot assigned jlewi Mar 22, 2018

Merge branch 'master' into v1alpha2-master

82068af

examples: Keep v1alpha1

003fa7d

Signed-off-by: Ce Gao <gaoce@caicloud.io>

jlewi mentioned this pull request Mar 23, 2018

[v1alpha2] Sync commits with v1alpha1 #490

Closed

4 tasks

linter: Fix

0947506

k8s-ci-robot requested a review from ScorpioCPH March 24, 2018 02:09

k8s-ci-robot requested review from ddysher and DjangoPeng April 4, 2018 12:00

gaocegege closed this Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1alpha2: Implementation #492

v1alpha2: Implementation #492

gaocegege commented Mar 22, 2018 •

edited by jlewi

Loading

googlebot commented Mar 22, 2018

coveralls commented Mar 22, 2018

coveralls commented Mar 22, 2018 •

edited

Loading

coveralls commented Mar 22, 2018

jlewi commented Mar 22, 2018

jlewi commented Mar 22, 2018

gaocegege commented Mar 23, 2018

gaocegege commented Mar 23, 2018

gaocegege commented Mar 23, 2018

k8s-ci-robot commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 24, 2018

gaocegege commented Mar 24, 2018

jlewi commented Mar 24, 2018

ScorpioCPH commented Mar 26, 2018 •

edited

Loading

jlewi commented Mar 26, 2018

jlewi commented Mar 27, 2018

jlewi commented Apr 4, 2018

gaocegege commented Apr 4, 2018

DjangoPeng commented Apr 5, 2018

ddysher commented Apr 12, 2018 •

edited

Loading

v1alpha2: Implementation #492

v1alpha2: Implementation #492

Conversation

gaocegege commented Mar 22, 2018 • edited by jlewi Loading

googlebot commented Mar 22, 2018

coveralls commented Mar 22, 2018

coveralls commented Mar 22, 2018 • edited Loading

coveralls commented Mar 22, 2018

jlewi commented Mar 22, 2018

jlewi commented Mar 22, 2018

gaocegege commented Mar 23, 2018

gaocegege commented Mar 23, 2018

gaocegege commented Mar 23, 2018

k8s-ci-robot commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 24, 2018

gaocegege commented Mar 24, 2018

jlewi commented Mar 24, 2018

ScorpioCPH commented Mar 26, 2018 • edited Loading

jlewi commented Mar 26, 2018

jlewi commented Mar 27, 2018

jlewi commented Apr 4, 2018

gaocegege commented Apr 4, 2018

DjangoPeng commented Apr 5, 2018

ddysher commented Apr 12, 2018 • edited Loading

gaocegege commented Mar 22, 2018 •

edited by jlewi

Loading

coveralls commented Mar 22, 2018 •

edited

Loading

ScorpioCPH commented Mar 26, 2018 •

edited

Loading

ddysher commented Apr 12, 2018 •

edited

Loading