Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1alpha2: Implementation #492

Closed
wants to merge 25 commits into from
Closed

v1alpha2: Implementation #492

wants to merge 25 commits into from

Conversation

gaocegege
Copy link
Member

@gaocegege gaocegege commented Mar 22, 2018

@ScorpioCPH and @gaocegege implemented the main logic and basic unit test cases (73% coverage).

I think it is time to open the PR, and we decided to add basic e2e test cases for v1alpha2 these days.

We have created a branch for v1alpha1 and when the v1alpha2 is merged into master we will remove the branch v1alpha2.

PS: Suggest to rebase the PR to master, not squash it since the commit message may be helpful for new contributors.

WDYT? @jlewi


This change is Reviewable

ScorpioCPH and others added 20 commits March 22, 2018 11:22
per this design proposal kubeflow/community#30.
Update API to v1alpha2
* linter: Fix

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* linter: Keep v1alpha2

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* client: Update

Signed-off-by: Ce Gao <gaoce@caicloud.io>
* utils: Add FakeServiceControl

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Add basic test case

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* travis: Ignore vendored code

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* travis: Ignore vendered code

Signed-off-by: Ce Gao <gaoce@caicloud.io>
* controller: Fix the status outdate problem

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Add check for status update

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Remove call for KeyFunc

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* pod: Add comment and remove debug statements

Signed-off-by: Ce Gao <gaoce@caicloud.io>
* *: Fix some errors in Travis CI

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* controller: Fix

Signed-off-by: Ce Gao <gaoce@caicloud.io>
* controller: Add internal state test

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Remove useless log

Signed-off-by: Ce Gao <gaoce@caicloud.io>
* controller: Separate ps and worker pods

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Remove log

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Fix travis

Signed-off-by: Ce Gao <gaoce@caicloud.io>
* controller: Separate ps and worker pods

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Remove log

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Add

Signed-off-by: Ce Gao <gaoce@caicloud.io>
* defaulter: Add

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* default: Add

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* default: Fix

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* defaulter: Remove image default value

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* example: Add minimal example

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* utils.go: Add copyright holder

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* defaulter: Remove restartpolicy

Signed-off-by: Ce Gao <gaoce@caicloud.io>
* test: Add run test case

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* test: Add enqueue test

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* pod_test: Add

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* service_test: Add

Signed-off-by: Ce Gao <gaoce@caicloud.io>
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

@coveralls
Copy link

Coverage Status

Coverage increased (+27.08%) to 72.503% when pulling 70f2a8d on gaocegege:v1alpha2-master into 771c4f9 on kubeflow:master.

@coveralls
Copy link

coveralls commented Mar 22, 2018

Coverage Status

Coverage increased (+3.07%) to 48.487% when pulling 0947506 on gaocegege:v1alpha2-master into eec56b5 on kubeflow:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+27.08%) to 72.503% when pulling 70f2a8d on gaocegege:v1alpha2-master into 771c4f9 on kubeflow:master.

@jlewi
Copy link
Contributor

jlewi commented Mar 22, 2018

Fantastic job on test the test coverage.

@jlewi
Copy link
Contributor

jlewi commented Mar 22, 2018

/assign jlewi

@gaocegege
Copy link
Member Author

/retest

1 similar comment
@gaocegege
Copy link
Member Author

/retest

@gaocegege
Copy link
Member Author

@jlewi Could we merge it ASAP since it blocks the continuous development of v1alpha2?

Signed-off-by: Ce Gao <gaoce@caicloud.io>
@jlewi jlewi mentioned this pull request Mar 23, 2018
4 tasks
@k8s-ci-robot
Copy link

@gaocegege: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
kubeflow-tf-operator-presubmit 0947506 link /test kubeflow-tf-operator-presubmit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jlewi
Copy link
Contributor

jlewi commented Mar 23, 2018

The disk quota issue is kubeflow/testing#81 and should be fixed.

@jlewi
Copy link
Contributor

jlewi commented Mar 23, 2018

Can we split the PR please and commit the code in v1alpha2 and code gen first?

Massive PRs slow down code review and block other work. Splitting a PR into multiple PRs is a great way to speed up the review process.

@jlewi
Copy link
Contributor

jlewi commented Mar 24, 2018

Review status: 0 of 49 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


cmd/tf-operator.v2/app/server.go, line 53 at r1 (raw file):

const RecommendedKubeConfigPathEnv = "KUBECONFIG"

func Run(opt *options.ServerOption) error {

This looks very similar to
https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator/app/server.go

Only difference I see is we get rid of reading readControllerConfig and more importantly we use the v1alpha packages

Why not refactor the code a bit to promote code reuse?

What if we had a RunForVersionFunction

that could be called by Run.

I think this would be useful because this won't be the only time we change the API. For example at a minimum at sometime in the future we will have a v1beta1 and a v1 API.


cmd/tf-operator.v2/app/options/options.go, line 15 at r1 (raw file):

// limitations under the License.

package options

Could we just reuse the existing options package?
Is there anything different about this file?


examples/simple_tf_job.yaml, line 1 at r1 (raw file):

apiVersion: "kubeflow.org/v1alpha2"

Why do we need this?


examples/tf_job_v1alpha2.yaml, line 1 at r1 (raw file):

apiVersion: "kubeflow.org/v1alpha2"

Do we need this?


pkg/controller.v2/controller.go, line 142 at r1 (raw file):

	// - "tf-operator/tfjob-abc/worker/services", expects 4 adds.
	// - "tf-operator/tfjob-abc/worker/pods", expects 4 adds.
	expectations ControllerExpectationsInterface

Can you explain this design? Why do we need to check number of events? Why don't we just check that if we expect workers 1....N that there are 1...N pods as expected?

Are there other K8s controllers that are designed this way?


pkg/controller.v2/controller_pod.go, line 67 at r1 (raw file):

		expectationPodsKey := genExpectationPodsKey(tfjobKey, rt)
		err := tc.expectations.ExpectCreations(expectationPodsKey, diff)

Why are you only checking the number of creations? Why don't you check that the actual pods you expect exist?

Suppose we expect 4 worker replicas; 1...N.

But suppose we get the following create events
Worker 0
Worker 0
Worker 1
Worker 2

So worker 0 is created twice but worker 3 is never created.

Won't the code think all 4 workers have been created even though its not true?


pkg/controller.v2/controller_utils.go, line 398 at r1 (raw file):

	// CreatePodsOnNode creates a new pod according to the spec on the specified node,
	// and sets the ControllerRef.
	CreatePodsOnNode(nodeName, namespace string, template *v1.PodTemplateSpec, object runtime.Object, controllerRef *metav1.OwnerReference) error

Why do we need a function to create pods on specific nodes?


pkg/util/signals/signal.go, line 2 at r1 (raw file):

/*
Copyright 2017 The Kubernetes Authors.

Can you move this into a separate PR? It will speed up the review process if we can break this up into small PRs so its easier to focus on the important bits in each PR.


pkg/util/signals/signal_windows.go, line 17 at r1 (raw file):

*/

package signals

Why do we have a package for windows? When would we ever run this code on windows?


test/e2e/sleep-and-random-exit/Dockerfile, line 3 at r1 (raw file):

FROM ubuntu:16.04

COPY ./sleep_and_random_exit.sh ~/

If we need this, please move this into a separate PR.


Comments from Reviewable

@gaocegege
Copy link
Member Author

@jlewi API and code-gen is 0d61bf9

V1alpha2 core logic is 4e1dd8d

These two commits are the most important commits and other commits mainly focus on test and I think we could just look through them.

Sorry that I think it is hard to split the PRs and here is the reason:

We removed the v1alpha1 during the development and added it again in one commit since you recommend to keep two implementations in master. Then if we split the PR, there are lots of work to do to keep two implementations from the beginning instead of the commit caf21fa.

Or we could keep one implementation and just remove v1alpha1 then I think I could split the previous commits since the commits in v1alpha2 already do.

Another option is to keep the development in v1alpha2 and wait until the downstream is ready, then we could split the commits in v1alpha2 and overwrite master.

/cc @ScorpioCPH

@jlewi
Copy link
Contributor

jlewi commented Mar 24, 2018

I really think this PR needs to be split up. There's a bunch of code in here that should be relatively straightforward to approve ; e.g. signals. So the inclusion in this PR is just a distraction.

I don't understand why splitting up this PR is difficult.

Create a new based of master head.
Copy the files
pkg/apis/tensorflow/v1alpha2
pkg/clientset/.../v1alpha2
into the directory
Create a PR.

That PR would be just the files for the API which should be relatively straightforward to review.

Follow the same process to create a PR for the signals package.

Once those PRs are committed which should be relatively straightforward. You can sync with master. At that point the only thing left is the controller package which is where I think the most attention needs to be paid.

@ScorpioCPH
Copy link
Member

ScorpioCPH commented Mar 26, 2018

@jlewi You can check out branch v1alpha2, the whole commit history are there.

@jlewi
Copy link
Contributor

jlewi commented Mar 26, 2018

@ScorpioCPH I'm not sure I follow. The point of splitting up this PR is that the vast majority of it is straightforward to review. So if we split up the PR we can get most of the files committed pretty quickly.

This would leave us with the controller logic and the controller logic is where I'd like to focus during the review process because there appear to be substantial changes to how we track and reconcile resources as part of the TFJob.

@jlewi
Copy link
Contributor

jlewi commented Mar 27, 2018

Review status: 0 of 49 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


pkg/controller.v2/controller.go, line 76 at r1 (raw file):

	// wait between two reconciler sync.
	// It is set to 15 sec by default.
	// TODO(cph): maybe we can let it grows by multiple in the future

Do we really need exponential backoff?

If we correctly configure and listen to all appropriate events for a TFJob then we should be able to set the periodic reconcile to a large value 1-2 minutes.


pkg/controller.v2/controller.go, line 197 at r1 (raw file):

	// Create tfjob informer.
	tfJobInformer := tfJobInformerFactory.Kubeflow().V1alpha2().TFJobs()

Do we configure this informer to periodically generate update events to be used for reconcilation like we currently do?
If we do where does that happen?
If not why not?
https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator/app/server.go#L82


pkg/controller.v2/controller.go, line 289 at r1 (raw file):

// attempt to process it, by calling the syncHandler.
func (tc *TFJobController) processNextWorkItem() bool {
	key, quit := tc.workQueue.Get()

Can I suggest making processNextWorkItem a utility function that can be called from the v1 and v2 controller?

I just fixed some issues in #501. I don't think this code is actually different so it would be useful to avoid code duplication.


pkg/controller.v2/controller.go, line 325 at r1 (raw file):

	startTime := time.Now()
	defer func() {
		log.Infof("Finished syncing tfjob %q (%v)", key, time.Since(startTime))

It would be good to use logrus and add fields to tag log entries with job name.
Like
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller/controller.go#L221

This will make it alot easier to debug issues processing jobs.


pkg/controller.v2/controller.go, line 344 at r1 (raw file):

	tfjob := sharedtfjob.DeepCopy()
	tfjobNeedsSync := tc.satisfiedExpectations(tfjob)

What is satisifiedExpections? Why don't we just call tc.reconcileTFJobs and let the reconcile function figure out what if anything needs to be done?


pkg/controller.v2/controller.go, line 362 at r1 (raw file):

func (tc *TFJobController) reconcileTFJobs(tfjob *tfv1alpha2.TFJob) error {

	pods, err := tc.getPodsForTFJob(tfjob)

Why do we need to get the pods and service here?
Can't we just rely on reconcilePods and reconcileServices to do the correct thing?


pkg/controller.v2/controller_pod.go, line 37 at r1 (raw file):

// It will requeue the tfjob in case of an error while creating/deleting pods.
func (tc *TFJobController) reconcilePods(
	tfjob *tfv1alpha2.TFJob,

When a job is finished but not deleted, what prevents pods from being recreated?


pkg/controller.v2/controller_pod.go, line 242 at r1 (raw file):

}

// When a pod is created, enqueue the tfjob that manages it and update its expectations.

Why do we need to update the expectations? What are the expectations used for?


pkg/controller.v2/controller_pod.go, line 293 at r1 (raw file):

// and new replica set. old and cur must be *v1.Pod types.
func (tc *TFJobController) updatePod(old, cur interface{}) {
	// TODO(CPH): handle this gracefully.

Can we just enqueue a work item for the associated job which will cause us to call reconcile for the job?

We can rely on the workQueue to aggregate events and rate limit things.


pkg/controller.v2/controller_service.go, line 172 at r1 (raw file):

	}

	// If any adoptions are attempted, we should first recheck for deletion

What are adoptions?


pkg/controller.v2/controller_service.go, line 255 at r1 (raw file):

// and new replica set. old and cur must be *v1.Service types.
func (tc *TFJobController) updateService(old, cur interface{}) {
	// TODO(CPH): handle this gracefully.

Can we just enqueue a work item for the associated job which will cause us to call reconcile for the job?

We can rely on the workQueue to aggregate events and rate limit things.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Apr 4, 2018

@gaocegege How's it going with v1alpha2?

@gaocegege
Copy link
Member Author

@jlewi

We planned to discuss about the integration but that community meeting has no time for this topic. Maybe we could schedule a meeting for it

/cc @DjangoPeng @ScorpioCPH @ddysher

@DjangoPeng
Copy link
Member

SGTM.

We could schedule a online meeting to discuss and make a decision after tomb-sweeping day (the traditional Qingming festival). How about Apr 8th?

@gaocegege @ddysher @ScorpioCPH WDYT

@gaocegege gaocegege closed this Apr 10, 2018
@ddysher
Copy link
Member

ddysher commented Apr 12, 2018

For anyone who happens to read this PR, we will split the PR into smaller ones to ease review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants