Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): calculate satisfied with && instead of || #1120

Merged
merged 1 commit into from
Mar 11, 2020

Conversation

GuoHaiqing
Copy link
Contributor

@GuoHaiqing GuoHaiqing commented Dec 24, 2019

If calculating satisfied with '||', it only guarantee some replicaTypes not all replicaTypes are satisfied. There will be some problems. So it should be changed to '&&'.


This change is Reviewable

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@k8s-ci-robot
Copy link

Hi @GuoHaiqing. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link

coveralls commented Dec 24, 2019

Coverage Status

Coverage remained the same at 96.512% when pulling d7855fd on GuoHaiqing:fix/controller-satisfied into 4b67180 on kubeflow:master.

@TravisBuddy
Copy link

Hey @GuoHaiqing,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: f80653a0-261e-11ea-8748-5b98fb6aa2de

@GuoHaiqing
Copy link
Contributor Author

@googlebot I signed it!

@GuoHaiqing GuoHaiqing force-pushed the fix/controller-satisfied branch 2 times, most recently from 0d981a8 to 2b9ab3d Compare December 24, 2019 07:34
@TravisBuddy
Copy link

Hey @GuoHaiqing,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 1c08a900-2620-11ea-8748-5b98fb6aa2de

@TravisBuddy
Copy link

Hey @GuoHaiqing,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 6b1245b0-2620-11ea-8748-5b98fb6aa2de

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@GuoHaiqing
Copy link
Contributor Author

/assign @richardsliu

@TravisBuddy
Copy link

Hey @GuoHaiqing,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: fd6cb3f0-2620-11ea-8748-5b98fb6aa2de

@gaocegege
Copy link
Member

lating satisfied with '||', it only guarantee some replicaTypes not all replicaTypes are satisfied. There will be some problems. So it should be changed to '&&'.

Could you please explain more about why it may cause problems? Thanks

@GuoHaiqing
Copy link
Contributor Author

reconcileTFJobs reconcile all replicaTypes and satisfiedExpectations determines whether there should begin a new reconciliation or not. I think it should begin a new reconciliation only when all replicaTypes are satisfied.

tfjobNeedsSync := tc.satisfiedExpectations(tfjob)
// Set default for the new tfjob.
scheme.Scheme.Default(tfjob)

var reconcileTFJobsErr error
if tfjobNeedsSync && tfjob.DeletionTimestamp == nil {
	reconcileTFJobsErr = tc.reconcileTFJobs(tfjob)
}
// Diff current active pods/services with replicas.
for rtype, spec := range tfjob.Spec.TFReplicaSpecs {
	err = tc.reconcilePods(tfjob, pods, rtype, spec, replicasStatus)
	if err != nil {
		logger.Warnf("reconcilePods error %v", err)
		return err
	}

	err = tc.reconcileServices(tfjob, services, rtype, spec)

	if err != nil {
		logger.Warnf("reconcileServices error %v", err)
		return err
	}
}

@gaocegege
Copy link
Member

Then, the problem is about performance?

@GuoHaiqing
Copy link
Contributor Author

In one reconciliation, controller creates one master and one ps. Then controller observes one master pod creation. At this moment, satisfiedExpectations returns true. It will begin a new reconciliation. In this reconciliation, ps pods from client-go will be empty and controller will create a new ps pod.

ControllerExpectations is for handling stale client-go cache. At this situation, I think ControllerExpectations doesn't play a corresponding role.

@gaocegege
Copy link
Member

Gotcha. Reasonable. Let me have a look.

@gaocegege
Copy link
Member

/lgtm

@gaocegege
Copy link
Member

/ok-to-test

@richardsliu
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@richardsliu
Copy link
Contributor

/retest

@richardsliu
Copy link
Contributor

@GuoHaiqing Can you rebase your PR? Sorry it has been many months and I forgot about this.

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gaocegege
Copy link
Member

I think there are some new problems introduced by this PR if the controller deletes some pods when the job is succeeded and the cleanpodpolicy is running.

The pod is deleted, we will get one observation here jc.Expectations.DeletionObserved(expectationPodsKey). But we do not sync the TFJob because of the PR. Then we cannot reach here. https://github.com/GuoHaiqing/tf-operator/blob/d7855fdee562594830f3b0fa4d3bbb60be0c014a/pkg/controller.v1/tensorflow/controller.go#L373

WDYT @GuoHaiqing @SimonCqk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants