-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use owner references instead of timestamps to determine build/build pod ownership #18735
use owner references instead of timestamps to determine build/build pod ownership #18735
Conversation
@csrwng if you have a moment to comment on this PR it'd be appreciated. @bparees and @gabemontero too. I think we might have discussed this in the past, but I don't recall what the outcome was. Perhaps back then it was "don't touch it unless something comes up in the future?" In https://bugzilla.redhat.com/show_bug.cgi?id=1547551 a problem was encountered where the build controller rejected its own build pod because of timestamp differences, ultimately caused by faulty time synchronisation between the masters in the cluster in question (ntp not switched on). I'm wondering if we can tighten up the code in this area. So far this PR is a naive change to use the pre-existing OwnerReference (including UID) instead of timestamps as the factor that links build pods and builds. However, I'm a bit in doubt about a couple of wider aspects of the code - added three questions as code comments. Feedback would be appreciated. I'm wondering whether this falls under "we added extra code for an unlikely corner case, but it doesn't actually quite work correctly and there are no tests" ? |
update = transitionToPhase(buildapi.BuildPhaseError, buildapi.StatusReasonBuildPodExists, buildapi.StatusMessageBuildPodExists) | ||
return update, nil | ||
} | ||
// QUESTION 1: isn't it a mistake to be returning nil, nil here in | ||
// the case that err != nil? Shouldn't we return nil, err in that | ||
// case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that yes it's a mistake to return nil, nil in case err != nil ... we want to retry handling this build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
// again fails to Update() the Build correctly. Don't we then lose | ||
// information like the push secret, etc. that we try to stash | ||
// below? Isn't that wrong? Should we / can we effectively jump to | ||
// line 913 in this case (refactoring instead of goto)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -425,12 +425,14 @@ func (bc *BuildController) cancelBuild(build *buildapi.Build) (*buildUpdate, err | |||
func (bc *BuildController) handleNewBuild(build *buildapi.Build, pod *v1.Pod) (*buildUpdate, error) { | |||
// If a pod was found, and it was created after the build was created, it | |||
// means that the build is active and its status should be updated | |||
// QUESTION 3: if we fix question 2 below, can we remove this entire stanza, | |||
// and wouldn't that be the right thing to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you already know that a pod exists, why not handle it here instead of checking policy, and then trying to create it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you know that you didn't successfully complete the work in the current state, therefore you shouldn't advance to the next. The implementation related to handling and advancing the state should be in one place for legibility and maintainability. And precisely one thing only should determine whether the work in the given state is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you know that you didn't successfully complete the work in the current state, therefore you shouldn't advance to the next.
But the work is to move to the next state :)
The implementation related to handling and advancing the state should be in one place for legibility and maintainability
This I think I agree with. It does allow simpler code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to remember the rules for the workqueue.RateLimitingInterface
... can we assume we will not get duplicate events? I thought the answer was we could not assume that. If so, we would want this stanza to handle it.
Even if we can assume no duplicate events, this still seems like safe code / protection for something unexpected, including a potential bug in the lister/queue/cache stuff that lead to duplicate events.
Thanks for looking this over @csrwng! |
|
||
// HasOwnerReference returns true if the build pod has an OwnerReference to the | ||
// build. | ||
func HasOwnerReference(pod *v1.Pod, build *buildapi.Build) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming the method to HaveACommonOwnerReference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not so much a common owner ref. it's one owning the other. I think the name is fine but if we want something else, "Owns()" would be the name i would suggest. Or "BuildOwnsPod" is perhaps clearest about the relationship being tested.
// again fails to Update() the Build correctly. Don't we then lose | ||
// information like the push secret, etc. that we try to stash | ||
// below? Isn't that wrong? Should we / can we effectively jump to | ||
// line 913 in this case (refactoring instead of goto)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
update = transitionToPhase(buildapi.BuildPhaseError, buildapi.StatusReasonBuildPodExists, buildapi.StatusMessageBuildPodExists) | ||
return update, nil | ||
} | ||
// QUESTION 1: isn't it a mistake to be returning nil, nil here in | ||
// the case that err != nil? Shouldn't we return nil, err in that | ||
// case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -425,12 +425,14 @@ func (bc *BuildController) cancelBuild(build *buildapi.Build) (*buildUpdate, err | |||
func (bc *BuildController) handleNewBuild(build *buildapi.Build, pod *v1.Pod) (*buildUpdate, error) { | |||
// If a pod was found, and it was created after the build was created, it | |||
// means that the build is active and its status should be updated | |||
// QUESTION 3: if we fix question 2 below, can we remove this entire stanza, | |||
// and wouldn't that be the right thing to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to remember the rules for the workqueue.RateLimitingInterface
... can we assume we will not get duplicate events? I thought the answer was we could not assume that. If so, we would want this stanza to handle it.
Even if we can assume no duplicate events, this still seems like safe code / protection for something unexpected, including a potential bug in the lister/queue/cache stuff that lead to duplicate events.
Hmm, I'm not sure if I'm feeling courageous enough to remove the "question 3" stanza. I fear too much happens under handleNewBuild - e.g. side-effects in different run policies; non-local state in run policies, image and secret resolution. Maybe that's why this hack exists. The risk is that some time after a pod is kicked off but the build update fails, we get back into handleNewBuild and make different decisions to those made previously. The pod is already running but we update the build, or other objects, in a divergent way. I think we're one state short, and/or we have got other architectural shortcomings. I'm not convinced it's worth opening this all up. I think the maximum this PR should do is change the timestamp logic, fix questions 1 & 2, and add documentation around question 3 (assuming others are in agreement with my viewpoint). Thoughts? |
+1 @jim-minter with some sort of //TODO for your number 3 ... I'm glad I don't have to come up with the wording for that one |
Sgtm |
@jim-minter be aware that we didn't use to set the ownerrefs. Have you thought through how this impacts someone doing an upgrade who has a build/build-pod that have no owner relationship? It seems unlikely such a cluster would have a build in the "new" state, but it's not impossible. You might consider making this check more thorough to detect/handle "older" build pods that have a build annotation, but no ownerref. |
FWIW, I think we've set OwnerRefs since 3091d6a, which was in 3.6. |
9e0b0b9
to
c0bc246
Compare
/lgtm |
c0bc246
to
71c1059
Compare
repushed, updated tests only |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees, jim-minter The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
the extended build failure was
/hold cancel |
/retest |
/retest |
1 similar comment
/retest |
/hold |
3.9 has been branched |
/hold |
/hold cancel |
/retest Please review the full test history for this PR and help us cut down flakes. |
/test all [submit-queue is verifying that this PR is safe to merge] |
@jim-minter: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 18735, 18746). |
Follow up from https://bugzilla.redhat.com/show_bug.cgi?id=1547551