use owner references instead of timestamps to determine build/build pod ownership #18735

jim-minter · 2018-02-23T16:30:26Z

Follow up from https://bugzilla.redhat.com/show_bug.cgi?id=1547551

jim-minter · 2018-02-23T16:46:15Z

@csrwng if you have a moment to comment on this PR it'd be appreciated. @bparees and @gabemontero too. I think we might have discussed this in the past, but I don't recall what the outcome was. Perhaps back then it was "don't touch it unless something comes up in the future?"

In https://bugzilla.redhat.com/show_bug.cgi?id=1547551 a problem was encountered where the build controller rejected its own build pod because of timestamp differences, ultimately caused by faulty time synchronisation between the masters in the cluster in question (ntp not switched on).

I'm wondering if we can tighten up the code in this area.

So far this PR is a naive change to use the pre-existing OwnerReference (including UID) instead of timestamps as the factor that links build pods and builds.

However, I'm a bit in doubt about a couple of wider aspects of the code - added three questions as code comments. Feedback would be appreciated.

I'm wondering whether this falls under "we added extra code for an unlikely corner case, but it doesn't actually quite work correctly and there are no tests" ?

csrwng · 2018-02-23T17:25:13Z

pkg/build/controller/build/build_controller.go

 				update = transitionToPhase(buildapi.BuildPhaseError, buildapi.StatusReasonBuildPodExists, buildapi.StatusMessageBuildPodExists)
 				return update, nil
 			}
+			// QUESTION 1: isn't it a mistake to be returning nil, nil here in
+			// the case that err != nil?  Shouldn't we return nil, err in that
+			// case?


I would say that yes it's a mistake to return nil, nil in case err != nil ... we want to retry handling this build.

csrwng · 2018-02-23T17:33:42Z

pkg/build/controller/build/build_controller.go

+			// again fails to Update() the Build correctly.  Don't we then lose
+			// information like the push secret, etc. that we try to stash
+			// below?  Isn't that wrong? Should we / can we effectively jump to
+			// line 913 in this case (refactoring instead of goto)?


I would say yes

csrwng · 2018-02-23T17:35:47Z

pkg/build/controller/build/build_controller.go

@@ -425,12 +425,14 @@ func (bc *BuildController) cancelBuild(build *buildapi.Build) (*buildUpdate, err
 func (bc *BuildController) handleNewBuild(build *buildapi.Build, pod *v1.Pod) (*buildUpdate, error) {
 	// If a pod was found, and it was created after the build was created, it
 	// means that the build is active and its status should be updated
+	// QUESTION 3: if we fix question 2 below, can we remove this entire stanza,
+	// and wouldn't that be the right thing to do?


So if you already know that a pod exists, why not handle it here instead of checking policy, and then trying to create it?

Because you know that you didn't successfully complete the work in the current state, therefore you shouldn't advance to the next. The implementation related to handling and advancing the state should be in one place for legibility and maintainability. And precisely one thing only should determine whether the work in the given state is complete.

Because you know that you didn't successfully complete the work in the current state, therefore you shouldn't advance to the next.

But the work is to move to the next state :)

The implementation related to handling and advancing the state should be in one place for legibility and maintainability

This I think I agree with. It does allow simpler code.

trying to remember the rules for the workqueue.RateLimitingInterface ... can we assume we will not get duplicate events? I thought the answer was we could not assume that. If so, we would want this stanza to handle it.

Even if we can assume no duplicate events, this still seems like safe code / protection for something unexpected, including a potential bug in the lister/queue/cache stuff that lead to duplicate events.

jim-minter · 2018-02-23T18:09:24Z

Thanks for looking this over @csrwng!

gabemontero · 2018-02-23T18:08:47Z

pkg/build/controller/strategy/util.go

+
+// HasOwnerReference returns true if the build pod has an OwnerReference to the
+// build.
+func HasOwnerReference(pod *v1.Pod, build *buildapi.Build) bool {


How about renaming the method to HaveACommonOwnerReference

it's not so much a common owner ref. it's one owning the other. I think the name is fine but if we want something else, "Owns()" would be the name i would suggest. Or "BuildOwnsPod" is perhaps clearest about the relationship being tested.

gabemontero · 2018-02-23T18:17:22Z

pkg/build/controller/build/build_controller.go

+			// again fails to Update() the Build correctly.  Don't we then lose
+			// information like the push secret, etc. that we try to stash
+			// below?  Isn't that wrong? Should we / can we effectively jump to
+			// line 913 in this case (refactoring instead of goto)?


gabemontero · 2018-02-23T18:17:29Z

pkg/build/controller/build/build_controller.go

 				update = transitionToPhase(buildapi.BuildPhaseError, buildapi.StatusReasonBuildPodExists, buildapi.StatusMessageBuildPodExists)
 				return update, nil
 			}
+			// QUESTION 1: isn't it a mistake to be returning nil, nil here in
+			// the case that err != nil?  Shouldn't we return nil, err in that
+			// case?


gabemontero · 2018-02-23T18:28:33Z

pkg/build/controller/build/build_controller.go

@@ -425,12 +425,14 @@ func (bc *BuildController) cancelBuild(build *buildapi.Build) (*buildUpdate, err
 func (bc *BuildController) handleNewBuild(build *buildapi.Build, pod *v1.Pod) (*buildUpdate, error) {
 	// If a pod was found, and it was created after the build was created, it
 	// means that the build is active and its status should be updated
+	// QUESTION 3: if we fix question 2 below, can we remove this entire stanza,
+	// and wouldn't that be the right thing to do?


trying to remember the rules for the workqueue.RateLimitingInterface ... can we assume we will not get duplicate events? I thought the answer was we could not assume that. If so, we would want this stanza to handle it.

Even if we can assume no duplicate events, this still seems like safe code / protection for something unexpected, including a potential bug in the lister/queue/cache stuff that lead to duplicate events.

jim-minter · 2018-02-23T22:26:52Z

Hmm, I'm not sure if I'm feeling courageous enough to remove the "question 3" stanza. I fear too much happens under handleNewBuild - e.g. side-effects in different run policies; non-local state in run policies, image and secret resolution. Maybe that's why this hack exists.

The risk is that some time after a pod is kicked off but the build update fails, we get back into handleNewBuild and make different decisions to those made previously. The pod is already running but we update the build, or other objects, in a divergent way.

I think we're one state short, and/or we have got other architectural shortcomings. I'm not convinced it's worth opening this all up.

I think the maximum this PR should do is change the timestamp logic, fix questions 1 & 2, and add documentation around question 3 (assuming others are in agreement with my viewpoint).

Thoughts?

gabemontero · 2018-02-23T22:32:36Z

+1 @jim-minter with some sort of //TODO for your number 3 ... I'm glad I don't have to come up with the wording for that one

csrwng · 2018-02-23T22:58:45Z

Sgtm

bparees · 2018-02-24T02:56:43Z

@jim-minter be aware that we didn't use to set the ownerrefs. Have you thought through how this impacts someone doing an upgrade who has a build/build-pod that have no owner relationship?

It seems unlikely such a cluster would have a build in the "new" state, but it's not impossible. You might consider making this check more thorough to detect/handle "older" build pods that have a build annotation, but no ownerref.

jim-minter · 2018-02-26T16:19:34Z

FWIW, I think we've set OwnerRefs since 3091d6a, which was in 3.6.

bparees · 2018-02-26T17:11:50Z

/lgtm
/hold
(remove hold when extended builds passes)

…od ownership

jim-minter · 2018-02-26T18:47:23Z

repushed, updated tests only

bparees · 2018-02-26T19:04:58Z

/lgtm

openshift-ci-robot · 2018-02-26T19:05:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, jim-minter

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/build/OWNERS~~ [bparees]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bparees · 2018-02-26T22:03:18Z

the extended build failure was

2018-02-26T19:50:48.083795313Z pulling image error : failed to register layer: devmapper: Error activating devmapper device for 'c4aca1f2288eb8a556d03823c3d8b9d99855e9eec7a9b9448623efadcae03de3': devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed
2018-02-26T19:50:48.139010328Z error: build error: unable to get openshift/nodejs-010-centos7@sha256:bd971b467b08b8dbbbfee26bad80dcaa0110b184e0a8dd6c1b0460a6d6f5d332

/hold cancel

bparees · 2018-02-27T17:55:09Z

/retest

bparees · 2018-02-27T20:06:36Z

/retest

bparees · 2018-02-27T20:22:06Z

/retest

bparees · 2018-02-27T20:30:13Z

/hold

bparees · 2018-02-27T22:59:40Z

3.9 has been branched
/hold cancel

bparees · 2018-02-28T20:37:36Z

/hold
master is locked

bparees · 2018-03-01T23:33:06Z

/hold cancel

openshift-bot · 2018-03-10T17:27:32Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-merge-robot · 2018-03-10T19:26:27Z

/test all [submit-queue is verifying that this PR is safe to merge]

openshift-ci-robot · 2018-03-10T20:02:27Z

@jim-minter: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/unit	`71c1059`	link	`/test unit`
ci/openshift-jenkins/extended_networking_minimal	`71c1059`	link	`/test extended_networking_minimal`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-merge-robot · 2018-03-10T20:28:58Z

Automatic merge from submit-queue (batch tested with PRs 18735, 18746).

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 23, 2018

openshift-ci-robot requested review from csrwng and smarterclayton February 23, 2018 16:30

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 23, 2018

csrwng reviewed Feb 23, 2018

View reviewed changes

gabemontero reviewed Feb 23, 2018

View reviewed changes

bparees self-assigned this Feb 23, 2018

jim-minter force-pushed the bc_time_sync_fix branch from 9e0b0b9 to c0bc246 Compare February 26, 2018 16:52

jim-minter changed the title ~~[WIP] use owner references instead of timestamps to determine build/build pod ownership~~ use owner references instead of timestamps to determine build/build pod ownership Feb 26, 2018

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2018

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 26, 2018

use owner references instead of timestamps to determine build/build p…

71c1059

…od ownership

jim-minter force-pushed the bc_time_sync_fix branch from c0bc246 to 71c1059 Compare February 26, 2018 18:47

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 26, 2018

openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2018

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2018

bparees added this to the 3.10.0 milestone Feb 27, 2018

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2018

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2018

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2018

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2018

openshift-merge-robot merged commit b21077f into openshift:master Mar 10, 2018

use owner references instead of timestamps to determine build/build pod ownership #18735

use owner references instead of timestamps to determine build/build pod ownership #18735

Conversation

jim-minter commented Feb 23, 2018

jim-minter commented Feb 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jim-minter commented Feb 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jim-minter commented Feb 23, 2018

gabemontero commented Feb 23, 2018

csrwng commented Feb 23, 2018

bparees commented Feb 24, 2018

jim-minter commented Feb 26, 2018

bparees commented Feb 26, 2018

jim-minter commented Feb 26, 2018

bparees commented Feb 26, 2018

openshift-ci-robot commented Feb 26, 2018

bparees commented Feb 26, 2018

bparees commented Feb 27, 2018

bparees commented Feb 27, 2018

bparees commented Feb 27, 2018

bparees commented Feb 27, 2018

bparees commented Feb 27, 2018

bparees commented Feb 28, 2018

bparees commented Mar 1, 2018

openshift-bot commented Mar 10, 2018

openshift-merge-robot commented Mar 10, 2018

openshift-ci-robot commented Mar 10, 2018 • edited Loading

openshift-merge-robot commented Mar 10, 2018

openshift-ci-robot commented Mar 10, 2018 •

edited

Loading