Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use owner references instead of timestamps to determine build/build pod ownership #18735

Merged
merged 1 commit into from
Mar 10, 2018

Conversation

jim-minter
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 23, 2018
@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 23, 2018
@jim-minter
Copy link
Contributor Author

@csrwng if you have a moment to comment on this PR it'd be appreciated. @bparees and @gabemontero too. I think we might have discussed this in the past, but I don't recall what the outcome was. Perhaps back then it was "don't touch it unless something comes up in the future?"

In https://bugzilla.redhat.com/show_bug.cgi?id=1547551 a problem was encountered where the build controller rejected its own build pod because of timestamp differences, ultimately caused by faulty time synchronisation between the masters in the cluster in question (ntp not switched on).

I'm wondering if we can tighten up the code in this area.

So far this PR is a naive change to use the pre-existing OwnerReference (including UID) instead of timestamps as the factor that links build pods and builds.

However, I'm a bit in doubt about a couple of wider aspects of the code - added three questions as code comments. Feedback would be appreciated.

I'm wondering whether this falls under "we added extra code for an unlikely corner case, but it doesn't actually quite work correctly and there are no tests" ?

update = transitionToPhase(buildapi.BuildPhaseError, buildapi.StatusReasonBuildPodExists, buildapi.StatusMessageBuildPodExists)
return update, nil
}
// QUESTION 1: isn't it a mistake to be returning nil, nil here in
// the case that err != nil? Shouldn't we return nil, err in that
// case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that yes it's a mistake to return nil, nil in case err != nil ... we want to retry handling this build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// again fails to Update() the Build correctly. Don't we then lose
// information like the push secret, etc. that we try to stash
// below? Isn't that wrong? Should we / can we effectively jump to
// line 913 in this case (refactoring instead of goto)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -425,12 +425,14 @@ func (bc *BuildController) cancelBuild(build *buildapi.Build) (*buildUpdate, err
func (bc *BuildController) handleNewBuild(build *buildapi.Build, pod *v1.Pod) (*buildUpdate, error) {
// If a pod was found, and it was created after the build was created, it
// means that the build is active and its status should be updated
// QUESTION 3: if we fix question 2 below, can we remove this entire stanza,
// and wouldn't that be the right thing to do?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you already know that a pod exists, why not handle it here instead of checking policy, and then trying to create it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you know that you didn't successfully complete the work in the current state, therefore you shouldn't advance to the next. The implementation related to handling and advancing the state should be in one place for legibility and maintainability. And precisely one thing only should determine whether the work in the given state is complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you know that you didn't successfully complete the work in the current state, therefore you shouldn't advance to the next.

But the work is to move to the next state :)

The implementation related to handling and advancing the state should be in one place for legibility and maintainability

This I think I agree with. It does allow simpler code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to remember the rules for the workqueue.RateLimitingInterface ... can we assume we will not get duplicate events? I thought the answer was we could not assume that. If so, we would want this stanza to handle it.

Even if we can assume no duplicate events, this still seems like safe code / protection for something unexpected, including a potential bug in the lister/queue/cache stuff that lead to duplicate events.

@jim-minter
Copy link
Contributor Author

Thanks for looking this over @csrwng!


// HasOwnerReference returns true if the build pod has an OwnerReference to the
// build.
func HasOwnerReference(pod *v1.Pod, build *buildapi.Build) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming the method to HaveACommonOwnerReference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not so much a common owner ref. it's one owning the other. I think the name is fine but if we want something else, "Owns()" would be the name i would suggest. Or "BuildOwnsPod" is perhaps clearest about the relationship being tested.

// again fails to Update() the Build correctly. Don't we then lose
// information like the push secret, etc. that we try to stash
// below? Isn't that wrong? Should we / can we effectively jump to
// line 913 in this case (refactoring instead of goto)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

update = transitionToPhase(buildapi.BuildPhaseError, buildapi.StatusReasonBuildPodExists, buildapi.StatusMessageBuildPodExists)
return update, nil
}
// QUESTION 1: isn't it a mistake to be returning nil, nil here in
// the case that err != nil? Shouldn't we return nil, err in that
// case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -425,12 +425,14 @@ func (bc *BuildController) cancelBuild(build *buildapi.Build) (*buildUpdate, err
func (bc *BuildController) handleNewBuild(build *buildapi.Build, pod *v1.Pod) (*buildUpdate, error) {
// If a pod was found, and it was created after the build was created, it
// means that the build is active and its status should be updated
// QUESTION 3: if we fix question 2 below, can we remove this entire stanza,
// and wouldn't that be the right thing to do?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to remember the rules for the workqueue.RateLimitingInterface ... can we assume we will not get duplicate events? I thought the answer was we could not assume that. If so, we would want this stanza to handle it.

Even if we can assume no duplicate events, this still seems like safe code / protection for something unexpected, including a potential bug in the lister/queue/cache stuff that lead to duplicate events.

@bparees bparees self-assigned this Feb 23, 2018
@jim-minter
Copy link
Contributor Author

Hmm, I'm not sure if I'm feeling courageous enough to remove the "question 3" stanza. I fear too much happens under handleNewBuild - e.g. side-effects in different run policies; non-local state in run policies, image and secret resolution. Maybe that's why this hack exists.

The risk is that some time after a pod is kicked off but the build update fails, we get back into handleNewBuild and make different decisions to those made previously. The pod is already running but we update the build, or other objects, in a divergent way.

I think we're one state short, and/or we have got other architectural shortcomings. I'm not convinced it's worth opening this all up.

I think the maximum this PR should do is change the timestamp logic, fix questions 1 & 2, and add documentation around question 3 (assuming others are in agreement with my viewpoint).

Thoughts?

@gabemontero
Copy link
Contributor

+1 @jim-minter with some sort of //TODO for your number 3 ... I'm glad I don't have to come up with the wording for that one

@csrwng
Copy link
Contributor

csrwng commented Feb 23, 2018

Sgtm

@bparees
Copy link
Contributor

bparees commented Feb 24, 2018

@jim-minter be aware that we didn't use to set the ownerrefs. Have you thought through how this impacts someone doing an upgrade who has a build/build-pod that have no owner relationship?

It seems unlikely such a cluster would have a build in the "new" state, but it's not impossible. You might consider making this check more thorough to detect/handle "older" build pods that have a build annotation, but no ownerref.

@jim-minter
Copy link
Contributor Author

FWIW, I think we've set OwnerRefs since 3091d6a, which was in 3.6.

@jim-minter jim-minter changed the title [WIP] use owner references instead of timestamps to determine build/build pod ownership use owner references instead of timestamps to determine build/build pod ownership Feb 26, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2018
@bparees
Copy link
Contributor

bparees commented Feb 26, 2018

/lgtm
/hold
(remove hold when extended builds passes)

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 26, 2018
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 26, 2018
@jim-minter
Copy link
Contributor Author

repushed, updated tests only

@openshift-merge-robot openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2018
@bparees
Copy link
Contributor

bparees commented Feb 26, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, jim-minter

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bparees
Copy link
Contributor

bparees commented Feb 26, 2018

the extended build failure was

2018-02-26T19:50:48.083795313Z pulling image error : failed to register layer: devmapper: Error activating devmapper device for 'c4aca1f2288eb8a556d03823c3d8b9d99855e9eec7a9b9448623efadcae03de3': devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed
2018-02-26T19:50:48.139010328Z error: build error: unable to get openshift/nodejs-010-centos7@sha256:bd971b467b08b8dbbbfee26bad80dcaa0110b184e0a8dd6c1b0460a6d6f5d332

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2018
@bparees
Copy link
Contributor

bparees commented Feb 27, 2018

/retest

@bparees bparees added this to the 3.10.0 milestone Feb 27, 2018
@bparees
Copy link
Contributor

bparees commented Feb 27, 2018

/retest

1 similar comment
@bparees
Copy link
Contributor

bparees commented Feb 27, 2018

/retest

@bparees
Copy link
Contributor

bparees commented Feb 27, 2018

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2018
@bparees
Copy link
Contributor

bparees commented Feb 27, 2018

3.9 has been branched
/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2018
@bparees
Copy link
Contributor

bparees commented Feb 28, 2018

/hold
master is locked

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2018
@bparees
Copy link
Contributor

bparees commented Mar 1, 2018

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2018
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 10, 2018

@jim-minter: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/unit 71c1059 link /test unit
ci/openshift-jenkins/extended_networking_minimal 71c1059 link /test extended_networking_minimal

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 18735, 18746).

@openshift-merge-robot openshift-merge-robot merged commit b21077f into openshift:master Mar 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants