Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #378

Conversation

openshift-cherrypick-robot

This is an automated cherry-pick of #372

/assign wking

wking added 2 commits June 3, 2020 13:15
With this commit, I drop contextIsCancelled in favor of Context.Err().
From the docs [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context
  was canceled or DeadlineExceeded if the context's deadline
  passed.  After Err returns a non-nil error, successive calls to
  Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled
dates back to 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).

I've also generalized a number of *Cancel* helpers to be *Context* to
remind folks that Context.Err() can be DeadlineExceeded as well as
Canceled, and the CVO uses both WithCancel and WithTimeout.  The new
error messages will be either:

  update context deadline exceeded at 1 of 2

or:

  update context canceled at 1 of 2

Instead of always claiming:

  update was cancelled at 1 of 2

[1]: https://golang.org/pkg/context/#Context
… as success

For [1].  Before this commit, you could have a flow like:

1. SyncWorker.Start()
2. External code calls Update(), e.g. after noticing a ClusterVersion
   spec.desiredUpdate change.
3. Update sets w.work to point at the desired payload.
4. Start's Until loop is triggered via w.notify.
5. Start calls calculateNext, which notices the change and sets the
   state to UpdatingPayload.
6. Start calculates a syncTimeout and calls syncOnce.
7. syncOnce notices the new payload and loads it.
8. For whatever reason, payload retrieval takes a while.  Blackholed
   signature-fetch attempts in a restricted network [2] are one
   example of something that could be slow here.  Eventually the
   syncTimeout kicks in and signature fetching or other verification
   is canceled (which counts as failed verification).
9. Force is true, so syncOnce logs "Forcing past precondition
   failures..." but carries on to call apply.
10. apply computes the manifest graph, runs the ClusterOperator
    precreation (whose handlers return ContextError() right after
    spawning, because the context is already expired), and runs the
    main RunGraph (whose handlers do the same).
11. The main RunGraph returns at a slice with a handful of context
    errors and nothing else.  apply passes this on to
    consistentReporter.Errors.
12. consistentReporter.Errors calls summarizeTaskGraphErrors, which
    logs "All errors were context errors..." and returns nil to avoid
    alarming consistentReporter.Errors (we don't want to put this in
    our ClusterVersion status and alarm users).
13. apply returns the summarized nil to syncOnce.
14. syncOnce returns the summarized nil to Start.
15. Start logs "Sync succeeded..." and flops into ReconcilingPayload
    for the next round.
16. Start comes into the next round in reconciling mode despite never
    having attempted to apply any manifests in its Updating mode.  The
    manifest graph gets flattened and shuffled and all kinds of
    terrible things could happen like the machine-config trying to
    roll out the newer machine-os-content and its 4.4 hyperkube binary
    before rolling out prerequisites like the 4.4 kube-apiserver
    operator.

With this commit, the process is the same through 12, but ends with:

13. apply returns the first context error to syncOnce.
14. syncOnce returns that error to Start.
15. Start backs off and comes in again with a second attempt at
    UpdatingPayload.
16. Manifests get pushed in the intended order, and nothing explodes.

The race fixed in this commit could also have come up without timing
out the payload pull/verification, e.g. by having a perfectly slow
ClusterOperator preconditions.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1838497#c23
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1840343
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: An error was encountered creating a cherry-pick bug in Bugzilla: could not get list of clones for bug 1838497 on the Bugzilla server at https://bugzilla.redhat.com:

Failed to get dependent bug #1723620: code 102: You are not authorized to access bug #1723620. Most likely the bug has been restricted for internal development processes and we cannot grant access. If you are a Red Hat customer with an active subscription, please visit the Red Hat Customer Portal for assistance with your issue If you are a Fedora Project user and require assistance, please consider using one of the mailing lists we host for the Fedora Project.
Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

[release-4.5] Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking changed the title [release-4.5] Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success Jun 3, 2020
@wking wking changed the title Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success Jun 3, 2020
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking changed the title Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success Jun 3, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jun 3, 2020
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: This pull request references Bugzilla bug 1843526, which is invalid:

  • expected dependent Bugzilla bug 1838497 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member

wking commented Jun 3, 2020

/cherrypick release-4.4

@openshift-cherrypick-robot
Copy link
Author

@wking: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member

wking commented Jun 3, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openshift-cherrypick-robot, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2020
@wking
Copy link
Member

wking commented Jun 3, 2020

@sdodson , can you drop group-lead approval and twiddle Bugzilla? Once I get on Slack, I'll round with the test folks about 4.5 only needing MODIFIED 4.6 bugs until ART fixes the issues with building 4.6 nightlies.

@sdodson sdodson added staff-eng-approved Indicates a release branch PR has been approved by a staff engineer (formerly group/pillar lead). bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jun 3, 2020
@sdodson
Copy link
Member

sdodson commented Jun 3, 2020

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 3, 2020
@openshift-ci-robot
Copy link
Contributor

@sdodson: This pull request references Bugzilla bug 1843526, which is invalid:

  • expected dependent Bugzilla bug 1838497 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson sdodson added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jun 3, 2020
@sdodson
Copy link
Member

sdodson commented Jun 3, 2020

openshift/release#9456 to provide temporary relaxation of BZ status requirements of 4.6 bugs.

@sdodson
Copy link
Member

sdodson commented Jun 3, 2020

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@sdodson: This pull request references Bugzilla bug 1843526, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1838497 is in the state MODIFIED, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Bugzilla bug 1838497 targets the "4.6.0" release, which is one of the valid target releases: 4.6.0, 4.6.z
  • bug has dependents

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member

wking commented Jun 3, 2020

e2e-aws:

failed to acquire lease: resources not found

But I want this landed, so try to elbow our way back into the queue instead of waiting on the retest bot:

/test e2e-aws

@openshift-merge-robot openshift-merge-robot merged commit 99f6ba5 into openshift:release-4.5 Jun 3, 2020
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: All pull requests linked via external trackers have merged: openshift/cluster-version-operator#378. Bugzilla bug 1843526 has been moved to the MODIFIED state.

In response to this:

Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Author

@wking: #378 failed to apply on top of branch "release-4.4":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	pkg/cvo/sync_worker.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/sync_worker.go
CONFLICT (content): Merge conflict in pkg/cvo/sync_worker.go
Patch failed at 0001 pkg/cvo/sync_worker: Generalize CancelError to ContextError

In response to this:

/cherrypick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 3, 2020
With this commit, I drop contextIsCancelled in favor of Context.Err().
From the docs [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context
  was canceled or DeadlineExceeded if the context's deadline
  passed.  After Err returns a non-nil error, successive calls to
  Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled
dates back to 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).

I've also generalized a number of *Cancel* helpers to be *Context* to
remind folks that Context.Err() can be DeadlineExceeded as well as
Canceled, and the CVO uses both WithCancel and WithTimeout.  The new
error messages will be either:

  update context deadline exceeded at 1 of 2

or:

  update context canceled at 1 of 2

Instead of always claiming:

  update was cancelled at 1 of 2

Cherry-picked from eea2092 (pkg/cvo/sync_worker: Generalize
CancelError to ContextError, 2020-05-28, openshift#378) and edited to resolve
context conflicts because release-4.4 lacks 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).

[1]: https://golang.org/pkg/context/#Context
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jun 4, 2020
With this commit, I drop contextIsCancelled in favor of Context.Err().
From the docs [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context
  was canceled or DeadlineExceeded if the context's deadline
  passed.  After Err returns a non-nil error, successive calls to
  Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled
dates back to 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).

I've also generalized a number of *Cancel* helpers to be *Context* to
remind folks that Context.Err() can be DeadlineExceeded as well as
Canceled, and the CVO uses both WithCancel and WithTimeout.  The new
error messages will be either:

  update context deadline exceeded at 1 of 2

or:

  update context canceled at 1 of 2

Instead of always claiming:

  update was cancelled at 1 of 2

Cherry-picked from eea2092 (pkg/cvo/sync_worker: Generalize
CancelError to ContextError, 2020-05-28, openshift#378) and edited to resolve
context conflicts because release-4.4 lacks 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).

[1]: https://golang.org/pkg/context/#Context
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jun 11, 2020
With this commit, I drop contextIsCancelled in favor of Context.Err().
From the docs [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context
  was canceled or DeadlineExceeded if the context's deadline
  passed.  After Err returns a non-nil error, successive calls to
  Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled
dates back to 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).

I've also generalized a number of *Cancel* helpers to be *Context* to
remind folks that Context.Err() can be DeadlineExceeded as well as
Canceled, and the CVO uses both WithCancel and WithTimeout.  The new
error messages will be either:

  update context deadline exceeded at 1 of 2

or:

  update context canceled at 1 of 2

Instead of always claiming:

  update was cancelled at 1 of 2

Cherry-picked from eea2092 (pkg/cvo/sync_worker: Generalize
CancelError to ContextError, 2020-05-28, openshift#378) and edited to resolve
context conflicts because release-4.4 lacks 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).

[1]: https://golang.org/pkg/context/#Context
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. staff-eng-approved Indicates a release branch PR has been approved by a staff engineer (formerly group/pillar lead).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants