Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1916384: pkg/cvo: Use shutdownContext for final status synchronization #501

Conversation

wking
Copy link
Member

@wking wking commented Jan 20, 2021

Address a bug introduced by cc1921d (#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync:

E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification).

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 20, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1916384, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1916384: pkg/cvo: Use shutdownContext for final status synchronization

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
@wking wking force-pushed the shutdownContext-for-final-status-sync branch from 947d4eb to a0ce6f7 Compare January 20, 2021 05:08
@wking
Copy link
Member Author

wking commented Jan 20, 2021

GCP race Error setting IAM policy for project ... Service account ... does not exist for upgrade.

/test e2e-agnostic-upgrade

@wking
Copy link
Member Author

wking commented Jan 20, 2021

e2e had an 172.30.0.10:53: i/o timeout flake in a skip check, and unfortunately the "failed/skipped" combination is considered fatal when it should not be.

/test e2e-agnostic

@wking
Copy link
Member Author

wking commented Jan 20, 2021

e2e had another flake, this time matching user doesnt received UID for the testing priority-level and flow-schema [sic].

/test e2e-agnostic

@wking
Copy link
Member Author

wking commented Jan 20, 2021

From the update job, we only have two CVO logs in Loki:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351764964469116928/artifacts/e2e-agnostic-upgrade/gather-loki/loki-container-logs/loki-container-logs.tar | tar -tvz | grep 'cluster-version.*.log'
-rw-r--r-- 1001840000/root 1441652 2021-01-19 23:44 loki-container-logs/openshift-cluster-version/cluster-version-operator-85bfd58d8-p7x8b/cluster-version-operator_0.log
-rw-r--r-- 1001840000/root   27333 2021-01-19 23:44 loki-container-logs/openshift-cluster-version/cluster-version-operator-85bfd58d8-sjdbq/cluster-version-operator_0.log

The unable to perform final sync entry is gone:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351764964469116928/artifacts/e2e-agnostic-upgrade/gather-loki/loki-container-logs/loki-container-logs.tar | tar -xOz loki-container-logs/openshift-cluster-version/cluster-version-operator-85bfd58d8-sjdbq/cluster-version-operator_0.log | grep 'unable to perform final sync\|Graceful shutdown complete'
I0120 07:31:04.302829       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator v1.0.0-434-g2b453544-dirty.

But 2b45354 is master, not my commit :p. And, in the A->B->A rollback testing, the B->A rollback is not verified:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351764964469116928/artifacts/e2e-agnostic-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version + " " + (.verified | tostring)' 
2021-01-20T06:41:15Z 2021-01-20T07:38:11Z Completed 4.7.0-0.ci.test-2021-01-20-053427-ci-op-sib1tp88 false
2021-01-20T06:07:10Z 2021-01-20T06:41:15Z Partial 4.7.0-0.ci.test-2021-01-20-053428-ci-op-sib1tp88 false
2021-01-20T05:40:29Z 2021-01-20T06:04:52Z Completed 4.7.0-0.ci.test-2021-01-20-053427-ci-op-sib1tp88 false

Checking for events:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351764964469116928/artifacts/e2e-agnostic-upgrade/gather-extra/events.json | jq -r '.items[] | select(.reason // "" | contains("VerifyPayload")) | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort
2021-01-20T05:40:29Z 1 VerifyPayload: verifying payload version="4.7.0-0.ci.test-2021-01-20-053427-ci-op-sib1tp88" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T05:53:04Z 1 VerifyPayload: verifying payload version="4.7.0-0.ci.test-2021-01-20-053427-ci-op-sib1tp88" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T06:07:16Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:2dd90e014ac200b6cb7a50c859fcdb7ebf7906f1814a351187c6cb573103430e"
2021-01-20T06:07:22Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:2dd90e014ac200b6cb7a50c859fcdb7ebf7906f1814a351187c6cb573103430e"
2021-01-20T06:42:04Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T06:42:16Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T06:46:27Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T07:31:17Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"

We really need to avoid those events for the "but we don't actually verify, because we're loading from the local image" case... [edit: maybe we are re-verifying the sig each time, in which case it's probably fine to event each time]. But no failed verification events, so I'm not sure why we had verified: false for B->A.

@wking
Copy link
Member Author

wking commented Jan 20, 2021

I got better logs via a cluster-bot install launch openshift/cluster-version-operator#501. Then I triggered an update to fc.3:

$ while sleep 1; do POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)" && oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log"; done &
$ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537

Waited for that to stick. And then:

$ fg
while sleep 1; do
    POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)" && oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log";
done
^C
$ oc get -o json clusterversion version | jq -r '.status.history[] | .startedTime + " " + (.completionTime // "-") + " 
" + .version + " " + (.verified | tostring)'
2021-01-20T09:54:50Z - 4.7.0-fc.3 true
2021-01-20T09:19:15Z 2021-01-20T09:45:02Z 4.7.0-0.ci.test-2021-01-20-091428-ci-ln-7bw1wmb false

So hooray, actually verified: true. No unable to perform final sync in the outgoing CVO's logs. I'm a bit concerned that I still see:

$ grep status.go 2021-01-20T09:53:42+00:00-cluster-version-operator-675b5f96cf-blgl7.log | tail -n4
I0120 09:55:00.440068       1 status.go:159] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:2, Step:"ApplyResources", Failure:error(nil), Fraction:0.012084592, Completed:0, Reconciling:false, Initial:false, VersionHash:"yOjXo9Ndk0U=", LastProgress:time.Time{wall:0xbffa1d5d19c3d943, ext:1338782608313, loc:(*time.Location)(0x276a900)}, Actual:v1.Release{Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", URL:"", Channels:[]string(nil)}, Verified:true}
I0120 09:55:00.440133       1 status.go:79] merge into existing history completed=false desired=v1.Release{Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63746733290, loc:(*time.Location)(0x276a900)}}, CompletionTime:(*v1.Time)(nil), Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", Verified:false}
I0120 09:55:00.442528       1 status.go:159] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:2, Step:"ApplyResources", Failure:error(nil), Fraction:0.012084592, Completed:0, Reconciling:false, Initial:false, VersionHash:"yOjXo9Ndk0U=", LastProgress:time.Time{wall:0xbffa1d5d19c3d943, ext:1338782608313, loc:(*time.Location)(0x276a900)}, Actual:v1.Release{Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", URL:"", Channels:[]string(nil)}, Verified:true}
I0120 09:55:00.442572       1 status.go:79] merge into existing history completed=false desired=v1.Release{Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63746733290, loc:(*time.Location)(0x276a900)}}, CompletionTime:(*v1.Time)(nil), Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", Verified:false}

I'd expect that Verified:true Synchronizing ... line to get the ClusterVersion updated, so the final merge into existing history ... line would have Verified:true under last=.... But all the merge into existing history ... lines have Verified:false. Maybe we need more debugging...

@wking
Copy link
Member Author

wking commented Jan 20, 2021

/hold

I just pushed some WIP debugging (--v=6 to get status.go's Apply config ... diff).

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 20, 2021
optr.worker(ctx, optr.queue, func(ctx context.Context, key string) error { return optr.sync(ctx, key) })
if err := optr.sync(ctx, optr.queueKey()); err != nil {
optr.worker(runContext, optr.queue, func(runContext context.Context, key string) error { return optr.sync(runContext, key) })
if err := optr.sync(shutdownContext, optr.queueKey()); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test for this change?

I'm not sure... We can probably add an integration (e2e-agnostic-operator) test for it. Do we want to hold up this PR while I work that out, or can I defer it to follow-up work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends how much time we need to add the test. If it is just one day then we should wait for the test. The issue with merging it without tests is that I am not confident if this fixes the issue or not and having a test for it will make me more confident to merge it.

@jottofar
Copy link
Contributor

@wking wking force-pushed the shutdownContext-for-final-status-sync branch from aac0361 to a0ce6f7 Compare January 20, 2021 21:34
@wking
Copy link
Member Author

wking commented Jan 20, 2021

I've pushed aac0361 -> a0ce6f7, dropping some debug commits. Local testing updating out of aac0361 to fc.3 included the following outgoing CVO log entries:

  • 19:38:46.880513 "merge into existing history" last Version:"" Image:...sha256:ce05f... Verified:false
  • 19:38:46.882537 "remembering last update" same^
  • 19:38:50.341422 "sync got ClusterVersion" same^
  • 19:38:50.341734 "sync into" same^
  • 19:38:50.341824 "merge into existing history" bringing in a Version, last grows Version:"4.7.0-fc.3" but still Verified:false
  • 19:38:50.372943 "pushed to cluster" Version:"4.7.0-fc.3" and Verified:true
  • 19:38:50.372992 "remembering last update" same^
  • 19:38:50.373064 "sync got ClusterVersion" same^
  • 19:38:50.373139 "sync into" same^
  • 19:38:50.373253 "merge into existing history" last is same^
  • 19:38:50.668535 "unable to synchronize image (waiting 2m50.956499648s): Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 662)"
  • ...
  • 19:38:50.677596 "pushed to cluster" 4.7.0-fc.3 Verfied:true

Which sounds like what we want.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 20, 2021
@jottofar
Copy link
Contributor

jottofar commented Feb 3, 2021

/test e2e-agnostic-operator

@jottofar
Copy link
Contributor

jottofar commented Feb 3, 2021

LGTM

This makes sense to me and appears safe sense shutdownContext is already being used to reap other things during shutdown 2 minute grace.

@LalatenduMohanty

@jottofar
Copy link
Contributor

jottofar commented Feb 3, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 3, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Feb 4, 2021

e2e-agnostic had storage test-cases explode. Seems unlikely to be related to this change.

/test e2e-agnostic

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit ff0f705 into openshift:master Feb 4, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1916384 has been moved to the MODIFIED state.

In response to this:

Bug 1916384: pkg/cvo: Use shutdownContext for final status synchronization

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the shutdownContext-for-final-status-sync branch February 9, 2021 05:51
@sferich888
Copy link

/cherry-pick release-4.6
/cherry-pick release-4.5

per- https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c11

@openshift-cherrypick-robot

@sferich888: new pull request created: #517

In response to this:

/cherry-pick release-4.6
/cherry-pick release-4.5

per- https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Feb 17, 2021

Cherry-picking back straight from master/4.7 to 4.5 drags a bug and PR around while we wait for QE to verify 4.6. Easier to just create the cherry-picks one minor at a time as QE gets through verifying the master-ward bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants