New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1916384: pkg/cvo: Use shutdownContext for final status synchronization #501
Bug 1916384: pkg/cvo: Use shutdownContext for final status synchronization #501
Conversation
@wking: This pull request references Bugzilla bug 1916384, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
947d4eb
to
a0ce6f7
Compare
GCP race /test e2e-agnostic-upgrade |
e2e had an /test e2e-agnostic |
e2e had another flake, this time /test e2e-agnostic |
From the update job, we only have two CVO logs in Loki: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351764964469116928/artifacts/e2e-agnostic-upgrade/gather-loki/loki-container-logs/loki-container-logs.tar | tar -tvz | grep 'cluster-version.*.log'
-rw-r--r-- 1001840000/root 1441652 2021-01-19 23:44 loki-container-logs/openshift-cluster-version/cluster-version-operator-85bfd58d8-p7x8b/cluster-version-operator_0.log
-rw-r--r-- 1001840000/root 27333 2021-01-19 23:44 loki-container-logs/openshift-cluster-version/cluster-version-operator-85bfd58d8-sjdbq/cluster-version-operator_0.log The $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351764964469116928/artifacts/e2e-agnostic-upgrade/gather-loki/loki-container-logs/loki-container-logs.tar | tar -xOz loki-container-logs/openshift-cluster-version/cluster-version-operator-85bfd58d8-sjdbq/cluster-version-operator_0.log | grep 'unable to perform final sync\|Graceful shutdown complete'
I0120 07:31:04.302829 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator v1.0.0-434-g2b453544-dirty. But 2b45354 is master, not my commit :p. And, in the A->B->A rollback testing, the B->A rollback is not verified: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351764964469116928/artifacts/e2e-agnostic-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version + " " + (.verified | tostring)'
2021-01-20T06:41:15Z 2021-01-20T07:38:11Z Completed 4.7.0-0.ci.test-2021-01-20-053427-ci-op-sib1tp88 false
2021-01-20T06:07:10Z 2021-01-20T06:41:15Z Partial 4.7.0-0.ci.test-2021-01-20-053428-ci-op-sib1tp88 false
2021-01-20T05:40:29Z 2021-01-20T06:04:52Z Completed 4.7.0-0.ci.test-2021-01-20-053427-ci-op-sib1tp88 false Checking for events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351764964469116928/artifacts/e2e-agnostic-upgrade/gather-extra/events.json | jq -r '.items[] | select(.reason // "" | contains("VerifyPayload")) | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort
2021-01-20T05:40:29Z 1 VerifyPayload: verifying payload version="4.7.0-0.ci.test-2021-01-20-053427-ci-op-sib1tp88" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T05:53:04Z 1 VerifyPayload: verifying payload version="4.7.0-0.ci.test-2021-01-20-053427-ci-op-sib1tp88" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T06:07:16Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:2dd90e014ac200b6cb7a50c859fcdb7ebf7906f1814a351187c6cb573103430e"
2021-01-20T06:07:22Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:2dd90e014ac200b6cb7a50c859fcdb7ebf7906f1814a351187c6cb573103430e"
2021-01-20T06:42:04Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T06:42:16Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T06:46:27Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
2021-01-20T07:31:17Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-sib1tp88/release@sha256:1026873c2f2d348bd3d2e612d3f52faec42c5fc242dd2acdf56153a844a948ee"
|
I got better logs via a cluster-bot install $ while sleep 1; do POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)" && oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log"; done &
$ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537 Waited for that to stick. And then: $ fg
while sleep 1; do
POD="$(oc -n openshift-cluster-version get -o jsonpath='{.items[].metadata.name}{"\n"}' pods)" && oc -n openshift-cluster-version logs -f "${POD}" > "$(date --utc --iso=s)-${POD}.log";
done
^C
$ oc get -o json clusterversion version | jq -r '.status.history[] | .startedTime + " " + (.completionTime // "-") + "
" + .version + " " + (.verified | tostring)'
2021-01-20T09:54:50Z - 4.7.0-fc.3 true
2021-01-20T09:19:15Z 2021-01-20T09:45:02Z 4.7.0-0.ci.test-2021-01-20-091428-ci-ln-7bw1wmb false So hooray, actually $ grep status.go 2021-01-20T09:53:42+00:00-cluster-version-operator-675b5f96cf-blgl7.log | tail -n4
I0120 09:55:00.440068 1 status.go:159] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:2, Step:"ApplyResources", Failure:error(nil), Fraction:0.012084592, Completed:0, Reconciling:false, Initial:false, VersionHash:"yOjXo9Ndk0U=", LastProgress:time.Time{wall:0xbffa1d5d19c3d943, ext:1338782608313, loc:(*time.Location)(0x276a900)}, Actual:v1.Release{Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", URL:"", Channels:[]string(nil)}, Verified:true}
I0120 09:55:00.440133 1 status.go:79] merge into existing history completed=false desired=v1.Release{Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63746733290, loc:(*time.Location)(0x276a900)}}, CompletionTime:(*v1.Time)(nil), Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", Verified:false}
I0120 09:55:00.442528 1 status.go:159] Synchronizing errs=field.ErrorList{} status=&cvo.SyncWorkerStatus{Generation:2, Step:"ApplyResources", Failure:error(nil), Fraction:0.012084592, Completed:0, Reconciling:false, Initial:false, VersionHash:"yOjXo9Ndk0U=", LastProgress:time.Time{wall:0xbffa1d5d19c3d943, ext:1338782608313, loc:(*time.Location)(0x276a900)}, Actual:v1.Release{Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", URL:"", Channels:[]string(nil)}, Verified:true}
I0120 09:55:00.442572 1 status.go:79] merge into existing history completed=false desired=v1.Release{Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", URL:"", Channels:[]string(nil)} last=&v1.UpdateHistory{State:"Partial", StartedTime:v1.Time{Time:time.Time{wall:0x0, ext:63746733290, loc:(*time.Location)(0x276a900)}}, CompletionTime:(*v1.Time)(nil), Version:"4.7.0-fc.3", Image:"quay.io/openshift-release-dev/ocp-release@sha256:ce05fd235f0e59ea0114c73ef25abf838b9c3af103d9cb66aa7ba10eeacd8537", Verified:false} I'd expect that |
/hold I just pushed some WIP debugging ( |
optr.worker(ctx, optr.queue, func(ctx context.Context, key string) error { return optr.sync(ctx, key) }) | ||
if err := optr.sync(ctx, optr.queueKey()); err != nil { | ||
optr.worker(runContext, optr.queue, func(runContext context.Context, key string) error { return optr.sync(runContext, key) }) | ||
if err := optr.sync(shutdownContext, optr.queueKey()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test for this change?
I'm not sure... We can probably add an integration (e2e-agnostic-operator) test for it. Do we want to hold up this PR while I work that out, or can I defer it to follow-up work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends how much time we need to add the test. If it is just one day then we should wait for the test. The issue with merging it without tests is that I am not confident if this fixes the issue or not and having a test for it will make me more confident to merge it.
Tried checking for the $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/501/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1351834608156545024/artifacts/e2e-agnostic-upgrade/gather-loki/loki-container-logs/loki-container-logs.tar| tar -tvz | grep 'cluster-version.*.log' Attempting a cluster-bot run of the PR... |
aac0361
to
a0ce6f7
Compare
I've pushed aac0361 -> a0ce6f7, dropping some debug commits. Local testing updating out of aac0361 to fc.3 included the following outgoing CVO log entries:
Which sounds like what we want. /hold cancel |
/test e2e-agnostic-operator |
LGTM This makes sense to me and appears safe sense shutdownContext is already being used to reap other things during shutdown 2 minute grace. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e2e-agnostic had storage test-cases explode. Seems unlikely to be related to this change. /test e2e-agnostic |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 1916384 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.6 per- https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c11 |
@sferich888: new pull request created: #517 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Cherry-picking back straight from master/4.7 to 4.5 drags a bug and PR around while we wait for QE to verify 4.6. Easier to just create the cherry-picks one minor at a time as QE gets through verifying the master-ward bug. |
Address a bug introduced by cc1921d (#424), where canceling the
Operator.Run
context would leave the operator with no time to attempt the final sync:With this commit, I'm piping through
shutdownContext
, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification).