Bug 1889540: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown #262

wking · 2020-10-24T04:28:09Z

The alert was born with a 5m for as CCOperatorDown in 63af2de (#132). But folks are unlikely to be churning their creds so quickly that the occasional longer operator outage is worth waking an admin with a midnight alarm. This is true in general, although folks have been revisiting this alert in the context of 4.5->4.6 updates, where a shift in leader leasing has lead to a risk of an 8 minute delay as a 4.6 operator waits patiently before picking up a lease abandoned by a 4.5 operator. The new 20m threshold allows for two such delays with room to spare, and also ensures we aren't waking folks up if theres a brief network or registry outage while pods are being rescheduled or anything minor like that. Some things are worth more agressive thresholds, but I don't think the cred operator is one of them.

The alert was born with a 5m 'for' as CCOperatorDown in 63af2de (add alert for when operator is down, 2019-10-28, openshift#132). But folks are unlikely to be churning their creds so quickly that the occasional longer operator outage is worth waking an admin with a midnight alarm. This is true in general, although folks have been revisiting this alert in the context of 4.5->4.6 updates, where a shift in leader leasing has lead to a risk of an 8 minute delay as a 4.6 operator waits patiently before picking up a lease abandoned by a 4.5 operator [1]. The new 20m threshold allows for two such delays with room to spare, and also ensures we aren't waking folks up if theres a brief network or registry outage while pods are being rescheduled or anything minor like that. Some things are worth more agressive thresholds, but I don't think the cred operator is one of them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1889540#c4

openshift-ci-robot · 2020-10-24T04:28:13Z

@wking: This pull request references Bugzilla bug 1889540, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1889540: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2020-10-24T04:28:59Z

/bugzilla refresh

openshift-ci-robot · 2020-10-24T04:29:05Z

@wking: This pull request references Bugzilla bug 1889540, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dgoodwin · 2020-10-26T12:05:16Z

I am in agreement and I really wish this was in 4.6.0 now. :( Thanks Trevor!

/lgtm

dgoodwin · 2020-10-26T12:05:27Z

/retest

openshift-ci-robot · 2020-10-26T12:05:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-10-26T15:52:52Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-26T16:31:58Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-26T16:44:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-26T18:28:52Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-26T19:20:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-10-26T23:41:09Z

/retest

wking · 2020-10-27T04:38:05Z

Hah, I seem to have broken the operator :p. Upgrade:

 error: some steps failed:
  * could not run steps: step e2e-upgrade failed: "e2e-upgrade" test steps failed: "e2e-upgrade" pod "e2e-upgrade-openshift-e2e-test" exceeded the configured timeout activeDeadlineSeconds=7200: the pod ci-op-h7f9j7x7/e2e-upgrade-openshift-e2e-test failed after 2h0m1s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-10-27T00:22:38Z  4.7.0-0.ci.test-2020-10-26-234654-ci-op-h7f9j7x7 Partial false
2020-10-26T23:54:37Z 2020-10-27T00:19:53Z 4.7.0-0.ci.test-2020-10-26-234143-ci-op-h7f9j7x7 Completed false
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-10-26T23:54:37Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2020-10-27T00:19:53Z Available=True : Done applying 4.7.0-0.ci.test-2020-10-26-234143-ci-op-h7f9j7x7
2020-10-27T00:22:38Z Progressing=True ClusterOperatorNotAvailable: Unable to apply 4.7.0-0.ci.test-2020-10-26-234654-ci-op-h7f9j7x7: the cluster operator cloud-credential has not yet successfully rolled out
2020-10-27T00:57:11Z Failing=True ClusterOperatorNotAvailable: Cluster operator cloud-credential is still updating
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cloud-credential").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-10-26T23:55:39Z Available=True : 
2020-10-26T23:55:39Z Upgradeable=True : 
2020-10-26T23:55:42Z Degraded=False : 
2020-10-27T00:09:31Z Progressing=False : 
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cloud-credential").status.versions'
[
  {
    "name": "operator",
    "version": "4.7.0-0.ci.test-2020-10-26-234143-ci-op-h7f9j7x7"
  }
]

Ah, the operator pod is crash-looping:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/pods/openshift-cloud-credential-operator_cloud-credential-operator-649f878d55-6fp8r_cloud-credential-operator.log
Copying system trust bundle
cp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied

openshift-bot · 2020-10-27T06:50:02Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-27T08:20:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-27T10:43:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-27T12:14:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-27T15:03:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-27T15:29:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-02T12:43:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-02T12:57:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-02T13:09:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

dgoodwin · 2020-11-02T13:28:19Z

/hold

@akhil-rane is working on the problems introduced into the build cluster.

wking · 2020-11-09T16:18:50Z

openshift/release#13491 should fix the failures.

/hold cancel

openshift-bot · 2020-11-09T16:30:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-11-09T17:08:49Z

/hold

Turns out openshift/release#13491 only moved postsubmit builds, not presubmit builds.

wking · 2020-11-09T18:46:34Z

Trying again after openshift/release#13499:

/hold cancel
/retest

openshift-bot · 2020-11-09T20:11:34Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-11-09T20:25:59Z

e2e-aws:

[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]
...
[github.com/openshift/origin/test/extended/operators/operators.go:94]: Nov  9 19:17:03.289: Some cluster operators are not ready: marketplace (missing: Degraded)

And indeed:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-aws/1325872495944798208/artifacts/e2e-aws/gather-extra/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "marketplace").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-11-09T19:02:47Z Progressing=False OperatorAvailable: Successfully progressed to release version: 4.7.0-0.ci.test-2020-11-09-184839-ci-op-zbzw3qs6
2020-11-09T19:02:47Z Available=True OperatorAvailable: Available release version: 4.7.0-0.ci.test-2020-11-09-184839-ci-op-zbzw3qs6

Dunno why it isn't setting Degraded=False. Seems uncommon.

openshift-bot · 2020-11-09T21:42:35Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-09T21:55:35Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-09T23:13:33Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-11-09T23:41:56Z

e2e-aws:

[sig-imageregistry][Feature:ImageInfo] Image info should display information about images [Suite:openshift/conformance/parallel]
...
+ oc image info docker.io/library/mysql:latest
error: unable to read image docker.io/library/mysql:latest: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
...
Expected success, but got an error:
    <*errors.errorString | 0xc0027a7270>: {
        s: "pod \"append-test\" failed with reason: \"\", message: \"\"",
    }
    pod "append-test" failed with reason: "", message: ""

So that is Docker's new throttling, rhbz#1895107.

wking · 2020-11-10T00:40:00Z

e2e-aws:

[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully [Suite:openshift/conformance/parallel]
...
fail [github.com/onsi/ginkgo@v4.5.0-origin.1+incompatible/internal/leafnodes/runner.go:64]: kube-apiserver reports a non-graceful termination: v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-apiserver-ip-10-0-248-27.us-west-1.compute.internal.1645fb7153fdf173", GenerateName:"", Namespace:"openshift-kube-apiserver", SelfLink:"/api/v1/namespaces/openshift-kube-apiserver/events/kube-apiserver-ip-10-0-248-27.us-west-1.compute.internal.1645fb7153fdf173", UID:"0c417d8c-52b2-4375-a1c1-397561efc6ec", ResourceVersion:"23900", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63740562106, loc:(*time.Location)(0x8fe06e0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"watch-termination", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc0011042e0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc001104340)}}}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-248-27.us-west-1.compute.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"NonGracefulTermination", Message:"Previous pod kube-apiserver-ip-10-0-248-27.us-west-1.compute.internal started at 2020-11-09 23:40:18.897294264 +0000 UTC did not terminate gracefully", Source:v1.EventSource{Component:"apiserver", Host:"ip-10-0-248-27"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63740562106, loc:(*time.Location)(0x8fe06e0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63740562106, loc:(*time.Location)(0x8fe06e0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}. Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.

We will eventually break through all the flakes ;).

openshift-bot · 2020-11-10T00:44:36Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-10T00:57:35Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-10T02:15:35Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-10T02:28:35Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-11-10T05:03:15Z

/cherrypick release-4.6

openshift-cherrypick-robot · 2020-11-10T05:03:15Z

@wking: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-11-10T05:40:39Z

@wking: All pull requests linked via external trackers have merged:

openshift/cloud-credential-operator#262

Bugzilla bug 1889540 has been moved to the MODIFIED state.

In response to this:

Bug 1889540: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2020-11-10T05:40:58Z

@wking: new pull request created: #267

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 24, 2020

openshift-ci-robot requested review from staebler and twiest October 24, 2020 04:28

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 24, 2020

openshift-ci-robot assigned dgoodwin Oct 26, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2020

twiest removed their request for review October 27, 2020 13:13

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020

openshift-merge-robot merged commit ae4f77e into openshift:master Nov 10, 2020

openshift-cherrypick-robot mentioned this pull request Nov 10, 2020

Bug 1896230: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown #267

Merged

wking deleted the allow-more-time-before-critical-alert branch November 10, 2020 05:46

Bug 1889540: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown #262

Bug 1889540: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown #262

Conversation

wking commented Oct 24, 2020

openshift-ci-robot commented Oct 24, 2020

wking commented Oct 24, 2020

openshift-ci-robot commented Oct 24, 2020

dgoodwin commented Oct 26, 2020

dgoodwin commented Oct 26, 2020

openshift-ci-robot commented Oct 26, 2020

openshift-bot commented Oct 26, 2020

openshift-bot commented Oct 26, 2020

openshift-bot commented Oct 26, 2020

openshift-bot commented Oct 26, 2020

openshift-bot commented Oct 26, 2020

wking commented Oct 26, 2020

wking commented Oct 27, 2020

openshift-bot commented Oct 27, 2020

openshift-bot commented Oct 27, 2020

openshift-bot commented Oct 27, 2020

openshift-bot commented Oct 27, 2020

openshift-bot commented Oct 27, 2020

openshift-bot commented Oct 27, 2020

openshift-bot commented Nov 2, 2020

openshift-bot commented Nov 2, 2020

openshift-bot commented Nov 2, 2020

dgoodwin commented Nov 2, 2020

wking commented Nov 9, 2020

openshift-bot commented Nov 9, 2020

wking commented Nov 9, 2020

wking commented Nov 9, 2020

openshift-bot commented Nov 9, 2020

wking commented Nov 9, 2020

openshift-bot commented Nov 9, 2020

openshift-bot commented Nov 9, 2020

openshift-bot commented Nov 9, 2020

wking commented Nov 9, 2020 • edited Loading

wking commented Nov 10, 2020

openshift-bot commented Nov 10, 2020

openshift-bot commented Nov 10, 2020

openshift-bot commented Nov 10, 2020

openshift-bot commented Nov 10, 2020

wking commented Nov 10, 2020

openshift-cherrypick-robot commented Nov 10, 2020

openshift-ci-robot commented Nov 10, 2020

openshift-cherrypick-robot commented Nov 10, 2020

wking commented Nov 9, 2020 •

edited

Loading