Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1889540: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown #262

Conversation

wking
Copy link
Member

@wking wking commented Oct 24, 2020

The alert was born with a 5m for as CCOperatorDown in 63af2de (#132). But folks are unlikely to be churning their creds so quickly that the occasional longer operator outage is worth waking an admin with a midnight alarm. This is true in general, although folks have been revisiting this alert in the context of 4.5->4.6 updates, where a shift in leader leasing has lead to a risk of an 8 minute delay as a 4.6 operator waits patiently before picking up a lease abandoned by a 4.5 operator. The new 20m threshold allows for two such delays with room to spare, and also ensures we aren't waking folks up if theres a brief network or registry outage while pods are being rescheduled or anything minor like that. Some things are worth more agressive thresholds, but I don't think the cred operator is one of them.

The alert was born with a 5m 'for' as CCOperatorDown in 63af2de
(add alert for when operator is down, 2019-10-28, openshift#132).  But folks
are unlikely to be churning their creds so quickly that the occasional
longer operator outage is worth waking an admin with a midnight alarm.
This is true in general, although folks have been revisiting this
alert in the context of 4.5->4.6 updates, where a shift in leader
leasing has lead to a risk of an 8 minute delay as a 4.6 operator
waits patiently before picking up a lease abandoned by a 4.5 operator
[1].  The new 20m threshold allows for two such delays with room to
spare, and also ensures we aren't waking folks up if theres a brief
network or registry outage while pods are being rescheduled or
anything minor like that.  Some things are worth more agressive
thresholds, but I don't think the cred operator is one of them.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1889540#c4
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 24, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1889540, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1889540: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Oct 24, 2020

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1889540, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 24, 2020
@dgoodwin
Copy link
Contributor

I am in agreement and I really wish this was in 4.6.0 now. :( Thanks Trevor!

/lgtm

@dgoodwin
Copy link
Contributor

/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Oct 26, 2020

/retest

@wking
Copy link
Member Author

wking commented Oct 27, 2020

Hah, I seem to have broken the operator :p. Upgrade:

 error: some steps failed:
  * could not run steps: step e2e-upgrade failed: "e2e-upgrade" test steps failed: "e2e-upgrade" pod "e2e-upgrade-openshift-e2e-test" exceeded the configured timeout activeDeadlineSeconds=7200: the pod ci-op-h7f9j7x7/e2e-upgrade-openshift-e2e-test failed after 2h0m1s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline 
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-10-27T00:22:38Z  4.7.0-0.ci.test-2020-10-26-234654-ci-op-h7f9j7x7 Partial false
2020-10-26T23:54:37Z 2020-10-27T00:19:53Z 4.7.0-0.ci.test-2020-10-26-234143-ci-op-h7f9j7x7 Completed false
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-10-26T23:54:37Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2020-10-27T00:19:53Z Available=True : Done applying 4.7.0-0.ci.test-2020-10-26-234143-ci-op-h7f9j7x7
2020-10-27T00:22:38Z Progressing=True ClusterOperatorNotAvailable: Unable to apply 4.7.0-0.ci.test-2020-10-26-234654-ci-op-h7f9j7x7: the cluster operator cloud-credential has not yet successfully rolled out
2020-10-27T00:57:11Z Failing=True ClusterOperatorNotAvailable: Cluster operator cloud-credential is still updating
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cloud-credential").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-10-26T23:55:39Z Available=True : 
2020-10-26T23:55:39Z Upgradeable=True : 
2020-10-26T23:55:42Z Degraded=False : 
2020-10-27T00:09:31Z Progressing=False : 
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "cloud-credential").status.versions'
[
  {
    "name": "operator",
    "version": "4.7.0-0.ci.test-2020-10-26-234143-ci-op-h7f9j7x7"
  }
]

Ah, the operator pod is crash-looping:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade/1320873201856679936/artifacts/e2e-upgrade/gather-extra/pods/openshift-cloud-credential-operator_cloud-credential-operator-649f878d55-6fp8r_cloud-credential-operator.log
Copying system trust bundle
cp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@twiest twiest removed their request for review October 27, 2020 13:13
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@dgoodwin
Copy link
Contributor

dgoodwin commented Nov 2, 2020

/hold

@akhil-rane is working on the problems introduced into the build cluster.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2020
@wking
Copy link
Member Author

wking commented Nov 9, 2020

openshift/release#13491 should fix the failures.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Nov 9, 2020

/hold

Turns out openshift/release#13491 only moved postsubmit builds, not presubmit builds.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020
@wking
Copy link
Member Author

wking commented Nov 9, 2020

Trying again after openshift/release#13499:

/hold cancel
/retest

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Nov 9, 2020

e2e-aws:

[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]
...
[github.com/openshift/origin/test/extended/operators/operators.go:94]: Nov  9 19:17:03.289: Some cluster operators are not ready: marketplace (missing: Degraded)

And indeed:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cloud-credential-operator/262/pull-ci-openshift-cloud-credential-operator-master-e2e-aws/1325872495944798208/artifacts/e2e-aws/gather-extra/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "marketplace").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-11-09T19:02:47Z Progressing=False OperatorAvailable: Successfully progressed to release version: 4.7.0-0.ci.test-2020-11-09-184839-ci-op-zbzw3qs6
2020-11-09T19:02:47Z Available=True OperatorAvailable: Available release version: 4.7.0-0.ci.test-2020-11-09-184839-ci-op-zbzw3qs6

Dunno why it isn't setting Degraded=False. Seems uncommon.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Nov 9, 2020

e2e-aws:

[sig-imageregistry][Feature:ImageInfo] Image info should display information about images [Suite:openshift/conformance/parallel]
...
+ oc image info docker.io/library/mysql:latest
error: unable to read image docker.io/library/mysql:latest: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
...
Expected success, but got an error:
    <*errors.errorString | 0xc0027a7270>: {
        s: "pod \"append-test\" failed with reason: \"\", message: \"\"",
    }
    pod "append-test" failed with reason: "", message: ""

So that is Docker's new throttling, rhbz#1895107.

@wking
Copy link
Member Author

wking commented Nov 10, 2020

e2e-aws:

[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully [Suite:openshift/conformance/parallel]
...
fail [github.com/onsi/ginkgo@v4.5.0-origin.1+incompatible/internal/leafnodes/runner.go:64]: kube-apiserver reports a non-graceful termination: v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-apiserver-ip-10-0-248-27.us-west-1.compute.internal.1645fb7153fdf173", GenerateName:"", Namespace:"openshift-kube-apiserver", SelfLink:"/api/v1/namespaces/openshift-kube-apiserver/events/kube-apiserver-ip-10-0-248-27.us-west-1.compute.internal.1645fb7153fdf173", UID:"0c417d8c-52b2-4375-a1c1-397561efc6ec", ResourceVersion:"23900", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63740562106, loc:(*time.Location)(0x8fe06e0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"watch-termination", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc0011042e0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc001104340)}}}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-ip-10-0-248-27.us-west-1.compute.internal", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}, Reason:"NonGracefulTermination", Message:"Previous pod kube-apiserver-ip-10-0-248-27.us-west-1.compute.internal started at 2020-11-09 23:40:18.897294264 +0000 UTC did not terminate gracefully", Source:v1.EventSource{Component:"apiserver", Host:"ip-10-0-248-27"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63740562106, loc:(*time.Location)(0x8fe06e0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63740562106, loc:(*time.Location)(0x8fe06e0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}. Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.

We will eventually break through all the flakes ;).

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Nov 10, 2020

/cherrypick release-4.6

@openshift-cherrypick-robot

@wking: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit ae4f77e into openshift:master Nov 10, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1889540 has been moved to the MODIFIED state.

In response to this:

Bug 1889540: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@wking: new pull request created: #267

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants