Revert "Configure CoreDNS to shut down gracefully" #213

cgwalters · 2020-11-13T21:18:14Z

This reverts commit f094ddf.
It didn't actually help, and causes system shutdown to take
noticeably longer which makes the MCO tests time out.

The real fix will involve backporting
kubernetes/kubernetes#96129

kikisdeliveryservice · 2020-11-13T23:39:16Z

We really need this revert bc our ci is currently blocked.

Miciah · 2020-11-13T23:39:32Z

It didn't actually help, and causes system shutdown to take
noticeably longer which makes the MCO tests time out.

It seems like #205 is doing the right thing by making the readiness probe actually report readiness and adding a grace period before the pod is terminated. Is #205 technically wrong, or is the case that #205 is technically correct and the broken behavior that kubernetes/kubernetes#96129 fixes means that doing the correct thing in the operator makes the overall upgrade behavior worse?

The real fix will involve backporting
kubernetes/kubernetes#96129

If we revert #205 and subsequently backport kubernetes/kubernetes#96129, do we then need to restore #205 (i.e., revert the reversion)?

Note that reverting #205 might improve the situation for node reboots, but I think it will make the situation worse for rolling updates of the pod where node reboots are not involved. Maybe the tradeoff is appropriate though.

Miciah · 2020-11-13T23:40:27Z

We really need this revert bc our ci is currently blocked.

All right, if #205 broke CI, we need to revert it and figure out a way to re-introduce it (if appropriate) in a way that doesn't break CI.

/approve
/lgtm

cgwalters · 2020-11-13T23:44:57Z

To clarify, it's the MCO's CI which is hit the most by this. The openshift e2e tests don't apply changes to nodes, and the e2e-upgrade only does it once.

It seems like #205 is doing the right thing by making the readiness probe actually report readiness and adding a grace period before the pod is terminated.

Having a readiness probe makes total sense but...the problem is the grace period. If we're going to reboot the node, then what we want to do is:

Drop it out of the list of endpoints
reboot and kill the pod without any "grace"

Miciah · 2020-11-13T23:53:21Z

Having a readiness probe makes total sense but...the problem is the grace period.

Would a shorter grace period (say 20 seconds) be an acceptable compromise? I used 2 minutes in #205 to give kubelet's readiness probes time to fail several times and thus mark the pod as unready. Really though, the endpoints controller should immediately take note that the pod's deletion timestamp is set. So if the endpoints controller and kube-proxy are operating properly, they should stop using a given DNS pod's endpoint pretty quickly (I'd guess <10s, possibly a bit longer on a stressed cluster?) once deletion is requested, so the grace period probably can be pretty short to prevent any blips.

Miciah · 2020-11-14T00:20:00Z

/test e2e-aws-operator

Miciah · 2020-11-16T03:04:33Z

/retest

cgwalters · 2020-11-16T14:19:16Z

/retest

rphillips · 2020-11-16T17:45:52Z

/retest

kikisdeliveryservice · 2020-11-17T00:15:09Z

This seems like it's getting hit by : https://bugzilla.redhat.com/show_bug.cgi?id=1897604

see: incident-kcm-auth channel for current status

openshift-bot · 2020-11-17T03:03:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-17T03:16:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-17T06:05:10Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-11-17T06:31:18Z

/retest

Please review the full test history for this PR and help us cut down flakes.

cgwalters · 2020-11-17T13:59:49Z

/retest

cgwalters · 2020-11-17T14:14:51Z

/test e2e-aws

cgwalters · 2020-11-17T14:50:44Z

/test e2e-aws

cgwalters · 2020-11-17T17:04:32Z

Whee, timed out on lease.
/test e2e-aws

openshift-bot · 2020-11-17T18:39:12Z

/retest

Please review the full test history for this PR and help us cut down flakes.

cgwalters · 2020-11-17T23:19:21Z

/retest

rphillips · 2020-11-18T02:32:59Z

/test e2e-upgrade

sinnykumari · 2020-11-18T07:25:46Z

/test e2e-upgrade

runcom · 2020-11-18T10:19:42Z

uhm, looks like e2e-aws-upgrade never passed on this PR with some weird timeout error still:

error: some steps failed:
  * could not run steps: step e2e-upgrade failed: ["e2e-upgrade" test steps failed: "e2e-upgrade" pod "e2e-upgrade-openshift-e2e-test" exceeded the configured timeout activeDeadlineSeconds=7200: the pod ci-op-v1rqnq5z/e2e-upgrade-openshift-e2e-test failed after 2h0m0s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline
Link to step on registry info site: https://steps.ci.openshift.org/reference/openshift-e2e-test
Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift&repo=cluster-dns-operator&branch=master&test=e2e-upgrade, "e2e-upgrade" post steps failed: "e2e-upgrade" pod "e2e-upgrade-gather-loki" exceeded the configured timeout activeDeadlineSeconds=600: the pod ci-op-v1rqnq5z/e2e-upgrade-gather-loki failed after 10m0s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline
Link to step on registry info site: https://steps.ci.openshift.org/reference/gather-loki
Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift&repo=cluster-dns-operator&branch=master&test=e2e-upgrade]
time="2020-11-17T09:35:46Z" level=info msg="Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'"

It seems the test passed tho

runcom · 2020-11-18T10:28:19Z

/test e2e-upgrade

This mostly reverts commit f094ddf. It didn't actually help, and causes system shutdown to take noticeably longer which makes the MCO tests time out. The real fix will involve backporting kubernetes/kubernetes#96129 We do continue carry the changes though to update the daemonset if the readiness changes because we're reverting that on upgrades in 4.7 now.

cgwalters · 2020-11-18T13:04:01Z

Right hmm DNS is degraded...I think I may see the problem. We still need to carry the diff logic to roll out a new daemonset if the readiness probe changes across upgrades, since that's what we're doing now.

cgwalters · 2020-11-18T15:03:45Z

And yep, upgrade test is green now. This case was clearly our CI tests doing their job correctly. Events like CI going red across the board are bad because they teach us not to look in depth at failures and hope for things to go through in retries.

Miciah · 2020-11-18T16:15:23Z

/lgtm

openshift-ci-robot · 2020-11-18T16:15:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yuqi-zhang · 2020-11-18T16:29:59Z

/retest

The e2e-aws errors in that failed test seem to happen occasionally on PRs from this repo (see 212 and 210) so maybe its a flake?

cgwalters · 2020-11-18T16:30:54Z

Hmm in that e2e-aws run looks like one node failed to fully join the cluster. We didn't get journal logs. Nothing obvious on the console (though man we really need to teach the MCD to dump status to the console).

openshift-ci-robot requested review from danehans and knobunc November 13, 2020 21:18

cgwalters mentioned this pull request Nov 13, 2020

Bug 1884053: Configure CoreDNS to shut down gracefully #205

Merged

openshift-ci-robot assigned Miciah Nov 13, 2020

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 13, 2020

cgwalters mentioned this pull request Nov 16, 2020

Bug 1897361: ctrcfg_test: Wait for our prior target config openshift/machine-config-operator#2229

Merged

kikisdeliveryservice mentioned this pull request Nov 16, 2020

test: increase e2e test run with 15 minutes openshift/machine-config-operator#2184

Closed

cgwalters force-pushed the revert-graceful branch from a51c562 to a96c45e Compare November 18, 2020 13:03

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2020

openshift-merge-robot merged commit 0d46f03 into openshift:master Nov 18, 2020

This was referenced Feb 15, 2021

[release-4.6] Bug 1928773: Set CoreDNS readiness probe period and timeout each to 3 seconds #236

Merged

Bug 1884053: Configure CoreDNS to shut down gracefully #237

Merged

This was referenced Mar 4, 2021

Bug 1935297: Set CoreDNS readiness probe period and timeout each to 3 seconds #242

Merged

[release-4.7] Bug 1937089: Configure CoreDNS to shut down gracefully #247

Merged

[release-4.6] Bug 1937090: Configure CoreDNS to shut down gracefully #248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Configure CoreDNS to shut down gracefully" #213

Revert "Configure CoreDNS to shut down gracefully" #213

cgwalters commented Nov 13, 2020

kikisdeliveryservice commented Nov 13, 2020

Miciah commented Nov 13, 2020

Miciah commented Nov 13, 2020

cgwalters commented Nov 13, 2020

Miciah commented Nov 13, 2020

Miciah commented Nov 14, 2020

Miciah commented Nov 16, 2020

cgwalters commented Nov 16, 2020

rphillips commented Nov 16, 2020

kikisdeliveryservice commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

cgwalters commented Nov 17, 2020

cgwalters commented Nov 17, 2020

cgwalters commented Nov 17, 2020

cgwalters commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

cgwalters commented Nov 17, 2020

rphillips commented Nov 18, 2020

sinnykumari commented Nov 18, 2020

runcom commented Nov 18, 2020

runcom commented Nov 18, 2020

cgwalters commented Nov 18, 2020

cgwalters commented Nov 18, 2020

Miciah commented Nov 18, 2020

openshift-ci-robot commented Nov 18, 2020

yuqi-zhang commented Nov 18, 2020

cgwalters commented Nov 18, 2020

Revert "Configure CoreDNS to shut down gracefully" #213

Revert "Configure CoreDNS to shut down gracefully" #213

Conversation

cgwalters commented Nov 13, 2020

kikisdeliveryservice commented Nov 13, 2020

Miciah commented Nov 13, 2020

Miciah commented Nov 13, 2020

cgwalters commented Nov 13, 2020

Miciah commented Nov 13, 2020

Miciah commented Nov 14, 2020

Miciah commented Nov 16, 2020

cgwalters commented Nov 16, 2020

rphillips commented Nov 16, 2020

kikisdeliveryservice commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

cgwalters commented Nov 17, 2020

cgwalters commented Nov 17, 2020

cgwalters commented Nov 17, 2020

cgwalters commented Nov 17, 2020

openshift-bot commented Nov 17, 2020

cgwalters commented Nov 17, 2020

rphillips commented Nov 18, 2020

sinnykumari commented Nov 18, 2020

runcom commented Nov 18, 2020

runcom commented Nov 18, 2020

cgwalters commented Nov 18, 2020

cgwalters commented Nov 18, 2020

Miciah commented Nov 18, 2020

openshift-ci-robot commented Nov 18, 2020

yuqi-zhang commented Nov 18, 2020

cgwalters commented Nov 18, 2020