New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Configure CoreDNS to shut down gracefully" #213
Revert "Configure CoreDNS to shut down gracefully" #213
Conversation
We really need this revert bc our ci is currently blocked. |
It seems like #205 is doing the right thing by making the readiness probe actually report readiness and adding a grace period before the pod is terminated. Is #205 technically wrong, or is the case that #205 is technically correct and the broken behavior that kubernetes/kubernetes#96129 fixes means that doing the correct thing in the operator makes the overall upgrade behavior worse?
If we revert #205 and subsequently backport kubernetes/kubernetes#96129, do we then need to restore #205 (i.e., revert the reversion)? Note that reverting #205 might improve the situation for node reboots, but I think it will make the situation worse for rolling updates of the pod where node reboots are not involved. Maybe the tradeoff is appropriate though. |
All right, if #205 broke CI, we need to revert it and figure out a way to re-introduce it (if appropriate) in a way that doesn't break CI. /approve |
To clarify, it's the MCO's CI which is hit the most by this. The openshift e2e tests don't apply changes to nodes, and the e2e-upgrade only does it once.
Having a readiness probe makes total sense but...the problem is the grace period. If we're going to reboot the node, then what we want to do is:
|
Would a shorter grace period (say 20 seconds) be an acceptable compromise? I used 2 minutes in #205 to give kubelet's readiness probes time to fail several times and thus mark the pod as unready. Really though, the endpoints controller should immediately take note that the pod's deletion timestamp is set. So if the endpoints controller and kube-proxy are operating properly, they should stop using a given DNS pod's endpoint pretty quickly (I'd guess <10s, possibly a bit longer on a stressed cluster?) once deletion is requested, so the grace period probably can be pretty short to prevent any blips. |
/test e2e-aws-operator |
/retest |
1 similar comment
/retest |
/retest |
This seems like it's getting hit by : https://bugzilla.redhat.com/show_bug.cgi?id=1897604 see: incident-kcm-auth channel for current status |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
/test e2e-aws |
1 similar comment
/test e2e-aws |
Whee, timed out on lease. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
/test e2e-upgrade |
1 similar comment
/test e2e-upgrade |
uhm, looks like e2e-aws-upgrade never passed on this PR with some weird timeout error still:
It seems the test passed tho |
/test e2e-upgrade |
This mostly reverts commit f094ddf. It didn't actually help, and causes system shutdown to take noticeably longer which makes the MCO tests time out. The real fix will involve backporting kubernetes/kubernetes#96129 We do continue carry the changes though to update the daemonset if the readiness changes because we're reverting that on upgrades in 4.7 now.
a51c562
to
a96c45e
Compare
Right hmm DNS is degraded...I think I may see the problem. We still need to carry the diff logic to roll out a new daemonset if the readiness probe changes across upgrades, since that's what we're doing now. |
And yep, upgrade test is green now. This case was clearly our CI tests doing their job correctly. Events like CI going red across the board are bad because they teach us not to look in depth at failures and hope for things to go through in retries. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest The e2e-aws errors in that failed test seem to happen occasionally on PRs from this repo (see 212 and 210) so maybe its a flake? |
Hmm in that e2e-aws run looks like one node failed to fully join the cluster. We didn't get journal logs. Nothing obvious on the console (though man we really need to teach the MCD to dump status to the console). |
This reverts commit f094ddf.
It didn't actually help, and causes system shutdown to take
noticeably longer which makes the MCO tests time out.
The real fix will involve backporting
kubernetes/kubernetes#96129