Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Configure CoreDNS to shut down gracefully" #213

Merged
merged 1 commit into from Nov 18, 2020

Conversation

cgwalters
Copy link
Member

This reverts commit f094ddf.
It didn't actually help, and causes system shutdown to take
noticeably longer which makes the MCO tests time out.

The real fix will involve backporting
kubernetes/kubernetes#96129

@kikisdeliveryservice
Copy link

We really need this revert bc our ci is currently blocked.

@Miciah
Copy link
Contributor

Miciah commented Nov 13, 2020

It didn't actually help, and causes system shutdown to take
noticeably longer which makes the MCO tests time out.

It seems like #205 is doing the right thing by making the readiness probe actually report readiness and adding a grace period before the pod is terminated. Is #205 technically wrong, or is the case that #205 is technically correct and the broken behavior that kubernetes/kubernetes#96129 fixes means that doing the correct thing in the operator makes the overall upgrade behavior worse?

The real fix will involve backporting
kubernetes/kubernetes#96129

If we revert #205 and subsequently backport kubernetes/kubernetes#96129, do we then need to restore #205 (i.e., revert the reversion)?

Note that reverting #205 might improve the situation for node reboots, but I think it will make the situation worse for rolling updates of the pod where node reboots are not involved. Maybe the tradeoff is appropriate though.

@Miciah
Copy link
Contributor

Miciah commented Nov 13, 2020

We really need this revert bc our ci is currently blocked.

All right, if #205 broke CI, we need to revert it and figure out a way to re-introduce it (if appropriate) in a way that doesn't break CI.

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 13, 2020
@cgwalters
Copy link
Member Author

To clarify, it's the MCO's CI which is hit the most by this. The openshift e2e tests don't apply changes to nodes, and the e2e-upgrade only does it once.

It seems like #205 is doing the right thing by making the readiness probe actually report readiness and adding a grace period before the pod is terminated.

Having a readiness probe makes total sense but...the problem is the grace period. If we're going to reboot the node, then what we want to do is:

  • Drop it out of the list of endpoints
  • reboot and kill the pod without any "grace"

@Miciah
Copy link
Contributor

Miciah commented Nov 13, 2020

Having a readiness probe makes total sense but...the problem is the grace period.

Would a shorter grace period (say 20 seconds) be an acceptable compromise? I used 2 minutes in #205 to give kubelet's readiness probes time to fail several times and thus mark the pod as unready. Really though, the endpoints controller should immediately take note that the pod's deletion timestamp is set. So if the endpoints controller and kube-proxy are operating properly, they should stop using a given DNS pod's endpoint pretty quickly (I'd guess <10s, possibly a bit longer on a stressed cluster?) once deletion is requested, so the grace period probably can be pretty short to prevent any blips.

@Miciah
Copy link
Contributor

Miciah commented Nov 14, 2020

/test e2e-aws-operator

@Miciah
Copy link
Contributor

Miciah commented Nov 16, 2020

/retest

1 similar comment
@cgwalters
Copy link
Member Author

/retest

@rphillips
Copy link
Contributor

/retest

@kikisdeliveryservice
Copy link

This seems like it's getting hit by : https://bugzilla.redhat.com/show_bug.cgi?id=1897604

see: incident-kcm-auth channel for current status

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

/test e2e-aws

1 similar comment
@cgwalters
Copy link
Member Author

/test e2e-aws

@cgwalters
Copy link
Member Author

Whee, timed out on lease.
/test e2e-aws

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@cgwalters
Copy link
Member Author

/retest

@rphillips
Copy link
Contributor

/test e2e-upgrade

1 similar comment
@sinnykumari
Copy link

/test e2e-upgrade

@runcom
Copy link
Member

runcom commented Nov 18, 2020

uhm, looks like e2e-aws-upgrade never passed on this PR with some weird timeout error still:

error: some steps failed:
  * could not run steps: step e2e-upgrade failed: ["e2e-upgrade" test steps failed: "e2e-upgrade" pod "e2e-upgrade-openshift-e2e-test" exceeded the configured timeout activeDeadlineSeconds=7200: the pod ci-op-v1rqnq5z/e2e-upgrade-openshift-e2e-test failed after 2h0m0s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline
Link to step on registry info site: https://steps.ci.openshift.org/reference/openshift-e2e-test
Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift&repo=cluster-dns-operator&branch=master&test=e2e-upgrade, "e2e-upgrade" post steps failed: "e2e-upgrade" pod "e2e-upgrade-gather-loki" exceeded the configured timeout activeDeadlineSeconds=600: the pod ci-op-v1rqnq5z/e2e-upgrade-gather-loki failed after 10m0s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline
Link to step on registry info site: https://steps.ci.openshift.org/reference/gather-loki
Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift&repo=cluster-dns-operator&branch=master&test=e2e-upgrade]
time="2020-11-17T09:35:46Z" level=info msg="Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'"

It seems the test passed tho

@runcom
Copy link
Member

runcom commented Nov 18, 2020

/test e2e-upgrade

This mostly reverts commit f094ddf.
It didn't actually help, and causes system shutdown to take
noticeably longer which makes the MCO tests time out.

The real fix will involve backporting
kubernetes/kubernetes#96129

We do continue carry the changes though to update the daemonset
if the readiness changes because we're reverting that on upgrades
in 4.7 now.
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2020
@cgwalters
Copy link
Member Author

Right hmm DNS is degraded...I think I may see the problem. We still need to carry the diff logic to roll out a new daemonset if the readiness probe changes across upgrades, since that's what we're doing now.

@cgwalters
Copy link
Member Author

And yep, upgrade test is green now. This case was clearly our CI tests doing their job correctly. Events like CI going red across the board are bad because they teach us not to look in depth at failures and hope for things to go through in retries.

@Miciah
Copy link
Contributor

Miciah commented Nov 18, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yuqi-zhang
Copy link

/retest

The e2e-aws errors in that failed test seem to happen occasionally on PRs from this repo (see 212 and 210) so maybe its a flake?

@cgwalters
Copy link
Member Author

Hmm in that e2e-aws run looks like one node failed to fully join the cluster. We didn't get journal logs. Nothing obvious on the console (though man we really need to teach the MCD to dump status to the console).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants