New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1837103: Revert "remove dead host-etcd-2 service" #351
Bug 1837103: Revert "remove dead host-etcd-2 service" #351
Conversation
This reverts commit 7f142f6. kubernetes/kubernetes#6877 kubernetes/kubernetes#7821 kubernetes/kubernetes#7821 (comment) Deleting the service subjects the host-etcd-2 endpoint to deletion by the Kube endpoints controller at random times now or in the future. The service must remain to protect the existence of the endpoint given the current design of Kube. By deleting the service, we risk bootstrap failure through information loss when the endpoint is deleted during bootstrapping, and churning unnecessary endpoint revisions by recreating an endpoint Kube wants to delete.
You can't upgrade a cluster that hasn't completed bootstrapping. Thus the information is not lost. |
This may be true, but how long does it take to settle? if it settles eventually (should settle pretty quickly), then I think we can tolerate jitter. We just update the openshfit-apiserver-operator to have no toleration for empty. |
If Kube says "all endpoints must have a service and we'll try to delete any orphan endpoints", why would we rely on the coincidence that the endpoint controller hasn't yet "caught us"? Keeping the service is what Kube says we should do, and I'm not sure what problem the service's existence represents. The problems with deleting it seem clear. What is the purpose of deleting the service? |
avoiding management of a thing we don't want to manage. We're safe this release. Leave the service out and transition to a configmap if you like, but don't bring the service back. It brings in races and potential dual management of endpoints we don't need. |
/retest |
2 similar comments
/retest |
/retest |
Agree this patch is desired, but right now it makes the problem far worse: We must revert this pending a better fix. The periodics and CI are unstable enough without this compounding the problems. |
/retest |
/test e2e-metal-ipi |
/lgtm Holding to make sure we can't get the fix merged to prove clean signal in CI. Sorry to all affected but if this isnt fixed we have clusters that will randomly fail install among other issues. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, ironcladlou The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/skip |
@ironcladlou: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@ironcladlou: This pull request references Bugzilla bug 1837103, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold cancel |
/retest Please review the full test history for this PR and help us cut down flakes. |
@ironcladlou: All pull requests linked via external trackers have merged: openshift/cluster-etcd-operator#351. Bugzilla bug 1837103 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This reverts commit 7f142f6 (#346).
kubernetes/kubernetes#6877
kubernetes/kubernetes#7821
kubernetes/kubernetes#7821 (comment)
Deleting the service subjects the host-etcd-2 endpoint to deletion by
the Kube endpoints controller at random times now or in the future.
The service must remain to protect the existence of the endpoint given
the current design of Kube.
By deleting the service, we risk bootstrap failure through information
loss when the endpoint is deleted during bootstrapping, and churning
unnecessary endpoint revisions by recreating an endpoint Kube wants
to delete.
cc @hexfusion @retroflexer @deads2k @smarterclayton please fact check (see discussion in https://github.com/openshift/cluster-etcd-operator/pull/350/files#r425419897)