Bug 1837103: Revert "remove dead host-etcd-2 service" #351

ironcladlou · 2020-05-15T13:05:53Z

This reverts commit 7f142f6 (#346).

kubernetes/kubernetes#6877
kubernetes/kubernetes#7821
kubernetes/kubernetes#7821 (comment)

Deleting the service subjects the host-etcd-2 endpoint to deletion by
the Kube endpoints controller at random times now or in the future.

The service must remain to protect the existence of the endpoint given
the current design of Kube.

By deleting the service, we risk bootstrap failure through information
loss when the endpoint is deleted during bootstrapping, and churning
unnecessary endpoint revisions by recreating an endpoint Kube wants
to delete.

cc @hexfusion @retroflexer @deads2k @smarterclayton please fact check (see discussion in https://github.com/openshift/cluster-etcd-operator/pull/350/files#r425419897)

This reverts commit 7f142f6. kubernetes/kubernetes#6877 kubernetes/kubernetes#7821 kubernetes/kubernetes#7821 (comment) Deleting the service subjects the host-etcd-2 endpoint to deletion by the Kube endpoints controller at random times now or in the future. The service must remain to protect the existence of the endpoint given the current design of Kube. By deleting the service, we risk bootstrap failure through information loss when the endpoint is deleted during bootstrapping, and churning unnecessary endpoint revisions by recreating an endpoint Kube wants to delete.

deads2k · 2020-05-15T13:09:21Z

we risk bootstrap failure through information loss

You can't upgrade a cluster that hasn't completed bootstrapping. Thus the information is not lost.

deads2k · 2020-05-15T13:10:07Z

churning
unnecessary endpoint revisions by recreating an endpoint Kube wants
to delete.

This may be true, but how long does it take to settle? if it settles eventually (should settle pretty quickly), then I think we can tolerate jitter. We just update the openshfit-apiserver-operator to have no toleration for empty.

ironcladlou · 2020-05-15T13:15:25Z

If Kube says "all endpoints must have a service and we'll try to delete any orphan endpoints", why would we rely on the coincidence that the endpoint controller hasn't yet "caught us"?

Keeping the service is what Kube says we should do, and I'm not sure what problem the service's existence represents.

The problems with deleting it seem clear. What is the purpose of deleting the service?

deads2k · 2020-05-15T13:16:28Z

If Kube says "all endpoints must have a service and we'll try to delete any orphan endpoints", why would we rely on the coincidence that the endpoint controller hasn't yet "caught us"?

Keeping the service is what Kube says we should do, and I'm not sure what problem the service's existence represents.

The problems with deleting it seem clear. What is the purpose of deleting the service?

avoiding management of a thing we don't want to manage. We're safe this release. Leave the service out and transition to a configmap if you like, but don't bring the service back. It brings in races and potential dual management of endpoints we don't need.

hexfusion · 2020-05-18T15:35:35Z

/retest

ironcladlou · 2020-05-18T17:00:26Z

/retest

ironcladlou · 2020-05-18T18:24:44Z

/retest

ironcladlou · 2020-05-18T18:40:39Z

Agree this patch is desired, but right now it makes the problem far worse:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-ocp-installer-e2e-aws-4.5

https://search.apps.build01.ci.devcluster.openshift.com/chart?search=BootstrapTeardown_Error&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

We must revert this pending a better fix. The periodics and CI are unstable enough without this compounding the problems.

ironcladlou · 2020-05-18T18:41:10Z

/retest

hexfusion · 2020-05-18T19:54:53Z

/test e2e-metal-ipi

hexfusion · 2020-05-18T19:57:13Z

/lgtm
/approve
/hold

Holding to make sure we can't get the fix merged to prove clean signal in CI. Sorry to all affected but if this isnt fixed we have clusters that will randomly fail install among other issues.

openshift-ci-robot · 2020-05-18T19:57:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, ironcladlou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hexfusion]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hexfusion · 2020-05-18T19:58:52Z

/skip

openshift-ci-robot · 2020-05-18T20:15:31Z

@ironcladlou: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-metal-ipi	`f168ec1`	link	`/test e2e-metal-ipi`
ci/prow/e2e-aws-disruptive	`f168ec1`	link	`/test e2e-aws-disruptive`
ci/prow/e2e-azure	`f168ec1`	link	`/test e2e-azure`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2020-05-18T20:17:00Z

@ironcladlou: This pull request references Bugzilla bug 1837103, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1837103: Revert "remove dead host-etcd-2 service"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou · 2020-05-18T20:17:51Z

/hold cancel

openshift-bot · 2020-05-18T20:21:01Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-05-18T20:35:37Z

@ironcladlou: All pull requests linked via external trackers have merged: openshift/cluster-etcd-operator#351. Bugzilla bug 1837103 has been moved to the MODIFIED state.

In response to this:

Bug 1837103: Revert "remove dead host-etcd-2 service"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from alaypatel07 and deads2k May 15, 2020 13:06

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2020

openshift-ci-robot assigned hexfusion May 18, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2020

ironcladlou changed the title ~~Revert "remove dead host-etcd-2 service"~~ Bug 1837103: Revert "remove dead host-etcd-2 service" May 18, 2020

openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 18, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2020

openshift-merge-robot merged commit ec550f4 into openshift:master May 18, 2020

ironcladlou mentioned this pull request May 18, 2020

Bug 1832986: pkg/etcdcli: fix health checks #343

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1837103: Revert "remove dead host-etcd-2 service" #351

Bug 1837103: Revert "remove dead host-etcd-2 service" #351

ironcladlou commented May 15, 2020

deads2k commented May 15, 2020

deads2k commented May 15, 2020

ironcladlou commented May 15, 2020

deads2k commented May 15, 2020

hexfusion commented May 18, 2020

ironcladlou commented May 18, 2020

ironcladlou commented May 18, 2020

ironcladlou commented May 18, 2020

ironcladlou commented May 18, 2020

hexfusion commented May 18, 2020

hexfusion commented May 18, 2020

openshift-ci-robot commented May 18, 2020

hexfusion commented May 18, 2020

openshift-ci-robot commented May 18, 2020

openshift-ci-robot commented May 18, 2020

ironcladlou commented May 18, 2020

openshift-bot commented May 18, 2020

openshift-ci-robot commented May 18, 2020

Bug 1837103: Revert "remove dead host-etcd-2 service" #351

Bug 1837103: Revert "remove dead host-etcd-2 service" #351

Conversation

ironcladlou commented May 15, 2020

deads2k commented May 15, 2020

deads2k commented May 15, 2020

ironcladlou commented May 15, 2020

deads2k commented May 15, 2020

hexfusion commented May 18, 2020

ironcladlou commented May 18, 2020

ironcladlou commented May 18, 2020

ironcladlou commented May 18, 2020

ironcladlou commented May 18, 2020

hexfusion commented May 18, 2020

hexfusion commented May 18, 2020

openshift-ci-robot commented May 18, 2020

hexfusion commented May 18, 2020

openshift-ci-robot commented May 18, 2020

openshift-ci-robot commented May 18, 2020

ironcladlou commented May 18, 2020

openshift-bot commented May 18, 2020

openshift-ci-robot commented May 18, 2020