Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1837103: Revert "remove dead host-etcd-2 service" #351

Merged

Conversation

ironcladlou
Copy link
Contributor

This reverts commit 7f142f6 (#346).

kubernetes/kubernetes#6877
kubernetes/kubernetes#7821
kubernetes/kubernetes#7821 (comment)

Deleting the service subjects the host-etcd-2 endpoint to deletion by
the Kube endpoints controller at random times now or in the future.

The service must remain to protect the existence of the endpoint given
the current design of Kube.

By deleting the service, we risk bootstrap failure through information
loss when the endpoint is deleted during bootstrapping, and churning
unnecessary endpoint revisions by recreating an endpoint Kube wants
to delete.

cc @hexfusion @retroflexer @deads2k @smarterclayton please fact check (see discussion in https://github.com/openshift/cluster-etcd-operator/pull/350/files#r425419897)

This reverts commit 7f142f6.

kubernetes/kubernetes#6877
kubernetes/kubernetes#7821
kubernetes/kubernetes#7821 (comment)

Deleting the service subjects the host-etcd-2 endpoint to deletion by
the Kube endpoints controller at random times now or in the future.

The service must remain to protect the existence of the endpoint given
the current design of Kube.

By deleting the service, we risk bootstrap failure through information
loss when the endpoint is deleted during bootstrapping, and churning
unnecessary endpoint revisions by recreating an endpoint Kube wants
to delete.
@deads2k
Copy link
Contributor

deads2k commented May 15, 2020

we risk bootstrap failure through information loss

You can't upgrade a cluster that hasn't completed bootstrapping. Thus the information is not lost.

@deads2k
Copy link
Contributor

deads2k commented May 15, 2020

churning
unnecessary endpoint revisions by recreating an endpoint Kube wants
to delete.

This may be true, but how long does it take to settle? if it settles eventually (should settle pretty quickly), then I think we can tolerate jitter. We just update the openshfit-apiserver-operator to have no toleration for empty.

@ironcladlou
Copy link
Contributor Author

If Kube says "all endpoints must have a service and we'll try to delete any orphan endpoints", why would we rely on the coincidence that the endpoint controller hasn't yet "caught us"?

Keeping the service is what Kube says we should do, and I'm not sure what problem the service's existence represents.

The problems with deleting it seem clear. What is the purpose of deleting the service?

@deads2k
Copy link
Contributor

deads2k commented May 15, 2020

If Kube says "all endpoints must have a service and we'll try to delete any orphan endpoints", why would we rely on the coincidence that the endpoint controller hasn't yet "caught us"?

Keeping the service is what Kube says we should do, and I'm not sure what problem the service's existence represents.

The problems with deleting it seem clear. What is the purpose of deleting the service?

avoiding management of a thing we don't want to manage. We're safe this release. Leave the service out and transition to a configmap if you like, but don't bring the service back. It brings in races and potential dual management of endpoints we don't need.

@hexfusion
Copy link
Contributor

/retest

2 similar comments
@ironcladlou
Copy link
Contributor Author

/retest

@ironcladlou
Copy link
Contributor Author

/retest

@ironcladlou
Copy link
Contributor Author

Agree this patch is desired, but right now it makes the problem far worse:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-ocp-installer-e2e-aws-4.5

https://search.apps.build01.ci.devcluster.openshift.com/chart?search=BootstrapTeardown_Error&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

We must revert this pending a better fix. The periodics and CI are unstable enough without this compounding the problems.

@ironcladlou
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor

/test e2e-metal-ipi

@hexfusion
Copy link
Contributor

/lgtm
/approve
/hold

Holding to make sure we can't get the fix merged to prove clean signal in CI. Sorry to all affected but if this isnt fixed we have clusters that will randomly fail install among other issues.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, ironcladlou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2020
@hexfusion
Copy link
Contributor

/skip

@openshift-ci-robot
Copy link

@ironcladlou: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-metal-ipi f168ec1 link /test e2e-metal-ipi
ci/prow/e2e-aws-disruptive f168ec1 link /test e2e-aws-disruptive
ci/prow/e2e-azure f168ec1 link /test e2e-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ironcladlou ironcladlou changed the title Revert "remove dead host-etcd-2 service" Bug 1837103: Revert "remove dead host-etcd-2 service" May 18, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 18, 2020
@openshift-ci-robot
Copy link

@ironcladlou: This pull request references Bugzilla bug 1837103, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1837103: Revert "remove dead host-etcd-2 service"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit ec550f4 into openshift:master May 18, 2020
@openshift-ci-robot
Copy link

@ironcladlou: All pull requests linked via external trackers have merged: openshift/cluster-etcd-operator#351. Bugzilla bug 1837103 has been moved to the MODIFIED state.

In response to this:

Bug 1837103: Revert "remove dead host-etcd-2 service"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants