Migrate etcd-quorum-guard from MCO. #394

retroflexer · 2020-07-15T18:53:31Z

Migrating the quorum guard from machine-config-operator.

Note that the TLS certs come from secrets and configmaps, instead of files from the host (MCO was using files on disk).

ironcladlou · 2020-07-15T19:38:23Z

From an offline discussion worth bringing out here:

Instead of retaining CVO management of QG, this is an opportunity to bring the component under direct management of the operator. A static CVO-managed QG is simple but doesn't leave any room for us to enhance the component with dynamic behavior going forward. Is it low effort to bring QG under operator management?

retroflexer · 2020-07-15T20:33:08Z

From an offline discussion worth bringing out here:

Instead of retaining CVO management of QG, this is an opportunity to bring the component under direct management of the operator. A static CVO-managed QG is simple but doesn't leave any room for us to enhance the component with dynamic behavior going forward. Is it low effort to bring QG under operator management?

As we discussed on slack, I too think it is a worthy idea to pursue.

But I would let this PR go through so that the migration from MCO happens without glitches, and then, once we have the quorum guard in the etcd-operator namespace, we can move it the under direct management of the operator.

retroflexer · 2020-07-15T20:58:24Z

/retest

retroflexer · 2020-07-16T03:30:45Z

/retest

ironcladlou · 2020-07-16T12:30:34Z

But I would let this PR go through so that the migration from MCO happens without glitches, and then, once we have the quorum guard in the etcd-operator namespace, we can move it the under direct management of the operator.

As long as we're committed to doing the operator transition within the same release, I guess I don't see the harm

retroflexer · 2020-07-16T14:43:05Z

/retest

hexfusion · 2020-07-16T16:13:09Z

As long as we're committed to doing the operator transition within the same release, I guess I don't see the harm

My understanding is that moving a CVO managed resource is easy but removing is not. If we are really going to use the operator why remove 2 CVO resources?

hexfusion · 2020-07-16T16:13:37Z

/hold

I would like a detailed plan to address this.

vrutkovs · 2020-07-16T16:35:35Z

manifests/0000_12_etcd-operator_08_etcdquorumguard_deployment.yaml

+  annotations:
+    exclude.release.openshift.io/internal-openshift-hosted: "true"
+spec:
+  replicas: 3


Could the operator control the amount of replicas? In OKD's fork we used openshift/machine-config-operator@ddc89ca in MCO - when useUnsupportedUnsafeNonHANonProductionUnstableEtcd is enabled etcd-quorum-guard is scaled back to 1

Operator controlling this Deployment so it can do intelligent things makes sense to me. But can that happen after the pivot from machine-config operator to etcd operator? We certainly don't want the etcd operator trying to do intelligent Deployment management while the machine-config operator is still trying to hand this Deployment off to the cluster-version operator, because the CVO and etcd operator would be fighting each other.

There shouldn't be any issue during transition from MCO to CEO as these are different deployments in different namespaces

There should be any issue during transition from MCO to CEO as these are different deployments in different namespaces

Oh. Why is the quorum guard not living in the same namespace as etcd? Regardless, I'd still like the cross-repo pivot to be as boring and conservative as possible. That way, if we flub the transition to intelligent etcd-operator management, we can easily revert the change without having to involve the MCO. I don't think it's wrong to attempt the bigger pivot in a single PR, if folks are feeling more aggressive. But is there much upside to moving to etcd-operator management in a single PR?

Why is the quorum guard not living in the same namespace as etcd?

That would make MCO watch events in different namespace (currently etcd-quorum-guard lives in MCO namespace).

I don't think it's wrong to attempt the bigger pivot in a single PR, if folks are feeling more aggressive

Sure, I don't mind splitting replica control in a separate PR

retroflexer · 2020-07-16T17:37:36Z

@hexfusion said:

My understanding is that moving a CVO managed resource is easy but removing is not. If we are really going to use the operator why remove 2 CVO resources?

@wking Could you help me here? Is removing PDB not very easy? I was hoping just removing the pdb file is enough to get rid of it. What other implications and impacts we need to consider?

wking · 2020-07-16T17:59:39Z

openshift/machine-config-operator#1928 is dropping some operator-specific e2e tests around this. Does the etcd operator have a similar operator-specific e2e suite? Or some other mechanism for ensuring that the quorum guard is operator as expected?

wking · 2020-07-16T18:06:33Z

I was hoping just removing the pdb file is enough to get rid of it.

That's true for new clusters. But existing clusters will still have the resource in-cluster. If you're just transferring ownership, that's fine, because there's no need to actually remove the in-cluster resource. You'd want the hand-off to look something like:

CVO manages the resource.
New release, operator begins managing the resource, but with a fixed target to match the CVO. That way the operator and CVO are both controlling the resource, but they always agree on the intended state.
Drop the manifest, so the CVO stops trying to manage the resource. The operator can now do whatever it likes for intelligent management, including removing the resource entirely.

The point of the second step is that during an update, the CVO will begin reconciling the new manifests. If you jumped straight from 1 to 3, there would be a window where the CVO had relinquished control but the etcd operator had not yet been bumped to code that picked up control. If the manifests for the etcd operator deployment and the managed resource are close together in the update graph, maybe that's fine. If they are far enough apart that a stuck update could leave the resource unmanaged for a significant length of time, you probably want 2's careful handoff.

kikisdeliveryservice · 2020-07-16T20:32:45Z

openshift/machine-config-operator#1928 is dropping some operator-specific e2e tests around this. Does the etcd operator have a similar operator-specific e2e suite? Or some other mechanism for ensuring that the quorum guard is operator as expected?

Seconding that moving https://github.com/openshift/machine-config-operator/blob/master/test/e2e/etcdquorumguard_test.go is necessary and I think is necessary to have it run on this PR/change.

retroflexer · 2020-07-16T21:02:24Z

openshift/machine-config-operator#1928 is dropping some operator-specific e2e tests around this. Does the etcd operator have a similar operator-specific e2e suite? Or some other mechanism for ensuring that the quorum guard is operator as expected?

I will work on adding e2e tests on our side, once we moved to more intelligent control by operator.

retroflexer · 2020-07-17T04:10:48Z

/retest

retroflexer · 2020-07-21T18:09:27Z

/retest

kikisdeliveryservice · 2020-07-22T00:56:00Z

test/e2e/etcquorumguard_test.go

+
+type podinfo map[string]podstatus
+
+// TestEtcdQuorumGuard tests the etcd Quorum Guard.  It assumes there


retroflexer · 2020-07-22T00:57:23Z

Update: I added e2e test framework and added a test for EtcdQuorumGuard.

I hope we can proceed with this PR to finish cross-repo transfer.

@wking @kikisdeliveryservice

retroflexer · 2020-07-22T01:17:36Z

@kikisdeliveryservice I need your help again. The verify-deps is failing. I ran go mod tidy;go mod verify; go mod vendor and it was clean.

go: extracting github.com/xlab/handysort v0.0.0-20150421192137-fb3537ed64a1
github.com/openshift/cluster-etcd-operator/test/e2e tested by
	github.com/openshift/cluster-etcd-operator/test/e2e.test imports
	github.com/openshift/cluster-etcd-operator/test/e2e/framework: no matching versions for query "latest"

I appreciate any help.

retroflexer · 2020-07-22T01:29:23Z

/retest

go.mod

kikisdeliveryservice · 2020-07-22T01:58:34Z

@kikisdeliveryservice I need your help again. The verify-deps is failing. I ran go mod tidy;go mod verify; go mod vendor and it was clean.
go: extracting github.com/xlab/handysort v0.0.0-20150421192137-fb3537ed64a1
github.com/openshift/cluster-etcd-operator/test/e2e tested by
	github.com/openshift/cluster-etcd-operator/test/e2e.test imports
	github.com/openshift/cluster-etcd-operator/test/e2e/framework: no matching versions for query "latest"
I appreciate any help.

You import "github.com/openshift/cluster-etcd-operator/test/e2e/framework" in test/e2e/etcquorumguard_test.go but you don't actually have that dir and contents anywhere :)

retroflexer · 2020-07-24T20:54:30Z

/retest

retroflexer · 2020-07-24T22:28:48Z

/retest

hexfusion · 2020-07-25T00:19:07Z

/test e2e-gcp-upgrade

retroflexer · 2020-07-25T02:00:14Z

/test e2e-gcp-upgrade

retroflexer · 2020-07-25T03:32:03Z

/test e2e-gcp-upgrade

retroflexer · 2020-07-25T03:32:16Z

/retest

retroflexer · 2020-07-25T06:32:24Z

/retest

hexfusion · 2020-07-27T13:52:09Z

/hold cancel

hexfusion · 2020-07-27T13:53:01Z

@retroflexer has manually tested upgrades with success.
/approve

retroflexer · 2020-07-27T13:57:40Z

Verified the PR to be working well with y-upgrades using cluster-bot on aws and gcp (tested with MCO's 1928 along with this PR).

aws:
job test upgrade 4.5 4.6,openshift/machine-config-operator#1928,#394 aws succeeded
(https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1286728540858880000)

gcp:
job test upgrade 4.5 4.6,openshift/machine-config-operator#1928,#394 gcp succeeded
(https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1286844014271664128)

wking · 2020-07-27T17:52:15Z

disruptive:

Jul 25 07:41:36.324: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin@/test/extended/dr/quorum_restore.go:195]: Unexpected error:
    <*errors.StatusError | 0xc0023a0e60>: {
        ErrStatus: {
            TypeMeta: {Kind: "Status", APIVersion: "v1"},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "rpc error: code = Unknown desc = context deadline exceeded",
            Reason: "",
            Details: nil,
            Code: 500,
        },
    }
    rpc error: code = Unknown desc = context deadline exceeded
occurred

failed: (38m17s) 2020-07-25T07:41:36 "[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Disabled:Broken] [Serial] [Suite:openshift]"

Any idea what's going on there?

retroflexer · 2020-07-27T19:09:57Z

fail [github.com/openshift/origin@/test/extended/dr/quorum_restore.go:195]: Unexpected error:
<*errors.StatusError | 0xc0023a0e60>: {

@wking This quorum restore is regarding the restoration of the backup and has nothing to do with quorum guard per se. These tests haven't passed in a while, and I agree we need to fix these tests to make them work reliably. But none of the changes in this PR, I believe, are impacting the results.
https://github.com/openshift/origin/blob/master/test/extended/dr/quorum_restore.go#L195

hexfusion · 2020-07-27T21:43:04Z

/lgtm

merging with the understanding that having 2 quorum-guards for a short period of time will not cause failure.

openshift-ci-robot · 2020-07-27T21:43:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, retroflexer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hexfusion]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

retroflexer · 2020-07-27T23:25:24Z

/retest

openshift-bot · 2020-07-27T23:38:34Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-28T00:04:40Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-28T00:17:30Z

/retest

Please review the full test history for this PR and help us cut down flakes.

retroflexer · 2020-07-28T00:45:39Z

/test e2e-gcp-upgrade

openshift-bot · 2020-07-28T01:22:33Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-07-28T01:41:06Z

@retroflexer: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-metal-ipi	`a0c8359`	link	`/test e2e-metal-ipi`
ci/prow/e2e-aws-disruptive	`a0c8359`	link	`/test e2e-aws-disruptive`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot requested review from hexfusion and ironcladlou July 15, 2020 18:53

retroflexer changed the title ~~Migrating etcd-quorum-guard from MCO.~~ Migrate etcd-quorum-guard from MCO. Jul 15, 2020

retroflexer mentioned this pull request Jul 15, 2020

*: Remove files used for etcd-quorum-guard, as it is being migrated to... openshift/machine-config-operator#1928

Merged

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2020

vrutkovs reviewed Jul 16, 2020

View reviewed changes

retroflexer force-pushed the migrate-quorum-guard branch from b64cb41 to a0ed908 Compare July 16, 2020 20:59

kikisdeliveryservice reviewed Jul 22, 2020

View reviewed changes

retroflexer force-pushed the migrate-quorum-guard branch from f99bd4c to 7e59b08 Compare July 22, 2020 01:10

retroflexer force-pushed the migrate-quorum-guard branch from 7e59b08 to 9624635 Compare July 22, 2020 01:26

hexfusion reviewed Jul 22, 2020

View reviewed changes

go.mod Outdated Show resolved Hide resolved

Add e2e test framework and add a test for EtcdQuorumGuard

a0c8359

retroflexer force-pushed the migrate-quorum-guard branch from f611f34 to a0c8359 Compare July 24, 2020 20:47

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 27, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2020

openshift-ci-robot assigned hexfusion Jul 27, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2020

openshift-merge-robot merged commit d956fbd into openshift:master Jul 28, 2020

retroflexer deleted the migrate-quorum-guard branch July 28, 2020 02:08

slintes mentioned this pull request Aug 5, 2020

Fixed namespace of etcd-quorum-guard PDB kubevirt/node-maintenance-operator#111

Merged


		type podinfo map[string]podstatus

		// TestEtcdQuorumGuard tests the etcd Quorum Guard. It assumes there

Migrate etcd-quorum-guard from MCO. #394

Migrate etcd-quorum-guard from MCO. #394

Conversation

retroflexer commented Jul 15, 2020

ironcladlou commented Jul 15, 2020

retroflexer commented Jul 15, 2020 • edited

retroflexer commented Jul 15, 2020

retroflexer commented Jul 16, 2020

ironcladlou commented Jul 16, 2020

retroflexer commented Jul 16, 2020

hexfusion commented Jul 16, 2020

hexfusion commented Jul 16, 2020

vrutkovs Jul 16, 2020

Choose a reason for hiding this comment

wking Jul 16, 2020

Choose a reason for hiding this comment

vrutkovs Jul 16, 2020 • edited

Choose a reason for hiding this comment

wking Jul 16, 2020

Choose a reason for hiding this comment

vrutkovs Jul 16, 2020

Choose a reason for hiding this comment

retroflexer commented Jul 16, 2020 • edited

wking commented Jul 16, 2020

wking commented Jul 16, 2020

kikisdeliveryservice commented Jul 16, 2020

retroflexer commented Jul 16, 2020

retroflexer commented Jul 17, 2020

retroflexer commented Jul 21, 2020

kikisdeliveryservice Jul 22, 2020

Choose a reason for hiding this comment

retroflexer commented Jul 22, 2020

retroflexer commented Jul 22, 2020

retroflexer commented Jul 22, 2020

kikisdeliveryservice commented Jul 22, 2020

retroflexer commented Jul 24, 2020

retroflexer commented Jul 24, 2020

hexfusion commented Jul 25, 2020

retroflexer commented Jul 25, 2020

retroflexer commented Jul 25, 2020

retroflexer commented Jul 25, 2020

retroflexer commented Jul 25, 2020

hexfusion commented Jul 27, 2020

hexfusion commented Jul 27, 2020

retroflexer commented Jul 27, 2020

wking commented Jul 27, 2020

retroflexer commented Jul 27, 2020 • edited

hexfusion commented Jul 27, 2020

openshift-ci-robot commented Jul 27, 2020

retroflexer commented Jul 27, 2020

openshift-bot commented Jul 27, 2020

openshift-bot commented Jul 28, 2020

openshift-bot commented Jul 28, 2020

retroflexer commented Jul 28, 2020

openshift-bot commented Jul 28, 2020

openshift-ci-robot commented Jul 28, 2020 • edited

retroflexer commented Jul 15, 2020 •

edited

vrutkovs Jul 16, 2020 •

edited

retroflexer commented Jul 16, 2020 •

edited

retroflexer commented Jul 27, 2020 •

edited

openshift-ci-robot commented Jul 28, 2020 •

edited