Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate etcd-quorum-guard from MCO. #394

Merged

Conversation

retroflexer
Copy link
Contributor

Migrating the quorum guard from machine-config-operator.

Note that the TLS certs come from secrets and configmaps, instead of files from the host (MCO was using files on disk).

@retroflexer retroflexer changed the title Migrating etcd-quorum-guard from MCO. Migrate etcd-quorum-guard from MCO. Jul 15, 2020
@ironcladlou
Copy link
Contributor

From an offline discussion worth bringing out here:

Instead of retaining CVO management of QG, this is an opportunity to bring the component under direct management of the operator. A static CVO-managed QG is simple but doesn't leave any room for us to enhance the component with dynamic behavior going forward. Is it low effort to bring QG under operator management?

@retroflexer
Copy link
Contributor Author

retroflexer commented Jul 15, 2020

From an offline discussion worth bringing out here:

Instead of retaining CVO management of QG, this is an opportunity to bring the component under direct management of the operator. A static CVO-managed QG is simple but doesn't leave any room for us to enhance the component with dynamic behavior going forward. Is it low effort to bring QG under operator management?

As we discussed on slack, I too think it is a worthy idea to pursue.

But I would let this PR go through so that the migration from MCO happens without glitches, and then, once we have the quorum guard in the etcd-operator namespace, we can move it the under direct management of the operator.

@retroflexer
Copy link
Contributor Author

/retest

1 similar comment
@retroflexer
Copy link
Contributor Author

/retest

@ironcladlou
Copy link
Contributor

But I would let this PR go through so that the migration from MCO happens without glitches, and then, once we have the quorum guard in the etcd-operator namespace, we can move it the under direct management of the operator.

As long as we're committed to doing the operator transition within the same release, I guess I don't see the harm

@retroflexer
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor

As long as we're committed to doing the operator transition within the same release, I guess I don't see the harm

My understanding is that moving a CVO managed resource is easy but removing is not. If we are really going to use the operator why remove 2 CVO resources?

@hexfusion
Copy link
Contributor

/hold

I would like a detailed plan to address this.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2020
annotations:
exclude.release.openshift.io/internal-openshift-hosted: "true"
spec:
replicas: 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the operator control the amount of replicas? In OKD's fork we used openshift/machine-config-operator@ddc89ca in MCO - when useUnsupportedUnsafeNonHANonProductionUnstableEtcd is enabled etcd-quorum-guard is scaled back to 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator controlling this Deployment so it can do intelligent things makes sense to me. But can that happen after the pivot from machine-config operator to etcd operator? We certainly don't want the etcd operator trying to do intelligent Deployment management while the machine-config operator is still trying to hand this Deployment off to the cluster-version operator, because the CVO and etcd operator would be fighting each other.

Copy link
Member

@vrutkovs vrutkovs Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any issue during transition from MCO to CEO as these are different deployments in different namespaces

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be any issue during transition from MCO to CEO as these are different deployments in different namespaces

Oh. Why is the quorum guard not living in the same namespace as etcd? Regardless, I'd still like the cross-repo pivot to be as boring and conservative as possible. That way, if we flub the transition to intelligent etcd-operator management, we can easily revert the change without having to involve the MCO. I don't think it's wrong to attempt the bigger pivot in a single PR, if folks are feeling more aggressive. But is there much upside to moving to etcd-operator management in a single PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the quorum guard not living in the same namespace as etcd?

That would make MCO watch events in different namespace (currently etcd-quorum-guard lives in MCO namespace).

I don't think it's wrong to attempt the bigger pivot in a single PR, if folks are feeling more aggressive

Sure, I don't mind splitting replica control in a separate PR

@retroflexer
Copy link
Contributor Author

retroflexer commented Jul 16, 2020

@hexfusion said:

My understanding is that moving a CVO managed resource is easy but removing is not. If we are really going to use the operator why remove 2 CVO resources?

@wking Could you help me here? Is removing PDB not very easy? I was hoping just removing the pdb file is enough to get rid of it. What other implications and impacts we need to consider?

@wking
Copy link
Member

wking commented Jul 16, 2020

openshift/machine-config-operator#1928 is dropping some operator-specific e2e tests around this. Does the etcd operator have a similar operator-specific e2e suite? Or some other mechanism for ensuring that the quorum guard is operator as expected?

@wking
Copy link
Member

wking commented Jul 16, 2020

I was hoping just removing the pdb file is enough to get rid of it.

That's true for new clusters. But existing clusters will still have the resource in-cluster. If you're just transferring ownership, that's fine, because there's no need to actually remove the in-cluster resource. You'd want the hand-off to look something like:

  1. CVO manages the resource.
  2. New release, operator begins managing the resource, but with a fixed target to match the CVO. That way the operator and CVO are both controlling the resource, but they always agree on the intended state.
  3. Drop the manifest, so the CVO stops trying to manage the resource. The operator can now do whatever it likes for intelligent management, including removing the resource entirely.

The point of the second step is that during an update, the CVO will begin reconciling the new manifests. If you jumped straight from 1 to 3, there would be a window where the CVO had relinquished control but the etcd operator had not yet been bumped to code that picked up control. If the manifests for the etcd operator deployment and the managed resource are close together in the update graph, maybe that's fine. If they are far enough apart that a stuck update could leave the resource unmanaged for a significant length of time, you probably want 2's careful handoff.

@kikisdeliveryservice
Copy link
Contributor

openshift/machine-config-operator#1928 is dropping some operator-specific e2e tests around this. Does the etcd operator have a similar operator-specific e2e suite? Or some other mechanism for ensuring that the quorum guard is operator as expected?

Seconding that moving https://github.com/openshift/machine-config-operator/blob/master/test/e2e/etcdquorumguard_test.go is necessary and I think is necessary to have it run on this PR/change.

@retroflexer
Copy link
Contributor Author

openshift/machine-config-operator#1928 is dropping some operator-specific e2e tests around this. Does the etcd operator have a similar operator-specific e2e suite? Or some other mechanism for ensuring that the quorum guard is operator as expected?

I will work on adding e2e tests on our side, once we moved to more intelligent control by operator.

@retroflexer
Copy link
Contributor Author

/retest

1 similar comment
@retroflexer
Copy link
Contributor Author

/retest


type podinfo map[string]podstatus

// TestEtcdQuorumGuard tests the etcd Quorum Guard. It assumes there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HOORAY!

@retroflexer
Copy link
Contributor Author

Update: I added e2e test framework and added a test for EtcdQuorumGuard.

I hope we can proceed with this PR to finish cross-repo transfer.

@wking @kikisdeliveryservice

@retroflexer
Copy link
Contributor Author

@kikisdeliveryservice I need your help again. The verify-deps is failing. I ran go mod tidy;go mod verify; go mod vendor and it was clean.

go: extracting github.com/xlab/handysort v0.0.0-20150421192137-fb3537ed64a1
github.com/openshift/cluster-etcd-operator/test/e2e tested by
	github.com/openshift/cluster-etcd-operator/test/e2e.test imports
	github.com/openshift/cluster-etcd-operator/test/e2e/framework: no matching versions for query "latest"

I appreciate any help.

@retroflexer
Copy link
Contributor Author

/retest

go.mod Outdated Show resolved Hide resolved
@kikisdeliveryservice
Copy link
Contributor

@kikisdeliveryservice I need your help again. The verify-deps is failing. I ran go mod tidy;go mod verify; go mod vendor and it was clean.

go: extracting github.com/xlab/handysort v0.0.0-20150421192137-fb3537ed64a1
github.com/openshift/cluster-etcd-operator/test/e2e tested by
	github.com/openshift/cluster-etcd-operator/test/e2e.test imports
	github.com/openshift/cluster-etcd-operator/test/e2e/framework: no matching versions for query "latest"

I appreciate any help.

You import "github.com/openshift/cluster-etcd-operator/test/e2e/framework" in test/e2e/etcquorumguard_test.go but you don't actually have that dir and contents anywhere :)

@retroflexer
Copy link
Contributor Author

/retest

1 similar comment
@retroflexer
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor

/test e2e-gcp-upgrade

2 similar comments
@retroflexer
Copy link
Contributor Author

/test e2e-gcp-upgrade

@retroflexer
Copy link
Contributor Author

/test e2e-gcp-upgrade

@retroflexer
Copy link
Contributor Author

/retest

1 similar comment
@retroflexer
Copy link
Contributor Author

/retest

@hexfusion
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 27, 2020
@hexfusion
Copy link
Contributor

@retroflexer has manually tested upgrades with success.
/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2020
@retroflexer
Copy link
Contributor Author

Verified the PR to be working well with y-upgrades using cluster-bot on aws and gcp (tested with MCO's 1928 along with this PR).

aws:
job test upgrade 4.5 4.6,openshift/machine-config-operator#1928,#394 aws succeeded
(https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1286728540858880000)

gcp:
job test upgrade 4.5 4.6,openshift/machine-config-operator#1928,#394 gcp succeeded
(https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1286844014271664128)

@wking
Copy link
Member

wking commented Jul 27, 2020

disruptive:

Jul 25 07:41:36.324: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin@/test/extended/dr/quorum_restore.go:195]: Unexpected error:
    <*errors.StatusError | 0xc0023a0e60>: {
        ErrStatus: {
            TypeMeta: {Kind: "Status", APIVersion: "v1"},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "rpc error: code = Unknown desc = context deadline exceeded",
            Reason: "",
            Details: nil,
            Code: 500,
        },
    }
    rpc error: code = Unknown desc = context deadline exceeded
occurred

failed: (38m17s) 2020-07-25T07:41:36 "[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Disabled:Broken] [Serial] [Suite:openshift]"

Any idea what's going on there?

@retroflexer
Copy link
Contributor Author

retroflexer commented Jul 27, 2020

fail [github.com/openshift/origin@/test/extended/dr/quorum_restore.go:195]: Unexpected error:
<*errors.StatusError | 0xc0023a0e60>: {

@wking This quorum restore is regarding the restoration of the backup and has nothing to do with quorum guard per se. These tests haven't passed in a while, and I agree we need to fix these tests to make them work reliably. But none of the changes in this PR, I believe, are impacting the results.
https://github.com/openshift/origin/blob/master/test/extended/dr/quorum_restore.go#L195

@hexfusion
Copy link
Contributor

/lgtm

merging with the understanding that having 2 quorum-guards for a short period of time will not cause failure.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, retroflexer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@retroflexer
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@retroflexer
Copy link
Contributor Author

/test e2e-gcp-upgrade

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 28, 2020

@retroflexer: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-metal-ipi a0c8359 link /test e2e-metal-ipi
ci/prow/e2e-aws-disruptive a0c8359 link /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants