New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate etcd-quorum-guard from MCO. #394
Migrate etcd-quorum-guard from MCO. #394
Conversation
From an offline discussion worth bringing out here: Instead of retaining CVO management of QG, this is an opportunity to bring the component under direct management of the operator. A static CVO-managed QG is simple but doesn't leave any room for us to enhance the component with dynamic behavior going forward. Is it low effort to bring QG under operator management? |
As we discussed on slack, I too think it is a worthy idea to pursue. But I would let this PR go through so that the migration from MCO happens without glitches, and then, once we have the quorum guard in the etcd-operator namespace, we can move it the under direct management of the operator. |
/retest |
1 similar comment
/retest |
As long as we're committed to doing the operator transition within the same release, I guess I don't see the harm |
/retest |
My understanding is that moving a CVO managed resource is easy but removing is not. If we are really going to use the operator why remove 2 CVO resources? |
/hold I would like a detailed plan to address this. |
annotations: | ||
exclude.release.openshift.io/internal-openshift-hosted: "true" | ||
spec: | ||
replicas: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the operator control the amount of replicas? In OKD's fork we used openshift/machine-config-operator@ddc89ca in MCO - when useUnsupportedUnsafeNonHANonProductionUnstableEtcd
is enabled etcd-quorum-guard is scaled back to 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operator controlling this Deployment so it can do intelligent things makes sense to me. But can that happen after the pivot from machine-config operator to etcd operator? We certainly don't want the etcd operator trying to do intelligent Deployment management while the machine-config operator is still trying to hand this Deployment off to the cluster-version operator, because the CVO and etcd operator would be fighting each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be any issue during transition from MCO to CEO as these are different deployments in different namespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be any issue during transition from MCO to CEO as these are different deployments in different namespaces
Oh. Why is the quorum guard not living in the same namespace as etcd? Regardless, I'd still like the cross-repo pivot to be as boring and conservative as possible. That way, if we flub the transition to intelligent etcd-operator management, we can easily revert the change without having to involve the MCO. I don't think it's wrong to attempt the bigger pivot in a single PR, if folks are feeling more aggressive. But is there much upside to moving to etcd-operator management in a single PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the quorum guard not living in the same namespace as etcd?
That would make MCO watch events in different namespace (currently etcd-quorum-guard lives in MCO namespace).
I don't think it's wrong to attempt the bigger pivot in a single PR, if folks are feeling more aggressive
Sure, I don't mind splitting replica control in a separate PR
@hexfusion said:
@wking Could you help me here? Is removing PDB not very easy? I was hoping just removing the pdb file is enough to get rid of it. What other implications and impacts we need to consider? |
openshift/machine-config-operator#1928 is dropping some operator-specific e2e tests around this. Does the etcd operator have a similar operator-specific e2e suite? Or some other mechanism for ensuring that the quorum guard is operator as expected? |
That's true for new clusters. But existing clusters will still have the resource in-cluster. If you're just transferring ownership, that's fine, because there's no need to actually remove the in-cluster resource. You'd want the hand-off to look something like:
The point of the second step is that during an update, the CVO will begin reconciling the new manifests. If you jumped straight from 1 to 3, there would be a window where the CVO had relinquished control but the etcd operator had not yet been bumped to code that picked up control. If the manifests for the etcd operator deployment and the managed resource are close together in the update graph, maybe that's fine. If they are far enough apart that a stuck update could leave the resource unmanaged for a significant length of time, you probably want 2's careful handoff. |
Seconding that moving https://github.com/openshift/machine-config-operator/blob/master/test/e2e/etcdquorumguard_test.go is necessary and I think is necessary to have it run on this PR/change. |
b64cb41
to
a0ed908
Compare
I will work on adding e2e tests on our side, once we moved to more intelligent control by operator. |
/retest |
1 similar comment
/retest |
|
||
type podinfo map[string]podstatus | ||
|
||
// TestEtcdQuorumGuard tests the etcd Quorum Guard. It assumes there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HOORAY!
Update: I added e2e test framework and added a test for EtcdQuorumGuard. I hope we can proceed with this PR to finish cross-repo transfer. |
f99bd4c
to
7e59b08
Compare
@kikisdeliveryservice I need your help again. The verify-deps is failing. I ran
I appreciate any help. |
7e59b08
to
9624635
Compare
/retest |
You import "github.com/openshift/cluster-etcd-operator/test/e2e/framework" in test/e2e/etcquorumguard_test.go but you don't actually have that dir and contents anywhere :) |
f611f34
to
a0c8359
Compare
/retest |
1 similar comment
/retest |
/test e2e-gcp-upgrade |
2 similar comments
/test e2e-gcp-upgrade |
/test e2e-gcp-upgrade |
/retest |
1 similar comment
/retest |
/hold cancel |
@retroflexer has manually tested upgrades with success. |
Verified the PR to be working well with y-upgrades using cluster-bot on aws and gcp (tested with MCO's 1928 along with this PR). aws: gcp: |
Any idea what's going on there? |
@wking This quorum restore is regarding the restoration of the backup and has nothing to do with quorum guard per se. These tests haven't passed in a while, and I agree we need to fix these tests to make them work reliably. But none of the changes in this PR, I believe, are impacting the results. |
/lgtm merging with the understanding that having 2 quorum-guards for a short period of time will not cause failure. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, retroflexer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/test e2e-gcp-upgrade |
/retest Please review the full test history for this PR and help us cut down flakes. |
@retroflexer: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Migrating the quorum guard from machine-config-operator.
Note that the TLS certs come from secrets and configmaps, instead of files from the host (MCO was using files on disk).