STOR-962: Update storage operators for HyperShift #1202

jsafrane · 2022-07-29T16:29:36Z

Describe deployment for storage components in HyperShift control plane + necessary changes in existing components.

dhellmann · 2022-07-29T18:06:27Z

/priority important-soon

jsafrane · 2022-08-01T14:01:37Z

Updated with CSI driver operator + added missing chapters

Add alternative with storage operators watching HostedCluster. Add risk about insufficient design. Review updates Review updates 2 Review updates 3 Review updates after storage team meeting Review updates 4 Review updates 5 Review updates 6

enhancements/storage/storage-hypershift.md

dobsonj · 2022-08-01T22:13:53Z

enhancements/storage/storage-hypershift.md

+
+#### Failure Modes
+
+Same as in standalone OCP - status of all storage operators is reported to ClusterOperator CRs in the guest cluster.


One new failure mode that we haven't really discussed yet: what happens if the management cluster loses connection or access to the guest cluster?
This isn't specific to storage, but it's a new failure mode with hypershift in general. Maybe that's documented elsewhere (or should be).

enhancements/storage/storage-hypershift.md

enxebre · 2022-08-03T17:47:48Z

enhancements/storage/storage-hypershift.md

+
+The components currently are:
+
+* cluster-csi-snapshot-controller-operator and its operand, csi-snapshot-controller.


nit: can we follow always the same dash convention for naming? "cluster-csi-snapshot-controller-operator" vs "cluster-storage operator" vs "AWS EBS CSI driver"?

AWS EBS CSI driver is actually collection of many things (its Deployment has 6 sidecars), it's not a monolith like our operators in one repo / Deployment.

enhancements/storage/storage-hypershift.md

csrwng

Thank you @jsafrane
Awesome stuff

enhancements/storage/storage-hypershift.md

enxebre · 2022-08-04T07:37:19Z

@jsafrane can we may be put this enhancement here https://github.com/openshift/enhancements/tree/master/enhancements/hypershift

gnufied · 2022-08-04T15:35:16Z

enhancements/storage/storage-hypershift.md

+| manifests/05_operand_rbac.yaml               | CVO  | G     | Few RBAC objects for the operand. We can run the operand with kubeadmin for now.         |
+| manifests/05_operator_rbac.yaml              | CVO  | G     | Operator RBAC, basically cluster-admin binding. No special RBAC is needed in HyperShift. |
+| manifests/05_user_rbac.yaml                  | CVO  | G     | Aggregate RBAC roles to add to storage-admin and basic-user to manage snapshots.         |
+| manifests/07_deployment.yaml                 | CPO  | MC    | Operator's Deployment.                                                                   |


How are we going to keep these manifests synchronized between snapshot-operator (which is managed by CVO in other environments) and by CPO in hypershift environment?

We have to do something similar for MicroShift, and there we have automation to extract the manifests from the release payload and add them to our repo to be customized. We're working toward a model where the operators that own the assets can be run in a one-time mode that produces the same content, so the customization can be known and tested "upstream" of MicroShift. https://github.com/openshift/microshift/blob/main/scripts/rebase.sh#L277

This IMO calls for taking a step back and redesign how standalone OCP (CVO), HyperShift and MicroShift manifests and code are kept consistent. Notice there already is OLM with its yaml manifests in HyperShift repo, synced ad-hoc (?) from OCP.

That OLM directory looks something like what we're doing in MicroShift.

I guess we will sync the yamls manually, at least in the beginning.

enhancements/storage/storage-hypershift.md

jsafrane · 2022-08-04T17:12:54Z

enhancements/storage/storage-hypershift.md

+* Complexity of cluster-storage-operator, cluster-csi-snapshot-controller-operator, and aws-ebs-csi-driver-operator
+  will grow significantly, they need to handle two kubeconfigs / kubeclients where one was needed before. We can expect
+  regressions.
+  * We will mitigate this risk by reworking the unit tests to handle primarily hypershift case. Standalone OCP is
+    a variation of HyperShift, where both kube clients are actually the same and both lead to the standalone OCP.


This is actually a big risk. Storage is complicated already and this design adds yet another dimension to our support / maintenance matrix. We have 9 CSI driver operators, each of them is a bit different (because the clouds are different!), and keeping track of what operator watches / syncs which object in mgmt/guest cluster is crazy. This is error prone.

Is this the first time we've encountered this sort of need for HyperShift? How did we deal with it in other situations?

I think storage is the first one to encounter this issue. Other control-plane components are started by code hardcoded to control-plane-operator and IMO they don't try to reuse code from their standalone OCP operators.
(I did not research OLM deployment thoroughly).
@csrwng, is this comment correct?

No, the cluster network operator uses two kubeconfigs as well in order to work in a HyperShift environment.

enhancements/storage/storage-hypershift.md

alvaroaleman · 2022-08-04T17:41:00Z

enhancements/storage/storage-hypershift.md

+  in the guest cluster, a Cluster Instance Admin can DOS the whole CSI driver / CSI snapshots by editing these CRs in
+  an endless loop. The operators will redeploy the operands with each CR change and the new operands will need to wait
+  for leader election to expire, while the old operands are already terminated.
+  * We mitigate this issue a bit by overwriting user changes in the operator CRs by Hosted Cluster Config Operator that


We could use a validating webhook to prevent this. But in general, there are lots of ways for a cluster admin to break their cluster, if they explicitly attempt to.

enhancements/storage/storage-hypershift.md

dhellmann · 2022-08-05T20:04:37Z

enhancements/storage/storage-hypershift.md

+| manifests/05_operand_rbac.yaml               | CVO  | G     | Few RBAC objects for the operand. We can run the operand with kubeadmin for now.         |
+| manifests/05_operator_rbac.yaml              | CVO  | G     | Operator RBAC, basically cluster-admin binding. No special RBAC is needed in HyperShift. |
+| manifests/05_user_rbac.yaml                  | CVO  | G     | Aggregate RBAC roles to add to storage-admin and basic-user to manage snapshots.         |
+| manifests/07_deployment.yaml                 | CPO  | MC    | Operator's Deployment.                                                                   |


We have to do something similar for MicroShift, and there we have automation to extract the manifests from the release payload and add them to our repo to be customized. We're working toward a model where the operators that own the assets can be run in a one-time mode that produces the same content, so the customization can be known and tested "upstream" of MicroShift. https://github.com/openshift/microshift/blob/main/scripts/rebase.sh#L277

enhancements/storage/storage-hypershift.md

dhellmann · 2022-08-05T20:36:24Z

enhancements/storage/storage-hypershift.md

+* Complexity of cluster-storage-operator, cluster-csi-snapshot-controller-operator, and aws-ebs-csi-driver-operator
+  will grow significantly, they need to handle two kubeconfigs / kubeclients where one was needed before. We can expect
+  regressions.
+  * We will mitigate this risk by reworking the unit tests to handle primarily hypershift case. Standalone OCP is
+    a variation of HyperShift, where both kube clients are actually the same and both lead to the standalone OCP.


Is this the first time we've encountered this sort of need for HyperShift? How did we deal with it in other situations?

dhellmann · 2022-08-05T20:38:17Z

enhancements/storage/storage-hypershift.md

+  of them when fixing a bug.
+  * Esp. the CSI driver operators are complex, each of them is a bit different and we have ~9 of them (and the number is
+    growing). We would need to reimplement it in the control-plane-operator and the implementation would get out of sync
+    with the standalone OCP easily.


Could the code to do this be shared between the 2 operators, or would it have to be very different implementations?

IMO the implementations would be quite different, because inputs of the operators would be different. In standalone OCP, a CSI driver operator needs to watch Infrastructure, cluster-wide + cloud CA bundles (ConfigMaps), Proxy, ClusterCSIDriver CR and act on their changes. In HyperShift, the input would be only HostedControlPlane CR.

They could share some template of the CSI driver objects, but that's not much.

If there's not a lot of code to share and the inputs and outputs are so different, would it make sense to build a special-purpose thing to meet the new requirements? It might prove simpler to maintain than weaving if/then checks throughout the existing code to deal with both cases in one place.

dhellmann · 2022-08-05T20:47:11Z

enhancements/storage/storage-hypershift.md

+be a good place to talk about core concepts and how they relate.
+-->
+
+#### csi-snapshot-controller


If you combine the diagrams in this section it would be clearer that the cluster-csi-snapshot-controller-operator is reconciling resources in both clusters (deployments in the management-cluster and a bunch of other things in the guest-cluster) at the same time.

Could we run 2 copies of that operator separately with command line flags to enable/disable certain reconciliation steps so that each binary only had to reconcile resources in 1 cluster (and therefore only deal with one kube client)?

How would these two operators agree on ClusterOperator.Status? IMO it would be more complicated / confusing than running 1 operator with 2 kubeconfigs.

Why would they need to agree on status? Why not report status separately?

Do you mean two OperatorStatus CRs for csi-snapshot-controller? Yes, we could have them, but I don't see the benefit. Yes, one operator instance will have just one kubeconfig, but the source code will still have code paths with different kubeconfigs.

enhancements/storage/storage-hypershift.md

jsafrane · 2022-08-15T12:28:43Z

enhancements/storage/storage-hypershift.md

+
+#### Support Procedures
+
+TODO: link to HyperShift support procedures, we don't plan anything special for storage.


@derekwaynecarr @csrwng Do you have any "HyperShift Support Procedures" that I can link from here?

Nothing we have documented yet, but this proposal can be updated when we do have them in place.

csrwng · 2022-08-16T01:57:15Z

/lgtm
Thanks, I think we can submit follow up changes if anything needs to be updated/clarified.

bertinatto

/lgtm

enhancements/storage/storage-hypershift.md

enxebre · 2022-08-23T08:20:28Z

lgtm, @jsafrane can you please squash commits before merging?

jsafrane · 2022-09-02T12:10:53Z

squashed

jsafrane · 2022-09-02T12:11:09Z

/assign @derekwaynecarr
for approval

dobsonj · 2022-09-02T18:12:32Z

re-
/lgtm

derekwaynecarr · 2022-09-12T22:24:47Z

/approve
/lgtm

openshift-ci · 2022-09-12T22:25:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-09-12T22:45:43Z

@jsafrane: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 29, 2022

openshift-ci bot requested review from russellb and stbenjam July 29, 2022 16:30

jsafrane force-pushed the hypershift branch 2 times, most recently from fff2831 to 6f255f9 Compare July 29, 2022 16:44

openshift-ci bot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 29, 2022

jsafrane force-pushed the hypershift branch 2 times, most recently from d9c65d3 to 2aa2c3e Compare August 1, 2022 13:58

jsafrane changed the title ~~WIP: STOR-962: Update storage operators for HyperShift~~ STOR-962: Update storage operators for HyperShift Aug 1, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 1, 2022

Update storage operators for HyperShift

90d5d15

Add alternative with storage operators watching HostedCluster. Add risk about insufficient design. Review updates Review updates 2 Review updates 3 Review updates after storage team meeting Review updates 4 Review updates 5 Review updates 6

jsafrane force-pushed the hypershift branch from 2aa2c3e to 9316763 Compare August 1, 2022 14:05

dobsonj reviewed Aug 1, 2022

View reviewed changes

jsafrane force-pushed the hypershift branch from 795a7cb to 793f7b3 Compare August 2, 2022 13:33

bertinatto reviewed Aug 2, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Show resolved Hide resolved

enxebre reviewed Aug 3, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Show resolved Hide resolved

enxebre reviewed Aug 3, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Outdated Show resolved Hide resolved

enxebre reviewed Aug 3, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Outdated Show resolved Hide resolved

enxebre reviewed Aug 3, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Outdated Show resolved Hide resolved

enxebre reviewed Aug 3, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Show resolved Hide resolved

enxebre reviewed Aug 3, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Show resolved Hide resolved

csrwng reviewed Aug 3, 2022

View reviewed changes

gnufied reviewed Aug 4, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Show resolved Hide resolved

jsafrane force-pushed the hypershift branch from cd9bfe0 to 2d7cb10 Compare August 4, 2022 16:48

jsafrane commented Aug 4, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Show resolved Hide resolved

jsafrane commented Aug 4, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Outdated Show resolved Hide resolved

alvaroaleman reviewed Aug 4, 2022

View reviewed changes

derekwaynecarr reviewed Aug 4, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Outdated Show resolved Hide resolved

derekwaynecarr reviewed Aug 4, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Outdated Show resolved Hide resolved

dhellmann reviewed Aug 5, 2022

View reviewed changes

avishayt reviewed Aug 7, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Show resolved Hide resolved

jsafrane force-pushed the hypershift branch from 7fafa97 to 5759852 Compare August 11, 2022 13:44

jsafrane commented Aug 15, 2022

View reviewed changes

jsafrane force-pushed the hypershift branch 2 times, most recently from 36074f7 to 961c6fa Compare August 15, 2022 12:44

openshift-ci bot assigned csrwng Aug 16, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2022

bertinatto reviewed Aug 16, 2022

View reviewed changes

enhancements/storage/storage-hypershift.md Outdated Show resolved Hide resolved

enhancements/storage/storage-hypershift.md Outdated Show resolved Hide resolved

enhancements/storage/storage-hypershift.md Show resolved Hide resolved

openshift-ci bot assigned bertinatto Aug 16, 2022

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 18, 2022

jsafrane force-pushed the hypershift branch from f0df77e to 90d5d15 Compare September 2, 2022 12:10

openshift-ci bot assigned derekwaynecarr Sep 2, 2022

openshift-ci bot assigned dobsonj Sep 2, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 2, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 12, 2022

openshift-merge-robot merged commit 4290d90 into openshift:master Sep 12, 2022


		#### Failure Modes

		Same as in standalone OCP - status of all storage operators is reported to ClusterOperator CRs in the guest cluster.


		The components currently are:

		* cluster-csi-snapshot-controller-operator and its operand, csi-snapshot-controller.


		#### Support Procedures

		TODO: link to HyperShift support procedures, we don't plan anything special for storage.

STOR-962: Update storage operators for HyperShift #1202

STOR-962: Update storage operators for HyperShift #1202

Conversation

jsafrane commented Jul 29, 2022

dhellmann commented Jul 29, 2022

jsafrane commented Aug 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane Aug 4, 2022 • edited

Choose a reason for hiding this comment

csrwng left a comment

Choose a reason for hiding this comment

enxebre commented Aug 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csrwng commented Aug 16, 2022

bertinatto left a comment

Choose a reason for hiding this comment

enxebre commented Aug 23, 2022

jsafrane commented Sep 2, 2022

jsafrane commented Sep 2, 2022

dobsonj commented Sep 2, 2022

derekwaynecarr commented Sep 12, 2022

openshift-ci bot commented Sep 12, 2022

openshift-ci bot commented Sep 12, 2022

jsafrane Aug 4, 2022 •

edited