Hypershift changes for STS support #1427

gallettilance · 2023-06-20T16:12:11Z

No description provided.

openshift-ci · 2023-06-20T16:17:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jerpeter1 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bparees · 2023-06-20T17:30:30Z

enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md

@@ -169,7 +169,16 @@ CCO should consider adding the above, three-part test as a utility function expo
 **HyperShift changes**: Include Cloud Credential Operator with token-aware mode (see above). Allows for processing of 
 CredentialsRequest objects added by operators. 

-The CCO is expected to run on the control plane NS, not on the customer's worker nodes, and not be visible to the customer. However if CredentialsRequests are created on the worker nodes, the CCO will process them and yield Secrets as described elsewhere in this document.
+The CCO is expected to run on the control plane NS, not on the customer's worker nodes, and not be visible to the customer. We propose that multiple 
+instances of CCO be installed on the control plane each consuming the 


this phrasing is confusing. You're installing multiple instances on the management cluster, but it's still just one instance per hosted control plane (i.e. one per hosted cluster)

How about:

Suggested change

instances of CCO be installed on the control plane each consuming the

The CCO is expected to run in the management cluster, not on the customer's worker nodes, and not be visible to the customer. We propose that multiple

instances of CCO be installed on management cluster each consuming the kubeconfig of the hosted control plane it is intended to be watching.

you might want to mention that it will run in the hosted control plane's NS on the management cluster, but othewise yes, better.

bparees · 2023-06-20T17:32:36Z

enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md

+CCO will be modified so that it can consume a given kubeconfig when specified. 
+Because the intention is to make this feature available with HyperShift GA it 
+will need to be backported all the way to 4.12 (contingent on CCO changes 
+moving out of the feature gate).


not sure this backport is necessary(SD has not explicitly required it). At a minimum it's contingent on us backporting the rest of the STS-enablement work (e.g. console) to 4.12.

if we didn't backport it, it would just mean that you only get this behavior on 4.14+ guest clusters and not 4.12/4.13 guest clusters.

it also might mean the hypershift operator would need to understand the version difference, depending on whether it needs to do anything special to provide the hosted cluster's kubeconfig to the CCO.

bparees · 2023-06-20T17:33:46Z

enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md

+
+The kubeconfig of the worker node is mounted at a specied path on the CCO pod. 
+CCO can consume that kubeconfig when it starts the go client.
+
 ### Risks and Mitigations


need to add some risks+mitigations in which we look at the resource consumption of the CCO, determine if it's acceptable to the hypershift/rosa team, and if not, potentially pursue how we can reduce the footprint.

are you thinking comparing the utilization of "CCO can watch multiple clusters" vs "multiple CCOs each watching one cluster" ? or more broadly even comparing options beyond the scope of this EP like the pod identity webhook?

no, i wasn't trying to go that far, just calling it out as:

we need to measure it

hypershift/rosa team needs some threshold defined that is acceptable/unacceptable

if we are exceeding that threshold, work will be required. That work may be as simple as just going into the CCO code and finding optimizations we can make to reduce its footprint. (possibly optimizations we can make that are specific to manual/STS mode). If we find ourselves having to do more sophisticated things like a shared CCO across clusters, or switching back to using webhooks, that's going to significantly derail things. So i guess you can also mention those options, but i do not expect us to be going down those paths. (Shared CCO probably isn't even an option, from a tenancy/isolation requirements perspective)

The CCO on build farms is taking up enormous amounts of memory footprint. It looks like the CCO keeps the following in cache:

secrets

configmaps

namespaces

This is cluster-wide, always. We should be able to make the following optimizations:

watch the two configmaps that need to be watched - the legacy cco config one and the one for leader-election (should be doable in c-r cache lib)

do not cache namespaces, we just want to watch for creation to start filling in credentials (not sure if possible)

label secrets created by the CCO, restrict all secret LIST+WATCH to that label selector (doable in c-r)

openshift/cloud-credential-operator#544 handles the configmap situation

Handling secrets will require a couple releases for labeling existing data and ratcheting validation to kick in - if we backport this, could be fast. LMK what we want.

Namespaces, I think we will need to just use a metadata-only watch if we want that functionality, but namespace objects are small so I'm not sure how much of a difference it will make

@stevekuznetsov Appreciate the proposed partial fix.

Since you have ready data can you better help us understand the current scale of the problem? How many secrets, configmaps, and namespaces does a build farm cluster have and what's the resource consumption of CCO currently?

openshift/cloud-credential-operator#545 adds a controller to start labelling things

@sdodson two answers to your question:

Regardless of what's going on in the build farms, I think it's irresponsible and wasteful to have this component holding the entire state of the clusters' ConfigMaps, Secrets and Namespaces in memory at all times - especially the former two objects are used to hold large volumes of data and there's no need for holding them, it's just a missed opportunity for optimization. Supportability of a HCP solution that places memory burden on the service cluster which scales linearly with customer dataset size is poor.

Build farms are huge, as you know, so on e.g. build02 the CCO is holding on to 600MiB :)

Yup, agreed. Just trying to understand broadly what things look like today.

openshift/cloud-credential-operator#546 adds a metadata-only watch for namespaces

openshift-ci · 2023-06-21T14:17:09Z

@gallettilance: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

abutcher · 2023-06-23T12:48:32Z

enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md

+
+#### HyperShift Stories
+
+CCO without the pod identity webhook is enabled on the HyperShift management 


openshift/cloud-credential-operator#547 disables starting the pod identity webhook controller when the infrastructure ControlPlaneTopology is External.

enxebre · 2023-07-19T12:58:31Z

enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md

 ### Risks and Mitigations

+Enabling CCO on HyperShift incurs additional infrastructure cost. The exact 


It'd be good to have concrete data for:

Compute resources impact

RBAC reqs management and guest cluster side

Networking requirements

Compute resources impact

We've done some aggressive pruning and the final numbers are low. The CCO container and the rbac-proxy together are something like 80MiB RAM and 20m vCPU. I'm going to work on the RAM a bit more after this, I think we should be able to get that down to 20MiB or so.

RBAC reqs management and guest cluster side

Not sure exactly what you mean by this but the CCO running in the HyperShift management plane will simply use the ServiceAccount credentials as it does in standalone.

Networking requirements

CCO simply needs to connect to the API server for the tenant.

openshift-bot · 2023-08-17T01:15:59Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

sdodson · 2023-08-17T19:01:49Z

/remove-lifecycle stale
Sholdn't close this but also not going to get back to it within the next week.

openshift-bot · 2023-09-15T01:15:30Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-09-22T08:45:37Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2023-09-30T00:15:32Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2023-09-30T00:15:44Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2023-10-06T13:25:47Z

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OCPBU-4, has status "In Progress". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

dhellmann · 2023-10-13T13:34:45Z

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OCPBU-4, has status "In Progress". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

Hypershift changes for STS support

41a61c4

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 20, 2023

openshift-ci bot requested review from mandre and sdodson June 20, 2023 16:16

bparees reviewed Jun 20, 2023

View reviewed changes

minor update

143785c

abutcher reviewed Jun 23, 2023

View reviewed changes

gallettilance changed the title ~~WIP: Hypershift changes for STS support~~ Hypershift changes for STS support Jul 7, 2023

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 7, 2023

enxebre reviewed Jul 19, 2023

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 17, 2023

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 17, 2023

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2023

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 22, 2023

openshift-ci bot closed this Sep 30, 2023

dhellmann removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hypershift changes for STS support #1427

Hypershift changes for STS support #1427

gallettilance commented Jun 20, 2023

openshift-ci bot commented Jun 20, 2023

bparees Jun 20, 2023

gallettilance Jun 20, 2023

bparees Jun 20, 2023

bparees Jun 20, 2023

bparees Jun 20, 2023

gallettilance Jun 20, 2023

bparees Jun 20, 2023

stevekuznetsov Jun 21, 2023

stevekuznetsov Jun 22, 2023

sdodson Jun 22, 2023

stevekuznetsov Jun 22, 2023

stevekuznetsov Jun 22, 2023

sdodson Jun 22, 2023

stevekuznetsov Jun 22, 2023

openshift-ci bot commented Jun 21, 2023

abutcher Jun 23, 2023

enxebre Jul 19, 2023

stevekuznetsov Jul 19, 2023 •

edited

openshift-bot commented Aug 17, 2023

sdodson commented Aug 17, 2023

openshift-bot commented Sep 15, 2023

openshift-bot commented Sep 22, 2023

openshift-bot commented Sep 30, 2023

openshift-ci bot commented Sep 30, 2023

dhellmann commented Oct 6, 2023

dhellmann commented Oct 13, 2023

	instances of CCO be installed on the control plane each consuming the
	The CCO is expected to run in the management cluster, not on the customer's worker nodes, and not be visible to the customer. We propose that multiple
	instances of CCO be installed on management cluster each consuming the kubeconfig of the hosted control plane it is intended to be watching.


		#### HyperShift Stories

		CCO without the pod identity webhook is enabled on the HyperShift management

		### Risks and Mitigations

		Enabling CCO on HyperShift incurs additional infrastructure cost. The exact

Hypershift changes for STS support #1427

Hypershift changes for STS support #1427

Conversation

gallettilance commented Jun 20, 2023

openshift-ci bot commented Jun 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevekuznetsov Jul 19, 2023 • edited

Choose a reason for hiding this comment

openshift-bot commented Aug 17, 2023

sdodson commented Aug 17, 2023

openshift-bot commented Sep 15, 2023

openshift-bot commented Sep 22, 2023

openshift-bot commented Sep 30, 2023

openshift-ci bot commented Sep 30, 2023

dhellmann commented Oct 6, 2023

dhellmann commented Oct 13, 2023

stevekuznetsov Jul 19, 2023 •

edited