Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypershift changes for STS support #1427

Closed
wants to merge 2 commits into from

Conversation

gallettilance
Copy link
Contributor

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 20, 2023
@openshift-ci openshift-ci bot requested review from mandre and sdodson June 20, 2023 16:16
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 20, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jerpeter1 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -169,7 +169,16 @@ CCO should consider adding the above, three-part test as a utility function expo
**HyperShift changes**: Include Cloud Credential Operator with token-aware mode (see above). Allows for processing of
CredentialsRequest objects added by operators.

The CCO is expected to run on the control plane NS, not on the customer's worker nodes, and not be visible to the customer. However if CredentialsRequests are created on the worker nodes, the CCO will process them and yield Secrets as described elsewhere in this document.
The CCO is expected to run on the control plane NS, not on the customer's worker nodes, and not be visible to the customer. We propose that multiple
instances of CCO be installed on the control plane each consuming the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this phrasing is confusing. You're installing multiple instances on the management cluster, but it's still just one instance per hosted control plane (i.e. one per hosted cluster)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Suggested change
instances of CCO be installed on the control plane each consuming the
The CCO is expected to run in the management cluster, not on the customer's worker nodes, and not be visible to the customer. We propose that multiple
instances of CCO be installed on management cluster each consuming the kubeconfig of the hosted control plane it is intended to be watching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to mention that it will run in the hosted control plane's NS on the management cluster, but othewise yes, better.

CCO will be modified so that it can consume a given kubeconfig when specified.
Because the intention is to make this feature available with HyperShift GA it
will need to be backported all the way to 4.12 (contingent on CCO changes
moving out of the feature gate).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this backport is necessary(SD has not explicitly required it). At a minimum it's contingent on us backporting the rest of the STS-enablement work (e.g. console) to 4.12.

if we didn't backport it, it would just mean that you only get this behavior on 4.14+ guest clusters and not 4.12/4.13 guest clusters.

it also might mean the hypershift operator would need to understand the version difference, depending on whether it needs to do anything special to provide the hosted cluster's kubeconfig to the CCO.


The kubeconfig of the worker node is mounted at a specied path on the CCO pod.
CCO can consume that kubeconfig when it starts the go client.

### Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add some risks+mitigations in which we look at the resource consumption of the CCO, determine if it's acceptable to the hypershift/rosa team, and if not, potentially pursue how we can reduce the footprint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you thinking comparing the utilization of "CCO can watch multiple clusters" vs "multiple CCOs each watching one cluster" ? or more broadly even comparing options beyond the scope of this EP like the pod identity webhook?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i wasn't trying to go that far, just calling it out as:

  1. we need to measure it
  2. hypershift/rosa team needs some threshold defined that is acceptable/unacceptable
  3. if we are exceeding that threshold, work will be required. That work may be as simple as just going into the CCO code and finding optimizations we can make to reduce its footprint. (possibly optimizations we can make that are specific to manual/STS mode). If we find ourselves having to do more sophisticated things like a shared CCO across clusters, or switching back to using webhooks, that's going to significantly derail things. So i guess you can also mention those options, but i do not expect us to be going down those paths. (Shared CCO probably isn't even an option, from a tenancy/isolation requirements perspective)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CCO on build farms is taking up enormous amounts of memory footprint. It looks like the CCO keeps the following in cache:

  • secrets
  • configmaps
  • namespaces

This is cluster-wide, always. We should be able to make the following optimizations:

  • watch the two configmaps that need to be watched - the legacy cco config one and the one for leader-election (should be doable in c-r cache lib)
  • do not cache namespaces, we just want to watch for creation to start filling in credentials (not sure if possible)
  • label secrets created by the CCO, restrict all secret LIST+WATCH to that label selector (doable in c-r)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/cloud-credential-operator#544 handles the configmap situation

Handling secrets will require a couple releases for labeling existing data and ratcheting validation to kick in - if we backport this, could be fast. LMK what we want.

Namespaces, I think we will need to just use a metadata-only watch if we want that functionality, but namespace objects are small so I'm not sure how much of a difference it will make

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevekuznetsov Appreciate the proposed partial fix.

Since you have ready data can you better help us understand the current scale of the problem? How many secrets, configmaps, and namespaces does a build farm cluster have and what's the resource consumption of CCO currently?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/cloud-credential-operator#545 adds a controller to start labelling things

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdodson two answers to your question:

  1. Regardless of what's going on in the build farms, I think it's irresponsible and wasteful to have this component holding the entire state of the clusters' ConfigMaps, Secrets and Namespaces in memory at all times - especially the former two objects are used to hold large volumes of data and there's no need for holding them, it's just a missed opportunity for optimization. Supportability of a HCP solution that places memory burden on the service cluster which scales linearly with customer dataset size is poor.
  2. Build farms are huge, as you know, so on e.g. build02 the CCO is holding on to 600MiB :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, agreed. Just trying to understand broadly what things look like today.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/cloud-credential-operator#546 adds a metadata-only watch for namespaces

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 21, 2023

@gallettilance: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.


#### HyperShift Stories

CCO without the pod identity webhook is enabled on the HyperShift management
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/cloud-credential-operator#547 disables starting the pod identity webhook controller when the infrastructure ControlPlaneTopology is External.

@gallettilance gallettilance changed the title WIP: Hypershift changes for STS support Hypershift changes for STS support Jul 7, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 7, 2023
### Risks and Mitigations

Enabling CCO on HyperShift incurs additional infrastructure cost. The exact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to have concrete data for:

  • Compute resources impact
  • RBAC reqs management and guest cluster side
  • Networking requirements

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute resources impact

We've done some aggressive pruning and the final numbers are low. The CCO container and the rbac-proxy together are something like 80MiB RAM and 20m vCPU. I'm going to work on the RAM a bit more after this, I think we should be able to get that down to 20MiB or so.

RBAC reqs management and guest cluster side

Not sure exactly what you mean by this but the CCO running in the HyperShift management plane will simply use the ServiceAccount credentials as it does in standalone.

Networking requirements

CCO simply needs to connect to the API server for the tenant.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 17, 2023
@sdodson
Copy link
Member

sdodson commented Aug 17, 2023

/remove-lifecycle stale
Sholdn't close this but also not going to get back to it within the next week.

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 17, 2023
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2023
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 22, 2023
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Sep 30, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 30, 2023

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OCPBU-4, has status "In Progress". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

1 similar comment
@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OCPBU-4, has status "In Progress". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann dhellmann removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants