Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-15365: *: use a filtered LIST + WATCH on Secrets for AWS STS #545

Merged

Conversation

stevekuznetsov
Copy link
Contributor

@stevekuznetsov stevekuznetsov commented Jun 22, 2023

The status quo for this controller is to LIST + WATCH all Secrets on the cluster. This consumes
more resources than necessary on clusters where users put other data in Secrets themselves, as we
hold that data in our cache and never do anything with it. The reconcilers mainly need to react to
changes in Secrets created for CredentialRequests, which they control and can label, allowing us
to filter the LIST + WATCH down and hold the minimal set of data in memory. However, two caveats:

  • in passthrough and mint mode, admin credentials are also provided by the user through Secrets,
    and we need to watch those, but we can't label them
  • on existing clusters, secrets exist from previous work these reconcilers did that will not have
    Secrets labelled

We could solve the second issue with an interim release of this controller that labels all previous
Secrets, but does not restrict the watch stream.

Due to the way that controller-runtime closes over the client/cache concepts, it's difficult to
solve the first issue, though, since we'd need two sets of clients and caches, both for Secrets,
and ensure that we use one for client access to Secrets we're creating or mutating and the other
when we're interacting with admin credentials. Not impossible to do, but tricky to implement and
complex.

Until we undertake that effort, we apply a simplification to the space: only when AWS STS mode is
enabled, we will try to filter the LIST + WATCH. This mode is brand new, so we can be reasonably
sure that there are no previous secrets on the cluster, and, we make the filtering best-effort
in order to check if that assumption held. Second, AWS STS mode only runs in clusters without
admin credentials, so if we apply the filter, we should not see failures downstream from clients
that hope to see those objects but can't.

@stevekuznetsov
Copy link
Contributor Author

Would like some guidance for what kinds of tests we want to add to this :)

@@ -441,7 +441,7 @@ func (a *AWSActuator) syncPassthrough(ctx context.Context, cr *minterv1.Credenti
}

// userPolicy param empty because in passthrough mode this doesn't really have any meaning
err = a.syncAccessKeySecret(cr, accessKeyID, secretAccessKey, existingSecret, "", logger)
err = a.syncAccessKeySecret(ctx, cr, accessKeyID, secretAccessKey, existingSecret, "", logger)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were not strictly necessary but it seemed like latent tech debt and since I was touching the client calls, I figured we could fix them.


// AddToManager adds all Controllers to the Manager
func AddToManager(m manager.Manager, explicitKubeconfig string) error {
rules := clientcmd.NewDefaultClientConfigLoadingRules()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller-runtime does not have support for server-side-apply yet, and by far the best way for us to add labels to existing objects surgically, and without conflict problems, is using server-side-apply. I create a client-go client and thread it through in order to allow that.

})

if _, err := r.mutatingClient.Secrets(secret.Namespace).Apply(ctx, applyConfig, metav1.ApplyOptions{
Force: true, // we're the authoritative owner of this field and should not allow anyone to stomp it
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k would love a gut-check that this does what I think it does, given that the applyConfig is set up the way it is - is there a single owner for all of metadata.labels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels are divided by name, so you only own the labels that you're trying to set, not all the labels available. At least that's my memory of it. Is the reality different?

@openshift-ci openshift-ci bot requested review from jstuever and suhanime June 22, 2023 13:56
@stevekuznetsov
Copy link
Contributor Author

/hold

It is not clear to me if labeling the root credentials in kube-system is a tenable approach. I need to look more into how controller-runtime works with a filtered cache to see if we can use a different approach here.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2023
@codecov
Copy link

codecov bot commented Jun 22, 2023

Codecov Report

Merging #545 (c0b4a13) into master (0c629a5) will increase coverage by 0.59%.
The diff coverage is 53.33%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
+ Coverage   47.84%   48.43%   +0.59%     
==========================================
  Files          93       93              
  Lines       11488    11958     +470     
==========================================
+ Hits         5496     5792     +296     
- Misses       5359     5538     +179     
+ Partials      633      628       -5     
Impacted Files Coverage Δ
pkg/openstack/actuator.go 0.00% <0.00%> (ø)
...awspodidentity/awspodidentitywebhook_controller.go 31.78% <0.00%> (-0.43%) ⬇️
pkg/operator/cleanup/cleanup_controller.go 52.32% <0.00%> (ø)
pkg/operator/loglevel/controller.go 49.15% <0.00%> (ø)
pkg/operator/metrics/metrics.go 45.40% <0.00%> (ø)
pkg/operator/platform/platform.go 10.95% <0.00%> (-0.16%) ⬇️
pkg/operator/status/status_controller.go 61.24% <0.00%> (+6.74%) ⬆️
pkg/ovirt/actuator.go 9.75% <0.00%> (+1.71%) ⬆️
...redentialsrequest/credentialsrequest_controller.go 50.47% <38.83%> (-1.29%) ⬇️
pkg/azure/client.go 73.91% <66.66%> (-7.04%) ⬇️
... and 5 more

... and 4 files with indirect coverage changes

@stevekuznetsov
Copy link
Contributor Author

@stevekuznetsov
Copy link
Contributor Author

stevekuznetsov commented Jun 22, 2023

/retitle PORTENABLE-526: *: label the secrets we interact with

@stevekuznetsov stevekuznetsov changed the title *: label the secrets we interact with PORTENABLE-526: *: label the secrets we interact with Jun 22, 2023
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 22, 2023
@stevekuznetsov
Copy link
Contributor Author

After some chats with @abutcher and @deads2k :

  • we cannot mutate the admin credentials in kube-system
  • using multiple informers (one for labelled secrets, N for user-provided ones in kube-system) is really nasty in controller-runtime right now and the cost of an implementation is high
  • STS mode - the specific mode we care about footprint in - will have no root credential
  • I will refactor this to piggy-back on CCO-366 Add ability to detect AWS STS and behave accordingly #542 and, as part of that feature flag:
    • run a startup check to ensure that no credential-request-created secrets exist without the label
    • run the controller using a label selector on secrets for the LIST+WATCH

Since STS feature flag does not exist before that PR merges, we don't have to worry about pulling in old data / upgrades / etc.

@sdodson @bparees WDYT?

@bparees
Copy link

bparees commented Jun 22, 2023

sgtm but why can't we mutate the kube-system cred (just to label it)? because it may be user provided/user-managed?

alternatively, if it's not associated w/ credreq, why do we need to label it?

@stevekuznetsov
Copy link
Contributor Author

It's user-provided, so David says we can't touch it. And we would want to label it because controller-runtime makes it exceedingly difficult to have a consistent experience when every object you want to GET is not in your cache, and the factoring today would require us to have two caches

@stevekuznetsov stevekuznetsov changed the title PORTENABLE-526: *: label the secrets we interact with OCPBUGS-15365: *: label the secrets we interact with Jun 23, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 23, 2023
@openshift-ci-robot
Copy link
Contributor

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-15365, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

In order to reduce the cache footprint we have, we need to label the secrets we interact with so we can use a label selector on the LIST + WATCH requests we make. This commit does two things:

  1. adds labels to all newly-created objects, so we have them moving forward
  2. adds a controller that will react to the (unrestricted) set of secrets on the cluster today and puts labels where they are needed, so we can bring clusters without correct labels into conformance

A future release of the CCO will remove the second controller and restrict the cache using our new label selector. We may need to do some extra work like setting up new controllers to ensure users don't remove labels from the admin secrets (or that we add labels to those when they're new).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stevekuznetsov
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 23, 2023
@openshift-ci-robot
Copy link
Contributor

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-15365, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianping-shu

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stevekuznetsov
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-15365, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianping-shu

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jianping-shu
Copy link

/test e2e-aws-ovn

@stevekuznetsov
Copy link
Contributor Author

The E2E is failing with:

2023-07-05T13:30:01.659332732Z time="2023-07-05T13:30:01Z" level=error msg="unable to fetch root cloud cred secret" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress error="Secret \"aws-creds\" not found"

Which is definitely broken because of my changes, looking into it :)

If we're using a filtered LIST + WATCH for Secrets to do our work on
CredentialRequests, we also need to open a second LIST + WATCH for the
root credential, as this won't have the labels we add to our own
secrets.

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@stevekuznetsov
Copy link
Contributor Author

Was sending a label selector instead of a field selector, oops. Should be good now.

@stevekuznetsov
Copy link
Contributor Author

/test e2e-aws-manual-oidc

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@stevekuznetsov
Copy link
Contributor Author

/test e2e-aws-manual-oidc

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>
@stevekuznetsov
Copy link
Contributor Author

/test e2e-aws-manual-oidc

@abutcher
Copy link
Member

/lgtm

@stevekuznetsov
Copy link
Contributor Author

/hold cancel

@stevekuznetsov
Copy link
Contributor Author

@abutcher will need an approval as well!

@abutcher
Copy link
Member

/approve

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abutcher, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 10, 2023
@abutcher abutcher added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 34cb2ab and 2 for PR HEAD c0b4a13 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 558df03 and 1 for PR HEAD c0b4a13 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 11, 2023

@stevekuznetsov: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit acfee1f into openshift:master Jul 11, 2023
9 checks passed
@openshift-ci-robot
Copy link
Contributor

@stevekuznetsov: Jira Issue OCPBUGS-15365: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-15365 has been moved to the MODIFIED state.

In response to this:

The status quo for this controller is to LIST + WATCH all Secrets on the cluster. This consumes
more resources than necessary on clusters where users put other data in Secrets themselves, as we
hold that data in our cache and never do anything with it. The reconcilers mainly need to react to
changes in Secrets created for CredentialRequests, which they control and can label, allowing us
to filter the LIST + WATCH down and hold the minimal set of data in memory. However, two caveats:

  • in passthrough and mint mode, admin credentials are also provided by the user through Secrets,
    and we need to watch those, but we can't label them
  • on existing clusters, secrets exist from previous work these reconcilers did that will not have
    Secrets labelled

We could solve the second issue with an interim release of this controller that labels all previous
Secrets, but does not restrict the watch stream.

Due to the way that controller-runtime closes over the client/cache concepts, it's difficult to
solve the first issue, though, since we'd need two sets of clients and caches, both for Secrets,
and ensure that we use one for client access to Secrets we're creating or mutating and the other
when we're interacting with admin credentials. Not impossible to do, but tricky to implement and
complex.

Until we undertake that effort, we apply a simplification to the space: only when AWS STS mode is
enabled, we will try to filter the LIST + WATCH. This mode is brand new, so we can be reasonably
sure that there are no previous secrets on the cluster, and, we make the filtering best-effort
in order to check if that assumption held. Second, AWS STS mode only runs in clusters without
admin credentials, so if we apply the filter, we should not see failures downstream from clients
that hope to see those objects but can't.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants