OCPBUGS-15365: *: use a filtered LIST + WATCH on Secrets for AWS STS #545

stevekuznetsov · 2023-06-22T13:50:43Z

The status quo for this controller is to LIST + WATCH all Secrets on the cluster. This consumes
more resources than necessary on clusters where users put other data in Secrets themselves, as we
hold that data in our cache and never do anything with it. The reconcilers mainly need to react to
changes in Secrets created for CredentialRequests, which they control and can label, allowing us
to filter the LIST + WATCH down and hold the minimal set of data in memory. However, two caveats:

in passthrough and mint mode, admin credentials are also provided by the user through Secrets,
and we need to watch those, but we can't label them
on existing clusters, secrets exist from previous work these reconcilers did that will not have
Secrets labelled

We could solve the second issue with an interim release of this controller that labels all previous
Secrets, but does not restrict the watch stream.

Due to the way that controller-runtime closes over the client/cache concepts, it's difficult to
solve the first issue, though, since we'd need two sets of clients and caches, both for Secrets,
and ensure that we use one for client access to Secrets we're creating or mutating and the other
when we're interacting with admin credentials. Not impossible to do, but tricky to implement and
complex.

Until we undertake that effort, we apply a simplification to the space: only when AWS STS mode is
enabled, we will try to filter the LIST + WATCH. This mode is brand new, so we can be reasonably
sure that there are no previous secrets on the cluster, and, we make the filtering best-effort
in order to check if that assumption held. Second, AWS STS mode only runs in clusters without
admin credentials, so if we apply the filter, we should not see failures downstream from clients
that hope to see those objects but can't.

stevekuznetsov · 2023-06-22T13:53:50Z

Would like some guidance for what kinds of tests we want to add to this :)

stevekuznetsov · 2023-06-22T13:54:18Z

pkg/aws/actuator/actuator.go

@@ -441,7 +441,7 @@ func (a *AWSActuator) syncPassthrough(ctx context.Context, cr *minterv1.Credenti
 	}

 	// userPolicy param empty because in passthrough mode this doesn't really have any meaning
-	err = a.syncAccessKeySecret(cr, accessKeyID, secretAccessKey, existingSecret, "", logger)
+	err = a.syncAccessKeySecret(ctx, cr, accessKeyID, secretAccessKey, existingSecret, "", logger)


These changes were not strictly necessary but it seemed like latent tech debt and since I was touching the client calls, I figured we could fix them.

stevekuznetsov · 2023-06-22T13:55:10Z

pkg/operator/controller.go


 // AddToManager adds all Controllers to the Manager
 func AddToManager(m manager.Manager, explicitKubeconfig string) error {
+	rules := clientcmd.NewDefaultClientConfigLoadingRules()


Controller-runtime does not have support for server-side-apply yet, and by far the best way for us to add labels to existing objects surgically, and without conflict problems, is using server-side-apply. I create a client-go client and thread it through in order to allow that.

stevekuznetsov · 2023-06-22T13:55:46Z

pkg/operator/credentialsrequest/credentialsrequest_controller.go

+	})
+
+	if _, err := r.mutatingClient.Secrets(secret.Namespace).Apply(ctx, applyConfig, metav1.ApplyOptions{
+		Force:        true, // we're the authoritative owner of this field and should not allow anyone to stomp it


@deads2k would love a gut-check that this does what I think it does, given that the applyConfig is set up the way it is - is there a single owner for all of metadata.labels?

Labels are divided by name, so you only own the labels that you're trying to set, not all the labels available. At least that's my memory of it. Is the reality different?

stevekuznetsov · 2023-06-22T15:28:27Z

/hold

It is not clear to me if labeling the root credentials in kube-system is a tenable approach. I need to look more into how controller-runtime works with a filtered cache to see if we can use a different approach here.

codecov · 2023-06-22T15:54:28Z

Codecov Report

Merging #545 (c0b4a13) into master (0c629a5) will increase coverage by 0.59%.
The diff coverage is 53.33%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
+ Coverage   47.84%   48.43%   +0.59%     
==========================================
  Files          93       93              
  Lines       11488    11958     +470     
==========================================
+ Hits         5496     5792     +296     
- Misses       5359     5538     +179     
+ Partials      633      628       -5

Impacted Files	Coverage Δ
pkg/openstack/actuator.go	`0.00% <0.00%> (ø)`
...awspodidentity/awspodidentitywebhook_controller.go	`31.78% <0.00%> (-0.43%)`	⬇️
pkg/operator/cleanup/cleanup_controller.go	`52.32% <0.00%> (ø)`
pkg/operator/loglevel/controller.go	`49.15% <0.00%> (ø)`
pkg/operator/metrics/metrics.go	`45.40% <0.00%> (ø)`
pkg/operator/platform/platform.go	`10.95% <0.00%> (-0.16%)`	⬇️
pkg/operator/status/status_controller.go	`61.24% <0.00%> (+6.74%)`	⬆️
pkg/ovirt/actuator.go	`9.75% <0.00%> (+1.71%)`	⬆️
...redentialsrequest/credentialsrequest_controller.go	`50.47% <38.83%> (-1.29%)`	⬇️
pkg/azure/client.go	`73.91% <66.66%> (-7.04%)`	⬇️
... and 5 more

... and 4 files with indirect coverage changes

stevekuznetsov · 2023-06-22T16:40:56Z

https://issues.redhat.com/browse/PORTENABLE-526

stevekuznetsov · 2023-06-22T16:41:19Z

/retitle PORTENABLE-526: *: label the secrets we interact with

stevekuznetsov · 2023-06-22T20:12:38Z

After some chats with @abutcher and @deads2k :

we cannot mutate the admin credentials in kube-system
using multiple informers (one for labelled secrets, N for user-provided ones in kube-system) is really nasty in controller-runtime right now and the cost of an implementation is high
STS mode - the specific mode we care about footprint in - will have no root credential
I will refactor this to piggy-back on CCO-366 Add ability to detect AWS STS and behave accordingly #542 and, as part of that feature flag:
- run a startup check to ensure that no credential-request-created secrets exist without the label
- run the controller using a label selector on secrets for the LIST+WATCH

Since STS feature flag does not exist before that PR merges, we don't have to worry about pulling in old data / upgrades / etc.

@sdodson @bparees WDYT?

bparees · 2023-06-22T20:16:51Z

sgtm but why can't we mutate the kube-system cred (just to label it)? because it may be user provided/user-managed?

alternatively, if it's not associated w/ credreq, why do we need to label it?

stevekuznetsov · 2023-06-22T20:17:55Z

It's user-provided, so David says we can't touch it. And we would want to label it because controller-runtime makes it exceedingly difficult to have a consistent experience when every object you want to GET is not in your cache, and the factoring today would require us to have two caches

openshift-ci-robot · 2023-06-23T16:59:07Z

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-15365, which is invalid:

expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

In order to reduce the cache footprint we have, we need to label the secrets we interact with so we can use a label selector on the LIST + WATCH requests we make. This commit does two things:

adds labels to all newly-created objects, so we have them moving forward

adds a controller that will react to the (unrestricted) set of secrets on the cluster today and puts labels where they are needed, so we can bring clusters without correct labels into conformance

A future release of the CCO will remove the second controller and restrict the cache using our new label selector. We may need to do some extra work like setting up new controllers to ensure users don't remove labels from the admin secrets (or that we add labels to those when they're new).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stevekuznetsov · 2023-06-23T17:00:02Z

/jira refresh

openshift-ci-robot · 2023-06-23T17:00:09Z

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-15365, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianping-shu

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stevekuznetsov · 2023-06-23T17:03:53Z

/jira refresh

openshift-ci-robot · 2023-06-23T17:03:56Z

@stevekuznetsov: This pull request references Jira Issue OCPBUGS-15365, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianping-shu

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jianping-shu · 2023-06-28T07:12:29Z

/test e2e-aws-ovn

stevekuznetsov · 2023-07-05T15:53:45Z

The E2E is failing with:

2023-07-05T13:30:01.659332732Z time="2023-07-05T13:30:01Z" level=error msg="unable to fetch root cloud cred secret" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress error="Secret \"aws-creds\" not found"

Which is definitely broken because of my changes, looking into it :)

If we're using a filtered LIST + WATCH for Secrets to do our work on CredentialRequests, we also need to open a second LIST + WATCH for the root credential, as this won't have the labels we add to our own secrets. Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov · 2023-07-05T16:36:14Z

Was sending a label selector instead of a field selector, oops. Should be good now.

stevekuznetsov · 2023-07-05T17:41:47Z

/test e2e-aws-manual-oidc

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov · 2023-07-05T20:46:19Z

/test e2e-aws-manual-oidc

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov · 2023-07-10T14:14:48Z

/test e2e-aws-manual-oidc

abutcher · 2023-07-10T14:28:38Z

/lgtm

stevekuznetsov · 2023-07-10T14:29:12Z

/hold cancel

stevekuznetsov · 2023-07-10T14:29:29Z

@abutcher will need an approval as well!

abutcher · 2023-07-10T14:29:40Z

/approve

openshift-ci · 2023-07-10T14:34:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abutcher, stevekuznetsov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abutcher]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-07-10T18:31:22Z

/retest-required

Remaining retests: 0 against base HEAD 34cb2ab and 2 for PR HEAD c0b4a13 in total

openshift-ci-robot · 2023-07-10T21:31:17Z

/retest-required

Remaining retests: 0 against base HEAD 558df03 and 1 for PR HEAD c0b4a13 in total

openshift-ci · 2023-07-11T01:32:35Z

@stevekuznetsov: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-07-11T01:38:45Z

@stevekuznetsov: Jira Issue OCPBUGS-15365: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-15365 has been moved to the MODIFIED state.

In response to this:

The status quo for this controller is to LIST + WATCH all Secrets on the cluster. This consumes
more resources than necessary on clusters where users put other data in Secrets themselves, as we
hold that data in our cache and never do anything with it. The reconcilers mainly need to react to
changes in Secrets created for CredentialRequests, which they control and can label, allowing us
to filter the LIST + WATCH down and hold the minimal set of data in memory. However, two caveats:

in passthrough and mint mode, admin credentials are also provided by the user through Secrets,
and we need to watch those, but we can't label them

on existing clusters, secrets exist from previous work these reconcilers did that will not have
Secrets labelled

We could solve the second issue with an interim release of this controller that labels all previous
Secrets, but does not restrict the watch stream.

Due to the way that controller-runtime closes over the client/cache concepts, it's difficult to
solve the first issue, though, since we'd need two sets of clients and caches, both for Secrets,
and ensure that we use one for client access to Secrets we're creating or mutating and the other
when we're interacting with admin credentials. Not impossible to do, but tricky to implement and
complex.

Until we undertake that effort, we apply a simplification to the space: only when AWS STS mode is
enabled, we will try to filter the LIST + WATCH. This mode is brand new, so we can be reasonably
sure that there are no previous secrets on the cluster, and, we make the filtering best-effort
in order to check if that assumption held. Second, AWS STS mode only runs in clusters without
admin credentials, so if we apply the filter, we should not see failures downstream from clients
that hope to see those objects but can't.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stevekuznetsov mentioned this pull request Jun 22, 2023

Hypershift changes for STS support openshift/enhancements#1427

Closed

stevekuznetsov commented Jun 22, 2023

View reviewed changes

openshift-ci bot requested review from jstuever and suhanime June 22, 2023 13:56

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 22, 2023

stevekuznetsov changed the title *: label the secrets we interact with PORTENABLE-526: *: label the secrets we interact with Jun 22, 2023

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 22, 2023

stevekuznetsov changed the title ~~PORTENABLE-526: *: label the secrets we interact with~~ OCPBUGS-15365: *: label the secrets we interact with Jun 23, 2023

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 23, 2023

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 23, 2023

openshift-ci bot requested a review from jianping-shu June 23, 2023 17:00

This was referenced Jun 23, 2023

OCPBUGS-15365: operator: use a partial metadata watch for Namespaces #546

Merged

OCPBUGS-15365: manager: filter the cache of configmaps #544

Merged

stevekuznetsov force-pushed the skuznets/label-secrets branch from 407f0bf to 553faba Compare July 5, 2023 16:00

stevekuznetsov mentioned this pull request Jul 5, 2023

CCO-392: BUGFIX Allow for Secret to exist in AWS STS case #558

Closed

stevekuznetsov force-pushed the skuznets/label-secrets branch from 553faba to e5706ed Compare July 5, 2023 16:33

stevekuznetsov added 2 commits July 5, 2023 10:33

manifests: add more static related objects

ae2828f

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov force-pushed the skuznets/label-secrets branch from e5706ed to ae2828f Compare July 5, 2023 16:33

pkg/operator: use cloud cred manager for annotation controller

caf857f

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov force-pushed the skuznets/label-secrets branch from 1f3d8b4 to caf857f Compare July 5, 2023 18:55

pkg/operator: add unit tests for label reconciler

c0b4a13

Signed-off-by: Steve Kuznetsov <skuznets@redhat.com>

stevekuznetsov force-pushed the skuznets/label-secrets branch from 8336770 to c0b4a13 Compare July 10, 2023 13:52

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 10, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 10, 2023

abutcher added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2023

openshift-merge-robot merged commit acfee1f into openshift:master Jul 11, 2023
9 checks passed

This was referenced Jul 12, 2023

OCPBUGS-15365: *: use correct clients in the secretannotator #563

Merged

OCPBUGS-15365: manifests: fix rbac #567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-15365: *: use a filtered LIST + WATCH on Secrets for AWS STS #545

OCPBUGS-15365: *: use a filtered LIST + WATCH on Secrets for AWS STS #545

stevekuznetsov commented Jun 22, 2023 •

edited

stevekuznetsov commented Jun 22, 2023

stevekuznetsov Jun 22, 2023

stevekuznetsov Jun 22, 2023

stevekuznetsov Jun 22, 2023

deads2k Jul 10, 2023

stevekuznetsov commented Jun 22, 2023

codecov bot commented Jun 22, 2023 •

edited

stevekuznetsov commented Jun 22, 2023

stevekuznetsov commented Jun 22, 2023 •

edited by openshift-ci bot

stevekuznetsov commented Jun 22, 2023

bparees commented Jun 22, 2023

stevekuznetsov commented Jun 22, 2023

openshift-ci-robot commented Jun 23, 2023

stevekuznetsov commented Jun 23, 2023

openshift-ci-robot commented Jun 23, 2023

stevekuznetsov commented Jun 23, 2023

openshift-ci-robot commented Jun 23, 2023

jianping-shu commented Jun 28, 2023

stevekuznetsov commented Jul 5, 2023

stevekuznetsov commented Jul 5, 2023

stevekuznetsov commented Jul 5, 2023

stevekuznetsov commented Jul 5, 2023

stevekuznetsov commented Jul 10, 2023

abutcher commented Jul 10, 2023

stevekuznetsov commented Jul 10, 2023

stevekuznetsov commented Jul 10, 2023

abutcher commented Jul 10, 2023

openshift-ci bot commented Jul 10, 2023

openshift-ci-robot commented Jul 10, 2023

openshift-ci-robot commented Jul 10, 2023

openshift-ci bot commented Jul 11, 2023

openshift-ci-robot commented Jul 11, 2023

OCPBUGS-15365: *: use a filtered LIST + WATCH on Secrets for AWS STS #545

OCPBUGS-15365: *: use a filtered LIST + WATCH on Secrets for AWS STS #545

Conversation

stevekuznetsov commented Jun 22, 2023 • edited

stevekuznetsov commented Jun 22, 2023

stevekuznetsov Jun 22, 2023

Choose a reason for hiding this comment

stevekuznetsov Jun 22, 2023

Choose a reason for hiding this comment

stevekuznetsov Jun 22, 2023

Choose a reason for hiding this comment

deads2k Jul 10, 2023

Choose a reason for hiding this comment

stevekuznetsov commented Jun 22, 2023

codecov bot commented Jun 22, 2023 • edited

Codecov Report

stevekuznetsov commented Jun 22, 2023

stevekuznetsov commented Jun 22, 2023 • edited by openshift-ci bot

stevekuznetsov commented Jun 22, 2023

bparees commented Jun 22, 2023

stevekuznetsov commented Jun 22, 2023

openshift-ci-robot commented Jun 23, 2023

stevekuznetsov commented Jun 23, 2023

openshift-ci-robot commented Jun 23, 2023

stevekuznetsov commented Jun 23, 2023

openshift-ci-robot commented Jun 23, 2023

jianping-shu commented Jun 28, 2023

stevekuznetsov commented Jul 5, 2023

stevekuznetsov commented Jul 5, 2023

stevekuznetsov commented Jul 5, 2023

stevekuznetsov commented Jul 5, 2023

stevekuznetsov commented Jul 10, 2023

abutcher commented Jul 10, 2023

stevekuznetsov commented Jul 10, 2023

stevekuznetsov commented Jul 10, 2023

abutcher commented Jul 10, 2023

openshift-ci bot commented Jul 10, 2023

openshift-ci-robot commented Jul 10, 2023

openshift-ci-robot commented Jul 10, 2023

openshift-ci bot commented Jul 11, 2023

openshift-ci-robot commented Jul 11, 2023

stevekuznetsov commented Jun 22, 2023 •

edited

codecov bot commented Jun 22, 2023 •

edited

stevekuznetsov commented Jun 22, 2023 •

edited by openshift-ci bot