fix: new ca-filter causing expontentially more api-calls #3608

the-technat · 2024-03-11T12:16:47Z

Issue

Sort of related to #3565

Description

The recently merged #3591 contained a bug where on large clusters the initial reconcile loop after startup would take hours instead of minutes (on clusters with a lot of ingresses/certificates). This is due to the missing domains-cache of certificates that were filtered by this newly introduced flag.

This PR fixes this behaviour by setting an empty cache value for these filtered certificates. At least in our (@swisspost) testing environment this reduced duration of the reconcile loop significantly.

Checklist

Added tests that cover your change (if possible)
Added/modified documentation as required (such as the README.md, or the docs directory)
Manually tested
Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Backfilled missing tests for code in same general area 🎉
Refactored something and made the world a better place 🌟

due to missing cache

k8s-ci-robot · 2024-03-11T12:16:58Z

Hi @the-technat. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-03-11T12:17:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: the-technat
Once this PR has been reviewed and has the lgtm label, please assign oliviassss for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dims · 2024-03-11T12:19:49Z

/ok-to-test
/assign @shraddhabang @oliviassss @M00nF1sh

mkilchhofer · 2024-03-11T12:38:57Z

pkg/ingress/cert_discovery.go

+	case acm.CertificateTypeAmazonIssued, acm.CertificateTypePrivate:
+		d.certDomainsCache.Set(certARN, domains, d.privateCertDomainsCacheTTL)
+	}
+	return domains, nil
 }


Real resp. resulting diff for this feature (#3565) can be viewed here:
v2.7.1...the-technat:aws-load-balancer-controller:main

Or on this screenshot:

M00nF1sh · 2024-03-12T20:42:56Z

taking a look, wondering how the original PR would cause such regression when the new flag is not used given it had the len(d.allowedCAARNs) == 0 check..

M00nF1sh · 2024-03-12T20:53:12Z

pkg/ingress/cert_discovery.go

@@ -153,18 +153,18 @@ func (d *acmCertDiscovery) loadDomainsForCertificate(ctx context.Context, certAR
 	certDetail := resp.Certificate

 	// check if cert is issued from an allowed CA
+	// otherwise empty-out the list of domains
+	domains := sets.String{}
 	if len(d.allowedCAARNs) == 0 || slices.Contains(d.allowedCAARNs, awssdk.StringValue(certDetail.CertificateAuthorityArn)) {


just for my understanding, the originally PR didn't introduce any regression right?
And you faced this issue due to you used the new "allowedCAARNs" feature?
If so, this fix looks good to me. However, i'd like to change this code to be like

domains := sets.NewString(aws.StringValueSlice(certDetail.SubjectAlternativeNames)...) switch aws.StringValue(certDetail.Type) { case acm.CertificateTypeImported: d.certDomainsCache.Set(certARN, domains, d.importedCertDomainsCacheTTL) case acm.CertificateTypeAmazonIssued, acm.CertificateTypePrivate: d.certDomainsCache.Set(certARN, domains, d.privateCertDomainsCacheTTL) } if len(d.allowedCAARNs) == 0 || slices.Contains(d.allowedCAARNs, awssdk.StringValue(certDetail.CertificateAuthorityArn)) { return domains, nil } return sets.String{}, nil

technically there is no functional difference since allowedCAARNs is a controller-level flag which is immutable given the controller's lifetime. However, from coding perspective, the cache shall be for the "domains" before the "CA filter logic" and this make the code more robust(e.g. works even allowedCAARNs can be dynamically updated somehow).

Our PR introduce this behavior only if both conditions are met:

use the --allowed-certificate-authority-arns=... parameter

there exists certificates inside AWS ACM which are not issued from a authority in the allowed list.

I/we can test your proposal tomorrow in our staging environment. But I assume that your proposed code change does not work properly as we then also cache the domain(s) of a certificate which is not issued from an allowed CA.

And the function first tries to load the domains from the cache before it reaches this filtering code block:

func (d *acmCertDiscovery) loadDomainsForCertificate(ctx context.Context, certARN string) (sets.String, error) { if rawCacheItem, ok := d.certDomainsCache.Get(certARN); ok { return rawCacheItem.(sets.String), nil } // only continues in case we didn't find it inside the cache

@mkilchhofer
You are right, the cache shall be the domains after the filter.
ideally we should refactor the code such that the cache stores certificate details from AWS. (but for now it only caches domains and your filter logic is on cert CA), thus filter logic has to be run before the cache.

I'll approve this, and maybe refactor this in the future if we ever need to make the ca list mutable.

M00nF1sh · 2024-03-12T21:50:42Z

/ok-to-test

the-technat · 2024-03-13T08:10:24Z

/retest

oliviassss · 2024-03-13T18:05:25Z

govulncheck failing is unrelated, will merge it.

…#3608) due to missing cache

* fix log level in listener manager and tagging manager (#3573) * bump up controller-gen version and update manifests (#3580) * docs: ingress subnets annotation - clarify locale differences (#3579) * feat: allowed ACM cert discovery to filter on CA ARNs (#3565) (#3591) * Add example for NLB target-group-attributes to enable unhealthy target connection draining (#3577) * Add example annotation for NLB unhealthy target connection draining * Add emtpyline back in * fix: ca-filter causing expontentially more api-calls (#3608) due to missing cache * Repo controlled build go version (#3598) * update go version to mitigate CVE (#3615) * Adding support for Availability Zone Affinity (#3470) Fixes #3431 Signed-off-by: Alex Berger <alex-berger@gmx.ch> * Update golang.org/protobuf version to fix CVE-2024-24786 (#3618) * Add a note to recommend to use compatible chart and image versions * Update golang.org/protobuf version to fix CVE-2024-24786 --------- Signed-off-by: Alex Berger <alex-berger@gmx.ch> Co-authored-by: Olivia Song <sonyingy@amazon.com> Co-authored-by: Andrey Lebedev <alebedev87@gmail.com> Co-authored-by: Nathanael Liechti <technat@technat.ch> Co-authored-by: Isaac Wilson <10012479+jukie@users.noreply.github.com> Co-authored-by: Nathanael Liechti <nathanael.liechti@post.ch> Co-authored-by: Jason Du <jasonxdu@amazon.com> Co-authored-by: Hao Zhou <haouc@users.noreply.github.com> Co-authored-by: Alexander Berger <alex-berger@users.noreply.github.com>

fix: ca-filter causing expontentially more api-calls

5cfcd19

due to missing cache

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 11, 2024

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 11, 2024

k8s-ci-robot requested review from kishorj and M00nF1sh March 11, 2024 12:17

k8s-ci-robot assigned M00nF1sh Mar 11, 2024

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 11, 2024

k8s-ci-robot assigned oliviassss and shraddhabang Mar 11, 2024

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 11, 2024

mkilchhofer reviewed Mar 11, 2024

View reviewed changes

M00nF1sh reviewed Mar 12, 2024

View reviewed changes

oliviassss added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 13, 2024

oliviassss merged commit 20e667d into kubernetes-sigs:main Mar 13, 2024
7 of 9 checks passed

shraddhabang pushed a commit to shraddhabang/aws-load-balancer-controller that referenced this pull request Mar 20, 2024

fix: ca-filter causing expontentially more api-calls (kubernetes-sigs…

33d2d8c

…#3608) due to missing cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: new ca-filter causing expontentially more api-calls #3608

fix: new ca-filter causing expontentially more api-calls #3608

the-technat commented Mar 11, 2024 •

edited

k8s-ci-robot commented Mar 11, 2024

k8s-ci-robot commented Mar 11, 2024

dims commented Mar 11, 2024

mkilchhofer Mar 11, 2024 •

edited

M00nF1sh commented Mar 12, 2024

M00nF1sh Mar 12, 2024

mkilchhofer Mar 12, 2024

M00nF1sh Mar 12, 2024

M00nF1sh commented Mar 12, 2024

the-technat commented Mar 13, 2024

oliviassss commented Mar 13, 2024

fix: new ca-filter causing expontentially more api-calls #3608

fix: new ca-filter causing expontentially more api-calls #3608

Conversation

the-technat commented Mar 11, 2024 • edited

Issue

Description

Checklist

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

k8s-ci-robot commented Mar 11, 2024

k8s-ci-robot commented Mar 11, 2024

dims commented Mar 11, 2024

mkilchhofer Mar 11, 2024 • edited

Choose a reason for hiding this comment

M00nF1sh commented Mar 12, 2024

M00nF1sh Mar 12, 2024

Choose a reason for hiding this comment

mkilchhofer Mar 12, 2024

Choose a reason for hiding this comment

M00nF1sh Mar 12, 2024

Choose a reason for hiding this comment

M00nF1sh commented Mar 12, 2024

the-technat commented Mar 13, 2024

oliviassss commented Mar 13, 2024

the-technat commented Mar 11, 2024 •

edited

mkilchhofer Mar 11, 2024 •

edited