New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: new ca-filter causing expontentially more api-calls #3608
Conversation
due to missing cache
Hi @the-technat. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: the-technat The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
case acm.CertificateTypeAmazonIssued, acm.CertificateTypePrivate: | ||
d.certDomainsCache.Set(certARN, domains, d.privateCertDomainsCacheTTL) | ||
} | ||
return domains, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real resp. resulting diff for this feature (#3565) can be viewed here:
v2.7.1...the-technat:aws-load-balancer-controller:main
Or on this screenshot:
taking a look, wondering how the original PR would cause such regression when the new flag is not used given it had the |
@@ -153,18 +153,18 @@ func (d *acmCertDiscovery) loadDomainsForCertificate(ctx context.Context, certAR | |||
certDetail := resp.Certificate | |||
|
|||
// check if cert is issued from an allowed CA | |||
// otherwise empty-out the list of domains | |||
domains := sets.String{} | |||
if len(d.allowedCAARNs) == 0 || slices.Contains(d.allowedCAARNs, awssdk.StringValue(certDetail.CertificateAuthorityArn)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my understanding, the originally PR didn't introduce any regression right?
And you faced this issue due to you used the new "allowedCAARNs" feature?
If so, this fix looks good to me. However, i'd like to change this code to be like
domains := sets.NewString(aws.StringValueSlice(certDetail.SubjectAlternativeNames)...)
switch aws.StringValue(certDetail.Type) {
case acm.CertificateTypeImported:
d.certDomainsCache.Set(certARN, domains, d.importedCertDomainsCacheTTL)
case acm.CertificateTypeAmazonIssued, acm.CertificateTypePrivate:
d.certDomainsCache.Set(certARN, domains, d.privateCertDomainsCacheTTL)
}
if len(d.allowedCAARNs) == 0 || slices.Contains(d.allowedCAARNs, awssdk.StringValue(certDetail.CertificateAuthorityArn)) {
return domains, nil
}
return sets.String{}, nil
technically there is no functional difference since allowedCAARNs
is a controller-level flag which is immutable given the controller's lifetime. However, from coding perspective, the cache shall be for the "domains" before the "CA filter logic" and this make the code more robust(e.g. works even allowedCAARNs
can be dynamically updated somehow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our PR introduce this behavior only if both conditions are met:
- use the
--allowed-certificate-authority-arns=...
parameter - there exists certificates inside AWS ACM which are not issued from a authority in the allowed list.
I/we can test your proposal tomorrow in our staging environment. But I assume that your proposed code change does not work properly as we then also cache the domain(s) of a certificate which is not issued from an allowed CA.
And the function first tries to load the domains from the cache before it reaches this filtering code block:
func (d *acmCertDiscovery) loadDomainsForCertificate(ctx context.Context, certARN string) (sets.String, error) {
if rawCacheItem, ok := d.certDomainsCache.Get(certARN); ok {
return rawCacheItem.(sets.String), nil
}
// only continues in case we didn't find it inside the cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkilchhofer
You are right, the cache shall be the domains after the filter.
ideally we should refactor the code such that the cache stores certificate details from AWS. (but for now it only caches domains and your filter logic is on cert CA), thus filter logic has to be run before the cache.
I'll approve this, and maybe refactor this in the future if we ever need to make the ca list mutable.
/ok-to-test |
/retest |
govulncheck failing is unrelated, will merge it. |
…#3608) due to missing cache
* fix log level in listener manager and tagging manager (#3573) * bump up controller-gen version and update manifests (#3580) * docs: ingress subnets annotation - clarify locale differences (#3579) * feat: allowed ACM cert discovery to filter on CA ARNs (#3565) (#3591) * Add example for NLB target-group-attributes to enable unhealthy target connection draining (#3577) * Add example annotation for NLB unhealthy target connection draining * Add emtpyline back in * fix: ca-filter causing expontentially more api-calls (#3608) due to missing cache * Repo controlled build go version (#3598) * update go version to mitigate CVE (#3615) * Adding support for Availability Zone Affinity (#3470) Fixes #3431 Signed-off-by: Alex Berger <alex-berger@gmx.ch> * Update golang.org/protobuf version to fix CVE-2024-24786 (#3618) * Add a note to recommend to use compatible chart and image versions * Update golang.org/protobuf version to fix CVE-2024-24786 --------- Signed-off-by: Alex Berger <alex-berger@gmx.ch> Co-authored-by: Olivia Song <sonyingy@amazon.com> Co-authored-by: Andrey Lebedev <alebedev87@gmail.com> Co-authored-by: Nathanael Liechti <technat@technat.ch> Co-authored-by: Isaac Wilson <10012479+jukie@users.noreply.github.com> Co-authored-by: Nathanael Liechti <nathanael.liechti@post.ch> Co-authored-by: Jason Du <jasonxdu@amazon.com> Co-authored-by: Hao Zhou <haouc@users.noreply.github.com> Co-authored-by: Alexander Berger <alex-berger@users.noreply.github.com>
Issue
Sort of related to #3565
Description
The recently merged #3591 contained a bug where on large clusters the initial reconcile loop after startup would take hours instead of minutes (on clusters with a lot of ingresses/certificates). This is due to the missing domains-cache of certificates that were filtered by this newly introduced flag.
This PR fixes this behaviour by setting an empty cache value for these filtered certificates. At least in our (@swisspost) testing environment this reduced duration of the reconcile loop significantly.
Checklist
README.md
, or thedocs
directory)BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯