Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: new ca-filter causing expontentially more api-calls #3608

Merged
merged 1 commit into from Mar 13, 2024

Conversation

the-technat
Copy link
Contributor

@the-technat the-technat commented Mar 11, 2024

Issue

Sort of related to #3565

Description

The recently merged #3591 contained a bug where on large clusters the initial reconcile loop after startup would take hours instead of minutes (on clusters with a lot of ingresses/certificates). This is due to the missing domains-cache of certificates that were filtered by this newly introduced flag.

This PR fixes this behaviour by setting an empty cache value for these filtered certificates. At least in our (@swisspost) testing environment this reduced duration of the reconcile loop significantly.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 11, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @the-technat. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 11, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: the-technat
Once this PR has been reviewed and has the lgtm label, please assign oliviassss for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dims
Copy link
Member

dims commented Mar 11, 2024

/ok-to-test
/assign @shraddhabang @oliviassss @M00nF1sh

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 11, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 11, 2024
case acm.CertificateTypeAmazonIssued, acm.CertificateTypePrivate:
d.certDomainsCache.Set(certARN, domains, d.privateCertDomainsCacheTTL)
}
return domains, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real resp. resulting diff for this feature (#3565) can be viewed here:
v2.7.1...the-technat:aws-load-balancer-controller:main

Or on this screenshot:

image

@M00nF1sh
Copy link
Collaborator

taking a look, wondering how the original PR would cause such regression when the new flag is not used given it had the len(d.allowedCAARNs) == 0 check..

@@ -153,18 +153,18 @@ func (d *acmCertDiscovery) loadDomainsForCertificate(ctx context.Context, certAR
certDetail := resp.Certificate

// check if cert is issued from an allowed CA
// otherwise empty-out the list of domains
domains := sets.String{}
if len(d.allowedCAARNs) == 0 || slices.Contains(d.allowedCAARNs, awssdk.StringValue(certDetail.CertificateAuthorityArn)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my understanding, the originally PR didn't introduce any regression right?
And you faced this issue due to you used the new "allowedCAARNs" feature?
If so, this fix looks good to me. However, i'd like to change this code to be like

domains := sets.NewString(aws.StringValueSlice(certDetail.SubjectAlternativeNames)...)
switch aws.StringValue(certDetail.Type) {
case acm.CertificateTypeImported:
	d.certDomainsCache.Set(certARN, domains, d.importedCertDomainsCacheTTL)
case acm.CertificateTypeAmazonIssued, acm.CertificateTypePrivate:
	d.certDomainsCache.Set(certARN, domains, d.privateCertDomainsCacheTTL)
}
if len(d.allowedCAARNs) == 0 || slices.Contains(d.allowedCAARNs, awssdk.StringValue(certDetail.CertificateAuthorityArn)) {
   return domains, nil
}
return sets.String{}, nil

technically there is no functional difference since allowedCAARNs is a controller-level flag which is immutable given the controller's lifetime. However, from coding perspective, the cache shall be for the "domains" before the "CA filter logic" and this make the code more robust(e.g. works even allowedCAARNs can be dynamically updated somehow).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our PR introduce this behavior only if both conditions are met:

  1. use the --allowed-certificate-authority-arns=... parameter
  2. there exists certificates inside AWS ACM which are not issued from a authority in the allowed list.

I/we can test your proposal tomorrow in our staging environment. But I assume that your proposed code change does not work properly as we then also cache the domain(s) of a certificate which is not issued from an allowed CA.

And the function first tries to load the domains from the cache before it reaches this filtering code block:

func (d *acmCertDiscovery) loadDomainsForCertificate(ctx context.Context, certARN string) (sets.String, error) {
	if rawCacheItem, ok := d.certDomainsCache.Get(certARN); ok {
		return rawCacheItem.(sets.String), nil
	}
	// only continues in case we didn't find it inside the cache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkilchhofer
You are right, the cache shall be the domains after the filter.
ideally we should refactor the code such that the cache stores certificate details from AWS. (but for now it only caches domains and your filter logic is on cert CA), thus filter logic has to be run before the cache.

I'll approve this, and maybe refactor this in the future if we ever need to make the ca list mutable.

@M00nF1sh
Copy link
Collaborator

/ok-to-test

@the-technat
Copy link
Contributor Author

/retest

@oliviassss
Copy link
Collaborator

govulncheck failing is unrelated, will merge it.

@oliviassss oliviassss added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Mar 13, 2024
@oliviassss oliviassss merged commit 20e667d into kubernetes-sigs:main Mar 13, 2024
7 of 9 checks passed
shraddhabang pushed a commit to shraddhabang/aws-load-balancer-controller that referenced this pull request Mar 20, 2024
M00nF1sh pushed a commit that referenced this pull request Mar 22, 2024
* fix log level in listener manager and tagging manager (#3573)

* bump up controller-gen version and update manifests (#3580)

* docs: ingress subnets annotation - clarify locale differences (#3579)

* feat: allowed ACM cert discovery to filter on CA ARNs (#3565) (#3591)

* Add example for NLB target-group-attributes to enable unhealthy target connection draining (#3577)

* Add example annotation for NLB unhealthy target connection draining

* Add emtpyline back in

* fix: ca-filter causing expontentially more api-calls (#3608)

due to missing cache

* Repo controlled build go version (#3598)

* update go version to mitigate CVE (#3615)

* Adding support for Availability Zone Affinity (#3470)

Fixes #3431

Signed-off-by: Alex Berger <alex-berger@gmx.ch>

* Update golang.org/protobuf version to fix CVE-2024-24786 (#3618)

* Add a note to recommend to use compatible chart and image versions

* Update golang.org/protobuf version to fix CVE-2024-24786

---------

Signed-off-by: Alex Berger <alex-berger@gmx.ch>
Co-authored-by: Olivia Song <sonyingy@amazon.com>
Co-authored-by: Andrey Lebedev <alebedev87@gmail.com>
Co-authored-by: Nathanael Liechti <technat@technat.ch>
Co-authored-by: Isaac Wilson <10012479+jukie@users.noreply.github.com>
Co-authored-by: Nathanael Liechti <nathanael.liechti@post.ch>
Co-authored-by: Jason Du <jasonxdu@amazon.com>
Co-authored-by: Hao Zhou <haouc@users.noreply.github.com>
Co-authored-by: Alexander Berger <alex-berger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants