Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-dns calls aws api "ListResourceRecordSets" too frequently #905

Closed
yuanlinios opened this issue Feb 19, 2019 · 18 comments
Closed

external-dns calls aws api "ListResourceRecordSets" too frequently #905

yuanlinios opened this issue Feb 19, 2019 · 18 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@yuanlinios
Copy link

I am piloting AWS EKS with external-DNS. There is 1 private hosted zone with around 700 pre-existing records.

The function works OK. However, from cloudtrail, I can see external-dns calls "ListHostedZones" once every minute which I can understand, but it also issues 8 or 9 "ListResourceRecordSets" every minute.

Is it as expected? This frequency for API calling is too much for me. Is it possible to increase the time interval?

@yuanlinios yuanlinios changed the title external-dns calsl aws api "ListResourceRecordSets" too frequently external-dns calls aws api "ListResourceRecordSets" too frequently Feb 19, 2019
@spender0
Copy link

spender0 commented Feb 20, 2019

Same thing. v0.5.11
with interval=1m is does a sequence of api requests every 1 minute. The sequence mostly consists of ListResourceRecordSets requests.
The problem is that it does it without any pause between the requests and exceeds AWS global route53 api rate that is 5 requests per 1 second.

@tewing-riffyn
Copy link

I am experiencing a similar problem. Cloudtrail shows a high volume of external-DNS upserts. This is causing "rate exceeded" messages when I use other tools like Terraform.

I am running 8 kubernetes clusters within the same AWS account. Each cluster is running a separate instance of external-dns and is updating a private zone and a public zone.

@njuettner
Copy link
Member

As long as you don't use other ProviderSpecific than target-health is shouldn't run differently than using < v0.5.9

see: https://github.com/kubernetes-incubator/external-dns/blob/master/plan/plan.go#L188

We introduced a bug in v0.5.10 when we merged ProviderSpecific, however this should be fixed in the latest version.

One thing you could try out is to test if v0.5.9 has the same problems.

@wallentx
Copy link

Not seeing any issues with v0.5.9, but had problems with v0.5.11 and v0.5.10. Running with:
args:
- '--source=service'
- '--source=ingress'
- '--provider=aws'
- '--registry=txt'
- '--txt-owner-id=k8s-external-dns'

@xanonid
Copy link

xanonid commented Mar 5, 2019

Duplicate issue: #891

@fraenkel
Copy link
Contributor

fraenkel commented Apr 5, 2019

So I don't believe this is a duplicate issue but more an issue of how the code is currently structured and the interaction between Controller and Registry.
Let me walk through what I am talking about.

I am ignoring the cache because in the worst case it won't matter.

The Controller calls Registry.Records() calls the provider.Records().
p.Records() calls p.Zones() and route53.ListResourceRecordSetsPages().
p.Zones() calls route53.ListHostedZonesPages.

So if we count thus far, we have made z (# of zone pages) + r (# of resourcerecord pages) calls.

Eventually the Controller will call Registry.ApplyChanges() which calls provider.ApplyChanges()
Now the problem ensues.
p.ApplyChanges does 3 calls to p.newChanges() and then calls p.submitChanges()
p.newChanges() calls p.Records() and calls p.newChange()
p.newChange() will call p.Zones() for Alias records.
p.submitChanges() will call p.Zones() and route53.ChangeResourceRecordSets()

For newChanges(), we have z + r + z calls.
For submitChanges(), we have z calls.

For a single pass, Records + ApplyChanges, we have (z + r ) + 3 *(2z+r) + z = 8z + 4r

Just to prove I wasn't crazy, I ran the simple TestAWSApplyChanges, and saw 4 zone calls ,and 3 records calls. Its not the worst case but it's not good.

The ideal would be to do 1 call for both zones and resources. With minimal effort it should be possible to do 1 zone and 1 resource call just within ApplyChanges.

@tewing-riffyn
Copy link

@fraenkel - good write-up. Yes, refactoring some of the duplicate calls would help greatly.

I solved my issue by setting zoneIdFilters within the helm values. External-dns was spending a lot of work evaluating other suddomain zones only to determine they weren't authoritative for the record it was manipulating. The problem was compounded because I have 8 clusters running in the same AWS account with their own external-dns controller.

If external-dns were more efficient about evaluating the zones I wouldn't need to do this.

@fraenkel
Copy link
Contributor

fraenkel commented Apr 5, 2019

I will put together a PR which I believe can reduce this to the bare minimum. Shouldn't take long.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2019
@so0k
Copy link

so0k commented Jul 18, 2019

I think this can be closed?

@DTTerastar
Copy link

I am seeing rate throttling errors as well. It'd be nice if it retried in a delayed backoff loop to work around

@joeharrison714
Copy link

@njuettner I had this issue so I tested out v0.5.9 and it instantly worked. I don't think the bug you mentioned was fixed.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 8, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@helgi
Copy link
Contributor

helgi commented Nov 7, 2019

/reopen

@k8s-ci-robot
Copy link
Contributor

@helgi: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@burdzwastaken
Copy link

This fix in the latest release resolved this issue for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests