Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tries to create records in Route53 that already exist (v1.12.2) #3007

Closed
2ZZ opened this issue Sep 7, 2022 · 9 comments · Fixed by #3724
Closed

Tries to create records in Route53 that already exist (v1.12.2) #3007

2ZZ opened this issue Sep 7, 2022 · 9 comments · Fixed by #3724
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@2ZZ
Copy link

2ZZ commented Sep 7, 2022

What happened: external-dns is frequently failing to create DNS records in Route53 as it is trying to create a record that already exists. For example these records are in Route53:

{"Name":"xyz.myexampledomain.com.","Type":"A","AliasTarget":{"HostedZoneId":"Z35SXDOTRQ7X7K","DNSName":"internal-xxx-xxx.us-east-1.elb.amazonaws.com.","EvaluateTargetHealth":true}}
{"Name":"myprefixcname-xyz.myexampledomain.com.","Type":"TXT","TTL":300,"ResourceRecords":[{"Value":"\"heritage=external-dns,external-dns/owner=myprefix,external-dns/resource=gateway/mynamespace/myapp\""}]}
{"Name":"myprefixxyz.myexampledomain.com.","Type":"TXT","TTL":300,"ResourceRecords":[{"Value":"\"heritage=external-dns,external-dns/owner=myprefix,external-dns/resource=gateway/mynamespace/myapp\""}]}

And external-dns is trying to create them again, resulting in an error

time="2022-09-07T06:38:44Z" level=debug msg="Endpoints generated from gateway: mynamespace/myapp: [xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com [] xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com []]"
time="2022-09-07T06:38:45Z" level=debug msg="Removing duplicate endpoint xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com []"
time="2022-09-07T06:38:45Z" level=debug msg="Adding myprefixcname-xyz.myexampledomain.com. to zone myexampledomain.com. [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=debug msg="Adding myprefixxyz.myexampledomain.com. to zone myexampledomain.com. [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=debug msg="Adding xyz.myexampledomain.com. to zone myexampledomain.com. [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=debug msg="Modifying endpoint: xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com [{alias true}], setting aws/evaluate-target-health=true"
time="2022-09-07T06:38:45Z" level=debug msg="Modifying endpoint: xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com [], setting alias=true"
time="2022-09-07T06:38:45Z" level=info msg="Desired change: CREATE myprefixcname-xyz.myexampledomain.com TXT [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=info msg="Desired change: CREATE xyz.myexampledomain.com A [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [Tried to create resource record set [name='myprefixcname-xyz.myexampledomain.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 28ce0839-65f0-4937-897b-d368c34928bf"
time="2022-09-07T06:38:46Z" level=info msg="Desired change: CREATE myprefixxyz.myexampledomain.com TXT [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:46Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='myprefixcname-us-vorax.myexampledomain.com.', type='TXT'] but it already exists, Tried to create resource record set [name='myprefixxyz.myexampledomain.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 4d0116d8-4bb2-44d1-a7e8-827564856577"

What you expected to happen: Records that already exist are left alone

How to reproduce it (as minimally and precisely as possible): Unknown so far

Environment: AWS

  • External-DNS version (use external-dns --version): v1.12.2
  • DNS provider: Route53
@2ZZ 2ZZ added the kind/bug Categorizes issue or PR as related to a bug. label Sep 7, 2022
@s-matyukevich
Copy link

We got the same error. After some investigation I figured out that it happens if you set --txt-cache-interval parameter.

Looking at the code I think the bug is here

After the first iteration recordsCache won't contain the TXT records in the new format (those records will be added to the missingTXTRecords and later created in the cloud provider) So during the second iteration, in case if we are using registry cache, the cache won't contain the new TXT records and registry once again will try to create them.

@benjimin
Copy link

This issue seems very similar to #2421 and #2793 (possibly duplicates?).

Same error. Tried upgrading 0.11.0 → 0.13.1 (and bitnami chart 6.2.7 → 6.12.1), to no avail.

  • In our config we were already not setting --txt-cache-interval (only --interval) so that suggestion didn't help.
  • I noticed that when the latest container first starts, the logs show one successful batch of TXT updates (or re-"CREATE"s), whereas the error occurs on a different batch that happens to include one new entry. (Also, I noticed that the InvalidChangeBatch message usually referred to only the first couple Desired change in the batch.)

I instead tried modifying --aws-batch-change-size and this did fix the problem. Originally it was set to 1000 (not sure why it was actually doing batches of 6-14?) and for curiosity I changed it to 5 (rather than 1). The outcome (pasted below) is curious because batches seem to partially succeed (complaining only about old records), or to succeed when they do not mix old and new records. I presume there was a recent external-dns format change; there now seems to be two TXT records for each substantive record?

time="2022-11-22T22:55:43Z" level=info msg="Applying provider record filter for domains: [dev.example. .dev.example.]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-a-new.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-cname-old1.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-cname-old2.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-old1.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-old2.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:44Z" level=error msg="Failure in zone dev.example. [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:44Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cluster-old1.dev.example.', type='TXT'] but it already exists, Tried to create resource record set [name='cluster-old2.dev.example.', type='T
XT'] but it already exists]\n\tstatus code: 400, request id: 4a4b433f-ce5e-49c8-af91-b0e2bb2ea748"
time="2022-11-22T22:55:45Z" level=info msg="Desired change: CREATE cluster-new.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:45Z" level=info msg="Desired change: CREATE old1.dev.example A [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:45Z" level=info msg="Desired change: CREATE old2.dev.example A [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:45Z" level=info msg="Desired change: CREATE new.dev.example A [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:46Z" level=info msg="4 record(s) in zone dev.example. [Id: /hostedzone/ZZZ] were successfully updated"
time="2022-11-22T22:55:46Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/ZZZ]"
time="2022-11-22T22:56:41Z" level=info msg="Applying provider record filter for domains: [dev.example. .dev.example.]"
time="2022-11-22T22:56:41Z" level=info msg="Desired change: CREATE cluster-a-new.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:56:41Z" level=info msg="Desired change: CREATE cluster-cname-old1.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:56:41Z" level=info msg="Desired change: CREATE cluster-cname-old2.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:56:41Z" level=info msg="3 record(s) in zone dev.example. [Id: /hostedzone/ZZZ] were successfully updated"
time="2022-11-22T22:56:41Z" level=info msg="All missing records are created"
time="2022-11-22T22:56:41Z" level=info msg="Applying provider record filter for domains: [dev.example. .dev.example.]"
time="2022-11-22T22:56:42Z" level=info msg="All records are already up to date"
time="2022-11-22T22:57:41Z" level=info msg="Applying provider record filter for domains: [dev.example. .dev.example.]"
time="2022-11-22T22:57:42Z" level=info msg="All records are already up to date"

@benjimin
Copy link

benjimin commented Nov 30, 2022

I have another (hopefully clearer) example of this problem.

Initially, it was attempting to create the 3 records (old-style TXT, new-style cname-TXT, and a substantive A record) for one ingress. This batch was consistently failing (with a message that the two TXT records already exist). I checked in Route 53 and confirmed that those TXT records already existed, and no substantive record for this ingress existed. The following errors were repeating ad infinitum:

time="2022-11-30T08:43:27Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:43:27Z" level=info msg="Desired change: CREATE cluster-cname-thisingress.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:43:27Z" level=info msg="Desired change: CREATE cluster-thisingress.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:43:27Z" level=info msg="Desired change: CREATE thisingress.example A [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:43:27Z" level=error msg="Failure in zone example. [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:43:27Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cluster-cname-thisingress.example.', type='TXT'] but it already exists, Tried to create
resource record set [name='cluster-thisingress.example.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: f822a731-deb0-4293-bba0-7c520780edf0"
time="2022-11-30T08:43:27Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/ZZZ]"

I then set --aws-batch-change-size=1 (to disable batching). This still resulted in some transient error messages, but then the substantive record was created and the errors ceased thereafter:

time="2022-11-30T08:49:53Z" level=info msg="Created Kubernetes client https://x.x.x.x:443"
time="2022-11-30T08:49:57Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:49:58Z" level=info msg="Desired change: CREATE cluster-cname-thisingress.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:49:58Z" level=error msg="Failure in zone example. [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:49:58Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cluster-cname-thisingress.example.', type='TXT'] but it already exists]\n\tstatus code:
400, request id: 03748069-8235-422c-b460-9c62d8cfae5e"
time="2022-11-30T08:49:59Z" level=info msg="Desired change: CREATE cluster-thisingress.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:50:00Z" level=error msg="Failure in zone example. [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:50:00Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cluster-thisingress.example.', type='TXT'] but it already exists]\n\tstatus code: 400, r
equest id: 2aaa247b-2b52-44ad-a46b-c1d55c94d772"
time="2022-11-30T08:50:01Z" level=info msg="Desired change: CREATE thisingress.example A [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:50:02Z" level=info msg="1 record(s) in zone example. [Id: /hostedzone/ZZZ] were successfully updated"
time="2022-11-30T08:50:02Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/ZZZ]"
time="2022-11-30T08:50:55Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:50:56Z" level=info msg="All records are already up to date"
time="2022-11-30T08:51:55Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:51:56Z" level=info msg="All records are already up to date"
time="2022-11-30T08:52:55Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:52:56Z" level=info msg="All records are already up to date"

I then also tried deleting just the new-style TXT record, or all three records. In both cases external-dns promptly replaced them with no errors. (Deleting only the substantive A record would replicate the transient error. Deleting only the old-style TXT record would be tolerated without any effect.)

As a separate experiment, I tried deleting and immediately recreating an annotated ClusterIP service (so that the target address would change), and I found that all three records were successfully UPSERT'ed, rather than getting re-CREATE'd. (Modifying the hostname annotation caused 3 deletions and 3 creations.)

Multiple aspects seem weird to me:

  • The substantive record is type A whereas the new-style TXT specifies cname not a. Checking in Route 53 (where we have dozens of subdomains managed by external-dns), I see that there is one case where the new-style TXT contains a (and this is for our most recently added subdomain, which is also unique in that its resource is an internal ClusterIP service rather than some form of ingress), but all the other new-style TXT records (for older subdomains) contain cname, even though they actually correspond to type A records. (We only have a handful of actual CNAME records, e.g. for cloudfront or ACM DNS validation, i.e. they are not managed by external-dns.) Presumably this is a problem with how aliased records (letting route53 resolve load balancer ips, to boost performance and reduce cost) are handled by external-dns? #3164.

  • When a batch does fail, subsequent elements in the batch get skipped (if they are of a different type than the element or two which caused the failure). This becomes a critical problem when external-dns later keeps retrying the same batch (with the elements in the same order).

  • With batch size set to one, external-dns still causes errors by attempting to create TXT records where the substantive (type A) record is absent, even if that TXT record is already present. However, this error has no consequence because even after a failed batch, external-dns proceeds to attempt any other pending batches (without waiting for the next interval). This means that it will continue to create the substantive A records, and once those exist it will relent from trying to recreate the already-existing TXT records.

  • I'm unsure how it had spontaneously gotten into a state with a substantive record missing but corresponding TXT records present. (Is aliasing related? Similar to #3186)

@jbilliau-rcd
Copy link

jbilliau-rcd commented Feb 13, 2023

Im having the exact same problem on multiple clusters; external-dns keeps trying to create records that already exist (it created them), and thus errors out nonstop....the worst part is some of these unnecessary retries are batched in with real records I do need created due to new ingresses, and thus one apps DNS's issues are now breaking other apps from having their DNS records created. Very annoying, and I can't figure out why it's happening.

@farazoman
Copy link

I'm having the exact same issue on 0.12 and I'm in a similar boat as @jbilliau-rcd where the batch size is causing issues for the actual records I need. External-dns wants to create a record that prefixes "cname" to the actual dns for some reason, which is same as the txt record that's created before upgrading.

@nipr-jdoenges
Copy link

This issue is blocking our upgrade from v0.11.0... the migration from the old-style to new-style TXT registry records seems fundamentally broken.

@jbilliau-rcd
Copy link

Can we just get added some sort of "overwrite" functionality? If the record already exists, just overwrite it; why would I ever care about a record being overwritten with the same value?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2023
@jbilliau-rcd
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants