Tries to create records in Route53 that already exist (v1.12.2) #3007

2ZZ · 2022-09-07T11:20:45Z

What happened: external-dns is frequently failing to create DNS records in Route53 as it is trying to create a record that already exists. For example these records are in Route53:

{"Name":"xyz.myexampledomain.com.","Type":"A","AliasTarget":{"HostedZoneId":"Z35SXDOTRQ7X7K","DNSName":"internal-xxx-xxx.us-east-1.elb.amazonaws.com.","EvaluateTargetHealth":true}}
{"Name":"myprefixcname-xyz.myexampledomain.com.","Type":"TXT","TTL":300,"ResourceRecords":[{"Value":"\"heritage=external-dns,external-dns/owner=myprefix,external-dns/resource=gateway/mynamespace/myapp\""}]}
{"Name":"myprefixxyz.myexampledomain.com.","Type":"TXT","TTL":300,"ResourceRecords":[{"Value":"\"heritage=external-dns,external-dns/owner=myprefix,external-dns/resource=gateway/mynamespace/myapp\""}]}

And external-dns is trying to create them again, resulting in an error

time="2022-09-07T06:38:44Z" level=debug msg="Endpoints generated from gateway: mynamespace/myapp: [xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com [] xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com []]"
time="2022-09-07T06:38:45Z" level=debug msg="Removing duplicate endpoint xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com []"
time="2022-09-07T06:38:45Z" level=debug msg="Adding myprefixcname-xyz.myexampledomain.com. to zone myexampledomain.com. [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=debug msg="Adding myprefixxyz.myexampledomain.com. to zone myexampledomain.com. [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=debug msg="Adding xyz.myexampledomain.com. to zone myexampledomain.com. [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=debug msg="Modifying endpoint: xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com [{alias true}], setting aws/evaluate-target-health=true"
time="2022-09-07T06:38:45Z" level=debug msg="Modifying endpoint: xyz.myexampledomain.com 0 IN CNAME  internal-xxx-xxx.us-east-1.elb.amazonaws.com [], setting alias=true"
time="2022-09-07T06:38:45Z" level=info msg="Desired change: CREATE myprefixcname-xyz.myexampledomain.com TXT [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=info msg="Desired change: CREATE xyz.myexampledomain.com A [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:45Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [Tried to create resource record set [name='myprefixcname-xyz.myexampledomain.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 28ce0839-65f0-4937-897b-d368c34928bf"
time="2022-09-07T06:38:46Z" level=info msg="Desired change: CREATE myprefixxyz.myexampledomain.com TXT [Id: /hostedzone/Z31YJ4LEYXXXX]"
time="2022-09-07T06:38:46Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='myprefixcname-us-vorax.myexampledomain.com.', type='TXT'] but it already exists, Tried to create resource record set [name='myprefixxyz.myexampledomain.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 4d0116d8-4bb2-44d1-a7e8-827564856577"

What you expected to happen: Records that already exist are left alone

How to reproduce it (as minimally and precisely as possible): Unknown so far

Environment: AWS

External-DNS version (use external-dns --version): v1.12.2
DNS provider: Route53

The text was updated successfully, but these errors were encountered:

s-matyukevich · 2022-10-13T15:24:38Z

We got the same error. After some investigation I figured out that it happens if you set --txt-cache-interval parameter.

Looking at the code I think the bug is here

After the first iteration recordsCache won't contain the TXT records in the new format (those records will be added to the missingTXTRecords and later created in the cloud provider) So during the second iteration, in case if we are using registry cache, the cache won't contain the new TXT records and registry once again will try to create them.

benjimin · 2022-11-23T00:23:02Z

This issue seems very similar to #2421 and #2793 (possibly duplicates?).

Same error. Tried upgrading 0.11.0 → 0.13.1 (and bitnami chart 6.2.7 → 6.12.1), to no avail.

In our config we were already not setting --txt-cache-interval (only --interval) so that suggestion didn't help.
I noticed that when the latest container first starts, the logs show one successful batch of TXT updates (or re-"CREATE"s), whereas the error occurs on a different batch that happens to include one new entry. (Also, I noticed that the InvalidChangeBatch message usually referred to only the first couple Desired change in the batch.)

I instead tried modifying --aws-batch-change-size and this did fix the problem. Originally it was set to 1000 (not sure why it was actually doing batches of 6-14?) and for curiosity I changed it to 5 (rather than 1). The outcome (pasted below) is curious because batches seem to partially succeed (complaining only about old records), or to succeed when they do not mix old and new records. I presume there was a recent external-dns format change; there now seems to be two TXT records for each substantive record?

time="2022-11-22T22:55:43Z" level=info msg="Applying provider record filter for domains: [dev.example. .dev.example.]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-a-new.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-cname-old1.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-cname-old2.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-old1.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:43Z" level=info msg="Desired change: CREATE cluster-old2.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:44Z" level=error msg="Failure in zone dev.example. [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:44Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cluster-old1.dev.example.', type='TXT'] but it already exists, Tried to create resource record set [name='cluster-old2.dev.example.', type='T
XT'] but it already exists]\n\tstatus code: 400, request id: 4a4b433f-ce5e-49c8-af91-b0e2bb2ea748"
time="2022-11-22T22:55:45Z" level=info msg="Desired change: CREATE cluster-new.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:45Z" level=info msg="Desired change: CREATE old1.dev.example A [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:45Z" level=info msg="Desired change: CREATE old2.dev.example A [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:45Z" level=info msg="Desired change: CREATE new.dev.example A [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:55:46Z" level=info msg="4 record(s) in zone dev.example. [Id: /hostedzone/ZZZ] were successfully updated"
time="2022-11-22T22:55:46Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/ZZZ]"
time="2022-11-22T22:56:41Z" level=info msg="Applying provider record filter for domains: [dev.example. .dev.example.]"
time="2022-11-22T22:56:41Z" level=info msg="Desired change: CREATE cluster-a-new.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:56:41Z" level=info msg="Desired change: CREATE cluster-cname-old1.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:56:41Z" level=info msg="Desired change: CREATE cluster-cname-old2.dev.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-22T22:56:41Z" level=info msg="3 record(s) in zone dev.example. [Id: /hostedzone/ZZZ] were successfully updated"
time="2022-11-22T22:56:41Z" level=info msg="All missing records are created"
time="2022-11-22T22:56:41Z" level=info msg="Applying provider record filter for domains: [dev.example. .dev.example.]"
time="2022-11-22T22:56:42Z" level=info msg="All records are already up to date"
time="2022-11-22T22:57:41Z" level=info msg="Applying provider record filter for domains: [dev.example. .dev.example.]"
time="2022-11-22T22:57:42Z" level=info msg="All records are already up to date"

benjimin · 2022-11-30T12:47:02Z

I have another (hopefully clearer) example of this problem.

Initially, it was attempting to create the 3 records (old-style TXT, new-style cname-TXT, and a substantive A record) for one ingress. This batch was consistently failing (with a message that the two TXT records already exist). I checked in Route 53 and confirmed that those TXT records already existed, and no substantive record for this ingress existed. The following errors were repeating ad infinitum:

time="2022-11-30T08:43:27Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:43:27Z" level=info msg="Desired change: CREATE cluster-cname-thisingress.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:43:27Z" level=info msg="Desired change: CREATE cluster-thisingress.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:43:27Z" level=info msg="Desired change: CREATE thisingress.example A [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:43:27Z" level=error msg="Failure in zone example. [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:43:27Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cluster-cname-thisingress.example.', type='TXT'] but it already exists, Tried to create
resource record set [name='cluster-thisingress.example.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: f822a731-deb0-4293-bba0-7c520780edf0"
time="2022-11-30T08:43:27Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/ZZZ]"

I then set --aws-batch-change-size=1 (to disable batching). This still resulted in some transient error messages, but then the substantive record was created and the errors ceased thereafter:

time="2022-11-30T08:49:53Z" level=info msg="Created Kubernetes client https://x.x.x.x:443"
time="2022-11-30T08:49:57Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:49:58Z" level=info msg="Desired change: CREATE cluster-cname-thisingress.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:49:58Z" level=error msg="Failure in zone example. [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:49:58Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cluster-cname-thisingress.example.', type='TXT'] but it already exists]\n\tstatus code:
400, request id: 03748069-8235-422c-b460-9c62d8cfae5e"
time="2022-11-30T08:49:59Z" level=info msg="Desired change: CREATE cluster-thisingress.example TXT [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:50:00Z" level=error msg="Failure in zone example. [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:50:00Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cluster-thisingress.example.', type='TXT'] but it already exists]\n\tstatus code: 400, r
equest id: 2aaa247b-2b52-44ad-a46b-c1d55c94d772"
time="2022-11-30T08:50:01Z" level=info msg="Desired change: CREATE thisingress.example A [Id: /hostedzone/ZZZ]"
time="2022-11-30T08:50:02Z" level=info msg="1 record(s) in zone example. [Id: /hostedzone/ZZZ] were successfully updated"
time="2022-11-30T08:50:02Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/ZZZ]"
time="2022-11-30T08:50:55Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:50:56Z" level=info msg="All records are already up to date"
time="2022-11-30T08:51:55Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:51:56Z" level=info msg="All records are already up to date"
time="2022-11-30T08:52:55Z" level=info msg="Applying provider record filter for domains: [example. .example.]"
time="2022-11-30T08:52:56Z" level=info msg="All records are already up to date"

I then also tried deleting just the new-style TXT record, or all three records. In both cases external-dns promptly replaced them with no errors. (Deleting only the substantive A record would replicate the transient error. Deleting only the old-style TXT record would be tolerated without any effect.)

As a separate experiment, I tried deleting and immediately recreating an annotated ClusterIP service (so that the target address would change), and I found that all three records were successfully UPSERT'ed, rather than getting re-CREATE'd. (Modifying the hostname annotation caused 3 deletions and 3 creations.)

Multiple aspects seem weird to me:

The substantive record is type A whereas the new-style TXT specifies cname not a. Checking in Route 53 (where we have dozens of subdomains managed by external-dns), I see that there is one case where the new-style TXT contains a (and this is for our most recently added subdomain, which is also unique in that its resource is an internal ClusterIP service rather than some form of ingress), but all the other new-style TXT records (for older subdomains) contain cname, even though they actually correspond to type A records. (We only have a handful of actual CNAME records, e.g. for cloudfront or ACM DNS validation, i.e. they are not managed by external-dns.) Presumably this is a problem with how aliased records (letting route53 resolve load balancer ips, to boost performance and reduce cost) are handled by external-dns? #3164.
When a batch does fail, subsequent elements in the batch get skipped (if they are of a different type than the element or two which caused the failure). This becomes a critical problem when external-dns later keeps retrying the same batch (with the elements in the same order).
With batch size set to one, external-dns still causes errors by attempting to create TXT records where the substantive (type A) record is absent, even if that TXT record is already present. However, this error has no consequence because even after a failed batch, external-dns proceeds to attempt any other pending batches (without waiting for the next interval). This means that it will continue to create the substantive A records, and once those exist it will relent from trying to recreate the already-existing TXT records.
I'm unsure how it had spontaneously gotten into a state with a substantive record missing but corresponding TXT records present. (Is aliasing related? Similar to #3186)

jbilliau-rcd · 2023-02-13T16:12:46Z

Im having the exact same problem on multiple clusters; external-dns keeps trying to create records that already exist (it created them), and thus errors out nonstop....the worst part is some of these unnecessary retries are batched in with real records I do need created due to new ingresses, and thus one apps DNS's issues are now breaking other apps from having their DNS records created. Very annoying, and I can't figure out why it's happening.

farazoman · 2023-02-28T18:05:16Z

I'm having the exact same issue on 0.12 and I'm in a similar boat as @jbilliau-rcd where the batch size is causing issues for the actual records I need. External-dns wants to create a record that prefixes "cname" to the actual dns for some reason, which is same as the txt record that's created before upgrading.

nipr-jdoenges · 2023-03-10T19:34:49Z

This issue is blocking our upgrade from v0.11.0... the migration from the old-style to new-style TXT registry records seems fundamentally broken.

jbilliau-rcd · 2023-03-10T19:38:20Z

Can we just get added some sort of "overwrite" functionality? If the record already exists, just overwrite it; why would I ever care about a record being overwritten with the same value?

k8s-triage-robot · 2023-06-08T20:36:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jbilliau-rcd · 2023-06-08T20:43:29Z

/remove-lifecycle stale

2ZZ added the kind/bug Categorizes issue or PR as related to a bug. label Sep 7, 2022

rymai mentioned this issue Nov 23, 2022

Some new TXT records are not being cleaned up, causing an "InvalidChangeBatch" error #3186

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2023

Sewci0 mentioned this issue Jun 27, 2023

[TXT Registry] Fix handling of records produced by toNewTXTName in toEndpoint #3724

Merged

1 task

k8s-ci-robot closed this as completed in #3724 Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tries to create records in Route53 that already exist (v1.12.2) #3007

Tries to create records in Route53 that already exist (v1.12.2) #3007

2ZZ commented Sep 7, 2022

s-matyukevich commented Oct 13, 2022

benjimin commented Nov 23, 2022

benjimin commented Nov 30, 2022 •

edited

jbilliau-rcd commented Feb 13, 2023 •

edited

farazoman commented Feb 28, 2023

nipr-jdoenges commented Mar 10, 2023

jbilliau-rcd commented Mar 10, 2023

k8s-triage-robot commented Jun 8, 2023

jbilliau-rcd commented Jun 8, 2023

Tries to create records in Route53 that already exist (v1.12.2) #3007

Tries to create records in Route53 that already exist (v1.12.2) #3007

Comments

2ZZ commented Sep 7, 2022

s-matyukevich commented Oct 13, 2022

benjimin commented Nov 23, 2022

benjimin commented Nov 30, 2022 • edited

jbilliau-rcd commented Feb 13, 2023 • edited

farazoman commented Feb 28, 2023

nipr-jdoenges commented Mar 10, 2023

jbilliau-rcd commented Mar 10, 2023

k8s-triage-robot commented Jun 8, 2023

jbilliau-rcd commented Jun 8, 2023

benjimin commented Nov 30, 2022 •

edited

jbilliau-rcd commented Feb 13, 2023 •

edited