Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some new TXT records are not being cleaned up, causing an "InvalidChangeBatch" error #3186

Open
born4new opened this issue Nov 23, 2022 · 22 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@born4new
Copy link

What happened:

After deleting some ingress resources, it seems that the new TXT record is not being cleaned up, but the other two DNS entries (the A record and the legacy TXT record) are being cleaned up. When searching for DNS records in AWS53, this is what we see:

Searching for <our-dns-name>.

[]

Searching for a-<our-dns-name>.

    {
        "Name": "a-<our-dns-name>.",
        "Type": "TXT",
        "TTL": 300,
        "ResourceRecords": [
            {
                "Value": "\"heritage=external-dns,external-dns/owner=<our-owner-string>,external-dns/resource=ingress/<our-ingress>\""
            }
        ]
    }

This later on causes an issue when we redeploy the application, as external-dns tries to create those three DNS entries (A record, legacy TXT and new TXT):

time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE a-<our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> A [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="Failure in zone <our-dns-zone>. [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='a-<our-dns-name>.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: d65bc8e2-4055-4d9f-8412-4653debd76ff"

What you expected to happen:

The new TXT record should be cleaned up in the first place, or maybe we could also replace the TXT record if it already exists, or have an option to do so.

How to reproduce it (as minimally and precisely as possible):

I do not know how to reproduce this issue easily, but I'm more than happy to provide as much debugging info as needed.

Anything else we need to know?:

N/A

Environment:

  • External-DNS version (use external-dns --version): 0.13.1
  • DNS provider: AWS
  • Others:
@born4new born4new added the kind/bug Categorizes issue or PR as related to a bug. label Nov 23, 2022
@rymai
Copy link

rymai commented Nov 23, 2022

This definitely looks similar to #3007, #2421, and #2793.

@benjimin
Copy link

@born4new does setting --aws-batch-change-size=1 resolve your problem? (i.e., is it purely the batching that is broken?)

@born4new
Copy link
Author

born4new commented Dec 6, 2022

does setting --aws-batch-change-size=1 resolve your problem?

We haven't specifically tried a size of 1, but we have tried a few values (e.g. 20, 200, 1000), none of them helped.

The fix for us was to go back to an external-dns version below 0.12.0, so that external-dns wouldn't be aware of the newly introduced TXT record. This seems to indicate a problem in the way the new TXT records are cleaned up...

@JonathanLachapelle
Copy link

We are facing the exact same issue.

@JonathanLachapelle
Copy link

What happened:

After deleting some ingress resources, it seems that the new TXT record is not being cleaned up, but the other two DNS entries (the A record and the legacy TXT record) are being cleaned up. When searching for DNS records in AWS53, this is what we see:

Searching for <our-dns-name>.

[]

Searching for a-<our-dns-name>.

    {
        "Name": "a-<our-dns-name>.",
        "Type": "TXT",
        "TTL": 300,
        "ResourceRecords": [
            {
                "Value": "\"heritage=external-dns,external-dns/owner=<our-owner-string>,external-dns/resource=ingress/<our-ingress>\""
            }
        ]
    }

This later on causes an issue when we redeploy the application, as external-dns tries to create those three DNS entries (A record, legacy TXT and new TXT):

time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE a-<our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> A [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="Failure in zone <our-dns-zone>. [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='a-<our-dns-name>.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: d65bc8e2-4055-4d9f-8412-4653debd76ff"

What you expected to happen:

The new TXT record should be cleaned up in the first place, or maybe we could also replace the TXT record if it already exists, or have an option to do so.

How to reproduce it (as minimally and precisely as possible):

I do not know how to reproduce this issue easily, but I'm more than happy to provide as much debugging info as needed.

Anything else we need to know?:

N/A

Environment:

  • External-DNS version (use external-dns --version): 0.13.1
  • DNS provider: AWS
  • Others:

Does it happen on all record or just sometime?

@born4new
Copy link
Author

Does it happen on all record or just sometime?

@JonathanLachapelle It was happening on some records only.

@xavidop
Copy link

xavidop commented Dec 14, 2022

we faced the same issue today:
We are using AWS Route53 and our External DNS version is 0.12.2

{"level":"error","msg":"InvalidChangeBatch: [Tried to create resource record set [name='cname-runtime-api-dev-amy.development.voiceflow.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 4db33c47-f34f-4a36-8d60-b2cb0750578d","time":"2022-12-14T11:11:20Z"}

@ArturChe
Copy link

I have faced the same issue after updating the external-dns version from 0.12.0 to 0.13.1. And instead of syncing with previously created TXT record graylog.<domain> it tries to create cname-graylog.<domain> and it fails with output below:

time="2022-12-14T11:49:54Z" level=error msg="InvalidChangeBatch: [The request contains an invalid set of changes for a resource record set 'TXT cname-graylog.<domain>.', The request contains an invalid set of changes for a resource record set 'TXT cname-mongodb.<domain>.', The request contains an invalid set of changes for a resource record set 'TXT cname-tcp.graylog.<domain>.']\n\tstatus code: 400, request id: <Id>"
time="2022-12-14T11:49:54Z" level=info msg="Desired change: CREATE cname-graylog.<domain> TXT [Id: /hostedzone/<hostedzone>]"
...

@IKohli09
Copy link

IKohli09 commented Jan 26, 2023

I have faced the same issue.
I got a new cluster up with external chart version 6.12.1 which is using image 0.13.1
But it errors out with InvalidChangeBatch when trying to create cname-<domain> entry.

Also, when I switch back to version 0.11.0, it keeps on deleting and creating the route53 records instead of updating them.
here, I am using --upsert-policy.

Desired change: CREATE 123.dev.cloud A "Desired change: CREATE 123.dev.cloud TXT Applying provider record filter for domains Desired change: CREATE 123.dev.cloud A Desired change: CREATE 123.dev.cloud TXT

It's a huge blocker.

@liad5h
Copy link

liad5h commented Feb 1, 2023

We are experiencing the same issue with version 0.13.1 and kubernetes 1.21 or higher.
In our case when the issue happens, external-dns stops processing requests until we go to AWS and manually remove the leftovers.

logs:

time="2023-02-01T12:55:01Z" level=error msg="Failure in zone qa.controlup.com. [Id: /hostedzone/XXXXXXXXXX]"
time="2023-02-01T12:55:01Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='cname-x.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: 8b8e55e1-efe0-452d-96da-af65ff122fca"
time="2023-02-01T12:55:01Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/XXXXXXXXXX]"

@msvticket
Copy link

I'm having the same problem with version 0.13.1 and --aws-batch-change-size=100. I tried --aws-batch-change-size=1 and started to get warnings like

time="2023-02-08T15:32:34Z" level=warning msg="Total changes for xxx.yyy.zzz exceeds max batch size of 1, total changes: 2"

and the errors as described above kept coming.

So I tried --aws-batch-change-size=2 and that has actually resolved the problem for me.

@jbilliau-rcd
Copy link

Same problem as well. I wish there was a "force-overwrite" options where we could just tell external-dns to overwrite records; we have multiple clusters who have this error and are seemingly stuck. The worse part is good, new ingresses never have their DNS records created since they get batched up with these bogus retries.

@martinohmann
Copy link

martinohmann commented Feb 16, 2023

We're facing the same issue with v0.13.2 and the suggested batch size changes do not work:

  • With --aws-batch-change-size=1: it tries to create the already existing TXT record, which fails. It does not even attempt to create the A record, presumably because the first batch change within the sync interval failed. This does not resolve itself eventually and continues like this in every sync interval.
  • With --aws-batch-change-size=2: it tries to create the A record and the already existing TXT record in a batch and this fails. Same behaviour as above, it's stuck.

The only option we have is to either manually create the A record, or to delete the existing TXT records so that external-dns can properly recreate everything.

The expected behaviour would be to not attempt to create the TXT records again (if anything, it should upsert existing records).

Update: from what I can see, there's already a change in master which might partially fix this (7dd84a5), but it's still unreleased.

@cyril94440
Copy link

Same problem here...

@Kulagin-G
Copy link

Kulagin-G commented Jun 8, 2023

The same problem after updating external-dns from 0.10.2 to 0.13.4

There are some details about the environment:

  1. Provider: aws
  2. EKS: 1.24.0

There are details about the issue:

At the star we have 3 records:

  • (A) - alias for LB, host.example.com
  • (TXT) - old-style TXT for backward-compatibility, host.example.com
  • (TXT) - new-style TXT cname-host.example.com
  1. Test - Removing new-style TXT cname-host.example.com
    Result: Looks ok, record was restored.
    time="2023-06-08T13:05:04Z" level=info msg="Desired change: CREATE cname-host.example.com. TXT [Id: /hostedzone/xxx]"

  2. Test - Removing old-style TXT host.example.com
    Result: Looks ok, record was restored.
    time="2023-06-08T13:07:05Z" level=debug msg="Adding host.example.com. [Id: /hostedzone/xxx]"

  3. Test - Removing old-style TXT and new-style TXT
    Result: records were not restored, and no issues or attempts in the logs.

  4. Test - Removing alias host.example.com and both TXT
    Result: ok, all 3 records were restored.
    time="2023-06-08T13:18:18Z" level=debug msg="Adding host.example.com. to zone xxx. [Id: /hostedzone/xxx]" time="2023-06-08T13:18:18Z" level=debug msg="Adding host.example.com. to zone xxx. [Id: /hostedzone/xxx]" time="2023-06-08T13:18:18Z" level=debug msg="Adding cname-host.example.com. to zone xxx. [Id: /hostedzone/Z010946512D3RO332W8MB]"
    time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE host.example.com TXT [Id: /hostedzone/xxx]" time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE host.example.com A [Id: /hostedzone/xxx]" time="2023-06-08T13:18:19Z" level=info msg="Desired change: CREATE cname-host.example.com TXT [Id: /hostedzone/xxx]"

  5. Test - Removing alias host.example.com only
    Result: failure, alias was not restored.
    time="2023-06-08T13:22:23Z" level=error msg="Failure in zone xxx. [Id: /hostedzone/Z010946512D3RO332W8MB] when submitting change batch: InvalidChangeBatch: [Tried to create resource record set [name='cname-host.example.com.', type='TXT'] but it already exists, Tried to create resource record set [name='host.example.com.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: xxx" time="2023-06-08T13:22:24Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/xxx]"

I guess force override won't lead to issues with Rate exceeded from AWS API, because a case when we lost an alias record is very rare, for us at least.
But still, the current behavior is pretty uncomfortable and non-expected, I want to be 100% sure that all our records will be restored automatically if any shit happens.

Additionally, it's weird I don't see any logs in case p.3

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@ddieulivol
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2024
@CameronMackenzie99
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2024
@rookiehelm
Copy link

I'm seeing this issue when installing v0.14.1 on a brand new EKS 1.25.

@sileyang-sf
Copy link

Same issue happened in our EKS cluster in version 1.26.

@rookiehelm
Copy link

rookiehelm commented May 14, 2024

Hi guys, I was able to resolve my errors. Couple of pointers that helped:

  • First thing is that the external-dns repo has various branches tagged with release versions. But the release versions don't correspond directly to the image version hosted on GCR.
  • My issue got resolved after I used the following image: registry.k8s.io/external-dns/external-dns:v0.14.1. Please also follow the instructions using the branch tagged v0.14.1 (and not master or some other branch)
  • In my case my cluster was setup using terraform scripts, as I needed to deploy kubeflow. I accidentally have configured the IRSA using 'eksctl' command which was incorrect. The docs suggest directly creating the serviceaccount via 'kubectl'. Please be careful here. I had to manually delete the previous SA and re-create my SA using the right commands. Post that everything worked fine.
  • I also needed to configure 'ingress-nginx' controller first (and not later) as 'external-dns' needs to work with the loadbalancer during the creation of the records (correct me if I'm wrong here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests