designate: fix deletion of TXT records #1255

mcayland · 2019-11-04T10:54:15Z

This patchset ultimately contains a fix for the designate provider not deleting TXT records (see #1122) whilst on the way implementing ProviderSpecificProperty for the txt registry.

k8s-ci-robot · 2019-11-04T10:54:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mcayland
To complete the pull request process, please assign raffo
You can assign the PR to them by writing /assign @raffo in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2019-11-04T10:54:23Z

Welcome @mcayland!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

mcayland · 2019-11-04T12:12:17Z

##[error]Resource not accessible by integration
##[error]Node run failed with exit code 1```

I'm not sure this pull request labeler failure is because of changes in this MR?

njuettner · 2019-11-04T15:59:32Z

I thought I disabled it, sorry for confusing you. This has nothing to do with your PR.
Forks currently don't work, see actions/labeler#12.
I'll create a new PR to drop it completely. My hope was adding labels automatically so that reviewers are able to identify the context of the PR's sooner 😞.

njuettner · 2019-11-04T16:02:08Z

Because I saw you also updated the docs, there's already a PR for that: #1254. Maybe you want to have a look there first and see if anything is missing (please comment), if not I would suggest to drop the change of the doc in your PR and leave the rest.

I will have a look ASAP.

Thanks again 👍.

mcayland · 2019-11-04T16:17:24Z

Yes indeed - looks like I missed the documentation change by just a few hours :) #1254 looks good enough to me, so I've just repushed the branch with the documentation changes dropped.

njuettner · 2019-11-05T14:52:10Z

If I got this right this will add provider specifics inside the TXT records and if that's the case will there not a cap for TXT records? At least on AWS we had this issue once, you have only a limit of 255 character.

mcayland · 2019-11-05T16:42:43Z

Well that's not the case here, or at least testing this PR locally I get 2 records similar to the following generated in designate DNS:

foo.my.domain A 1.2.3.4
foo.my.domain TXT "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/my-namespace/my-service"

Maybe part of this could be due to me misunderstanding the exact purpose of a Label: my understanding from reading the source is that Endpoint.Labels is a map of data that comes from the provider, and the txt registry implementation simply serialises the contents of the map into the TXT record.

For this reason I looked at the code and noticed that the designate ids for the zone and recordsets are retrieved from designate directly based upon externalName: these are then used to populate the Endpoint. Given that these ids are only used internally, it seemed that ProviderSpecific was the right way to store this information within the Endpoint without exposing it which is why I implemented it.

However: just to confuse things more, the designate provider apparently (ab)uses Endpoint.Labels to store the internal zone and recordset ids, even though they are not visible as part of the TXT record. I guess this is because ProviderSpecific didn't exist at the time the provider was written? Actually in fact I also see there is still a "designateOriginalRecords" Label which should be switched to ProviderSpecific if this interpretation is correct.

If you can clarify whether Label or ProviderSpecific is the correct mechanism to attach the internal ids to Endpoints then that will help decide what the correct approach should be.

njuettner · 2019-11-06T08:20:11Z

@mcayland your explanation is correct.
Labels are persistent, e.g to set the OWNER. ProviderSpecific should be ephemeral.

After reviewing again it seems to be fine but we want to run it in a test cluster again if this also works with other provider. In the past we had some issues when we enabled cache sync and it tried to recreate records again and again due to provider specific stuff.

mcayland · 2019-11-06T08:29:54Z

Thanks @njuettner! In that case the one remaining designateOriginalRecords property should also be converted from a Label to a ProviderSpecific. Let me do that first so then everything is consistent and repush.

mcayland · 2019-11-07T21:10:11Z

@njuettner I've now pushed an updated version of this branch. I made a few minor tweaks from the previous version:

Simplify the txt registry ProviderSpecific code (makes it more similar to how the existing code for Labels works)
Convert designateOriginalRecords from a Label to a ProviderSpecific with prefix
Update the commit messages to better explain the changes

Let me know how the testing goes, and if something doesn't quite work I can try my best to help.

mcayland · 2019-11-08T11:08:46Z

@njuettner after sleeping on this, I've realised that adding a prefix to namespace the ProviderSpecific properties is just masking the real problem which is that ProviderSpecific properties are not preserved for TXT records (re)generated by the txt registry implementation of ApplyChanges().

Fortunately it is possible to fix this using a similar method as already happens for Labels, and so once that is working all that remains is to switch the relevant internal properties from Labels to ProviderSpecific properties and then everything works as intended.

I think this is a much better solution since then all the provider has to do is attach ProviderSpecific properties to the original Records which are then visible from end-to-end all the way through to the provider's ApplyChanges() implementation - there is no need for the provider to have any preconceived ideas around overlapping property namespaces.

mcayland · 2019-11-18T17:09:34Z

@njuettner have you had a chance to put this latest version through its paces on your test cluster? or is there anything else you're waiting for from me?

njuettner · 2019-11-19T15:04:11Z

@mcayland could your do a rebase again?

mcayland · 2019-11-19T17:21:49Z

@njuettner i've tried a rebase locally, but something appears to be broken elsewhere in external-dns. The build completes fine and all regression tests pass, but in the logs I keep seeing repeated messages like this:

E1119 17:19:15.170276       1 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)

... and the DNS records seem to be ping-ponging out of existence. Any idea what is going on here?

mcayland · 2019-11-20T10:37:28Z

@njuettner there seems to be 2 separate issues here. The first issue is the constant stream of error messages of the form "round_trippers.go:174] CancelRequest not implemented by *instrumented_http.Transport" and "streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)" in the log files which I've traced down to this PR:

Merge: 6e1fae67 b0e1688b
Author: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>
Date:   Thu Nov 7 02:32:43 2019 -0800

    Merge pull request #1195 from timja/azure-private-dns
    
    Upgrade client-go + azure sdk

Unfortunately bisecting this exactly is proving to be difficult since PR review doesn't appear to require squashing of build fixes into the original commit before merge :( My guess is that this may be something internal to the Azure SDK but I don't really know enough about this to diagnose it further.

mcayland · 2019-11-20T10:48:03Z

@njuettner the issue with the "ping-pong" creation and destruction of records I managed to bisect down to this commit:

Author: Alfred Krohmer <devkid@devkid.net>
Date:   Sat Jun 8 21:53:13 2019 +0200

    Fixes and cleanup

This is part of PR #1008 and it seems to be making changes to the way ProviderSpecific properties are being handled, but the commit message does not really give any detail as to why these are required and what they were trying to fix.

@devkid can you clarify further the intention of the ProviderSpecific changes here?

(Edit: one additional point here: the ping-pong occurs only with this PR applied of the above commit which makes me believe that something in this commit has subtly changed the ProviderSpecific semantics)

mcayland · 2019-11-20T12:16:36Z

@njuettner I've just repushed the rebased branch for this PR which although it builds and passes regression, suffers from being stuck in a "ping-pong" cycle of creating and then destroying the records on alternate runs.

For reference the previous working branch is here: https://github.com/mcayland/external-dns/tree/designate/fix-delete-TXT-records.good

alfredkrohmer · 2019-11-20T12:25:15Z

@mcayland The intention of the fix was: previously shouldUpdateProviderSpecific only returned true when a provider-specific attribute was updated, the change causes it to also return true when a provider-specific attribute was added or removed.

mcayland · 2019-11-20T13:09:10Z

@devkid Thanks for the info, I can see why that change is needed. I think the designate provider has some similar logic internally to try and track changes vs. update/remove:

https://github.com/kubernetes-sigs/external-dns/blob/master/provider/designate.go#L45
https://github.com/kubernetes-sigs/external-dns/blob/master/provider/designate.go#L326
https://github.com/kubernetes-sigs/external-dns/blob/master/provider/designate.go#L349

Since this PR also switches the per-Endpoint data from Labels to ProviderSpecific then I wonder if these 2 things are in conflict with each other?

alfredkrohmer · 2019-11-20T14:38:51Z

Could you describe in more detail what behavior you are seeing? Do you mean records are created in one iteration, then deleted in the next one, then created again etc.?

mcayland · 2019-11-20T14:57:39Z

@devkid Yes, that's exactly it. I've spent a few more hours this afternoon digging at this and I think I understand why this is happening:

The problem is specifically caused by the changes to shouldUpdateProviderSpecific() in commit b2a3e88 "Fixes and cleanup". If I revert that part of the diff then everything works again.
Comparing between the good and bad versions the difference is that the Plan now sees a difference between the desired state "UpdateNew" and the existing state "UpdateOld"

INFO[0001] plan input current ep: foo.com 0 IN A  172.29.12.171 [{designate-recordset-id cd01b341-f644-43be-86b1-91ca6c2ed165} {designate-record-id 7242c121-4acd-4509-a98f-d370704a8186} {designate-original-records 172.29.12.171}]

vs.

INFO[0001] plan desired current ep: foo.com 0 IN A  172.29.12.171 []

Note that the desired state does not contain the ProviderSpecific information: my guess is that this is because the desired Endpoint is generated from service.go's Endpoints() function.

And in this case as discussed above with @njuettner the ProviderSpecific entries for designate are ephemeral: therefore with your change to shouldUpdateProviderSpecific() the current and desired state will never match because the service Endpoint will never contain the ephemeral properties, which is why we see this continual loop of record creation and deletion.

mcayland · 2020-02-28T07:56:39Z

I've just updated the related issue at #1122 to keep this alive.

@njuettner @devkid thank you for your comments so far, but do we have any answers to #1255 (comment) yet?

kenperkins · 2020-03-31T23:23:57Z

The fact that designate doesn't work with policy=sync is exceptionally frustrating and troublesome for those of us with Designate backends. How can I help here?

fejta-bot · 2020-06-29T23:54:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

k8s-ci-robot · 2020-06-29T23:54:37Z

@mcayland: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-07-30T00:35:08Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

kenperkins · 2020-08-07T19:32:25Z

Bumping this back up as it is still an active problem with the designate provider.

seanmalloy · 2020-08-11T04:09:51Z

/remove-lifecycle rotten

@mcayland are you planning on finishing this pull request? Looks like the PR needs a rebase and there are two files with conflicts.

seanmalloy · 2020-08-19T05:07:38Z

/kind bug

mcayland · 2020-08-19T07:57:42Z

Sorry about the late reply, and thanks for all the pings. Due to COVID my contract working on OpenStack was extended, but is now due to terminate in the next month after which I will lose access to the cluster. Now it may be that I get a few odd days towards the end of next month to look at this again, although any time and testing will be quite limited.

What's the current state of upstream regarding ProviderSpecific and also adding ModifyEndpoints to the Provider interface as suggested by @devkid?

fejta-bot · 2020-11-17T08:36:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

seanmalloy · 2020-11-18T04:24:51Z

/remove-lifecycle stale

fejta-bot · 2021-02-16T05:09:04Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-03-18T05:55:39Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-04-17T06:36:09Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-17T06:36:15Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stefanandres · 2022-03-15T10:13:44Z

@mcayland thank you for your PR from 2019. Alas this issue is still a problem in 2022.
Do you know of any updates regarding this issue?

mcayland · 2022-03-18T09:07:03Z

@mcayland thank you for your PR from 2019. Alas this issue is still a problem in 2022. Do you know of any updates regarding this issue?

@stefanandres no updates from this end I'm afraid. As documented in the original issue #1122 above, the work was done under contract for a client OpenStack/k8s installation which I was given permission to submit to upstream.

Unfortunately due to issues with the changes to ProviderSpecific I wasn't able to get the PR merged before the contract finished, so I no longer have access to the installation in question.

I still get emails from people asking about this feature, so there is clearly still interest. If someone were willing to sponsor the work and provide access to a test OpenStack/k8s installation then I would certainly be interested to revisit this with the aim of getting a fix merged upstream.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 4, 2019

k8s-ci-robot requested review from linki and njuettner November 4, 2019 10:54

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 4, 2019

mcayland force-pushed the designate/fix-delete-TXT-records branch from 0224647 to 6d3ab36 Compare November 4, 2019 16:15

mcayland force-pushed the designate/fix-delete-TXT-records branch from 6d3ab36 to 0ae6375 Compare November 7, 2019 21:03

mcayland force-pushed the designate/fix-delete-TXT-records branch from 0ae6375 to 6e81ee7 Compare November 8, 2019 11:01

mcayland force-pushed the designate/fix-delete-TXT-records branch from 6e81ee7 to 516dbbd Compare November 20, 2019 12:13

kenperkins mentioned this pull request Apr 1, 2020

Sync policy not deleting TXT records using Designate provider #751

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 29, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 30, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 11, 2020

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 19, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 18, 2021

k8s-ci-robot closed this Apr 17, 2021

designate: fix deletion of TXT records #1255

designate: fix deletion of TXT records #1255

Conversation

mcayland commented Nov 4, 2019 • edited Loading

k8s-ci-robot commented Nov 4, 2019

k8s-ci-robot commented Nov 4, 2019

mcayland commented Nov 4, 2019

njuettner commented Nov 4, 2019

njuettner commented Nov 4, 2019

mcayland commented Nov 4, 2019

njuettner commented Nov 5, 2019

mcayland commented Nov 5, 2019

njuettner commented Nov 6, 2019 • edited Loading

mcayland commented Nov 6, 2019

mcayland commented Nov 7, 2019

mcayland commented Nov 8, 2019

mcayland commented Nov 18, 2019

njuettner commented Nov 19, 2019

mcayland commented Nov 19, 2019 • edited Loading

mcayland commented Nov 20, 2019 • edited Loading

mcayland commented Nov 20, 2019 • edited Loading

mcayland commented Nov 20, 2019

alfredkrohmer commented Nov 20, 2019 • edited Loading

mcayland commented Nov 20, 2019

alfredkrohmer commented Nov 20, 2019

mcayland commented Nov 20, 2019 • edited Loading

mcayland commented Feb 28, 2020

kenperkins commented Mar 31, 2020

fejta-bot commented Jun 29, 2020

k8s-ci-robot commented Jun 29, 2020

fejta-bot commented Jul 30, 2020

kenperkins commented Aug 7, 2020

seanmalloy commented Aug 11, 2020

seanmalloy commented Aug 19, 2020

mcayland commented Aug 19, 2020

fejta-bot commented Nov 17, 2020

seanmalloy commented Nov 18, 2020

fejta-bot commented Feb 16, 2021

fejta-bot commented Mar 18, 2021

fejta-bot commented Apr 17, 2021

k8s-ci-robot commented Apr 17, 2021

stefanandres commented Mar 15, 2022 • edited Loading

mcayland commented Mar 18, 2022

mcayland commented Nov 4, 2019 •

edited

Loading

njuettner commented Nov 6, 2019 •

edited

Loading

mcayland commented Nov 19, 2019 •

edited

Loading

mcayland commented Nov 20, 2019 •

edited

Loading

mcayland commented Nov 20, 2019 •

edited

Loading

alfredkrohmer commented Nov 20, 2019 •

edited

Loading

mcayland commented Nov 20, 2019 •

edited

Loading

stefanandres commented Mar 15, 2022 •

edited

Loading