Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

designate: fix deletion of TXT records #1255

Conversation

mcayland
Copy link

@mcayland mcayland commented Nov 4, 2019

This patchset ultimately contains a fix for the designate provider not deleting TXT records (see #1122) whilst on the way implementing ProviderSpecificProperty for the txt registry.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mcayland
To complete the pull request process, please assign raffo
You can assign the PR to them by writing /assign @raffo in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @mcayland!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 4, 2019
@mcayland
Copy link
Author

mcayland commented Nov 4, 2019

##[error]Resource not accessible by integration
##[error]Node run failed with exit code 1```

I'm not sure this pull request labeler failure is because of changes in this MR?

@njuettner
Copy link
Member

I thought I disabled it, sorry for confusing you. This has nothing to do with your PR.
Forks currently don't work, see actions/labeler#12.
I'll create a new PR to drop it completely. My hope was adding labels automatically so that reviewers are able to identify the context of the PR's sooner 😞.

@njuettner
Copy link
Member

Because I saw you also updated the docs, there's already a PR for that: #1254. Maybe you want to have a look there first and see if anything is missing (please comment), if not I would suggest to drop the change of the doc in your PR and leave the rest.

I will have a look ASAP.

Thanks again 👍.

@mcayland mcayland force-pushed the designate/fix-delete-TXT-records branch from 0224647 to 6d3ab36 Compare November 4, 2019 16:15
@mcayland
Copy link
Author

mcayland commented Nov 4, 2019

Yes indeed - looks like I missed the documentation change by just a few hours :) #1254 looks good enough to me, so I've just repushed the branch with the documentation changes dropped.

@njuettner
Copy link
Member

If I got this right this will add provider specifics inside the TXT records and if that's the case will there not a cap for TXT records? At least on AWS we had this issue once, you have only a limit of 255 character.

@mcayland
Copy link
Author

mcayland commented Nov 5, 2019

Well that's not the case here, or at least testing this PR locally I get 2 records similar to the following generated in designate DNS:

foo.my.domain A 1.2.3.4
foo.my.domain TXT "heritage=external-dns,external-dns/owner=default,external-dns/resource=service/my-namespace/my-service"

Maybe part of this could be due to me misunderstanding the exact purpose of a Label: my understanding from reading the source is that Endpoint.Labels is a map of data that comes from the provider, and the txt registry implementation simply serialises the contents of the map into the TXT record.

For this reason I looked at the code and noticed that the designate ids for the zone and recordsets are retrieved from designate directly based upon externalName: these are then used to populate the Endpoint. Given that these ids are only used internally, it seemed that ProviderSpecific was the right way to store this information within the Endpoint without exposing it which is why I implemented it.

However: just to confuse things more, the designate provider apparently (ab)uses Endpoint.Labels to store the internal zone and recordset ids, even though they are not visible as part of the TXT record. I guess this is because ProviderSpecific didn't exist at the time the provider was written? Actually in fact I also see there is still a "designateOriginalRecords" Label which should be switched to ProviderSpecific if this interpretation is correct.

If you can clarify whether Label or ProviderSpecific is the correct mechanism to attach the internal ids to Endpoints then that will help decide what the correct approach should be.

@njuettner
Copy link
Member

njuettner commented Nov 6, 2019

@mcayland your explanation is correct.
Labels are persistent, e.g to set the OWNER. ProviderSpecific should be ephemeral.

After reviewing again it seems to be fine but we want to run it in a test cluster again if this also works with other provider. In the past we had some issues when we enabled cache sync and it tried to recreate records again and again due to provider specific stuff.

@mcayland
Copy link
Author

mcayland commented Nov 6, 2019

Thanks @njuettner! In that case the one remaining designateOriginalRecords property should also be converted from a Label to a ProviderSpecific. Let me do that first so then everything is consistent and repush.

@mcayland mcayland force-pushed the designate/fix-delete-TXT-records branch from 6d3ab36 to 0ae6375 Compare November 7, 2019 21:03
@mcayland
Copy link
Author

mcayland commented Nov 7, 2019

@njuettner I've now pushed an updated version of this branch. I made a few minor tweaks from the previous version:

  • Simplify the txt registry ProviderSpecific code (makes it more similar to how the existing code for Labels works)

  • Convert designateOriginalRecords from a Label to a ProviderSpecific with prefix

  • Update the commit messages to better explain the changes

Let me know how the testing goes, and if something doesn't quite work I can try my best to help.

@mcayland mcayland force-pushed the designate/fix-delete-TXT-records branch from 0ae6375 to 6e81ee7 Compare November 8, 2019 11:01
@mcayland
Copy link
Author

mcayland commented Nov 8, 2019

@njuettner after sleeping on this, I've realised that adding a prefix to namespace the ProviderSpecific properties is just masking the real problem which is that ProviderSpecific properties are not preserved for TXT records (re)generated by the txt registry implementation of ApplyChanges().

Fortunately it is possible to fix this using a similar method as already happens for Labels, and so once that is working all that remains is to switch the relevant internal properties from Labels to ProviderSpecific properties and then everything works as intended.

I think this is a much better solution since then all the provider has to do is attach ProviderSpecific properties to the original Records which are then visible from end-to-end all the way through to the provider's ApplyChanges() implementation - there is no need for the provider to have any preconceived ideas around overlapping property namespaces.

@mcayland
Copy link
Author

@njuettner have you had a chance to put this latest version through its paces on your test cluster? or is there anything else you're waiting for from me?

@njuettner
Copy link
Member

@mcayland could your do a rebase again?

@mcayland
Copy link
Author

mcayland commented Nov 19, 2019

@njuettner i've tried a rebase locally, but something appears to be broken elsewhere in external-dns. The build completes fine and all regression tests pass, but in the logs I keep seeing repeated messages like this:

E1119 17:19:15.170276       1 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)

... and the DNS records seem to be ping-ponging out of existence. Any idea what is going on here?

@mcayland
Copy link
Author

mcayland commented Nov 20, 2019

@njuettner there seems to be 2 separate issues here. The first issue is the constant stream of error messages of the form "round_trippers.go:174] CancelRequest not implemented by *instrumented_http.Transport" and "streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)" in the log files which I've traced down to this PR:

Merge: 6e1fae67 b0e1688b
Author: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>
Date:   Thu Nov 7 02:32:43 2019 -0800

    Merge pull request #1195 from timja/azure-private-dns
    
    Upgrade client-go + azure sdk

Unfortunately bisecting this exactly is proving to be difficult since PR review doesn't appear to require squashing of build fixes into the original commit before merge :( My guess is that this may be something internal to the Azure SDK but I don't really know enough about this to diagnose it further.

@mcayland
Copy link
Author

mcayland commented Nov 20, 2019

@njuettner the issue with the "ping-pong" creation and destruction of records I managed to bisect down to this commit:

Author: Alfred Krohmer <devkid@devkid.net>
Date:   Sat Jun 8 21:53:13 2019 +0200

    Fixes and cleanup

This is part of PR #1008 and it seems to be making changes to the way ProviderSpecific properties are being handled, but the commit message does not really give any detail as to why these are required and what they were trying to fix.

@devkid can you clarify further the intention of the ProviderSpecific changes here?

(Edit: one additional point here: the ping-pong occurs only with this PR applied of the above commit which makes me believe that something in this commit has subtly changed the ProviderSpecific semantics)

@mcayland mcayland force-pushed the designate/fix-delete-TXT-records branch from 6e81ee7 to 516dbbd Compare November 20, 2019 12:13
@mcayland
Copy link
Author

@njuettner I've just repushed the rebased branch for this PR which although it builds and passes regression, suffers from being stuck in a "ping-pong" cycle of creating and then destroying the records on alternate runs.

For reference the previous working branch is here: https://github.com/mcayland/external-dns/tree/designate/fix-delete-TXT-records.good

@alfredkrohmer
Copy link
Contributor

alfredkrohmer commented Nov 20, 2019

@mcayland The intention of the fix was: previously shouldUpdateProviderSpecific only returned true when a provider-specific attribute was updated, the change causes it to also return true when a provider-specific attribute was added or removed.

@mcayland
Copy link
Author

@devkid Thanks for the info, I can see why that change is needed. I think the designate provider has some similar logic internally to try and track changes vs. update/remove:

https://github.com/kubernetes-sigs/external-dns/blob/master/provider/designate.go#L45
https://github.com/kubernetes-sigs/external-dns/blob/master/provider/designate.go#L326
https://github.com/kubernetes-sigs/external-dns/blob/master/provider/designate.go#L349

Since this PR also switches the per-Endpoint data from Labels to ProviderSpecific then I wonder if these 2 things are in conflict with each other?

@alfredkrohmer
Copy link
Contributor

Could you describe in more detail what behavior you are seeing? Do you mean records are created in one iteration, then deleted in the next one, then created again etc.?

@mcayland
Copy link
Author

mcayland commented Nov 20, 2019

@devkid Yes, that's exactly it. I've spent a few more hours this afternoon digging at this and I think I understand why this is happening:

  • The problem is specifically caused by the changes to shouldUpdateProviderSpecific() in commit b2a3e88 "Fixes and cleanup". If I revert that part of the diff then everything works again.

  • Comparing between the good and bad versions the difference is that the Plan now sees a difference between the desired state "UpdateNew" and the existing state "UpdateOld"

INFO[0001] plan input current ep: foo.com 0 IN A  172.29.12.171 [{designate-recordset-id cd01b341-f644-43be-86b1-91ca6c2ed165} {designate-record-id 7242c121-4acd-4509-a98f-d370704a8186} {designate-original-records 172.29.12.171}]

vs.

INFO[0001] plan desired current ep: foo.com 0 IN A  172.29.12.171 []

Note that the desired state does not contain the ProviderSpecific information: my guess is that this is because the desired Endpoint is generated from service.go's Endpoints() function.

And in this case as discussed above with @njuettner the ProviderSpecific entries for designate are ephemeral: therefore with your change to shouldUpdateProviderSpecific() the current and desired state will never match because the service Endpoint will never contain the ephemeral properties, which is why we see this continual loop of record creation and deletion.

@mcayland
Copy link
Author

I've just updated the related issue at #1122 to keep this alive.

@njuettner @devkid thank you for your comments so far, but do we have any answers to #1255 (comment) yet?

@kenperkins
Copy link

The fact that designate doesn't work with policy=sync is exceptionally frustrating and troublesome for those of us with Designate backends. How can I help here?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2020
@k8s-ci-robot
Copy link
Contributor

@mcayland: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 29, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 30, 2020
@kenperkins
Copy link

Bumping this back up as it is still an active problem with the designate provider.

@seanmalloy
Copy link
Member

/remove-lifecycle rotten

@mcayland are you planning on finishing this pull request? Looks like the PR needs a rebase and there are two files with conflicts.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 11, 2020
@seanmalloy
Copy link
Member

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 19, 2020
@mcayland
Copy link
Author

Sorry about the late reply, and thanks for all the pings. Due to COVID my contract working on OpenStack was extended, but is now due to terminate in the next month after which I will lose access to the cluster. Now it may be that I get a few odd days towards the end of next month to look at this again, although any time and testing will be quite limited.

What's the current state of upstream regarding ProviderSpecific and also adding ModifyEndpoints to the Provider interface as suggested by @devkid?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2020
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 18, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stefanandres
Copy link

stefanandres commented Mar 15, 2022

@mcayland thank you for your PR from 2019. Alas this issue is still a problem in 2022.
Do you know of any updates regarding this issue?

@mcayland
Copy link
Author

@mcayland thank you for your PR from 2019. Alas this issue is still a problem in 2022. Do you know of any updates regarding this issue?

@stefanandres no updates from this end I'm afraid. As documented in the original issue #1122 above, the work was done under contract for a client OpenStack/k8s installation which I was given permission to submit to upstream.

Unfortunately due to issues with the changes to ProviderSpecific I wasn't able to get the PR merged before the contract finished, so I no longer have access to the installation in question.

I still get emails from people asking about this feature, so there is clearly still interest. If someone were willing to sponsor the work and provide access to a test OpenStack/k8s installation then I would certainly be interested to revisit this with the aim of getting a fix merged upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants