Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding hostedZoneID and alias to endpoints to assist deletion of alias entries on AWS #1356

Closed
wants to merge 16 commits into from

Conversation

spohner
Copy link
Contributor

@spohner spohner commented Jan 9, 2020

This PR tries to fix #1105 by adding hostedZoneID and alias to endpoints via ProviderSpecific attributes. This allows ALIAS entries targeting record sets in an hosted zone to be deleted. Previously this would fail if the target was not an AWS address.

Unsure if this allows to remove the hardcoded values for hosted zones ids, but it might. I can update this PR with this change if wanted.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jan 9, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @spohner!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 9, 2020
@spohner spohner force-pushed the del_alias_entries branch 2 times, most recently from 9114d0b to 9b5584b Compare January 9, 2020 14:08
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 9, 2020
@spohner
Copy link
Contributor Author

spohner commented Jan 10, 2020

/assign @Raffo

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 12, 2020
@spohner
Copy link
Contributor Author

spohner commented Jan 12, 2020

Great work, @devkid ! I'll test this later today and resolve any conflicts.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2020
@spohner
Copy link
Contributor Author

spohner commented Jan 13, 2020

Tested in our cluster and seems to work great for us.

Copy link
Member

@njuettner njuettner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason for this change is to differentiate between "ALIAS" and "CNAME"?
Could you explain it to me what the problem is? I probably don't get the full context here.

plan/plan.go Outdated
@@ -181,12 +181,6 @@ func shouldUpdateProviderSpecific(desired, current *endpoint.Endpoint) bool {
return false
}
for _, c := range current.ProviderSpecific {
// don't consider target health when detecting changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be dropped, #877
/cc @linki

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spohner can you please comment on this part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See point 7 from my lengthly explanation above.

@@ -122,6 +122,12 @@ func (c *Controller) RunOnce() error {
}
sourceEndpointsTotal.Set(float64(len(endpoints)))

if prov := c.Registry.Provider(); prov != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this inside the controller logic?

@alfredkrohmer
Copy link
Contributor

@njuettner This is a "multi-fold" problem.

  1. Without these changes, CNAME endpoints that point to an ALB/ELB/NLB could be created and deleted just fine. This is because the provider always converted CNAME endpoints that point to an ALB/ELB/NLB to an alias record here.
  2. The problem is when you wanted to work with alias records that don't point to an ALB/ELB/NLB. For these, you manually need to add the external-dns.alpha.kubernetes.io/alias: true annotation on your source. This code was responsible for the creation; this kind-of worked if you only had one zone configured but was broken if you used multiple zones (Creation of Route53 alias records is broken for targets that are not load balancers if using more than one zone #1175 - this is also fixed now by this PR). However, these records would never be deleted because external-dns retrieved the endpoints from the zone as CNAME endpoints and didn't "tag" them as alias records, so it always issued "delete this CNAME record" operations (instead of "delete this alias A record") - this is External DNS does not delete AWS aliases #1105.
  3. With this PR, all alias records retrieved from the zone will be CNAME records with a provider-specific property to tag them (see here). The line above adds the another provider-specific property for the evaluate-target-health setting.
  4. For calculating changes, external-dns compares the desired cluster state (endpoints retrieved from sources) and the actual state in the zones (endpoints retrieved from the Endpoints method of the provider) and issues update operations to the provider if the state doesn't match, which also happens when provider-specific properties differ.
  5. In order for the provier-specific properties to match, the alias and evaluate-target-health properties need to be set in the endpoints retrieved by the sources and by the provider. The properties are set by the source here and by the provider as mentioned in point 3.
  6. While adding the provider-specific properties like described in point 3 fixes the problem for alias records not pointing to an ALB/ELB/NLB, it breaks for the records that are automatically converted from CNAME to alias (i.e. pointing to ALB/ELB/NLB and not having the external-dns.alpha.kubernetes.io/alias annotation) because now the provider sets the provider-specific property property when it retrieves the record from the zone but the source doesn't add it because it doesn't know it will be converted to an alias record. This results in external-dns issuing update operations because the provider-specific properties differ; these updates fail because they don't actually change anything.
  7. The previous point is fixed by the introduction of the ModifyEndpoints method. It allows the provider to add provider-specific properties (or do other changes) to the endpoints that are retrieved from the sources. This way, the AWS provider can check in this method already if there are CNAME records that will be converted to alias records later automatically and add the corresponding provider-specific property; this makes the provider-specific properties in endpoints from the source and the provider match and point 6 is fixed. The same mechanism now applies for the evaluate-target-health property, which is why this check can be removed from plan.go.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2020
@spohner
Copy link
Contributor Author

spohner commented Jan 28, 2020

Great write up @devkid. Any comments or suggestions @njuettner?

@spohner
Copy link
Contributor Author

spohner commented Mar 6, 2020

Is this PR rejected? Let me know if I can assist to get this through.

@wwentland
Copy link

It would be much appreciated if this PR could get merged.

I deployed 0.6.0 with this PR merged in without any problems which was necessary to address #1175 encountered in our setup.

Copy link
Contributor

@Raffo Raffo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR @spohner and most of all for your patience. I reviewed it as promised on slack and added a few comments.

While I appreciate the explanation in #1356 (comment) , I am not sure of the following:

  • I'm not convinced that we need a new interface and I'd love to understand more about that. I guess it is done not to change things everywhere, but it seems like we are breaking the contract with the other provider. Isn't there a way to fix that without modifying the interface?
  • Are you sure we strictly need to do the fix in the controller code?
  • Where is the part of code that requires the context to be passed and that justifies changing a total of 60 files? If the addition of context is justified, I would propose something to ease the review: let's split this PR in two: one with all the context, and leave this one with all the core parts of the change.

plan/plan.go Outdated
@@ -181,12 +181,6 @@ func shouldUpdateProviderSpecific(desired, current *endpoint.Endpoint) bool {
return false
}
for _, c := range current.ProviderSpecific {
// don't consider target health when detecting changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spohner can you please comment on this part?

provider/aws.go Outdated
@@ -471,6 +477,30 @@ func (p *AWSProvider) newChanges(action string, endpoints []*endpoint.Endpoint,
return changes
}

func (p *AWSProvider) ModifyEndpoints(endpoints []*endpoint.Endpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to this function?

provider/aws.go Outdated
@@ -667,6 +686,18 @@ func changesByZone(zones map[string]*route53.HostedZone, changeSet []*route53.Ch
continue
}
for _, z := range zones {
if c.ResourceRecordSet.AliasTarget != nil && aws.StringValue(c.ResourceRecordSet.AliasTarget.HostedZoneId) == "same-zone" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid using magic strings like "same-zone"? If this is the only way to signal such a case, we should at least have a constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment. I agree that having a constant is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted "same-zone" to a constant with 8d63ec6

provider/aws.go Outdated
}

// if not, target needs to be in the same zone
return "same-zone"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using this to not break the signature of this method, I'd rather break it rather than have magic strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only to not break the signature, it's the only way of passing this around in the existing route53.Change structure (that comes from the AWS SDK). Of course, we could create another struct wrapping this one but this seems a bit overkill.

ApplyChanges(ctx context.Context, changes *plan.Changes) error
}

type EndpointModifyingProvider interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this needs to be a different interface? Do we have other options?

Also: this should have a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just introduced this so that I wouldn't need to implement this as an empty method for all other providers and because I would consider this an "optional" interface of a provider. However, I wouldn't mind at all if you prefer to have this in the Provider interface.

@spohner
Copy link
Contributor Author

spohner commented May 8, 2020

Thanks, @Raffo! @devkid has done most of the work here and deserves the credit. He is probably more qualified to answer the questions, but I will give it a shot pretty soon.

alfredkrohmer and others added 7 commits July 28, 2020 10:56
…erface

This avoids that providers can only modify endpoints in their
ApplyChanges method, possibly leading to no-op upsert operations because
external-dns can't foresee these automatic changes.

The prime example for this is automatic conversion of CNAME records to
alias records in the Route 53 provider if their target is an ELB/ALB.
When retrieving these records from the zone, the provider adds the
provider-specific properties for "alias" and "evaluate-target-health"
but when extracting the corresponding endpoints from the cluster state,
external-dns will not know that the provider will set these attributes
automatically and will therefor issue a (no-op) update operation.

This change calls a "ModifyEndpoints" method in the provider to let it
do these automatic modifications before external-dns calculates the
change plan, allowing it do skip the no-op update. As a consequence, the
Route 53 provider can move the automatic CNAME to alias conversion in
this method (where it only needs to the add the correct
provider-specific properties) and in ApplyChanges only needs to act upon
these. This also gets rid of the special-case handling of
evaluate-target-health in the plan calculation.
…erface

This avoids that providers can only modify endpoints in their
ApplyChanges method, possibly leading to no-op upsert operations because
external-dns can't foresee these automatic changes.

The prime example for this is automatic conversion of CNAME records to
alias records in the Route 53 provider if their target is an ELB/ALB.
When retrieving these records from the zone, the provider adds the
provider-specific properties for "alias" and "evaluate-target-health"
but when extracting the corresponding endpoints from the cluster state,
external-dns will not know that the provider will set these attributes
automatically and will therefor issue a (no-op) update operation.

This change calls a "ModifyEndpoints" method in the provider to let it
do these automatic modifications before external-dns calculates the
change plan, allowing it do skip the no-op update. As a consequence, the
Route 53 provider can move the automatic CNAME to alias conversion in
this method (where it only needs to the add the correct
provider-specific properties) and in ApplyChanges only needs to act upon
these. This also gets rid of the special-case handling of
evaluate-target-health in the plan calculation.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 28, 2020
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2020
@alfredkrohmer
Copy link
Contributor

@Raffo this is rebased now.

@seanmalloy
Copy link
Member

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 19, 2020
@@ -500,22 +529,12 @@ func (p *AWSProvider) newChange(action string, ep *endpoint.Endpoint, recordsCac
if val, ok := ep.Labels[endpoint.DualstackLabelKey]; ok {
dualstack = val == "true"
}

change.ResourceRecordSet.Type = aws.String(route53.RRTypeA)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about CNAME aliases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this was part wasn't changed in this pull request. I assume CNAME alias records don't work (and didn't work before either). I guess that would be a new feature request.

Copy link

@adrienjt adrienjt Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This PR's scope is already quite large, and has been pending for a long time. Would love to see it merged so we can add CNAME alias support in a subsequent issue/PR.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2021
@k8s-ci-robot
Copy link
Contributor

@spohner: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 5, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 4, 2021
@Raffo
Copy link
Contributor

Raffo commented Mar 31, 2021

Well this is embarassing, the kubernetes bot failed at merging this PR, then it conficted multiple times... can you resolve the conflicts and we try again merging this @spohner ?

@alfredkrohmer
Copy link
Contributor

@Raffo this PR is actually superseded by the one I raised: #1860

@spohner and I kinda worked together on this and we agreed that I raise a new PR and keep rebasing it until it's merged.

@seanmalloy
Copy link
Member

@Raffo this PR is actually superseded by the one I raised: #1860

@spohner and I kinda worked together on this and we agreed that I raise a new PR and keep rebasing it until it's merged.

@spohner and @devkid can we close this PR then? We will instead review #1860, and try to get it merged.

@spohner
Copy link
Contributor Author

spohner commented Apr 6, 2021

I agree. Let´s close this in favor of #1860 .

@spohner spohner closed this Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

External DNS does not delete AWS aliases
9 participants