New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFE-748: Improve CoreDNS Integration with EgressFirewall #1335
CFE-748: Improve CoreDNS Integration with EgressFirewall #1335
Conversation
…and EgressFirewall by adding new plugin to CoreDNS and a new CR for tracking the IPs and TTLs of DNS names used in EgressFirewall rules
Skipping CI for Draft Pull Request. |
@arkadeepsen: This pull request references CFE-748 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…tions. Also made some modifications in the User Stroies section.
} | ||
```` | ||
|
||
The following ``DNSName`` CRD will be added to the ``k8s.ovn.org`` api-group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are using the CoreDNS plugin with ovn-kubernetes, it's not logically bound to ovn-kubernetes. I think this should be something like dns.openshift.io
/assign @Miciah |
} | ||
```` | ||
|
||
The following ``DNSName`` CRD will be added to the ``k8s.ovn.org`` api-group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since this will only apply to Egress Firewall DNSNames, the objects should have names that reflect that, like EgressFirewallDNSName or similar. No one should be able to confuse DNSNames with traditional DNS names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the group name makes it clear that this is for OVN or for EgressFirewall, then the CRD name doesn't necessarily need to indicate it. So, for example, if the group name is "k8s.ovn.org" or "egressfirewall.operators.openshift.io" or whatever, then the qualified resource name ends up as "dnsnames.k8s.ovn.org" or "dnsnames.egressfirewall.operators.openshift.io", which makes it clear that the resource is related to the OVN-CoreDNS integration for EgressFirewall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wanted to discuss the naming. I don't think egress firewall or ovn is necessary in the name, because the functionality of resolving DNS names to ips is quite generic. As an example, even in ovn we can use this functionality for features different from egressfirewall, e.g. network policy.
So maybe we can figure out the name that would explain what this object is used, e.g. DNSNameResolver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the EP with the following changes that were discussed:
- The CoreDNS plugin will not wait for OVN-K masters to update the underlying ACL rules
- OVN-K cluster manager will create and delete the new CR.
- A new controller will watch the CR and perform the re-resolution of the DNS names added to the
.status
of the CRs. - The api definition of the new CRD is updated.
@Miciah @danwinship @npinaeva @JoelSpeed PTAL and let me know if any other changes are required.
Some of the points which are still kind of unresolved:
- What should be the name of the new CRD?
As of now I have usedEgressFirewallDNSName
. - Which apigroup should it be added to?
As of now I have useddns.openshift.io
as per the suggestion here: CFE-748: Improve CoreDNS Integration with EgressFirewall #1335 (comment) - Should the new CRD be cluster-scoped or namespace-scoped?
As of now it is a namespace-scoped CR. There was a previous suggestion to keep it namespace-scoped itself: CFE-748: Improve CoreDNS Integration with EgressFirewall #1335 (comment) - What should be the name of the new CoreDNS plugin?
As of now I have usedegressfirewall
. However, if we are changing the CRD name and apigroup then we probably have to change the name of the plugin as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per today's discussions, the name of the CRD is decided to be DNSNameResolver
and it'll belong to network.openshift.io
apigroup. The name of the new plugin will be ocp_dnsnameresolver
.
* Additionally, the exact matching of the next lookup time may never be successful. If the existing next lookup time of a DNS name lies | ||
within a threshold value (say 1 second) of the new lookup time, specifically lies in the range defined by | ||
`[new lookup time - threshold duration, new lookup time + threshold duration]`, then they are considered as same. | ||
* The `Degraded` condition of the DNS name will be set to `false` along with the success reason and message. The `resolutionFailures` field for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting a Degraded
condition on a DNS name seems like overkill, and beyond the scope and nature of this EP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was suggested by @JoelSpeed here: #1335 (comment).
A DNS name will be removed from the status of the CR if it fails for 5 times. The Degraded
condition and the related reason and message will provide clarity why the resolution failed. This will also provide information when we would need to debug any future issue related to EgressFirewall.
The details of the `DNSNameResolver` CRD can be found in the [Proposal](#proposal) section. | ||
|
||
|
||
### Implementation Details/Notes/Constraints [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Implementation Details/Notes/Constraints [optional] | |
### Implementation Details/Notes/Constraints |
Cluster DNS Operator will deploy CoreDNS with the `ocp_dnsnameresolver` plugin enabled by adding it to the corefile. For the EgressFirewall rules to | ||
apply consistently, even for DNS names that are resolved by custom upstream nameservers, it will be added to all server blocks in the corefile. As the | ||
plugin will watch and update the `DNSNameResolver` CRs in the `network.openshift.io` api-group, proper RBAC permissions will be needed | ||
to be added to the `ClusterRole` for CoreDNS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not design a way for the EgressFirewall to configure this if it is activated? If a customer doesn't even use the EgressFirewall, why would we need to make any changes in the plugin chain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a customer is not using EgressFirewall, the plugin will mostly be no-op apart from checking the DNS names in the maps. As the maps will be empty the plugin will immediately return.
If we want to design it in such a way so that the plugin chain is changed only when EgressFirewall is used, it will end up restarting all the CoreDNS pods to enable the new plugin. In such a scenario, a CoreDNS pod may get a request from a pod before being restarted and this will result in missing out adding the IP addresses to the status of the corresponding DNSNameResolver CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, the feature will be behind a feature gate. If the feature gate is enabled, it means that the customer intends to use the feature and all the related components will be enabled including the new plugin.
The new `DNSNameResolver` controller will be added to the Cluster DNS Operator. The controller will watch the `DNSNameResolver` CRs, | ||
and will send DNS lookup requests for the `spec.name` field. It will also re-resolve the `status.resolvedNames[*].dnsName` fields based on the | ||
corresponding next lookup time (TTL + last lookup time). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to enable/install this controller only when EgressFirewall is in use?
Will the status of the DNSNameResolver
impact the status of the Cluster DNS Operator? Do we have to mark c-d-o as degraded if the DNSNameResolver
controller is degraded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to enable/install this controller only when EgressFirewall is in use?
The feature will be behind a feature gate. If the feature gate is enabled, it means that the customer intends to use the feature and all the related components will be enabled including the new controller.
Will the status of the DNSNameResolver impact the status of the Cluster DNS Operator? Do we have to mark c-d-o as degraded if the DNSNameResolver controller is degraded?
The status of Cluster DNS Operator won't be impacted by the DNSNameResolver controller. The controller will mainly do lookups for the DNS names. It will also remove IP addresses whos TTL have expired. These operations won't impact the CDO status.
* Whenever there's a change in the IP addresses or the next lookup time (TTL + last lookup time) for a DNS name, the additional step of updating the | ||
related `DNSNameResolver` CRs will be executed. This will add some delay to the DNS lookup process. However, this will only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this feature is required because the EgressFirewall wasn't designed to keep up with the changes in DNS information, I think it is misleading to imply that this doesn't happen very often.
We frequently get people asking how to troubleshoot why a DNS lookup takes so long, so I do think we need to find out precisely how much this could add for a wildcard domain with a popular suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this feature is required because the EgressFirewall wasn't designed to keep up with the changes in DNS information, I think it is misleading to imply that this doesn't happen very often.
Can you please let me know which part seems to be misleading. I will change the wording to make it correct.
The last line is added to imply that the updates will only take place when there's a change in the DNS information and only then will we see a delay in the CoreDNS response.
We frequently get people asking how to troubleshoot why a DNS lookup takes so long, so I do think we need to find out precisely how much this could add for a wildcard domain with a popular suffix.
The additional step which may introduce some delay will be the update of the corresponding DNSNameResolver CRs. For a regular DNS name, a maximum of 2 CRs (a regular and a wildcard) will be updated. For a wildcard DNS name, only the corresponding wildcard CR will be updated. Without implementing and running some tests it will be difficult to predict the precise delay that would be added. As this feature is planned as a TechPreview feature, we can use the customer feedback to understand the impact visible to customers as well.
|
||
## Design Details | ||
|
||
### Open Questions [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to find out how this impacts DNS lookups, and generally how this works, when there are multiple Egress Firewalls, possibly with overlapping wildcard DNS names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a note in this section regarding the impact to DNS name lookups?
and the `ocp_dnsnameresolver` plugin will also start updating the `.status` fields of the `DNSNameResolver` CRs. The scenarios arising | ||
out of the order of the update of the various components are discussed in [Version Skew Strategy](#version-skew-strategy) | ||
|
||
Downgrade expectations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you update the regex for EgressFirewall DNS names, you may have some issues with downgrade. Need to cover both regexes.
#### Removing a deprecated feature | ||
|
||
|
||
### Upgrade / Downgrade Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding/removing the plugin from CoreDNS will be interesting - need tests to make sure it doesn't greatly impact the upgrade/downgrade times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will cover the scenario in e2e tests.
} | ||
```` | ||
|
||
### Workflow Description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see the the steps from the point of view of the DNS lookup - which steps are taken when there is an Egress FW, when there isn't, when there is a wildcard, when there isn't, when a DNS name is in multiple Egress FW, and when there is an address not even in Egress FW, what's the flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DNS lookup details is added in the workflow of the plugin: https://github.com/openshift/enhancements/pull/1335/files#diff-b27f04c5e2673c47f922695835e746ecb2205a9f7800def32cc15c25af521758R333-R500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to explain each scenario here as well.
when there is an Egress FW
For each unique DNS name used in all the Egress Firewalls, a corresponding DNSNameResolver CR will be created by the OVN-K cluster manager. The CoreDNS plugin will watch for the DNSNameResolver CRs and store regular and wildcard DNS names in 2 separate maps.
When a DNS lookup for a regular DNS name is intercepted and the DNS name matches either a regular or wildcard DNS name or both, only then the status of the corresponding CR is updated. For wildcard DNS name lookup it will be only checked with the wildcard DNS name map and if there's a match only then the corresponding wildcard DNS name CR will be updated.
when there isn't
If there are no EgressFirewall objects then there won't be any DNSNameResolver CRs created by the OVN-K cluster manager. Thus the 2 maps will be empty. Whenever a DNS lookup is intercepted as the maps are empty the plugin immediately sends the response to the next plugin in the chain.
when there is a wildcard, when there isn't
Already mentioned above. Copying it here again for completion.
When a DNS lookup for a regular DNS name is intercepted and the DNS name matches either a regular or wildcard DNS name or both, only then the status of the corresponding CR is updated. For wildcard DNS name lookup it will be only checked with the wildcard DNS name map and if there's a match only then the corresponding wildcard DNS name CR will be updated.
when a DNS name is in multiple Egress FW
As already mentioned, only a single DNSNameResolver CR will be created for a unique DNS name used in the EgressFirewall objects, even if the same DNS name is used in multiple EgressFirewall objects. The DNS lookup process will be the same as explained above.
when there is an address not even in Egress FW
I actually didn't get the question. Do you mean when there are no DNS names mentioned in any of the EgressFirewall objects? In that case, the OVN-K cluster manager won't create any DNSNameResolver CR. The DNS lookup process will be similar to the scenario where there are no EgressFirewall objects mentioned above.
@arkadeepsen this is a big effort, so it has alot of pieces, and alot of comments. Please address my latest comments and I will lgtm as soon as I can. |
…S name in EgressFirewall.
…uest to 5 random CoreDNS pods only if DNS name has more than one associtated ip. Change grace period from 1s to 5s.
…oving the corresponding DNS Name
@candita Thanks for your review comments. I have responded to your review comments and also made the appropriate changes in the EP. PTAL. |
/approve This enhancement looks good. Thank you! I do want to make sure that we continue to re-resolve the domains as the ttl gets short, but I assume since we already do that we will keep that behavior. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@arkadeepsen: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Thanks for approving the EP. The behavior of re-resolution will continue to be there. |
This enhancement proposes to improve the integration between CoreDNS and EgressFirewall by adding new plugin to CoreDNS and a new CR for tracking the IPs and TTLs of DNS names used in EgressFirewall rules.