Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS validation fails but it's wrong #4477

Closed
r00ta opened this issue Oct 9, 2022 · 7 comments
Closed

DNS validation fails but it's wrong #4477

r00ta opened this issue Oct 9, 2022 · 7 comments

Comments

@r00ta
Copy link

r00ta commented Oct 9, 2022

I've tried to install a cluster with

Screenshot from 2022-10-09 23-12-33

with the following dns settings

Screenshot from 2022-10-09 23-10-35

but the validations fails due to the dns wildcard certificate (that does not exists at all!)

Screenshot from 2022-10-09 23-09-03

Is that a known issue?

omertuc added a commit to omertuc/assisted-service that referenced this issue Oct 10, 2022
…sses

This commit improves the bad DNS wildcard validation message to include
a list of IP addresses that the bad wildcard domain resolves to so they
can be included in the message. This might help users understand where
the bad wildcard record is coming from.

See openshift#4477 - user was
stuck on this validation for a while but was able to figure out the
issue almost immediately when I told them the IP address that gets
resolved is 127.0.0.1 (turns out it's some weird known router
issue/behavior)
omertuc added a commit to omertuc/assisted-service that referenced this issue Oct 10, 2022
…sses

This commit improves the bad DNS wildcard validation message to include
the list of IP addresses that the bad wildcard domain resolves to so
they This might help users understand where the bad wildcard record is
coming from.

See openshift#4477 - user was
stuck on this validation for a while but was able to figure out the
issue almost immediately when I told them the IP address that gets
resolved is 127.0.0.1 (turns out it's some weird known router
issue/behavior)
omertuc added a commit to omertuc/assisted-service that referenced this issue Oct 10, 2022
…sses

This commit improves the bad DNS wildcard validation message to include
the list of IP addresses that the bad wildcard domain resolves to.  This
might help users understand where the bad wildcard record is coming
from.

See openshift#4477 - user was
stuck on this validation for a while but was able to figure out the
issue almost immediately when I told them the IP address that gets
resolved is 127.0.0.1 (turns out it's some weird known router
issue/behavior)
@omertuc
Copy link
Contributor

omertuc commented Oct 10, 2022

/close User figured out it's some weird router behavior. Created #4482 to give users more information in this situation

omertuc added a commit to omertuc/assisted-service that referenced this issue Oct 11, 2022
…sses

This commit improves the bad DNS wildcard validation message to include
the list of IP addresses that the bad wildcard domain resolves to.  This
might help users understand where the bad wildcard record is coming
from.

See openshift#4477 - user was
stuck on this validation for a while but was able to figure out the
issue almost immediately when I told them the IP address that gets
resolved is 127.0.0.1 (turns out it's some weird known router
issue/behavior)

Also fixed a (theoretical) bug where when the agent omitted the wildcard
entry from its DNS resolution response, the validation would fail
(instead of error) and would tell the user that the domain is resolving
while in reality we don't really know whether it's resolving or not.
omertuc added a commit to omertuc/assisted-service that referenced this issue Oct 11, 2022
…sses

This commit improves the bad DNS wildcard validation message to include
the list of IP addresses that the bad wildcard domain resolves to.  This
might help users understand where the bad wildcard record is coming
from.

See openshift#4477 - user was
stuck on this validation for a while but was able to figure out the
issue almost immediately when I told them the IP address that gets
resolved is 127.0.0.1 (turns out it's some weird known router
issue/behavior)

Also fixed a (theoretical) bug where when the agent omitted the wildcard
entry from its DNS resolution response, the validation would fail
(instead of error) and would tell the user that the domain is resolving
while in reality we don't really know whether it's resolving or not.
@r00ta
Copy link
Author

r00ta commented Oct 11, 2022

Thank you so much @omertuc for the support!

openshift-merge-robot pushed a commit that referenced this issue Oct 12, 2022
…sses (#4482)

This commit improves the bad DNS wildcard validation message to include
the list of IP addresses that the bad wildcard domain resolves to.  This
might help users understand where the bad wildcard record is coming
from.

See #4477 - user was
stuck on this validation for a while but was able to figure out the
issue almost immediately when I told them the IP address that gets
resolved is 127.0.0.1 (turns out it's some weird known router
issue/behavior)

Also fixed a (theoretical) bug where when the agent omitted the wildcard
entry from its DNS resolution response, the validation would fail
(instead of error) and would tell the user that the domain is resolving
while in reality we don't really know whether it's resolving or not.
@r00ta r00ta closed this as completed Oct 24, 2022
@jklare
Copy link

jklare commented May 4, 2023

Just in case anybody else stumbles across this thread:

  1. Congratulations that you made it this far!
  2. I encountered a very similar issue and could solve it by reconfiguring my DHCP server to stop announcing the apps.<clustername>.<basedomain> as option domain-search (dhcpd.conf)
  3. The issue here seems to be, that the validator, as well as the coredns running on the fully configured cluster lateron, will respect the domain-search option announced by the DHCP server for the network and therefore append those to any DNS requests in case the initial one returns NXDOMAIN. This means, that the first request of the validator will go towards the local DNS resolver on the cluster node itself and this one will respect the domain-search parameters announced by the DHCP server earlier. So the local resolver will first ask the upstream DNS for validateNoWildcardDNS.<clustername>.<basedomain> and get the expected NXDOMAIN. It will then append the domain-search (e.g. apps..) and ask for validateNoWildcardDNS.<clustername>.<basedomain>.apps.<clustername>.<basedomain> and get back the IP address configured for the wildcard A record of *.apps.<clustername>.<basedomain>.

@omertuc
Copy link
Contributor

omertuc commented May 4, 2023

Seems like an issue in how our validation works. Thanks for the explanation / looking into it. I will open an internal ticket to see how we can address it

@jklare
Copy link

jklare commented May 4, 2023

The strange thing for me here is, that although i agree and the validator should probably output a different error message, it is still somehow correct when saying "the cluster will not work with the current DNS/DHCP configuration". I initially ran into this issue since i saw the core DNS on my running cluster resolving service names (e.g. myservice.mynamespace.svc.cluster.local) to the address of my wildcard record (for *.apps.<clustername>.<basedomain>). While debugging i found out, that core DNS was also using the underlying local resolver and instead of resolving mservice.mynamespace.svc.cluster.local, it was appending the aformentioned domain-search parameter announced by my external dhcpd and was actually getting an answer for mservice.mynamespace.svc.cluster.local.apps.<clustername>.<basedomain>, which was obviously wrong. I am still investigating how to fix the full issue...

@omertuc
Copy link
Contributor

omertuc commented May 4, 2023

Good to know that because our proposed solution on internal Slack was to append . at the end to signal the resolver that the domain ends where we say it ends, no further appending should be done by the resolver.

But if the presence of your DHCP config breaks OCP itself... maybe we shouldn't do that. Perhaps we should first wait for OCP to be tolerant of such DHCP config and just improve the error message until then

@jklare
Copy link

jklare commented May 4, 2023

Yeah, i think improving the error message would be a good first step. I will keep you posted once i figure out how to configure oc for this scenario :).

danielerez pushed a commit to danielerez/assisted-service that referenced this issue Oct 15, 2023
…sses (openshift#4482)

This commit improves the bad DNS wildcard validation message to include
the list of IP addresses that the bad wildcard domain resolves to.  This
might help users understand where the bad wildcard record is coming
from.

See openshift#4477 - user was
stuck on this validation for a while but was able to figure out the
issue almost immediately when I told them the IP address that gets
resolved is 127.0.0.1 (turns out it's some weird known router
issue/behavior)

Also fixed a (theoretical) bug where when the agent omitted the wildcard
entry from its DNS resolution response, the validation would fail
(instead of error) and would tell the user that the domain is resolving
while in reality we don't really know whether it's resolving or not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants