New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoreDNS is misconfigured leading to unexpected healthcheck behaviour #64
Comments
|
Attaching a packet capture of the first 17 seconds after If the firewall is set to drop rather than reject, then I think this misconfig is exacerbated by |
|
What's interesting is that once triggered, the rate spikes and then remains fairly stable, dumping 3000 packets/sec onto the network Left to run, the rate fluctuates a bit but stays around that level Experimenting a bit, it seems the issue is not in fact the use of Then the packet storm never occurs, even with So actually, it seems the correct fix here is not to have that fallback statement at all. It also brings the behaviour inline with that expected of DNS servers - if you don't trust responses from your upstream then why query it in the first place? |
|
The reason this happens is that CoreDNS will loop over the configured upstreams until it reaches it's own timeout - https://github.com/coredns/proxy/blob/master/proxy.go#L78 - (FAOD the Because the RSTs can come back fast from the firewall, there's plenty of time to fling packets onto the network. Once CoreDNS reaches that timeout, it'll consider the backend down, but only for 2 seconds. In practice this doesn't matter much, because if all backends are considered down (which they will be here) then default behaviour is to spray randomly against one of the hosts. TL:DR forcing a hardcoded |
|
The storm can also trigger sometime after startup - if We can verify this by adding a firewall rule to drop traffic from HomeAssistant on the local DNS service. What we get is intermittent storms - the "recoveries" are At this point, you could theorise that if we change our rule to REJECT rather than DROP, we should cause a storm against the local DNS server. However, that's not the case (at least where the local DNS is contacted using UDP). In fact, if we rewrite the We also do not get a storm. Manually querying against the server block doesn't cause a storm either. This suggests that the underlying issue is in This means there's an alternative fix/mitigation available here - if the devs are overly attached to the existence of the |
|
OK, so pulling this altogether, this is how to Repro and verify the above. Metrics collectionOptional, you could also just run a packet capture if you don't want graphs
ReproOn your network firewall, add two rules
Exec into
You should see thousands of packets hit the network. If you're using some other metrics+graphing solution, be aware that you may not see them in graphs for Now exec into Restore the original config On your firewall, add two more rules
|
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
Nothing changed since the report, dear stale bot. Dear developers, please add an option to disable Cloudflare servers or just disable them by default, if any other server is configured. If a server was configured by the user, the user has had a reason to do so. Please respect that and behave like a good netizen. |
works for me |
for long uptime it doesnt work |
|
Fixed by #82 |
|
Also note that there is a new option to disable the fallback dns added here: home-assistant/supervisor#3586 as I would guess a number of users on here would be interested in that. |


CoreDNS is configured to healthcheck the Cloudflare fallback every 5 minutes, however in practice, a check is performed once a minute (and retries are generated when it fails).
The
fallbackdirective also causes healthchecks at startup, which can create substantial query rates.This is also why users have reported seeing small packet storms when Cloudflare is not reachable (by @lialosiu here and @tescophil here).
The intended behaviour appears to be to check once every 5 minutes
However, that is not the only check being performed, because the encompassing server block is referenced elsewhere
A check will be run once a minute against
127.0.0.1:5553as well as the locals (if present) - from further testing it appears those will only begin once you've had an initial failure which leadscorednsto move onto the next forward host.We can see this is the case by enabling coredns' prometheus endpoints and pointing telegraf at them
That's failures per minute - each failure represents a single query (where healthchecks are concerned for
example.org) sent to127.0.0.1:5553.However, to
127.0.0.1:5553(and any other upstream for that matter) it's just another query, so when it's query to it's upstream (one of1.1.1.1or1.0.0.1) fails, it retries and we end up with new packets hitting the wire, one after the other.In terms of the fix, it's not clear why the fallback behaviour is implemented/hardcoded in the first place (I couldn't find any architecture discussions on it in that repo, perhaps I missed them), but the correct way to have implemented this would be one of the following options
Option 1: not include
127.0.0.1:5553in the forwards statement at all (as it's handled by the fallback).(It'd need some logic to handle empty locals)
Option 2: Not use a separate server block
(perhaps there some other reason an entire separate server block was stood up, but I don't see any reference to it).
example.orglocally (so the healthcheck against:5553isn't passed upstream), but that's more horrid than the current setup.coredns's timeouts if a local does go down, so I've not included thatThe reason this isn't a PR is because it's blocked by a decision on approach.
Correction: The much bigger issue is actually the
fallbackstatement, see #64 (comment)Additional Observations
Whilst capturing telemetry there were a few things I noticed which might help inform a decision on the above
When in use, the Cloudflare fallback introduces a significant level of latency:
At the network level, Cloudflare is only 10-15ms away, but the average query duration for CF upstreams is half a second. The presumption is that's due to DoT overheads, but unfortunately
corednsdoesn't currently expose metrics that can help verify this.I'd posit therefore that as well as fixing the healthcheck issue
But, realistically, if this issue is fixed so that healthchecks aren't amplified onto the network, then users who want to block CF DoT at the perimeter will be able to do so without HA gradually attempting to flood the network.
I like
coredns, but it does feel rather out of place in an appliance - it's approach to dynamic timeouts isn't really very well tuned to the foibles of domestic connections/networks.The text was updated successfully, but these errors were encountered: