Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance impact after dns update to 2021.06.0 #50

Closed
milutt opened this issue Jun 30, 2021 · 8 comments
Closed

performance impact after dns update to 2021.06.0 #50

milutt opened this issue Jun 30, 2021 · 8 comments

Comments

@milutt
Copy link

milutt commented Jun 30, 2021

Since dns upgrade to 2021.06.0, my complete hassio setup is having performance issues.
I am running haos on Raspberry Pi 1B. It's an old pi, but before dns 2021.06.0 everything was running without issues and I had no reason to upgrade hardware.
Since dns upgrade, coredns will eventually get stuck at more than 60% CPU usage constantly and everything else slows down to the level that it's unusable.
Even 'ha dns restart' is failing with time out.
It's happening also with clean image install without configuring any integrations.
When I downgrade to dns 2021.04.0 using 'ha dns update --version 2021.04.0', CPU usage is back to normal and whole system is responsive.
Downgrading dns is not permanent fix as it gets automatically updated back to last version and CPU load increases again.

Is there an option to permanently downgrade to dns 2021.04.0 or disable DoT completely (if TLS is causing too much load on rpi1)?

dns logs using 2021.06.01:
[INFO] 127.0.0.1:45539 - 6781 "NS IN . udp 17 false 512" NOERROR - 0 30.016215226s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[INFO] 127.0.0.1:47240 - 31621 "NS IN . udp 17 false 512" NOERROR - 0 30.014927277s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[INFO] 127.0.0.1:39746 - 20863 "NS IN . udp 17 false 512" NOERROR - 0 30.018269213s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[INFO] 127.0.0.1:40733 - 54343 "NS IN . udp 17 false 512" NOERROR - 0 30.006049544s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[INFO] 127.0.0.1:47464 - 56063 "NS IN . udp 17 false 512" NOERROR - 0 35.378188925s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[INFO] 127.0.0.1:33661 - 11713 "NS IN . udp 17 false 512" NOERROR - 0 30.888139468s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[INFO] 127.0.0.1:54301 - 4260 "NS IN . udp 17 false 512" NOERROR - 0 30.002718645s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[INFO] 127.0.0.1:33394 - 6453 "NS IN . udp 17 false 512" NOERROR - 0 30.032896855s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out
[INFO] 127.0.0.1:35429 - 10631 "NS IN . udp 17 false 512" NOERROR - 0 30.012585403s
[ERROR] plugin/errors: 2 . NS: tls: DialWithDialer timed out

dns logs using 2021.04.0 (everything else the same, just downgraded dns):
[INFO] 172.30.32.1:39653 - 7139 "PTR IN 43.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.142744402s
[INFO] 172.30.32.1:47120 - 56986 "PTR IN 1.178.168.192.in-addr.arpa. udp 44 false 512" NXDOMAIN qr,aa,rd,ra 44 0.04007399s
[INFO] 172.30.32.1:42934 - 43274 "PTR IN 2.0.17.172.in-addr.arpa. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 41 0.074434124s
[INFO] 172.30.32.1:60416 - 61087 "PTR IN 80.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.035992093s
[INFO] 172.30.32.1:48272 - 61848 "PTR IN 84.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.024411385s
[INFO] 172.30.32.1:52543 - 46777 "PTR IN 43.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.107825254s
[INFO] 172.30.32.1:44524 - 31012 "PTR IN 1.178.168.192.in-addr.arpa. udp 44 false 512" NXDOMAIN qr,aa,rd,ra 44 0.047429792s
[INFO] 172.30.32.1:49698 - 36720 "PTR IN 2.0.17.172.in-addr.arpa. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 41 0.032395175s
[INFO] 172.30.32.1:35650 - 10995 "PTR IN 80.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.043622889s
[INFO] 172.30.32.1:39872 - 23006 "PTR IN 84.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.046617813s
[INFO] 172.30.32.1:60757 - 62788 "PTR IN 43.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.10125965s
[INFO] 172.30.32.1:43480 - 13081 "PTR IN 1.178.168.192.in-addr.arpa. udp 44 false 512" NXDOMAIN qr,aa,rd,ra 44 0.050008839s
[INFO] 172.30.32.1:39897 - 45303 "PTR IN 2.0.17.172.in-addr.arpa. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 41 0.091132884s
[INFO] 172.30.32.1:54320 - 64334 "PTR IN 80.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.116535295s
[INFO] 172.30.32.1:59641 - 63561 "PTR IN 84.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.049265856s
[INFO] 172.30.32.1:54196 - 61682 "PTR IN 43.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.029521276s
[INFO] 172.30.32.1:33093 - 37580 "PTR IN 1.178.168.192.in-addr.arpa. udp 44 false 512" NXDOMAIN qr,aa,rd,ra 44 0.038741049s
[INFO] 172.30.32.1:44545 - 42001 "PTR IN 2.0.17.172.in-addr.arpa. udp 41 false 512" NXDOMAIN qr,aa,rd,ra 41 0.032553201s
[INFO] 172.30.32.1:37139 - 44637 "PTR IN 80.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 2.032197136s
[INFO] 172.30.32.1:43724 - 10416 "PTR IN 84.178.168.192.in-addr.arpa. udp 45 false 512" NXDOMAIN qr,aa,rd,ra 45 0.104368439s

core-2021.6.6
supervisor-2021.06.6
Home Assistant OS 6.1
CPU armv6l

@Tallyrald
Copy link

I experience the same, can't really add much else. Pi gets bogged down by the coredns process.
Maybe this has something to do with the fallback plugin fix? ( #47 )

I suspect something is stuck in a retry loop without any backoff/stop instruction.

@Tallyrald
Copy link

After observing behaviour (sadly I found no logs that would back me up here) I found that most of the dns resolution / connection errors actually happen because of the CPU overload caused by the DNS plugin.
Errors start within a couple minutes after boot, CPU spikes to 45% on the coreDNS process, never goes below that. Several hours later this percentage starts to climb as more & more connection errors happen. Finally HA locks up & RPi(1B) needs a reboot.

All in all this makes HA completely unusable. Reverting to 2021.04.0 solves the problem until the supervisor decides to auto-update the plugin-dns again which means that all my observations are in line with what OP described. (We seriously need manual control over updates to be honest, especially since HA supports so many different platforms)

I tried the same setup on a Windows machine using HyperV but the problem never came up. I suspect something (fallback?) is not respecting when a query takes 'too long' & initiates a new query again & again. Which is kinda undesirable since it locks up the whole service. This also makes HA partially dead whenever the network is offline which is not uncommon to see given that HA is supposed to be privacy-first & cloud-free (if the user wants that).

Is there a way for me to help debug the problem? If someone could write instructions on how to develop & test the dns plugin (locally using docker I guess), I would gladly try to help. Unfortunately I'm not a golang expert although I am a software developer (mostly familiar with js/ts).

@dMopp
Copy link

dMopp commented Oct 22, 2021

Its a problem with the fallback DNS where the developers "dont want a discussion"... for whatever reason.

If you have a WORKING DNS Setup, you could do the following:

1.) make sure, HA dns is using you own DNS server:
ha dns info

host: 172.30.32.3
locals: []
servers:
- dns://<YOUDNSSERVERIP>
update_available: false
version: 2021.06.0
version_latest: 2021.06.0

if its not the case, run
ha dns options --servers dns://<YOUDNSSERVERIP>

2.) comment out unused fallback DNS

docker exec -it hassio_dns bash
vi /usr/share/tempio/corefile

comment out the line
fallback REFUSED,SERVFAIL,NXDOMAIN . dns://127.0.0.1:5553
save and exit container
ha dns restart

This should solve CPU issues until next release.. (As long as you have a working DNS setup, but i dont know how often the template file gets overwritten).

The discussion about that hardcoded cloudflare DNS servers is complete useless, because the devs do not want to discuss that.

And no, Fallback is NOT required to have HA working. Iam blocking DOH and DOT for the known public servers in my firewall and have no issues at all.

@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 9, 2022
@Strohhutpat
Copy link

Still present in 2021.12.7.

@stale stale bot removed the stale label Jan 10, 2022
@FileGo
Copy link

FileGo commented Jan 13, 2022

I confirm, just got my syslog and user.log files grow over 20GB each due to this.

@redgryphon
Copy link

Could it be fixed by #82?

@mdegat01
Copy link
Contributor

@redgryphon I don't believe so, I just noticed this on a dev system recently even after that PR because I forgot to unblock cloudflare DoT for it. However closing this because the new option to disable the fallback DNS here does fix this: home-assistant/supervisor#3586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants