New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU on EdgeOS #586
Comments
Please try reducing the max-inflight-requests by half. |
Thank you, done - I will report back. |
I have the exact same problem on the exact same hardware/software as nbrewster. max-inflight-requests aleady set to 128. |
Can you please try to reproduce without client identification enabled? |
Just set: "report-client-info false" Will test and report back |
Still the same problem after setting "report-client-info false"" |
Same problem (though it took a few days) with "max-inflight-requests 128". Here's a trace: https://gist.github.com/nbrewster/6e65e7ad780aca6464afe555132360b7 I will try "report-client-info false" next (but it seems that didn't help @dk90103). |
@nbrewster I currently had to un-install nextdns in my Edgemax because I hit this CPU bug after a few hours of run-time. But I am happy to try out anything that @rs suggests. |
Could this be a backend issue? I understand this is a CLI discussion but my old router also goes insane when I use nextdns via dnscrypt-proxy2 package from Entware. Of course this could be a firmware/entware issue but weird that it happens only with nextdns and seeing similar complaints here on CLI and different hardware/firmware. |
Since setting |
Recently had this problem with EdgeRouter X v2.0.9-hotfix.2 and nextdns CLI 1.37.2, the cpu jumps more than 200%. When this happens, it's stop responding to dns queries, try to restart nextdns via CLI got no response, hang the ssh, try create another ssh session, the nextdns service status is stopped, try to start again and the dns works again. I think disabling report-client-info and log-queries help this issue. This is the log from /var/log/messages around the time it happens. |
do you reproduce with older revision? |
I'm trying with older version 1.36.0. Will report back if i can reproduce the problem. I think to set installation version, we can use NEXTDNS_VERSION environment variable To install the older version, i first remove nextdns via CLI, then reinstall it. |
Update for version 1.36.0. Still get high cpu usage after running about 2 hours 30 minutes. DNS is not responding when that happens. I managed to get the dns work again with nextdns restart command. I also notice that everytime that happens, the log is flooded with doh resolve: context deadline exceeded and roudtrip: context deadline exceeded message. Now i'm trying version 1.35.0. Will report if this issue happens again. |
I have not rolled back nextdns versions yet. The symptom occurred again overnight. I was able restart nextdns before the router became un-response. I've attached some logs from around the timeframe I think the symptom occurred. It's essentially all non-query logs between restarts of nextdns. my non client/network/config settings are: |
Have been running nextdns cli 1.35.0 over almost 36 hours and haven't seen this issue occured again. The CPU usage is pretty low and not see any spike caused by nextdns. The router gives reply on dns request in all of those time. I also configured prtg since i installed 1.35.0 to monitor the cpu usage (empty cpu load in graph caused by my prtg server goes to sleep). Beside of that, i also noticed posibility of memory leak (?) when doh resolve: context deadline exceeded happened. The memory usage for nextdns cli increased about 35 MB since that happened and not coming back to initial memory usage. it is double from when the nextdns cli started, so i think it shouldn't be necessarly to use that much of memory space for running doh client. I think this issue already ever discussed in #505 but sadly still haven't fixed yet. |
I've continued to run 1.37.2, and have not observed this symptom since October 9th. In my case, I might have ~65 clients/8 configs. Not sure if others experiencing this symptom have similar device and query volumes. |
If for any reason (like a network hiccup) the HTTPSSVC steering fails and falls back on the non bootstraped HTTPSSVC steering while the /etc/resolv.conf is set to localhost, the DNS resolution will create a circular dependency. This change disables this fallback waiting for a better solution. Fixes #587 Fixes #591 May be related to #586
the symptom occurred again over-night:
|
Is this with the snapshot version? |
Please try with this verison: |
Just installed this version now. Will monitor. |
I experienced my crash symptom overnight on the pre-release snapshot. I'm trying to pull logs, but might have lost them as loss DNS disconnected by logging server. I'm installed 1.35 and will continue to monitor. |
@ralban I'm not sure if you have done this before. However, I recently fixed my recurring nextdns edgerouter issue by performing a full hardware reset on my device and reinstalling Nextdns cli. I backed up my router config and uploaded it once the hardware reset was done. Previously, I had performed "software" factory resets but never did a "hardware" factory reset until now. The hardware reset clears the config but also formats the file system to default. |
After trying the fix @nbrewster suggested NextDNS-cli on my EdgeRouter-X has been totally stable. My log still has sparse "context deadline exceeded" errors but significantly less than previously seen and all users on the network report stability has resumed. Previous outages were once or twice a day, no service disruption for nearly a week now. |
I'm also using nextdns 1.37.0 with ER-X and was experiencing DNS drops about twice a day. I just downgraded to 1.35.0 and will comment again if I experience more of the same issue. |
@jaYINGLING please try with 1.37.4 not 1.37.0. |
@rs Okay, I'll give that a try and report back. |
@rs still having the issue on 1.37.4 |
Reporting in that 1.3.5.0 has remained stable and uptime hasn't been effected since my previous comment. |
@rs I confirm, version 1.37.4 also failed me. |
Currently trying pr-618/SNAPSHOT-9483d8b, will report back. Installation command: |
Do we know which minor version of 1.17 that is used ? There are some additional fixes in 1.17.3 Minor revisions go1.17.2 (released 2021-10-07) includes a security fix to the linker and misc/wasm directory, as well as bug fixes to the compiler, the runtime, the go command, and to the time and text/template packages. See the Go 1.17.2 milestone on our issue tracker for details. go1.17.3 (released 2021-11-04) includes security fixes to the archive/zip and debug/macho packages, as well as bug fixes to the compiler, linker, runtime, the go command, the misc/wasm directory, and to the net/http and syscall packages. See the Go 1.17.3 milestone on our issue tracker for details. |
1.17.3 was used for 1.37.4. |
Same issue on Mi Wifi Mini with latest OpenWRT. Still not fixed. |
Same issue on ERX with latest OpenWRT/nextdns opkg |
nextdns 1.37.7 is working well on my EdgeRouter X without caching or client identification using dnsmasq for caching. My nextdns.conf:
dnsmasq is configured similar to:
|
@anastyalien said:
Yes, I found the same when using any DoH via dnscrypt-proxy2 on my EdgeRouter X. I had no problem with dnscrypt-proxy2 when using dnscrypt (rather than DoH) to talk to OpenDNS. |
It is probably due to the cipher suite negociate with TLS and the lack of acceleration on this hardware. We attempted several optimizations to reduce TLS handshake frequency, but it is apparently not always enough. |
@rs said:
How would that turn nextdns cli (and it seems, also dnscrypt-proxy2) into useless CPU eating processes? |
My theory is that handshake get so slow that it times out and retry in loop. I can’t reproduce this issue on the same hardware so it is hard to understand what is happening. |
People are saying nextdns cli v1.35 does not have this problem. Has something changed since then? |
Are you able to introduce some synthetic network delay and/or packet loss into your test environment? |
Does this problem only occur on mipsle based EdgeRouters? |
|
I wanted to update my experience with this issue to help if possible.. I have been running version v1.35 since Nov 10th and havent experienced any excessive errors, cpu usage, or lockups until today. Today may have been a fluke considering it has been running for the last 3-4 months straight. I went ahead and loaded #618 as suggested above and will keep an eye on it. I am optimistic considering that ive only seen one person complain of an issue. |
I updated the latest version on my EdgeRouter ER-X and it's been stable and happy for a couple of weeks now. |
I have also been running 1.37.7 for a couple of weeks with no issues - even with |
Yes. 1,37.7 and 1.37.10 seem good on my ER-X too. |
I noticed this issue on 1.37.11 (not on EdgeOS) after a few hours:
The same issue occurred on 1.37.10, on 1.37.7 everything seems stable for now. Edit: When checking the commits between 1.37.10 and 1.37.7 I don't see anything related that could cause this issue, I guess it's probably just that the problem happens randomly. How to downgrade to custom version 1.37.10 on OperWRT:
|
Context
NextDNS starts consuming all the CPU on my EdgeRouter, at least once a day.
Example: top output
Config:
Trace here: https://gist.github.com/nbrewster/fd250ddc3cad791073756ddb2007bba5
The text was updated successfully, but these errors were encountered: