Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU on EdgeOS #586

Closed
nbrewster opened this issue Sep 20, 2021 · 66 comments
Closed

High CPU on EdgeOS #586

nbrewster opened this issue Sep 20, 2021 · 66 comments
Labels
🐞 bug Something isn't working

Comments

@nbrewster
Copy link

nbrewster commented Sep 20, 2021

Context

  • CLI Version: nextdns version 1.37.2
  • Platform: EdgeRouter ER-X-SFP / EdgeOSv2.0.9-hotfix.2

NextDNS starts consuming all the CPU on my EdgeRouter, at least once a day.

Example: top output

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                         
 3889 root      20   0  670268  87888   5148 R 272.4 34.7  49:20.80 nextdns         

Config:

cache-size 0
cache-max-age 0s
bogus-priv true
timeout 5s
listen localhost:53
control /var/run/nextdns.sock
use-hosts true
max-inflight-requests 256
detect-captive-portals false
hardened-privacy false
log-queries false
max-ttl 0s
report-client-info true
discovery-dns 
setup-router true
config ****
auto-activate false

Trace here: https://gist.github.com/nbrewster/fd250ddc3cad791073756ddb2007bba5

@nbrewster nbrewster added the 🐞 bug Something isn't working label Sep 20, 2021
@rs
Copy link
Contributor

rs commented Sep 20, 2021

Please try reducing the max-inflight-requests by half.

@nbrewster
Copy link
Author

nbrewster commented Sep 20, 2021

Thank you, done - I will report back.

@dk90103
Copy link

dk90103 commented Sep 21, 2021

I have the exact same problem on the exact same hardware/software as nbrewster.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1216 root 20 0 669756 79868 4816 S 390.7 31.5 29:29.20 nextdns

max-inflight-requests aleady set to 128.

@rs
Copy link
Contributor

rs commented Sep 21, 2021

Can you please try to reproduce without client identification enabled?

@dk90103
Copy link

dk90103 commented Sep 21, 2021

Just set: "report-client-info false"

Will test and report back

@dk90103
Copy link

dk90103 commented Sep 23, 2021

Still the same problem after setting "report-client-info false""
Config:
listen localhost:53
bogus-priv true
config XXXXXX
timeout 5s
max-inflight-requests 128
auto-activate false
control /var/run/nextdns.sock
log-queries false
cache-size 10MB
cache-max-age 0s
detect-captive-portals false
max-ttl 5s
report-client-info false
discovery-dns
hardened-privacy false
use-hosts true
setup-router true

Top output:
image
And worse:
image

@nbrewster
Copy link
Author

nbrewster commented Sep 23, 2021

Same problem (though it took a few days) with "max-inflight-requests 128".

Here's a trace: https://gist.github.com/nbrewster/6e65e7ad780aca6464afe555132360b7

I will try "report-client-info false" next (but it seems that didn't help @dk90103).

@dk90103
Copy link

dk90103 commented Sep 23, 2021

@nbrewster I currently had to un-install nextdns in my Edgemax because I hit this CPU bug after a few hours of run-time. But I am happy to try out anything that @rs suggests.

@anastyalien
Copy link

Could this be a backend issue? I understand this is a CLI discussion but my old router also goes insane when I use nextdns via dnscrypt-proxy2 package from Entware. Of course this could be a firmware/entware issue but weird that it happens only with nextdns and seeing similar complaints here on CLI and different hardware/firmware.
After a while, nextdns makes dnscrypt-proxy service hang and router goes into 100% CPU usage with all DNS requests failing either until I reboot my router or manually restart dnscrypt-proxy service. This seems to only happen with nextdns and no other doh/dnscrypt provider. Happens even when I use my personal clientID instead of the default public "nextdns" one from dnscrypt relays list.
No setting in dnscrypt-proxy.toml affects this behavior and dnscrypt main log also has no errors, only dnscrypt domain resolve log gets flooded with "network error" errors and response times of 5000ms+ despite having no connection issues to nextdns servers when looking at traceroutes and diag.exe.

@nbrewster
Copy link
Author

Since setting max-inflight-requests 128 and report-client-info false, I have not seen a recurrence of the problem on my router. Of course, I also no longer see client names in the logs (everything shows as the public IP of my router), which is not acceptable for my use case. @rs any other suggestions?

@andrew-susanto
Copy link

andrew-susanto commented Oct 8, 2021

Recently had this problem with EdgeRouter X v2.0.9-hotfix.2 and nextdns CLI 1.37.2, the cpu jumps more than 200%. When this happens, it's stop responding to dns queries, try to restart nextdns via CLI got no response, hang the ssh, try create another ssh session, the nextdns service status is stopped, try to start again and the dns works again. I think disabling report-client-info and log-queries help this issue.

This is the log from /var/log/messages around the time it happens.
log.txt

@ralban
Copy link

ralban commented Oct 8, 2021 via email

@rs
Copy link
Contributor

rs commented Oct 8, 2021

do you reproduce with older revision?

@ralban
Copy link

ralban commented Oct 8, 2021 via email

@andrew-susanto
Copy link

I'm trying with older version 1.36.0. Will report back if i can reproduce the problem.

I think to set installation version, we can use NEXTDNS_VERSION environment variable
Run this command first before sh the installation script.
export NEXTDNS_VERSION=1.36.0

To install the older version, i first remove nextdns via CLI, then reinstall it.

@andrew-susanto
Copy link

andrew-susanto commented Oct 9, 2021

Update for version 1.36.0. Still get high cpu usage after running about 2 hours 30 minutes. DNS is not responding when that happens. I managed to get the dns work again with nextdns restart command. I also notice that everytime that happens, the log is flooded with doh resolve: context deadline exceeded and roudtrip: context deadline exceeded message.

image

Now i'm trying version 1.35.0. Will report if this issue happens again.

@ralban
Copy link

ralban commented Oct 9, 2021

I have not rolled back nextdns versions yet. The symptom occurred again overnight. I was able restart nextdns before the router became un-response. I've attached some logs from around the timeframe I think the symptom occurred. It's essentially all non-query logs between restarts of nextdns.

my non client/network/config settings are:
log-queries true max-ttl 5s detect-captive-portals false timeout 5s max-inflight-requests 256 setup-router true listen localhost cache-size 10MB report-client-info true discovery-dns cache-max-age 0s bogus-priv false use-hosts true auto-activate true control /var/run/nextdns.sock hardened-privacy false

nextdns-logs.csv

@andrew-susanto
Copy link

andrew-susanto commented Oct 10, 2021

Have been running nextdns cli 1.35.0 over almost 36 hours and haven't seen this issue occured again. The CPU usage is pretty low and not see any spike caused by nextdns. The router gives reply on dns request in all of those time. I also configured prtg since i installed 1.35.0 to monitor the cpu usage (empty cpu load in graph caused by my prtg server goes to sleep).

image

Beside of that, i also noticed posibility of memory leak (?) when doh resolve: context deadline exceeded happened. The memory usage for nextdns cli increased about 35 MB since that happened and not coming back to initial memory usage. it is double from when the nextdns cli started, so i think it shouldn't be necessarly to use that much of memory space for running doh client. I think this issue already ever discussed in #505 but sadly still haven't fixed yet.

image

@ralban
Copy link

ralban commented Oct 13, 2021

I've continued to run 1.37.2, and have not observed this symptom since October 9th.
Perhaps this symptom is not in the CLI client. Maybe it is something cloud side that becomes evident in the CLI on EdgeOS b/c it is a lower-powered device with a higher volume of requests served. The symptom might not be noticeable on single smartphone (faster proc) with only 1 device querying it.

In my case, I might have ~65 clients/8 configs. Not sure if others experiencing this symptom have similar device and query volumes.

rs added a commit that referenced this issue Oct 14, 2021
If for any reason (like a network hiccup) the HTTPSSVC steering fails and
falls back on the non bootstraped HTTPSSVC steering while the
/etc/resolv.conf is set to localhost, the DNS resolution will create a
circular dependency.

This change disables this fallback waiting for a better solution.

Fixes #587
Fixes #591

May be related to #586
@ralban
Copy link

ralban commented Oct 15, 2021

the symptom occurred again over-night:
I ran sudo nextdns restart which failed (that is my 07:44:55 log entry) restarting the router fixed the symptom:

Oct 15 11:44:56 10.10.2.1 Oct 15 07:44:55 ubnt nextdns[11944]: Endpoint failed: https://dns.nextdns.io#170.39.224.134,2a0e:6902:2002:12b:5054:ff:fed7:6b78: roundtrip: context deadline exceeded
Oct 15 04:21:23 10.10.2.1 Oct 15 00:20:57 ubnt nextdns[17778]: Endpoint failed: https://dns.nextdns.io#170.39.224.134,2a0e:6902:2002:12b:5054:ff:fed7:6b78: roundtrip: context deadline exceeded
Oct 15 04:21:23 10.10.2.1 Oct 15 00:20:57 ubnt nextdns[17778]: Endpoint failed: https://dns.nextdns.io#199.119.65.94,2a0b:4342:1a32:f:5054:ff:fe48:d17f: roundtrip: context deadline exceeded
Oct 15 04:04:02 10.10.2.1 Oct 15 00:04:02 ubnt nextdns[17778]: Endpoint failed: https://dns.nextdns.io#199.119.65.94,2a0b:4342:1a32:f:5054:ff:fe48:d17f: roundtrip: context deadline exceeded
Oct 15 04:03:57 10.10.2.1 Oct 15 00:03:57 ubnt nextdns[17778]: Endpoint failed: https://dns.nextdns.io#170.39.224.134,2a0e:6902:2002:12b:5054:ff:fed7:6b78: roundtrip: context deadline exceeded
Oct 15 04:03:51 10.10.2.1 Oct 15 00:03:51 ubnt nextdns[17778]: Endpoint failed: https://dns2.nextdns.io#45.90.30.0,2a07:a8c1::: roundtrip: context deadline exceeded
Oct 15 04:03:45 10.10.2.1 Oct 15 00:03:45 ubnt nextdns[17778]: Endpoint failed: https://dns1.nextdns.io#45.90.28.0,2a07:a8c0::: roundtrip: context deadline exceeded
Oct 15 04:03:19 10.10.2.1 Oct 15 00:03:19 ubnt nextdns[17778]: Endpoint failed: https://dns.nextdns.io#199.119.65.94,2a0b:4342:1a32:f:5054:ff:fe48:d17f: roundtrip: context deadline exceeded
Oct 15 04:03:14 10.10.2.1 Oct 15 00:03:14 ubnt nextdns[17778]: Endpoint failed: https://dns.nextdns.io#170.39.224.134,2a0e:6902:2002:12b:5054:ff:fed7:6b78: roundtrip: context deadline exceeded

@rs
Copy link
Contributor

rs commented Oct 15, 2021

Is this with the snapshot version?

@ralban
Copy link

ralban commented Oct 15, 2021 via email

@rs
Copy link
Contributor

rs commented Oct 15, 2021

Please try with this verison: NEXTDNS_VERSION=master/SNAPSHOT-ff4185e sh -c 'sh -c "$(curl -sL https://nextdns.io/install)"'

@dk90103
Copy link

dk90103 commented Oct 16, 2021

Please try with this verison: NEXTDNS_VERSION=master/SNAPSHOT-ff4185e sh -c 'sh -c "$(curl -sL https://nextdns.io/install)"'

Just installed this version now. Will monitor.
Executed "export NEXTDNS_VERSION=master/SNAPSHOT-ff4185e" and then "sh -c 'sh -c "$(curl -sL https://nextdns.io/install)"'"

@dk90103
Copy link

dk90103 commented Oct 16, 2021

After 45 mins I see this with the special build:
image

If I revert to 1.35 it works just fine. So something was introduced after 1.35 that the edgeOS does not like.

@ralban
Copy link

ralban commented Oct 17, 2021

I experienced my crash symptom overnight on the pre-release snapshot. I'm trying to pull logs, but might have lost them as loss DNS disconnected by logging server. I'm installed 1.35 and will continue to monitor.

@jrm523
Copy link

jrm523 commented Oct 19, 2021

@ralban I'm not sure if you have done this before. However, I recently fixed my recurring nextdns edgerouter issue by performing a full hardware reset on my device and reinstalling Nextdns cli. I backed up my router config and uploaded it once the hardware reset was done. Previously, I had performed "software" factory resets but never did a "hardware" factory reset until now. The hardware reset clears the config but also formats the file system to default.

@ybkshaw
Copy link

ybkshaw commented Nov 12, 2021

After trying the fix @nbrewster suggested NextDNS-cli on my EdgeRouter-X has been totally stable. My log still has sparse "context deadline exceeded" errors but significantly less than previously seen and all users on the network report stability has resumed.

Previous outages were once or twice a day, no service disruption for nearly a week now.

@jaYINGLING
Copy link

I'm also using nextdns 1.37.0 with ER-X and was experiencing DNS drops about twice a day. I just downgraded to 1.35.0 and will comment again if I experience more of the same issue.

@rs
Copy link
Contributor

rs commented Nov 22, 2021

@jaYINGLING please try with 1.37.4 not 1.37.0.

@jaYINGLING
Copy link

@rs Okay, I'll give that a try and report back.

@jaYINGLING
Copy link

@rs still having the issue on 1.37.4

@ybkshaw
Copy link

ybkshaw commented Nov 23, 2021

Reporting in that 1.3.5.0 has remained stable and uptime hasn't been effected since my previous comment.

@Hugo-LA
Copy link

Hugo-LA commented Nov 23, 2021

@rs I confirm, version 1.37.4 also failed me.

@vincentbernat
Copy link
Member

Could you try with the snapshot in #618? It switches back to Go 1.16 instead of Go 1.17 which was one of the main change between 1.35 and 1.36. Comment back in #618 if it works or not. Thanks!

@andrew-susanto
Copy link

Currently trying pr-618/SNAPSHOT-9483d8b, will report back.

Installation command:
NEXTDNS_VERSION=pr-618/SNAPSHOT-9483d8b sh -c 'sh -c "$(curl -sL https://nextdns.io/install)"'

@dk90103
Copy link

dk90103 commented Nov 26, 2021

Do we know which minor version of 1.17 that is used ? There are some additional fixes in 1.17.3
go1.17 (released 2021-08-16)
Go 1.17 is a major release of Go. Read the Go 1.17 Release Notes for more information.

Minor revisions
go1.17.1 (released 2021-09-09) includes a security fix to the archive/zip package, as well as bug fixes to the compiler, linker, the go command, and to the crypto/rand, embed, go/types, html/template, and net/http packages. See the Go 1.17.1 milestone on our issue tracker for details.

go1.17.2 (released 2021-10-07) includes a security fix to the linker and misc/wasm directory, as well as bug fixes to the compiler, the runtime, the go command, and to the time and text/template packages. See the Go 1.17.2 milestone on our issue tracker for details.

go1.17.3 (released 2021-11-04) includes security fixes to the archive/zip and debug/macho packages, as well as bug fixes to the compiler, linker, runtime, the go command, the misc/wasm directory, and to the net/http and syscall packages. See the Go 1.17.3 milestone on our issue tracker for details.

@vincentbernat
Copy link
Member

1.17.3 was used for 1.37.4.

@shivv3r
Copy link

shivv3r commented Dec 14, 2021

Same issue on Mi Wifi Mini with latest OpenWRT. Still not fixed.

@ahuj9
Copy link

ahuj9 commented Dec 19, 2021

Same issue on ERX with latest OpenWRT/nextdns opkg

@BrianG61UK
Copy link

BrianG61UK commented Jan 5, 2022

nextdns 1.37.7 is working well on my EdgeRouter X without caching or client identification using dnsmasq for caching.

My nextdns.conf:

auto-activate false
control /var/run/nextdns.sock
config ******
max-ttl 0s
report-client-info false
max-inflight-requests 256
discovery-dns
mdns disabled
hardened-privacy false
setup-router false
listen 127.0.0.1:53053
cache-size 0
cache-max-age 0s
detect-captive-portals false
bogus-priv true
log-queries false
use-hosts false
timeout 5s

dnsmasq is configured similar to:

    dns {
        forwarding {
            cache-size 1000
            listen-on switch0.10
            listen-on switch0.20
            options no-resolv
            options query-port=53054
            options server=127.0.0.1#53053
            options bogus-priv
            options domain-needed
            options stop-dns-rebind
            options dhcp-option=252,\n
        }
    }

@BrianG61UK
Copy link

BrianG61UK commented Jan 5, 2022

@anastyalien said:

but my old router also goes insane when I use nextdns via dnscrypt-proxy2 package from Entware.

Yes, I found the same when using any DoH via dnscrypt-proxy2 on my EdgeRouter X.
After a few hours no dns, and dnscrypt-proxy2 using 100++% CPU.

I had no problem with dnscrypt-proxy2 when using dnscrypt (rather than DoH) to talk to OpenDNS.

@rs
Copy link
Contributor

rs commented Jan 5, 2022

It is probably due to the cipher suite negociate with TLS and the lack of acceleration on this hardware. We attempted several optimizations to reduce TLS handshake frequency, but it is apparently not always enough.

Cc @vincentbernat

@BrianG61UK
Copy link

@rs said:

It is probably due to the cipher suite negociate with TLS and the lack of acceleration on this hardware.
We attempted several optimizations to reduce TLS handshake frequency, but it is apparently not always enough.

Cc @vincentbernat

How would that turn nextdns cli (and it seems, also dnscrypt-proxy2) into useless CPU eating processes?

@rs
Copy link
Contributor

rs commented Jan 5, 2022

My theory is that handshake get so slow that it times out and retry in loop. I can’t reproduce this issue on the same hardware so it is hard to understand what is happening.

@BrianG61UK
Copy link

BrianG61UK commented Jan 5, 2022

People are saying nextdns cli v1.35 does not have this problem. Has something changed since then?
Is that something perhaps something that is the same on dnscache-proxy2?
Also why does nextdns cli 1.37.7 seem to work well as I have chosen to configure it on my EdgeRouter X shown here

@jeamland
Copy link

jeamland commented Jan 5, 2022

My theory is that handshake get so slow that jt times out and retry in loop. I can’t reproduce this issue on the same hardware so it is hard to understand what is happening.

Are you able to introduce some synthetic network delay and/or packet loss into your test environment?

@BrianG61UK
Copy link

BrianG61UK commented Jan 6, 2022

Does this problem only occur on mipsle based EdgeRouters?
Has anyone seen this on the mips64 based EdgeRouters?

@BrianG61UK
Copy link

People are saying nextdns cli v1.35 does not have this problem.
If that's right and v1.35 is the most recent version without this problem, then that represents the last version released before updating the go compiler to v1.17. Could there be a bug in 1.17 with mipsle code?

@vincentbernat
Copy link
Member

1.37.7 is using Go 1.16. See #618. It has been confirmed by several people this was solving the problem for them on ER-X. Maybe you can try the snapshot in #620. This is using the exact same version Go version than for 1.35.

@jrm523
Copy link

jrm523 commented Feb 23, 2022

I wanted to update my experience with this issue to help if possible.. I have been running version v1.35 since Nov 10th and havent experienced any excessive errors, cpu usage, or lockups until today. Today may have been a fluke considering it has been running for the last 3-4 months straight. I went ahead and loaded #618 as suggested above and will keep an eye on it. I am optimistic considering that ive only seen one person complain of an issue.

@ybkshaw
Copy link

ybkshaw commented Mar 17, 2022

I updated the latest version on my EdgeRouter ER-X and it's been stable and happy for a couple of weeks now.

@nbrewster
Copy link
Author

I have also been running 1.37.7 for a couple of weeks with no issues - even with report-client-info true.

@BrianG61UK
Copy link

Yes. 1,37.7 and 1.37.10 seem good on my ER-X too.

@j-jacobi
Copy link

j-jacobi commented Apr 10, 2022

I noticed this issue on 1.37.11 (not on EdgeOS) after a few hours:

INFO: OS: openwrt
INFO: GOARCH: mips_softfloat
INFO: GOOS: linux
INFO: NEXTDNS_BIN: /usr/sbin/nextdns
INFO: INSTALL_RELEASE: 1.37.11

For the moment I downgraded to 1.37.10 and didn't notice the issue 🤞

The same issue occurred on 1.37.10, on 1.37.7 everything seems stable for now.

Edit: When checking the commits between 1.37.10 and 1.37.7 I don't see anything related that could cause this issue, I guess it's probably just that the problem happens randomly.
When the bug occurs (DNS no longer working at all) I notice a high CPU usage and the only way to recover is to restart nextdns.

How to downgrade to custom version 1.37.10 on OperWRT:
  1. First uninstall luci-app-nextdns if installed and then nextdns (Your configuration file will not be removed)
    opkg remove luci-app-nextdns
    opkg remove nextdns
  2. Download and install the old version
    wget https://github.com/nextdns/nextdns/releases/download/v1.37.10/nextdns_1.37.10_linux_mips_softfloat.tar.gz -O nextdns.tar.gz
    tar xvzf nextdns.tar.gz
    ./nextdns install
  3. (Optional/hacky): Install luci-app-nextdns, this will reinstall the new version of nextdns so I manually replaced it with:
    rm /usr/sbin/nextdns && mv ./nextdns /user/sbin/nextdns

@rs rs closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests