Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host #7440

cpw · 2019-12-05T02:49:38Z

Bug report summary

It seems my netdata stops working after a time. I have to restart the process for it to resume. The problem occurs on an intermittent basis - there seems to be little pattern to when the process stops working.

When this happens I notice a lot of CLOSE_WAIT sockets from my remote prometheus host (set up as a scrape in prometheus). Is this a cause, or a symptom? I do not know.

OS / Environment

Debian Linux buster, using the kickstarter install following nightly (it seems to update reasonably regularly). I'm running on the "baremetal" of the server - there's plenty of containers but netdata itself is not in one.

Netdata version (ouput of `netdata -V`)

netdata v1.19.0-37-nightly (this problem has been happening intermittently since around the v18 or so release)

Component Name

Everything, but probably related to prometheus somehow?

Steps To Reproduce

Wait. Though I can't guarantee there's not something specific to my environment.

Expected behavior

That netdata runs for more than around 24 hours without dying and needing to restart.

Observations

There is nothing in the logs - I tried compiling the debugging as in the documentation - it revealed nothing. The logs seem to sort of, but not really, continue. I can see the scraper in the logs up until the explosion of connections, after which netdata becomes unresponsive to all network activity.

The process definitely seems hung, since when it is working normally, the command systemctl restart netdata returns in a second - but when it is hung the same command takes around 60-90 seconds to complete (I'm guessing systemd is KILLing the process after the graceful stop timeout).

Most of the time, the "netdata" process is still around and doesn't indicate zombie status, but on a couple of occasions I've noticed it's completely vanished.

Attached is an example netstat output from when it was in such a wedged state. I have anonymized public IP addresses in this trace. The "prometheus" host is 10.0.0.241 - the host running netdata is 10.10.0.254.

Also attached is the output from lsof for the netdata pid at time of failure. There seems to be quite a lot of file handles open to the dbengine stuff. I don't know if this is relevant.

I have streaming setup from approximately 5 other environments on a few seconds basis.

The prometheus scrape config is:

global:
  scrape_interval:     60s
  evaluation_interval: 60s
  scrape_timeout: 60s

scrape_configs:
...
  - job_name: netdata
    metrics_path: /api/v1/allmetrics
    params:
      format: [ prometheus_all_hosts ]
    honor_labels: true
    static_configs:
      - targets: ['netdatahost:19999']
...

This was working very well for many months, until around 2 months ago, when I'd guess netdata v18 or so hit.

This has become so symptomatic I have implemented a simple cron task to restart netdata when the "wc -l" of CLOSE_WAIT connections for the 19999 port exceeds 50.

I would like advice on what diagnostics I should try next to see if I can figure out what's causing the runaway connections. I'm not 100% convinced this isn't a problem with something else causing netdata to lock up, but I have nothing to go on but the dead connections at present.

netstat.txt
lsof.txt

The text was updated successfully, but these errors were encountered:

cakrit · 2019-12-06T00:01:01Z

Ok, this looks like way too many open database engine files.

We have some charts to monitor the dbengine under "Netdata Monitoring" -> dbengine (right navbar, all the way to the end). Can you share some screenshots, or even better, do an export when you start noticing the delay? You can email the export to support@netdata.cloud, because it will contain information that can't be here.
Do you do any very large queries, attempting e.g to retrieve many days' worth of data at a high resolution? We have an open bug to resolve issues with very large queries that can make netdata hang, when using memory mode = dbengine.

ilyam8 · 2019-12-06T00:09:10Z

i see request is /api/v1/allmetrics?format=prometheus_all_hosts (master), interval is 60 secs

cakrit · 2019-12-06T08:44:55Z

60 sec is nothing. lsof.txt shows many, much older dbengine files. @mfundul will take a look, maybe he has a different idea.

mfundul · 2019-12-06T09:29:32Z

Hello @cpw , I'd still like to see the logs to understand the exact configuration of netdata to help us debug the problem. You can send the error.log to markos@netdata.cloud

Feel free to remove any sensitive information that there may be in the log if you like.

cpw · 2019-12-06T15:00:35Z

I'll email that over.. (sent).
I'll keep monitoring to try and generate an export when it's becoming wobbly.

mfundul · 2019-12-06T16:03:43Z

I'll email that over.. (sent).
I'll keep monitoring to try and generate an export when it's becoming wobbly.

Did you send it to me or to support? I don't see it in my inbox.

cpw · 2019-12-06T16:54:42Z

I sent both documents requested to both email addresses. (support and markos).

mfundul · 2019-12-06T17:54:22Z

In the logs I see every day at about 7:38 at least the SIGUSR1 signal being received like clockwork, usually accompanied by SIGTERM. This is probably some scheduled task.

I also see some kind of memory corruption at more random points in times some days.

mfundul · 2019-12-06T18:51:20Z

@cpw can you disable log flood protection cause I'm missing a lot of logs, by setting errors flood protection period to 0 in the configuration?

cpw · 2019-12-06T18:52:07Z

The daily activity is your netdata-updater? It's also doing SIGUSR1 to cycle logs. Memory corruption? You have me slightly concerned now... I would like to know where you see that?

I'll make the config change.

mfundul · 2019-12-06T19:37:57Z

One instance in the logs is:

malloc(): unsorted double linked list corrupted

Let's see if we can see something more without log flood protection.

mfundul · 2019-12-06T19:39:18Z

Can you tell me what your page cache size is in the configuration file?

cpw · 2019-12-06T20:45:09Z

Whatever the default is. it looks like it's showing 32 as the default:

        # page cache size = 32
        # dbengine disk space = 256

cpw · 2019-12-08T17:10:27Z

So it died again overnight. I wasn't able to grab an export - the browser UI was unresponsive.

Anyway, I can email you the data if that's something you want. I also captured pmap and strace output for the primary pid. I tried to capture ltrace but that caused the process to crash completely and thus restart.

mfundul · 2019-12-08T17:11:33Z

Sure, please e-mail the logs and whatever else you got to markos@netdata.cloud .

mfundul · 2019-12-08T22:50:12Z

@cpw can you test #7478 to see if this fixes your problem?

mfundul · 2019-12-09T15:03:00Z

@cpw please reopen this if you see the problem again.

cpw · 2019-12-12T21:17:22Z

Reopening this problem. The issue occurred again, however I see very interesting new errors in the log files:

2019-12-12 13:04:40: netdata INFO  : STREAM_RECEIVER[im,[76.10.144.162]:47142] : RRDSET: chart name 'web_log_nginx.detailed_response_codes' on host 'im' already exists.
2019-12-12 13:04:40: netdata INFO  : STREAM_RECEIVER[im,[76.10.144.162]:47142] : web_log_nginx.detailed_response_codes: too old data (last updated at 1576181223.0, last collected at 1576188566.832046). Resetting it. Will not store the next entry.
2019-12-12 13:04:40: netdata INFO  : STREAM_RECEIVER[im,[76.10.144.162]:47144] : RRDSET: chart name 'web_log_nginx.detailed_response_codes' on host 'im' already exists.
2019-12-12 13:04:47: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2019-12-12 13:04:47: netdata INFO  : MAIN : Created data file "/var/cache/netdata/dbengine/datafile-1-0000000880.ndf".
2019-12-12 13:04:47: netdata INFO  : MAIN : Created journal file "/var/cache/netdata/dbengine/journalfile-1-0000000880.njf".
2019-12-12 13:05:37: netdata FATAL : WEB_SERVER[static4] :Invalid page cache descriptor locking state:0X5647EE8E4A40 # : Resource temporarily unavailable

2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'FATAL' 'netdata:WEB_SERVER[static4]' '0080@database/e:rrdeng_page_des/11'
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : EXIT: netdata prepares to exit with code 1...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'ERROR' '-'
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : EXIT: cleaning up the database...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : Cleaning up database [6 hosts(s)]...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : Cleaning up database of host 'katie'...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : Cleaning up database of host 'wlanac.weeksfamily.ca'...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : Cleaning up database of host 'im'...
2019-12-12 13:06:37: netdata FATAL : WEB_SERVER[static3] :Invalid page cache descriptor locking state:0X5647EE8E4A40 # : Resource temporarily unavailable

2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'FATAL' 'netdata:WEB_SERVER[static3]' '0080@database/e:rrdeng_page_des/11'
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : EXIT: netdata prepares to exit with code 1...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'ERROR' '-'
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : EXIT: cleaning up the database...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : Cleaning up database [6 hosts(s)]...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : Cleaning up database of host 'katie'...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : Cleaning up database of host 'wlanac.weeksfamily.ca'...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : Cleaning up database of host 'im'...
2019-12-12 13:07:36: netdata FATAL : WEB_SERVER[static1] :Invalid page cache descriptor locking state:0X5647EE8E4A40 # : Resource temporarily unavailable

2019-12-12 13:07:36: netdata INFO  : WEB_SERVER[static1] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'FATAL' 'netdata:WEB_SERVER[static1]' '0080@database/e:rrdeng_page_des/11'
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : EXIT: netdata prepares to exit with code 1...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'ERROR' '-'
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : EXIT: cleaning up the database...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : Cleaning up database [6 hosts(s)]...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : Cleaning up database of host 'katie'...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : Cleaning up database of host 'wlanac.weeksfamily.ca'...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : Cleaning up database of host 'im'...
2019-12-12 13:08:30: alarm-notify.sh: WARNING: Cannot find file '/etc/netdata/health_alarm_notify.conf'.
2019-12-12 13:08:30: alarm-notify.sh: INFO: sent email notification for: im system.cpu.10min_cpu_iowait is CLEAR to 'root'
2019-12-12 13:08:36: netdata FATAL : WEB_SERVER[static2] :Invalid page cache descriptor locking state:0X5647EE8E4A40 # : Resource temporarily unavailable

2019-12-12 13:08:36: netdata INFO  : WEB_SERVER[static2] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'FATAL' 'netdata:WEB_SERVER[static2]' '0080@database/e:rrdeng_page_des/11'
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : EXIT: netdata prepares to exit with code 1...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'ERROR' '-'
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : EXIT: cleaning up the database...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : Cleaning up database [6 hosts(s)]...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : Cleaning up database of host 'katie'...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : Cleaning up database of host 'wlanac.weeksfamily.ca'...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : Cleaning up database of host 'im'...
2019-12-12 13:19:30: alarm-notify.sh: WARNING: Cannot find file '/etc/netdata/health_alarm_notify.conf'.
2019-12-12 13:19:30: alarm-notify.sh: INFO: sent email notification for: im system.swapio.30min_ram_swapped_out is CLEAR to 'root'

The last time the server responded to prometheus was at 13:08 roughly. I hope this helps a bit.

thiagoftsm · 2019-12-12T21:25:28Z

Hi @cpw ,

It was reopened, I will close the other as you requested.

cpw · 2019-12-12T21:27:44Z

Thank you. I think the new error might be enlightening for the dbengine. I have it in broken state at present - so if someone can suggest diagnostic data that might help I'll happily gather it.

cpw · 2019-12-13T16:34:10Z

I've generated exports for a second crash. You can see that the webserver threads stop working, though the rest of the engine seems to carry on..

mfundul · 2019-12-14T08:10:31Z

@cpw can you send me your logs again to markos@netdata.cloud ? Let's hope I'll never have to ask again!

cpw added bug needs triage Issues which need to be manually labelled labels Dec 5, 2019

cpw changed the title ~~Netdata stops working after some time: lots of sockets in TIME_WAIT from prometheus remote host~~ Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host Dec 5, 2019

cakrit assigned mfundul Dec 5, 2019

cakrit mentioned this issue Dec 6, 2019

Netdata service keeps stopping on Ubuntu 14.04 #7433

Closed

mfundul mentioned this issue Dec 8, 2019

Fix race condition with page cache descriptors #7478

Merged

mfundul added area/database priority/high Super important issue and removed needs triage Issues which need to be manually labelled labels Dec 9, 2019

mfundul added this to the v1.20-Sprint1 milestone Dec 9, 2019

mfundul closed this as completed in #7478 Dec 9, 2019

cpw mentioned this issue Dec 12, 2019

Follow up issue for #7440 #7516

Closed

ilyam8 reopened this Dec 12, 2019

ilyam8 closed this as completed Dec 12, 2019

thiagoftsm reopened this Dec 12, 2019

mfundul mentioned this issue Dec 15, 2019

Fix race condition in dbengine #7533

Merged

mfundul closed this as completed in #7533 Dec 16, 2019

thiagoftsm mentioned this issue Dec 17, 2019

Fix race condition in dbengine #7565

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host #7440

Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host #7440

cpw commented Dec 5, 2019

cakrit commented Dec 6, 2019

ilyam8 commented Dec 6, 2019

cakrit commented Dec 6, 2019

mfundul commented Dec 6, 2019

cpw commented Dec 6, 2019 •

edited

mfundul commented Dec 6, 2019

cpw commented Dec 6, 2019

mfundul commented Dec 6, 2019

mfundul commented Dec 6, 2019

cpw commented Dec 6, 2019

mfundul commented Dec 6, 2019

mfundul commented Dec 6, 2019

cpw commented Dec 6, 2019

cpw commented Dec 8, 2019

mfundul commented Dec 8, 2019

mfundul commented Dec 8, 2019

mfundul commented Dec 9, 2019

cpw commented Dec 12, 2019

thiagoftsm commented Dec 12, 2019

cpw commented Dec 12, 2019

cpw commented Dec 13, 2019

mfundul commented Dec 14, 2019

Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host #7440

Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host #7440

Comments

cpw commented Dec 5, 2019

Bug report summary

OS / Environment

Netdata version (ouput of netdata -V)

Component Name

Steps To Reproduce

Expected behavior

Observations

cakrit commented Dec 6, 2019

ilyam8 commented Dec 6, 2019

cakrit commented Dec 6, 2019

mfundul commented Dec 6, 2019

cpw commented Dec 6, 2019 • edited

mfundul commented Dec 6, 2019

cpw commented Dec 6, 2019

mfundul commented Dec 6, 2019

mfundul commented Dec 6, 2019

cpw commented Dec 6, 2019

mfundul commented Dec 6, 2019

mfundul commented Dec 6, 2019

cpw commented Dec 6, 2019

cpw commented Dec 8, 2019

mfundul commented Dec 8, 2019

mfundul commented Dec 8, 2019

mfundul commented Dec 9, 2019

cpw commented Dec 12, 2019

thiagoftsm commented Dec 12, 2019

cpw commented Dec 12, 2019

cpw commented Dec 13, 2019

mfundul commented Dec 14, 2019

Netdata version (ouput of `netdata -V`)

cpw commented Dec 6, 2019 •

edited