Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host #7440

Closed
cpw opened this issue Dec 5, 2019 · 22 comments · Fixed by #7478, #7533 or #7565
Closed
Assignees
Labels
Milestone

Comments

@cpw
Copy link

cpw commented Dec 5, 2019

Bug report summary

It seems my netdata stops working after a time. I have to restart the process for it to resume. The problem occurs on an intermittent basis - there seems to be little pattern to when the process stops working.

When this happens I notice a lot of CLOSE_WAIT sockets from my remote prometheus host (set up as a scrape in prometheus). Is this a cause, or a symptom? I do not know.

OS / Environment

Debian Linux buster, using the kickstarter install following nightly (it seems to update reasonably regularly). I'm running on the "baremetal" of the server - there's plenty of containers but netdata itself is not in one.

Netdata version (ouput of netdata -V)

netdata v1.19.0-37-nightly (this problem has been happening intermittently since around the v18 or so release)

Component Name

Everything, but probably related to prometheus somehow?

Steps To Reproduce

Wait. Though I can't guarantee there's not something specific to my environment.

Expected behavior

That netdata runs for more than around 24 hours without dying and needing to restart.

Observations

There is nothing in the logs - I tried compiling the debugging as in the documentation - it revealed nothing. The logs seem to sort of, but not really, continue. I can see the scraper in the logs up until the explosion of connections, after which netdata becomes unresponsive to all network activity.

The process definitely seems hung, since when it is working normally, the command systemctl restart netdata returns in a second - but when it is hung the same command takes around 60-90 seconds to complete (I'm guessing systemd is KILLing the process after the graceful stop timeout).

Most of the time, the "netdata" process is still around and doesn't indicate zombie status, but on a couple of occasions I've noticed it's completely vanished.

Attached is an example netstat output from when it was in such a wedged state. I have anonymized public IP addresses in this trace. The "prometheus" host is 10.0.0.241 - the host running netdata is 10.10.0.254.

Also attached is the output from lsof for the netdata pid at time of failure. There seems to be quite a lot of file handles open to the dbengine stuff. I don't know if this is relevant.

I have streaming setup from approximately 5 other environments on a few seconds basis.

The prometheus scrape config is:

global:
  scrape_interval:     60s
  evaluation_interval: 60s
  scrape_timeout: 60s

scrape_configs:
...
  - job_name: netdata
    metrics_path: /api/v1/allmetrics
    params:
      format: [ prometheus_all_hosts ]
    honor_labels: true
    static_configs:
      - targets: ['netdatahost:19999']
...

This was working very well for many months, until around 2 months ago, when I'd guess netdata v18 or so hit.

This has become so symptomatic I have implemented a simple cron task to restart netdata when the "wc -l" of CLOSE_WAIT connections for the 19999 port exceeds 50.

I would like advice on what diagnostics I should try next to see if I can figure out what's causing the runaway connections. I'm not 100% convinced this isn't a problem with something else causing netdata to lock up, but I have nothing to go on but the dead connections at present.

netstat.txt
lsof.txt

@cpw cpw added bug needs triage Issues which need to be manually labelled labels Dec 5, 2019
@cpw cpw changed the title Netdata stops working after some time: lots of sockets in TIME_WAIT from prometheus remote host Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host Dec 5, 2019
@cakrit
Copy link
Contributor

cakrit commented Dec 6, 2019

Ok, this looks like way too many open database engine files.

  • We have some charts to monitor the dbengine under "Netdata Monitoring" -> dbengine (right navbar, all the way to the end). Can you share some screenshots, or even better, do an export when you start noticing the delay? You can email the export to support@netdata.cloud, because it will contain information that can't be here.
  • Do you do any very large queries, attempting e.g to retrieve many days' worth of data at a high resolution? We have an open bug to resolve issues with very large queries that can make netdata hang, when using memory mode = dbengine.

@ilyam8
Copy link
Member

ilyam8 commented Dec 6, 2019

i see request is /api/v1/allmetrics?format=prometheus_all_hosts (master), interval is 60 secs

@cakrit
Copy link
Contributor

cakrit commented Dec 6, 2019

60 sec is nothing. lsof.txt shows many, much older dbengine files. @mfundul will take a look, maybe he has a different idea.

@mfundul
Copy link
Contributor

mfundul commented Dec 6, 2019

Hello @cpw , I'd still like to see the logs to understand the exact configuration of netdata to help us debug the problem. You can send the error.log to markos@netdata.cloud

Feel free to remove any sensitive information that there may be in the log if you like.

@cpw
Copy link
Author

cpw commented Dec 6, 2019

I'll email that over.. (sent).
I'll keep monitoring to try and generate an export when it's becoming wobbly.

@mfundul
Copy link
Contributor

mfundul commented Dec 6, 2019

I'll email that over.. (sent).
I'll keep monitoring to try and generate an export when it's becoming wobbly.

Did you send it to me or to support? I don't see it in my inbox.

@cpw
Copy link
Author

cpw commented Dec 6, 2019

I sent both documents requested to both email addresses. (support and markos).

@mfundul
Copy link
Contributor

mfundul commented Dec 6, 2019

In the logs I see every day at about 7:38 at least the SIGUSR1 signal being received like clockwork, usually accompanied by SIGTERM. This is probably some scheduled task.

I also see some kind of memory corruption at more random points in times some days.

@mfundul
Copy link
Contributor

mfundul commented Dec 6, 2019

@cpw can you disable log flood protection cause I'm missing a lot of logs, by setting errors flood protection period to 0 in the configuration?

@cpw
Copy link
Author

cpw commented Dec 6, 2019

The daily activity is your netdata-updater? It's also doing SIGUSR1 to cycle logs. Memory corruption? You have me slightly concerned now... I would like to know where you see that?

I'll make the config change.

@mfundul
Copy link
Contributor

mfundul commented Dec 6, 2019

One instance in the logs is:

malloc(): unsorted double linked list corrupted

Let's see if we can see something more without log flood protection.

@mfundul
Copy link
Contributor

mfundul commented Dec 6, 2019

Can you tell me what your page cache size is in the configuration file?

@cpw
Copy link
Author

cpw commented Dec 6, 2019

Whatever the default is. it looks like it's showing 32 as the default:

        # page cache size = 32
        # dbengine disk space = 256

@cpw
Copy link
Author

cpw commented Dec 8, 2019

So it died again overnight. I wasn't able to grab an export - the browser UI was unresponsive.

Anyway, I can email you the data if that's something you want. I also captured pmap and strace output for the primary pid. I tried to capture ltrace but that caused the process to crash completely and thus restart.

@mfundul
Copy link
Contributor

mfundul commented Dec 8, 2019

Sure, please e-mail the logs and whatever else you got to markos@netdata.cloud .

@mfundul
Copy link
Contributor

mfundul commented Dec 8, 2019

@cpw can you test #7478 to see if this fixes your problem?

@mfundul mfundul added area/database priority/high Super important issue and removed needs triage Issues which need to be manually labelled labels Dec 9, 2019
@mfundul mfundul added this to the v1.20-Sprint1 milestone Dec 9, 2019
@mfundul
Copy link
Contributor

mfundul commented Dec 9, 2019

@cpw please reopen this if you see the problem again.

@cpw
Copy link
Author

cpw commented Dec 12, 2019

Reopening this problem. The issue occurred again, however I see very interesting new errors in the log files:

2019-12-12 13:04:40: netdata INFO  : STREAM_RECEIVER[im,[76.10.144.162]:47142] : RRDSET: chart name 'web_log_nginx.detailed_response_codes' on host 'im' already exists.
2019-12-12 13:04:40: netdata INFO  : STREAM_RECEIVER[im,[76.10.144.162]:47142] : web_log_nginx.detailed_response_codes: too old data (last updated at 1576181223.0, last collected at 1576188566.832046). Resetting it. Will not store the next entry.
2019-12-12 13:04:40: netdata INFO  : STREAM_RECEIVER[im,[76.10.144.162]:47144] : RRDSET: chart name 'web_log_nginx.detailed_response_codes' on host 'im' already exists.
2019-12-12 13:04:47: netdata INFO  : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2019-12-12 13:04:47: netdata INFO  : MAIN : Created data file "/var/cache/netdata/dbengine/datafile-1-0000000880.ndf".
2019-12-12 13:04:47: netdata INFO  : MAIN : Created journal file "/var/cache/netdata/dbengine/journalfile-1-0000000880.njf".
2019-12-12 13:05:37: netdata FATAL : WEB_SERVER[static4] :Invalid page cache descriptor locking state:0X5647EE8E4A40 # : Resource temporarily unavailable

2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'FATAL' 'netdata:WEB_SERVER[static4]' '0080@database/e:rrdeng_page_des/11'
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : EXIT: netdata prepares to exit with code 1...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'ERROR' '-'
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : EXIT: cleaning up the database...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : Cleaning up database [6 hosts(s)]...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : Cleaning up database of host 'katie'...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : Cleaning up database of host 'wlanac.weeksfamily.ca'...
2019-12-12 13:05:37: netdata INFO  : WEB_SERVER[static4] : Cleaning up database of host 'im'...
2019-12-12 13:06:37: netdata FATAL : WEB_SERVER[static3] :Invalid page cache descriptor locking state:0X5647EE8E4A40 # : Resource temporarily unavailable

2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'FATAL' 'netdata:WEB_SERVER[static3]' '0080@database/e:rrdeng_page_des/11'
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : EXIT: netdata prepares to exit with code 1...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'ERROR' '-'
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : EXIT: cleaning up the database...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : Cleaning up database [6 hosts(s)]...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : Cleaning up database of host 'katie'...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : Cleaning up database of host 'wlanac.weeksfamily.ca'...
2019-12-12 13:06:37: netdata INFO  : WEB_SERVER[static3] : Cleaning up database of host 'im'...
2019-12-12 13:07:36: netdata FATAL : WEB_SERVER[static1] :Invalid page cache descriptor locking state:0X5647EE8E4A40 # : Resource temporarily unavailable

2019-12-12 13:07:36: netdata INFO  : WEB_SERVER[static1] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'FATAL' 'netdata:WEB_SERVER[static1]' '0080@database/e:rrdeng_page_des/11'
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : EXIT: netdata prepares to exit with code 1...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'ERROR' '-'
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : EXIT: cleaning up the database...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : Cleaning up database [6 hosts(s)]...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : Cleaning up database of host 'katie'...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : Cleaning up database of host 'wlanac.weeksfamily.ca'...
2019-12-12 13:07:37: netdata INFO  : WEB_SERVER[static1] : Cleaning up database of host 'im'...
2019-12-12 13:08:30: alarm-notify.sh: WARNING: Cannot find file '/etc/netdata/health_alarm_notify.conf'.
2019-12-12 13:08:30: alarm-notify.sh: INFO: sent email notification for: im system.cpu.10min_cpu_iowait is CLEAR to 'root'
2019-12-12 13:08:36: netdata FATAL : WEB_SERVER[static2] :Invalid page cache descriptor locking state:0X5647EE8E4A40 # : Resource temporarily unavailable

2019-12-12 13:08:36: netdata INFO  : WEB_SERVER[static2] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'FATAL' 'netdata:WEB_SERVER[static2]' '0080@database/e:rrdeng_page_des/11'
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : EXIT: netdata prepares to exit with code 1...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'ERROR' '-'
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : EXIT: cleaning up the database...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : Cleaning up database [6 hosts(s)]...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : Cleaning up database of host 'katie'...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : Cleaning up database of host 'wlanac.weeksfamily.ca'...
2019-12-12 13:08:37: netdata INFO  : WEB_SERVER[static2] : Cleaning up database of host 'im'...
2019-12-12 13:19:30: alarm-notify.sh: WARNING: Cannot find file '/etc/netdata/health_alarm_notify.conf'.
2019-12-12 13:19:30: alarm-notify.sh: INFO: sent email notification for: im system.swapio.30min_ram_swapped_out is CLEAR to 'root'

The last time the server responded to prometheus was at 13:08 roughly. I hope this helps a bit.

@ilyam8 ilyam8 reopened this Dec 12, 2019
@ilyam8 ilyam8 closed this as completed Dec 12, 2019
@thiagoftsm thiagoftsm reopened this Dec 12, 2019
@thiagoftsm
Copy link
Contributor

Hi @cpw ,

It was reopened, I will close the other as you requested.

@cpw
Copy link
Author

cpw commented Dec 12, 2019

Thank you. I think the new error might be enlightening for the dbengine. I have it in broken state at present - so if someone can suggest diagnostic data that might help I'll happily gather it.

@cpw
Copy link
Author

cpw commented Dec 13, 2019

I've generated exports for a second crash. You can see that the webserver threads stop working, though the rest of the engine seems to carry on..

Screenshot_2019-12-13 Get control of your Linux Servers Simple Effective Awesome (1)

@mfundul
Copy link
Contributor

mfundul commented Dec 14, 2019

@cpw can you send me your logs again to markos@netdata.cloud ? Let's hope I'll never have to ask again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants