Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netdata does not work with servers with many processors. #1323

Closed
candiao opened this issue Dec 6, 2016 · 13 comments
Labels

Comments

@candiao
Copy link

@candiao candiao commented Dec 6, 2016

Hi everyone,
could you help me with this question?

I acquired two x3950 X6 lenovo / IBM servers with 144 cores and 288 threads (8 sockets / 16 cores) and 2 TB Mem and Red Hat 7.3.

But don,t work.

Thank in advance.

error.log:

2016-12-06 19:24:50: netdata: INFO: NetData started on pid 114581
2016-12-06 19:24:50: netdata: ERROR: Health: cannot open health file: /var/lib/netdata/health/health-log.db.old (errno 2, No such file or directory)
2016-12-06 19:24:50: netdata: INFO: TC thread created with task id 114582
2016-12-06 19:24:50: netdata: INFO: 2016-12-06 19:24:50: netdata: INFO: CPU Idle Jitter thread created with task id 114583
PROC Plugin thread created with task id 114584
2016-12-06 19:24:50: netdata: INFO: 2016-12-06 19:24:50: netdata: INFO: CGROUP Plugin thread created with task id 114585
HEALTH thread created with task id 114586
2016-12-06 19:24:50: netdata: INFO: PLUGINS.D thread created with task id 114588
2016-12-06 19:24:50: netdata: INFO: Multi-threaded WEB SERVER thread created with task id 114589
2016-12-06 19:24:50: netdata: INFO: Listening on '[0.0.0.0]:19999'
2016-12-06 19:24:50: netdata: INFO: Listening on '[::]:19999'
2016-12-06 19:24:50: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/charts.d.plugin' running on pid 114591
2016-12-06 19:24:50: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/fping.plugin' running on pid 114593
2016-12-06 19:24:50: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/node.d.plugin' running on pid 114595
2016-12-06 19:24:50: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/python.d.plugin' running on pid 114597
2016-12-06 19:24:50: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/apps.plugin' running on pid 114599
2016-12-06 19:24:50: apps.plugin: INFO: started on pid 114599
2016-12-06 19:24:50: fping.plugin: FATAL: no hosts configued in '/etc/netdata/fping.conf' - nothing to do.
2016-12-06 19:24:50: netdata: ERROR: PLUGINSD: '/usr/libexec/netdata/plugins.d/fping.plugin' called DISABLE. Disabling it.
2016-12-06 19:24:50: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/fping.plugin' on pid 114593 stopped after 0 successful data collections (ENDs).
2016-12-06 19:24:50: netdata: ERROR: child pid 114593 exited with code 1. (errno 9, Bad file descriptor)
2016-12-06 19:24:50: netdata: ERROR: PLUGINSD: '/usr/libexec/netdata/plugins.d/fping.plugin' exited with error code 1. Disabling it.
2016-12-06 19:24:50: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/fping.plugin' thread exiting
2016-12-06 19:24:50: charts.d: INFO: main: started from '/usr/libexec/netdata/plugins.d/charts.d.plugin' with options: 1
2016-12-06 19:24:50: INFO: Using python v2
2016-12-06 19:24:50: charts.d: INFO: apache: is disabled. Add a line with apache=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
2016-12-06 19:24:50: charts.d: INFO: cpu_apps: is disabled. Add a line with cpu_apps=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
2016-12-06 19:24:50: charts.d: INFO: cpufreq: is disabled. Add a line with cpufreq=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
2016-12-06 19:24:50: charts.d: INFO: example: is disabled. Add a line with example=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
2016-12-06 19:24:50: charts.d: INFO: exim: is disabled. Add a line with exim=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
2016-12-06 19:24:50: charts.d: INFO: hddtemp: is disabled. Add a line with hddtemp=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
2016-12-06 19:24:50: charts.d: INFO: load_average: is disabled. Add a line with load_average=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
2016-12-06 19:24:50: charts.d: INFO: mem_apps: is disabled. Add a line with mem_apps=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
2016-12-06 19:24:50: charts.d: INFO: mysql: is disabled. Add a line with mysql=force in /etc/netdata/charts.d.conf to enable it (or remove the line that disables it).
@
"error.log" 399L, 39713C
2016-12-06 19:24:56: netdata: INFO: Initializing file /var/cache/netdata/cpu.cpu210_interrupts/122.db.
2016-12-06 19:24:56: netdata: INFO: Initializing file /var/cache/netdata/cpu.cpu210_interrupts/123.db.
2016-12-06 19:24:56: netdata: INFO: Initializing file /var/cache/netdata/cpu.cpu210_interrupts/124.db.
2016-12-06 19:24:56: netdata: INFO: Initializing file /var/cache/netdata/cpu.cpu210_interrupts/125.db.
2016-12-06 19:24:56: netdata: INFO: Initializing file /var/cache/netdata/cpu.cpu210_interrupts/126.db.
2016-12-06 19:24:56: netdata: INFO: Initializing file /var/cache/netdata/cpu.cpu210_interrupts/127.db.
2016-12-06 19:24:56: netdata: INFO: Initializing file /var/cache/netdata/cpu.cpu210_interrupts/128.db.
2016-12-06 19:24:56: netdata: ERROR: Cannot allocate PRIVATE ANONYMOUS memory for KSM for file '/var/cache/netdata/cpu.cpu210_interrupts/129.db'. (errno 12, Cannot allocate memory)
2016-12-06 19:24:56: netdata: ERROR: Cannot allocate PRIVATE ANONYMOUS memory for KSM for file '/var/cache/netdata/cpu.cpu210_interrupts/130.db'. (errno 12, Cannot allocate memory)
2016-12-06 19:24:56: netdata: ERROR: Cannot allocate PRIVATE ANONYMOUS memory for KSM for file '/var/cache/netdata/cpu.cpu210_interrupts/131.db'. (errno 12, Cannot allocate memory)
2016-12-06 19:24:56: netdata: ERROR: Cannot allocate PRIVATE ANONYMOUS memory for KSM for file '/var/cache/netdata/cpu.cpu210_interrupts/132.db'. (errno 12, Cannot allocate memory)
2016-12-06 19:24:56: netdata: ERROR: Cannot allocate PRIVATE ANONYMOUS memory for KSM for file '/var/cache/netdata/cpu.cpu210_interrupts/133.db'. (errno 12, Cannot allocate memory)
2016-12-06 19:24:56: netdata: FATAL: Cannot allocate 56 bytes of memory. # : Cannot allocate memory

2016-12-06 19:24:56: netdata: INFO: Saving database...
2016-12-06 19:24:56: netdata: INFO: CGROUP thread exiting
2016-12-06 19:24:57: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/python.d.plugin' on pid 114597 stopped after 36 successful data collections (ENDs).
2016-12-06 19:24:57: netdata: ERROR: child pid 114587 killed by signal 13. (errno 9, Bad file descriptor)
2016-12-06 19:24:57: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/apps.plugin' on pid 114599 stopped after 312 successful data collections (ENDs).
2016-12-06 19:24:58: python.d ERROR: sensors Something wrong: [Errno 32] Broken pipe
2016-12-06 19:24:58: netdata: ERROR: child pid 114599 killed by signal 13. (errno 9, Bad file descriptor)
2016-12-06 19:24:58: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/apps.plugin' thread exiting
2016-12-06 19:25:00: python.d ERROR: postfix_local Something wrong: [Errno 32] Broken pipe
2016-12-06 19:25:00: netdata: INFO: HEALTH thread exiting
2016-12-06 19:25:00: python.d FATAL: no more jobs
Traceback (most recent call last):
File "/usr/libexec/netdata/plugins.d/python.d.plugin", line 533, in
run()
File "/usr/libexec/netdata/plugins.d/python.d.plugin", line 528, in run
charts.update()
File "/usr/libexec/netdata/plugins.d/python.d.plugin", line 409, in update
msg.fatal("no more jobs")
File "/usr/libexec/netdata/python.d/python_modules/msg.py", line 72, in fatal
print('DISABLE')
IOError: [Errno 32] Broken pipe
2016-12-06 19:25:00: netdata: ERROR: child pid 114597 exited with code 1. (errno 9, Bad file descriptor)
2016-12-06 19:25:00: netdata: INFO: PLUGINSD: '/usr/libexec/netdata/plugins.d/python.d.plugin' thread exiting
2016-12-06 19:25:03: netdata: INFO: NetData exiting. Bye bye...

@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 6, 2016

This seems to be a limit on the number of memory mapped files a process can have.
Could you please switch memory mode to ram to verify this is the case?

edit /etc/netdata/netdata.conf and set memory mode = ram at the global section.

@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 6, 2016

Check this: http://ask.systutorials.com/1969/maximum-number-of-mmap-ed-ranges-and-how-to-set-it-on-linux

Could you please run sysctl vm.max_map_count to see the limit on your kernel?

@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 6, 2016

btw, nice machines...

@candiao

This comment has been minimized.

Copy link
Author

@candiao candiao commented Dec 7, 2016

Hi ktsaou,
Thank you for your attention.
After changing /etc/netdata/netdata.conf and set memory mode = ram at the global section.
Everything works fine.

#sysctl vm.max_map_count
#vm.max_map_count = 65530

@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 7, 2016

ok, with memory mode = ram, data are never saved back to disk. So, if you restart netdata, it will be empty again.

I guess that if you increase vm.max_map_count you should be able to go back to memory mode = save. Since you have plenty of RAM, increasing this number of 1mil or something, will not harm. This number is maximum number of mmaped regions your kernel allows. Netdata needs one such for each metric collected and one for each chart. I guess you have 300 charts and a few thousand dimensions (actually how many are there on such a machine?).

So, I suggest to increase it, a lot.

@candiao

This comment has been minimized.

Copy link
Author

@candiao candiao commented Dec 7, 2016

Hi,
I followed your orientation.

Change:
mmap what was vm.max_map_count = 65530 to vm.max_map_count = 655300
memory mode = ram to memory mode = save

Everything works fine. ;-)

collects every second 95.764 metrics, presented as 1.225 charts and monitored by 73 alarms, using 1.759 MB of memory for 1 hour of real-time history.

@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 7, 2016

collects every second 95.764 metrics, presented as 1.225 charts and monitored by 73 alarms, using 1.759 MB of memory for 1 hour of real-time history.

Wow! This is a first for me. 100.000 metrics? I can't believe it.

Can you give me the CPU utilisation of main netdata process? This chart on one of your servers (note that I have selected the netdata dimension only.

image

@candiao

This comment has been minimized.

Copy link
Author

@candiao candiao commented Dec 7, 2016

Cool! I'm happy for that!
Please, if you need information, feel free to ask me.
I am very grateful for the netdata project!

--> These servers will go into production in 15 days and will be a database server.

snap 2016-12-07 at 14 43 30

@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 7, 2016

Thanks!

ok. netdata is fast... 100.000 metrics per second with 12% CPU of a single core!

@rlefevre have a look at this. This machine, with 144 cores and 288 threads, is probably the best test that your work on interrupts and softirqs on PR #1304 needs to be as fast as it should.

@ktsaou ktsaou added the question label Dec 8, 2016
@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 8, 2016

I am closing this now.

As a side note, I have noticed that with 100.000 metrics netdata needs 1.7GB of RAM, for just an hour of data when the metrics are collected per second.

This is a problem (not for this case, since this server is equipped with 2TB of RAM), which should be addressed by future versions.

@candiao thank you for sharing this with us. We don't see that kind of machines so frequently...

@ktsaou ktsaou closed this Dec 8, 2016
@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 11, 2016

@candiao we have merged an update to the interrupts and softirqs processing in netdata.
Could you please update your netdata on one of these machines and post again its the CPU usage? We would like to verify that in your system, you did not introduce any overheads (or even we are now faster).
Thanks!

@candiao

This comment has been minimized.

Copy link
Author

@candiao candiao commented Dec 12, 2016

Hi

snap 2016-12-12 at 11 09 43
snap 2016-12-12 at 11 11 49
snap 2016-12-12 at 11 15 33
snap 2016-12-12 at 11 22 49

@ktsaou

This comment has been minimized.

Copy link
Member

@ktsaou ktsaou commented Dec 12, 2016

wow! @rlefevre you made it faster! It is now 9% for 100.000 metrics!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.