Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

pmwebd impossibly slow when using grafana with 300 archives #117

Open
mkevac opened this Issue Sep 29, 2016 · 8 comments

Comments

Projects
None yet
2 participants

mkevac commented Sep 29, 2016

Hello.

We are collecting metrics from ~300 servers with 1s granularity using pmlogger. It works fine. Data is going to appropriate PCP databases on disk.

But visually browsing these archives is impossible using pmwebd and grafana.

After clicking on a host list in grafana

Imgur

Nothing happens.

I can see in pmwebd log that it got request for /graphite/render:

[Thu Sep 29 14:06:55] pmwebd(22926): [192.168.3.129:33220] HTTP/1.1 GET /grafana/config.js
[Thu Sep 29 14:06:55] pmwebd(22926): [192.168.3.129:33220] HTTP/1.1 GET /grafana/app/dashboards/hostselect.js
[Thu Sep 29 14:06:55] pmwebd(22926): [192.168.3.129:33220] HTTP/1.1 GET /graphite/render

pmwebd process is stuck in D reading files. I've waited for 10 minutes, but nothing changed.

strace shows something like this:

[pid 22926] lseek(60, 666161152, SEEK_SET) = 666161152
[pid 22926] read(60, "\3\0\0\f\0\0\0\0\310\326\375\262\3\0\0\f\0\0\0\0\316\7\242Z\3\0\0\f\0\0\0\0"..., 4096) = 4096
[pid 22926] lseek(60, 666165248, SEEK_SET) = 666165248
[pid 22926] lseek(60, 666165248, SEEK_SET) = 666165248
[pid 22926] lseek(60, 666136576, SEEK_SET) = 666136576
[pid 22926] read(60, "\3\0\0\f\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0"..., 4096) = 4096
[pid 22926] read(60, "\0\0\0\20\0\0\v\265\0\0\0\21\0\0\v\270\0\0\0\22\0\0\v\273\0\0\0\23\0\0\v\276"..., 20480) = 20480
[pid 22926] read(60, "\3\0\0\f\0\0\0\0\310\326\375\262\3\0\0\f\0\0\0\0\316\7\242Z\3\0\0\f\0\0\0\0"..., 4096) = 4096
[pid 22926] lseek(60, 666136576, SEEK_SET) = 666136576
[pid 22926] read(60, "\3\0\0\f\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0"..., 4096) = 4096
[pid 22926] lseek(60, 666140672, SEEK_SET) = 666140672
[pid 22926] lseek(60, 666140672, SEEK_SET) = 666140672
[pid 22926] lseek(60, 666116096, SEEK_SET) = 666116096
[pid 22926] read(60, "\0\0\0\0\22\240\357\267\3\0\0\f\0\0\0\0\0\23\222G\3\0\0\f\0\0\0\33\204\245\3714"..., 4096) = 4096
[pid 22926] read(60, "\0\0\0\1\0\0\0\1\377\377\377\377\0\0\r\324\17\0pD\0\0\0\1\0\0\0\1\377\377\377\377"..., 16384) = 16384
[pid 22926] read(60, "\3\0\0\f\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0"..., 4096) = 4096
[pid 22926] lseek(60, 666116096, SEEK_SET) = 666116096
[pid 22926] read(60, "\0\0\0\0\22\240\357\267\3\0\0\f\0\0\0\0\0\23\222G\3\0\0\f\0\0\0\33\204\245\3714"..., 4096) = 4096
[pid 22926] lseek(60, 666120192, SEEK_SET) = 666120192
[pid 22926] lseek(60, 666120192, SEEK_SET) = 666120192
[pid 22926] lseek(60, 666091520, SEEK_SET) = 666091520
[pid 22926] read(60, "\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 22926] read(60, "\0\0\v[\0\0\0\23\0\0\v^\0\0\0\24\0\0\va\0\0\0\25\0\0\vd\0\0\0\26"..., 20480) = 20480
[pid 22926] read(60, "\0\0\0\0\22\240\357\267\3\0\0\f\0\0\0\0\0\23\222G\3\0\0\f\0\0\0\33\204\245\3714"..., 4096) = 4096
[pid 22926] lseek(60, 666091520, SEEK_SET) = 666091520
[pid 22926] read(60, "\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0\0\0\0\0\3\0\0\f\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 22926] lseek(60, 666095616, SEEK_SET) = 666095616
[pid 22926] lseek(60, 666095616, SEEK_SET) = 666095616
[pid 22926] lseek(60, 666071040, SEEK_SET) = 666071040
[pid 22926] read(60, "\0\21\302T\3\0\0\f\0\0\0\0\0\7\233D\3\0\0\f\0\0\0\0\0\31O\210\3\0\0\f"..., 4096) = 4096

I will try and provide some more info, but maybe you, pcp developers, already know what is going on...

This is a IO Wait time for server after requesting /graphite/render:

Imgur

Server is barely working :-)

Contributor

fche commented Oct 4, 2016

300 servers are stretching the practical limits of pmwebd's current approach to searching archives, especially if the archives are large enough not to fit into RAM. In fact if the active set of archives (those that pmwebd needs to read, and those that something else (pmmgr/pmlogger?) is writing) are too large to fit into RAM, then I/O will start dominating everything, as you are noticing.

Can you offer some stats about your archives? How far back do they go? How large are the currently-written-to ones? How much RAM do you have? How many separate archive files exist? Are any of them compressed (via service-pmlogger's pmlogger_daily, as in *YYYYMMDD.0.xz)?

see also
https://bugzilla.redhat.com/show_bug.cgi?id=1376861
https://bugzilla.redhat.com/show_bug.cgi?id=1293471

mkevac commented Oct 4, 2016 edited

These are archives for one of the servers. One archive per day.

marko@monitoring11.mlan:/local/pcp/cppbig81.mlan> ll -h
total 11G
-rw-r--r-- 1 pcp pcp 807M Sep 29 01:12 20160928.0
-rw-r--r-- 1 pcp pcp 153K Sep 29 01:12 20160928.index
-rw-r--r-- 1 pcp pcp  24K Sep 28 14:22 20160928.meta
-rw-r--r-- 1 pcp pcp 1.8G Sep 30 01:09 20160929.0
-rw-r--r-- 1 pcp pcp 335K Sep 30 01:09 20160929.index
-rw-r--r-- 1 pcp pcp  24K Sep 29 01:12 20160929.meta
-rw-r--r-- 1 pcp pcp 1.8G Oct  1 01:09 20160930.0
-rw-r--r-- 1 pcp pcp 338K Oct  1 01:09 20160930.index
-rw-r--r-- 1 pcp pcp  24K Sep 30 01:09 20160930.meta
-rw-r--r-- 1 pcp pcp 1.8G Oct  2 00:58 20161001.0
-rw-r--r-- 1 pcp pcp 335K Oct  2 00:58 20161001.index
-rw-r--r-- 1 pcp pcp  24K Oct  1 01:09 20161001.meta
-rw-r--r-- 1 pcp pcp 1.8G Oct  3 00:48 20161002.0
-rw-r--r-- 1 pcp pcp 336K Oct  3 00:48 20161002.index
-rw-r--r-- 1 pcp pcp  24K Oct  2 00:58 20161002.meta
-rw-r--r-- 1 pcp pcp 1.4G Oct  3 18:52 20161003.00.10.0
-rw-r--r-- 1 pcp pcp 254K Oct  3 18:52 20161003.00.10.index
-rw-r--r-- 1 pcp pcp  24K Oct  3 00:48 20161003.00.10.meta
-rw-r--r-- 1 pcp pcp 992M Oct  4 08:16 20161003.18.57.0
-rw-r--r-- 1 pcp pcp 188K Oct  4 08:16 20161003.18.57.index
-rw-r--r-- 1 pcp pcp  24K Oct  3 18:57 20161003.18.57.meta
-rw-r--r-- 1 pcp pcp 291M Oct  4 12:22 20161004.08.28.0
-rw-r--r-- 1 pcp pcp  56K Oct  4 12:22 20161004.08.28.index
-rw-r--r-- 1 pcp pcp  24K Oct  4 08:28 20161004.08.28.meta
-rw-r--r-- 1 pcp pcp  222 Oct  4 08:28 Latest
-rw-r--r-- 1 pcp pcp 8.6K Oct  4 08:28 pmlogger.log
-rw-r--r-- 1 pcp pcp 8.9K Oct  4 08:16 pmlogger.log.prior

Server where collecting happens and pmwebd is has 64GiB of RAM and 24 CPUs.
We have disk space for approx 7 days if collecting once a second. And we didn't planned to keep more than one week of data.
They are not compressed.

Is there anything we could do to use pmwebd for this amount of servers\days? Reduce size of single archive maybe? Rotate it not once a day, but once a hour?

Contributor

fche commented Oct 4, 2016

So about 2 GB of data per server per day, times seven days, times 300 servers, so 4200 GB of data on disk. Wow. Even the current day's data won't fit into your machine's RAM, so any scanning would have to rely on libpcp/archive optimally using .index files to seek to just the the parts being requested by the client (pmwebd/grafana). I don't know if pcp developers have much experience with such RAM-starved configurations.

This is not to say it's hopeless. I'd start with a highly constrained grafana query (substituting PMWEBD and HOSTNAME). It represents kind of the best case - one archive file, small time slice from the end. If that works, try additional &target= clauses, or gradually relax the host wildcard (so as to select more hosts).

http://PMWEBD:44323/grafana/index.html#/dashboard/script/multichart.js?from=now-5m&to=now&target=*HOSTNAME*.kernel.all.load.1%20minute&title=kernel.all.load.1%20minute&style=json

mkevac commented Oct 4, 2016

So about 2 GB of data per server per day, times seven days, times 300 servers, so 4200 GB of data on disk. Wow. Even the current day's data won't fit into your machine's RAM, so any scanning would have to rely on libpcp/archive optimally using .index files to seek to just the the parts being requested by the client (pmwebd/grafana). I don't know if pcp developers have much experience with such RAM-starved configurations.

Why query for host list has to read whole and each archive? It's very strange. Shouldn't host be somewhere in the beginning of the file?

After initial "get me host list" query, when I chose one host, pmwebd would have to read only 7 archives for one host (14GB).

And if only one day or one hour is selected in Grafana, then only one file (2GB).

But right now problem is on the first stage. Getting host list. And IMHO (without knowing about pcp internals), this should not need 4Tb read.

Contributor

fche commented Oct 4, 2016 edited

You're roughly right. The hostselect.js dashboard's query is:

/graphite/render?format=json&target=*.pmcd.pmlogger.port.*&from=-1m&until=now

... which asks pmwebd to iterate through all archives (300*7), to pull out one metric value recorded in the last minute. Its goal is to enumerate those archive files that are currently being written to, so it can reverse-engineer host names etc. from them.

Those archives whose end-of-records timestamp doesn't include this moment will be rejected pretty quickly (after one pmGetArchiveEnd call). This involves reading little bits of the beginning and the end of the .0 / .index / .meta files. (Compare strace pmloglabel -l $ARCHIVEFILE.) ... though there are some thousands of them, and this is not cached within pmwebd (see that rhbz link above). There is some wasted time but not that much I/O.

Those archives whose time intervals does include the last minute are probably those 300 that are currently being written to by a running pmlogger. In this case, pmwebd seeks to the near-end of the archive with pmSetMode, and tries to fetch that metric value. Unfortunately, things go badly here, because the metric in question turns out to be stored only one time, at the beginning of the archive file/timeline, and libpcp decides to go searching for it. During this search, libpcp reads, backwards, essentially the whole archive data file. For the sad tale, see your own strace, or the analogous:

strace pmval -t 60 -S "@`date -d '-1min' +'%y-%m-%d %H:%M'`" -T "@`date +'%y-%m-%d %H:%M'`"  -a $ANY_ACTIVE_ARCHIVE_FILE  pmcd.pmlogger.port

Sorry about that. That is pretty abysmal.

As a hack (and I bet you'll figure out why it works if it works), try changing the hostselect.js file thusly, and clear those browser caches:

-       url: pmwebd + "/graphite/render?format=json&target=*.pmcd.pmlogger.port.*&from=-1m&until=now"
+       url: pmwebd + "/graphite/render?format=json&target=*.proc.nprocs&from=-1m&until=now"

mkevac commented Oct 4, 2016

It should work because proc.nprocs is recorded constantly, not only once in the beginning. And it does work. If I wait for only about 30s, I will get host list :-)
30s is still bad though.

Contributor

fche commented Oct 4, 2016

Yeah 30s should be improved if possible. Once you get past that, presumably going to a per-host view, how is performance for you?

mkevac commented Oct 7, 2016

We've changed pmlogger arguments to create new file every 50 megabytes:

pmlogger -r -v 50m -m pmlogger_daily -l pmlogger.log -h cppbig75.ulan -c /etc/pcp/pmlogger/config.autogen 20161007.02.00

Host list dashboard loads for about 1s if data is in memory or 30s if it is not.
Standard per host dashboard takes about 30s to load fully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment