Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
pmwebd impossibly slow when using grafana with 300 archives #117
Comments
|
300 servers are stretching the practical limits of pmwebd's current approach to searching archives, especially if the archives are large enough not to fit into RAM. In fact if the active set of archives (those that pmwebd needs to read, and those that something else (pmmgr/pmlogger?) is writing) are too large to fit into RAM, then I/O will start dominating everything, as you are noticing. Can you offer some stats about your archives? How far back do they go? How large are the currently-written-to ones? How much RAM do you have? How many separate archive files exist? Are any of them compressed (via service-pmlogger's pmlogger_daily, as in *YYYYMMDD.0.xz)? see also |
mkevac
commented
Oct 4, 2016
•
|
These are archives for one of the servers. One archive per day.
Server where collecting happens and pmwebd is has 64GiB of RAM and 24 CPUs. Is there anything we could do to use pmwebd for this amount of servers\days? Reduce size of single archive maybe? Rotate it not once a day, but once a hour? |
|
So about 2 GB of data per server per day, times seven days, times 300 servers, so 4200 GB of data on disk. Wow. Even the current day's data won't fit into your machine's RAM, so any scanning would have to rely on libpcp/archive optimally using .index files to seek to just the the parts being requested by the client (pmwebd/grafana). I don't know if pcp developers have much experience with such RAM-starved configurations. This is not to say it's hopeless. I'd start with a highly constrained grafana query (substituting PMWEBD and HOSTNAME). It represents kind of the best case - one archive file, small time slice from the end. If that works, try additional &target= clauses, or gradually relax the host wildcard (so as to select more hosts).
|
mkevac
commented
Oct 4, 2016
Why query for host list has to read whole and each archive? It's very strange. Shouldn't host be somewhere in the beginning of the file? After initial "get me host list" query, when I chose one host, pmwebd would have to read only 7 archives for one host (14GB). And if only one day or one hour is selected in Grafana, then only one file (2GB). But right now problem is on the first stage. Getting host list. And IMHO (without knowing about pcp internals), this should not need 4Tb read. |
|
You're roughly right. The hostselect.js dashboard's query is:
... which asks pmwebd to iterate through all archives (300*7), to pull out one metric value recorded in the last minute. Its goal is to enumerate those archive files that are currently being written to, so it can reverse-engineer host names etc. from them. Those archives whose end-of-records timestamp doesn't include this moment will be rejected pretty quickly (after one Those archives whose time intervals does include the last minute are probably those 300 that are currently being written to by a running
Sorry about that. That is pretty abysmal. As a hack (and I bet you'll figure out why it works if it works), try changing the hostselect.js file thusly, and clear those browser caches:
|
mkevac
commented
Oct 4, 2016
|
It should work because proc.nprocs is recorded constantly, not only once in the beginning. And it does work. If I wait for only about 30s, I will get host list :-) |
|
Yeah 30s should be improved if possible. Once you get past that, presumably going to a per-host view, how is performance for you? |
mkevac
commented
Oct 7, 2016
|
We've changed pmlogger arguments to create new file every 50 megabytes:
Host list dashboard loads for about 1s if data is in memory or 30s if it is not. |
mkevac commentedSep 29, 2016
Hello.
We are collecting metrics from ~300 servers with 1s granularity using pmlogger. It works fine. Data is going to appropriate PCP databases on disk.
But visually browsing these archives is impossible using pmwebd and grafana.
After clicking on a host list in grafana
Nothing happens.
I can see in pmwebd log that it got request for /graphite/render:
pmwebd process is stuck in D reading files. I've waited for 10 minutes, but nothing changed.
strace shows something like this:
I will try and provide some more info, but maybe you, pcp developers, already know what is going on...
This is a IO Wait time for server after requesting /graphite/render:
Server is barely working :-)