Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmdaprometheus performance improvement ideas #434

Open
fche opened this issue Jan 31, 2018 · 7 comments
Open

pmdaprometheus performance improvement ideas #434

fche opened this issue Jan 31, 2018 · 7 comments
Assignees

Comments

@fche
Copy link
Contributor

fche commented Jan 31, 2018

Some quick performance tests today indicate we have some limitations well short of the PMID-limited 1000 prometheus sources. Some ideas:

  • confirm whether the newly added fstat-over-entire-config.d is indeed harmless, even when metrics are being sampled at busy rates. For pmwebd we found the opposite type of change (pmwebd: improve graphite archive-cache performance w.r.t. syscalls #429) made a big difference. Thousands of fstat's at every request was noticeable (seconds of elapsed time).
  • make it possible to suppress the debugging output
  • better pmcd<->pmda timeout processing; we can easily get pmcd giving up on pmdaprometheus, so:
    • the 1s timeout on http requests is not effective, as the total processing time can exceed the remaining 4s of pmcd<->pmda timeout
    • multithreaded parsing of the http documents may be counterproductive (due to the big python interpreter lock). Should investigate concurrency in fetching only (and deposit document in the Source object), then parse_lines them all in the main thread
    • the PM_ERR_PMDANOTREADY facility should be used if necessary
@goodwinos
Copy link
Contributor

  • fstat-over-entire-config.d wont be harmless. Just checking ctime on the config dir would be though. How about a cmdline option to only stat the directory - with documented instructions to touch the config.d directory to force a rescan if any existing URL files or scripts have been edited?
  • debugging output needs to be optional - will add another option for that too
  • and yes, we need PM_ERR_PMDANOTREADY when more than a few new Sources have been discovered, to avoid initial timeout when the PMDA is still busy creating all those class instances.
  • there might be somethign that isn't scaling in the Source class init or somewhere - shouldn't take 5 seconds to set up a even a few hundred sources. Lukas said he hit it with only 5 end-points (with 100 metrics per end-point)
  • saving fetched documents and then serially processing parse_lines in the main thread is probably a good plan. The main latency is in the http get (which should be parallel), but the parsing is all grunt work and probably not very friendly to the scheduler (and the big python lock)

@fche
Copy link
Contributor Author

fche commented Mar 20, 2018

and yes, we need PM_ERR_PMDANOTREADY when more than a few new Sources have been discovered,

It turns out we need this badly. When targeting a prometheus server, we exceed the 5s timeout at something between 40-60 urls during the initial startup. That means a pmda restart, a new initial startup, which means no progress.

@goodwinos goodwinos self-assigned this Mar 20, 2018
@goodwinos
Copy link
Contributor

OK, will add PM_ERR_PMDANOTREADY / PM_ERR_PMDAREADY notifications whenever there are more than a few new sources to process (not just during startup, though that would be when it's most needed obviously).

@goodwinos
Copy link
Contributor

I expect to have PM_ERR_PMDANOTREADY / PM_ERR_PMDAREADY working in pmdaprometheus later today

goodwinos added a commit to goodwinos/pcp that referenced this issue Mar 22, 2018
GH performancecopilot#434

pmdaprometheus now starts in the "notready" state, meaning it can
take as long as it likes to initialize and startup before PMCD will
send it any requests - i.e. very large initial configs (1000's of
URLs) can be set up without timing out on startup.

Once startup has completed and all the source class instances have
been created and initialized, pmdaprometheus now tells PMCD that it
is ready using the python binding for the new pmdaSendError() function.

Exercised by QA 1287 with 500 URLs and log checks to confirm the
initialization is now done in the pmda notready state. This feature
requires the changes to the Install script, and the previous commits
for pmdaSendError() and the python bindings.

modified:   src/pmdas/prometheus/Install
modified:   src/pmdas/prometheus/pmdaprometheus.1
modified:   src/pmdas/prometheus/pmdaprometheus.python
@goodwinos
Copy link
Contributor

@fche, have been experimenting with pmdaprometheus - your patch (as fpasted) is substantially slower that current master branch, but I think it's heading in the right direction because we need to limit the number of threads, sockets and file descriptors. With O(1000) URLs, we run out of FDs, etc., and 1000 threads is just plain silly, so limiting that to min(100,len(clusters)) is the way to go for sure. I'm still trying to figure out why it's so much slower .. might be the queuing or something - will continue and come up with a refined patch. Similar queuing is needed to stagger the processing of large config changes too, as already discussed.

Another simple experiment I did was to replace the requests module with pycurl. The patch is almost a drop in replacement, and approximately doubles the performance for the fetch (http get), i.e. halves the fetch latency. The parse time is of course about the same (no changes made there).

Using requests module (as per current master branch) :
# NURLS NMETRICS WALLTIME       FETCHTIME       PARSETIME       TOTALTIME
1       2       12.22ms         0.35ms  +       0.00ms  =       0.35ms
10      2       47.94ms         12.80ms +       0.09ms  =       12.89ms
50      2       284.78ms        86.08ms +       0.41ms  =       86.49ms  
100     2       762.37ms        178.75ms+       1.24ms  =       179.99ms    

Using pycurl :
# NURLS NMETRICS WALLTIME       FETCHTIME       PARSETIME       TOTALTIME
1       2       8.26ms          0.11ms  +       0.00ms  =       0.11ms
10      2       22.74ms         2.06ms  +       0.06ms  =       2.12ms
50      2       135.02ms        10.28ms +       0.76ms  =       11.04ms  
100     2       468.68ms        23.02ms +       1.15ms  =       24.17ms   

Above, each URL end-point has two metrics, and is to a remote host running apache over gig-E. The WALLTIME is milliseconds to fetch the URLs (averaged over 10 fetches to all metrics). FETCHTIME is from the control.fetch_time[0] metric (averaged over 10 repeats), PARSETIME is the parse_time metric and TOTALTIME is the sum of fetch+parse.

diff --git a/src/pmdas/prometheus/pmdaprometheus.python b/src/pmdas/prometheus/pmdaprometheus.python
index e66f7dbac..045912644 100755
--- a/src/pmdas/prometheus/pmdaprometheus.python
+++ b/src/pmdas/prometheus/pmdaprometheus.python
@@ -25,7 +25,8 @@ import subprocess
 from ctypes import c_int
 from socket import gethostname
 from stat import ST_MODE, S_IXUSR
-import requests
+import pycurl
+import io
 
 import cpmapi as c_api
 import cpmda
@@ -456,7 +457,7 @@ class Source(object):
         self.refresh_time = 0 # "never"
         if not is_scripted:
             # source is a URL.  Create a session for it.
-            self.requests = requests.Session() # allow persistent connections etc.
+            self.requests = pycurl.Curl() # allow persistent connections etc.
             self.headers = {} # dict of headers for the http get
             self.filterlist = [] # list of filters
 
@@ -659,6 +660,7 @@ class Source(object):
             return
 
         # fetch the document
+        buf = io.BytesIO()
         try:
             if self.is_scripted:
                 # Execute file, expecting prometheus metric data on stdout.
@@ -669,9 +671,10 @@ class Source(object):
                 if self.url.startswith('file://'):
                     document = open(self.url[7:],'r').read()
                 else:
-                    r = self.requests.get(self.url,headers=self.headers,timeout=timeout)
-                    r.raise_for_status() # non-200?  ERROR
-                    document = r.text
+                    self.requests.setopt(self.requests.URL,self.url)
+                    self.requests.setopt(self.requests.WRITEFUNCTION,buf.write)
+                    self.requests.perform() # GET 
+                    document = buf.getvalue().decode()
 
             # update fetch time counter stats, in ms
             self.parse_time = time.time()

@goodwinos
Copy link
Contributor

pycurl also has pycurl.CurlMulti(), which can launch a swag of async http gets, and then harvest them in a select loop. This could fit in pretty well with our existing code - i.e. launch all the fetches needed for a refresh in parallel, then process/parse them sequentially as the responses come back. I think it's worth pursuing.

@fche
Copy link
Contributor Author

fche commented Apr 9, 2018

your patch (as fpasted) is substantially slower that current master branch

Can you please elaborate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants