-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sharing registry across multiple workers (where possible) #30
Comments
Thinking on this, we'd want to implement the sharing mechanism ourselves so it could be used across any system. It will be limited to counter-based metrics. |
https://github.com/brian-brazil/client_python/tree/multi-process is a first pass at this, which seems to work. It only really does the countery types at the moment, and comes with a lot of caveats. |
Have you come up with a solution to this yet by any chance? We're just starting to look at using prometheus in some of or python apps, and most of them are all using uwsgi with multiple workers/processes. |
You can see how this will work in my previous update, that works but needs cleaning up to handle all the edge cases. |
I'm still looking through the code to see how everything works, but I haven't been able to get the multi-process branch you linked to to work yet. I tried something like this:
and get this error:
I did set the Am I attempting to do this correctly by creating a new registry from MultiProcessCollector and passing that to the individual metrics? Or is MultiProcessCollector supposed to be used automatically if the I could be doing something else wrong, but I tried this originally:
When I tried the above, I saw the *.db shelve files being created, but when I went to port 9080, only the ProcessCollector metrics showed up. EDIT In case it wasn't obvious, I am using |
The |
Thanks for the help, I was able to get it working in one app as a test. The only caveat so far is that I had to add a command to clean out |
Yeah, that's the intended way to use it. |
I've done some additional work which is up at https://github.com/prometheus/client_python/tree/multiproc In terms of metrics, everything except gauges are supported. |
I've completed gauge support and added docs. Please try out the multiproc branch. |
Thanks for the update @brian-brazil! I'll test it out when I get a chance soon |
Is this specific to gunicorn ? We are using uwsgi. |
It shouldn't be, I've only tested it with gunicorn. |
What are the performance implications? Have you done any benchmarks? On Tue, Oct 27, 2015 at 12:48 PM, Brian Brazil notifications@github.com
|
I haven't done any benchmarks, there's a fdatasync in there that I need to eliminate though. |
Hi, We are trying tu use your multi process implementation with an application in flask behind uwsgi. As I understand your example for Gunicorn, you define two functions:
I have discarded the work_exit for my poc, I use only basic metrics. I have implemented a route in my application to display the metrics: @main.route('/metrics')
def metrics():
registry = CollectorRegistry()
multiprocess.MultiProcessCollector(registry)
data = generate_latest(registry)
response_headers = [
('Content-type', CONTENT_TYPE_LATEST),
('Content-Length', str(len(data)))
]
return Response(response=data, status=200, headers=response_headers) There is an issue in generate_latest. I get exception "_gdbm.error: [Errno 11] Resource temporarily unavailable". This is because the .db files are opened. I put a try/except around that and try to open the db as read only and it presents metric, but only those of the dead process. As long as the db stay open, I get the exception and thus cannot get metrics of current processes. It cannot work, the next snippet present the same behavior: >>> import shelve as s
>>> fw = s.open("./test.db")
>>> fw["test"] = "test"
>>> fw.sync()
>>> fr = s.open("./test.db")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/shelve.py", line 243, in open
return DbfilenameShelf(filename, flag, protocol, writeback)
File "/usr/lib/python3.5/shelve.py", line 227, in __init__
Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback)
File "/usr/lib/python3.5/dbm/__init__.py", line 94, in open
return mod.open(file, flag, mode)
_gdbm.error: [Errno 11] Resource temporarily unavailable
>>> fw.close()
>>> fr = s.open("./test.db")
>>> fr
<shelve.DbfilenameShelf object at 0x7fe65de82d30> Do you have ideas how to work around that ? Am I doing something wrong ? Best, |
It looks like you're using Python 3.5 and gdbm, while I tested on Python 2.7 using hashdb. It looks like this change will need some work to also work on Python 3.5. |
My last snippet works indeed with python 2.7 but not with python 3.4.3. The documentation of shelve, in python 2 and 3 states that: Thus I think you should use another DB. I have also try to replace the default shelve with https://www.jcea.es/programacion/pybsddb.htm after a pip install bsddb3 since it is not in python library anymore. It seems to work, but I do not know if the documentation statement about concurrency holds. I may provide a patch if required. |
That bit of documentation only applied to shelve, I was thinking we could use bsddb directly which also will give us a bit more control performance wise. |
Excuse me, I am just looking at Prometheus for the first time, but my first impression is that I think this whole "sharing" design you guys are attempting is just over-engineered. In my opinion, there should be no sharing. Instead, each gunicorn (or whatever) worker should start listening on a random port, and the prometheus server would (somehow) discover those ports and start scraping individual workers. Aggregation would then be done at a higher level. I wish the client libraries would be allowed to register themselves with a prometheus server, to tell the server that they are alive and what is the port they can be scraped on. Hm, maybe that is what a service like Consul is all about, but I haven't looked in detail yet... |
That's how Prometheus generally works, but isn't compatible with gunicorn designs which tend to bring processes up/down too often for that to work well. |
If Prometheus isn't compatible with this, then I think I'm better off with statsd. But I'm not convinced! Gunicorn doesn't bring processes up and down too often. Worker processes tend to remain up for very long time. I really think this (using Consul for discovering worker processes) has a fair chance of working, and I'm willing to invest some time to see if I can really make it work well. |
From my research support for dynamically changing the number of processes such as in response to load is not uncommon. There's also the The approach we use here will be applied to other languages such as Ruby. |
From the Prometheus docs:
So maybe you have a point. My approach of having per process metrics would require to have the pid number or some such as a label, which seems to be a against the recommendation above. OTOH, processes sharing data via a db file is a terrible solution, as well as any WSGI-specific solution (we have not only gunicorn workers, but also celery workers, and even simple multiprocessing module subprocesses). Unfortunately I can't think of any good, efficient, and simple solution. Here's an idea I just had, but take it more as a thought process than a finished idea:
I think this solution scales pretty well, since aggregation of data is done only at each scrape interval, not for every sample. The downside is that it is a bit complicated. :( |
This is where things fall down. How do you aggregate data from processes that no longer exist? A file per process avoids this problem, as the files survive the process. |
Well, if a process dies, the http socket to it will fail with connection refused and we ignore it. We will not aggregate data from it for the last scrape interval. So you lose a few seconds of the metric data of that worker, in this case, not a big deal. Again, I am not convinced worker processes start and die very often, they tend to remain rather stable. |
That would break all counters, you need that data. |
So what if in an web app you don't count the last few http requests that come through one of the workers; it's not a big deal if it happens rarely. It's not like Prometheus will be used for banking transactions or anything... If it is still a problem, an atexit handler that would proactively send the last data to the elected worker could solve it. It's still better than saving a DB file every time you update a metric, that is terrible for performance... Anyway, I should probably stop spamming this thread, I don't want to be annoying. A poor solution is better than no solution at all. If I find time to prototype something, we can reopen the discussion. It's usually easy to think of solutions that often break down when you try to code them, I know... |
Any updates or solutions to this problem? |
@oliver006 After abandoning #70 moving to the statsd exporter is exactly what we did; works just fine! |
Thanks Ben. I went the statsd_exporter route, deploying one locally with each of my python apps and so far that's been working well. |
https://github.com/korfuri/django-prometheus/blob/master/documentation/exports.md#exporting-metrics-in-a-wsgi-application-with-multiple-processes is a design for letting each worker-process listen on a distinct prometheus port. Does assume the collector will scrape all the allowed reporting ports to catch 'em all. |
Cool, I quite like the django-prometheus solution! |
Similar to prometheus/client_ruby#9, when using the python client library on workers that get load balanced by something like gunicorn or uwsgi, each scrape it hits only one worker since they can't share state with others.
At least uwsgi supports sharing memory: http://uwsgi-docs.readthedocs.org/en/latest/SharedArea.html
This should be used to share the registry across all workers. Maybe gunicorn supports something similar.
The text was updated successfully, but these errors were encountered: