WIP: uWSGI Multiprocess Support #70

bendemaree · 2016-01-03T16:16:25Z

We've been looking into Prometheus and were almost immediately bit by #30. We brainstormed some solutions and I've made some major changes on top of @brian-brazil's work in the multiproc branch that are pretty large, so I'm opening now, unfinished, to gather feedback and (hopefully) finish this out.

High-level list of changes:

MultiProcess has been swapped for Partitioned, like PartitionedCollector for example.
Formalized Metric and Submetric as classes that know how to recreate themselves from otherwise mushy data (the key JSON, for example). I'm still unhappy with these, as they confound the Metric in the main part of the library, and I think the two should be merged into a single kind, but in trying to limit scope, I held off.
Broke out the heavily loop-based logic for samples into a bit more object-based structure, where PartitionedMetrics hold the logic for building their list of samples, rather than having branches and further inner loops to handle each case. I think there is more that can be done here.
Perhaps most importantly, this breaks out the logic for merging samples in "partitions" (separated by the process boundary walls) from collecting that data from shelve so that it is more generic. This opens the door to...
uWSGI cache subsystem-backed values, which are still divided by process, but should be significantly faster than shelve. There are numerous sticky things about this implementation still, and I think this is more of a POC than anything, but I'm now convinced it will work. See the uWSGI notes at the bottom for more...

Minor changes:

The shelve files are explicitly opened and closed now. Tests were failing under OS X as the files were in use.
Split out similar constants into enum-like classes, which should be less fragile in the long term. Have not reflected this out into the other modules yet, but I think that should happen outside this PR.

Still left to do:

Revise docs
Further test coverage
Clean up and refine semantics
Review/rework key format (see UWSGIValue constructor)

So, again, I'm hoping for feedback on the direction this is going, other ideas, critiques, and nits. I do apologize for the huge PR, but I hope it's welcome!

uWSGI notes:

The cache subsystem deals in integers, so I used a terrible trick (multiply by a sufficiently large number, truncate decimal places) to smash a float into it. Would love ideas on a more rigorous way to do this. It's also pretty space-expensive, but the cache block sizes are fixed anyway.
The cache subsystem can atomically increment and decrement but "set" is unimplemented. If it were available, it might be that a partitioned strategy isn't needed at all as the cache is global and IPC-safe.
The metrics subsystem would be an ideal fit but the metrics one collects must be defined up front, and cannot be specified at runtime, which I think is against the philosophy of Prometheus and would also just not work with the library as-is.
I'm still not happy with the uWSGI test "harness" which involves installing uWSGI, pytest, and the client in a virtualenv, then running pytest, which will at some point spawn and app running the test file under uWSGI, make an HTTP request to that app, and either succeed or fail based on the response code. There's no insight into which tests failed at that point and it's very hacky. This is because uWSGI's Python API seems to only be available when uWSGI is running. Still brainstorming; might be best to mock out the uWSGI cache interface, for example.

Files may be in contention without this depending on the platform and DBM implementation.

Still a large number of issues to build out, but several new concepts: - Store floats as ints in the uWSGI cache framework with a certain "resolution" so we can cram floats into ints. - Use the same "parition" idea to divide uWSGI cache values by pid as well. This is largely because there is no numeric, atomic uWSGI cache set op. - Rename things that were "multiprocess" to "partitioned". - Stores a list of uWSGI workers in a magic cache key; this will likely be unused as the entire cache is crawled for relevant keys. - Dead or alive PIDs are checked against the uWSGI worker status dict. - An extremely awkward test harness was added for uWSGI that currently must be run _under uWSGI_ as an app... Notes: - uWSGI cache operations in the Python interface are largely undocumented. There is no way to "set" a number in the cache. - The metrics framework is unable to declare new metrics at runtime.

Extends the uWSGI test suite to spawn uWSGI, then run that same test file as a WSGI app, then make an HTTP request against that running instance, asserting the response code, where the response code is 204 if the uWSGI tests pass and 500 if they fail. This is done because there does not seem to be a way to mock the uWSGI Python module beyond running uWSGI itself. The uWSGI tests must be run within a virtual env, which is used to determine the location of pytest and uwsgi.

bendemaree · 2016-01-03T16:18:00Z

Also, #66 should go in before this.

Also fixes the locking around the value so the lock is instantiated outside of the class. Additionally, bumped up the resolution multiplier to be significantly more useful/have greater precision.

brian-brazil · 2016-01-04T14:58:05Z

Thanks for the change, this is an important area for the python client.

The goal here is to have something that'll work for all the major uses of multi-process instrumentation in Python, with semantics that are as close as we can get them to the threaded equivalent.

I'm not too familiar with uWSGI's cache. My initial research indicates that it's not a good match. Firstly, it's a cache not a datastore so we need to worry about expiry. Secondly it appears to support modes meant for multi-machine synchronisation including UDP and Django's database. If a user is using one of these then the semantics and performance will not be as desired. Finally it only works for those using uWSGI so even if we can handle the previous issues, we're left with all the other use cases.

We do need to support floats. As it can only store a 64bit integer and we're using 64bit floats, one approach is to put the float's machine representation in the cache.

At a high level I think this could be an option for some of our users, but it's not a full solution for the multi-process problem. As it stands the only thing preventing #66 being a full solution is getting a shelve equivalent that support concurrency and that doesn't call fsync on every write to disk.

bendemaree · 2016-01-04T17:23:44Z

The goal here is to have something that'll work for all the major uses of multi-process instrumentation in Python, with semantics that are as close as we can get them to the threaded equivalent.

I don't feel that this moves away from this goal. It offers an adapter to bring a "built-in" speed and efficiency increase given that users are running their Python WSGI app under one of the main WSGI servers used in production environments. Were Gunicorn to have similar features, it would make sense to optimize for them as well, I feel.

That said, I see now that the uWSGI "adapter" should be opt-in, rather than automatic if uwsgi is importable. I can make that change.

The thesis of this PR is just: the tradeoff of the universality of using shelve for a large increase in disk hits is unacceptable for my environment. I should run some performance tests to verify the fact that shelve will be slow in this context, but I'm not completely sure that's necessary, unless you want me to do it. shelve is slow, and disk is slow.

Firstly, it's a cache not a datastore so we need to worry about expiry.

The default settings for the uWSGI cache has expiration disabled. Additionally, I think if a user is concerned with the number of cache entries, cache block size, etc., that's a deployment concern for the user that they take on. They should be considering that just as they consider the implications of writing shelve files to disk.

Secondly it appears to support modes meant for multi-machine synchronisation including UDP and Django's database. If a user is using one of these then the semantics and performance will not be as desired.

It does, but there's no reason to use these features in a Prometheus setup. If you have multiple nodes exposing metrics, it's better to have your Prometheus server scrape multiple nodes than side-aggregating all of the metrics using a uWSGI feature, and expose metrics on a single node. These features are not on by default, and caches can be named and distinct. The user could opt to have one named cache use UDP sync and not another. I will update the PR to support a named cache; that's an oversight.

Finally it only works for those using uWSGI so even if we can handle the previous issues, we're left with all the other use cases.

Again, I think this should be viewed as an optimization, not a catch-all. For those using uWSGI (I am assuming that's not a small number), it should be much faster and won't hit disk.

We do need to support floats. As it can only store a 64bit integer and we're using 64bit floats, one approach is to put the float's machine representation in the cache.

This is what I'd like to do. The multiplication thing is a hack.

At a high level I think this could be an option for some of our users, but it's not a full solution for the multi-process problem.

I spent some time thinking this through before I embarked on the PR. What are the ways other programs cross the process divide? There's a standalone database, shared disk, shared memory, or a shared network resource, more or less. Databases and shared network resources are no good because they implicate a significant dependency for the end user for what should be a lightweight library. Shared disk is well-known, but slow. Shared memory is fast but requires management and allocation.

Arguably, shared disk is one of the few things that comes somewhat universally available in a Python implementation. You don't want to assume you have shared memory across the user's processes because that would be managed "above" Python. So, I think the disk-based route is your baseline, and it serves your broadest audience, unless a shared-memory solution is available (and uWSGI provides this). There's still caveats with disk, even. Google AppEngine's Python runtime doesn't allow writing to disk, for example.

My opinion is that in this scenario, it makes sense to offer additional "adapters" to optimize for a certain environment so that some users can benefit from better performance, but nearly everyone can use it one way or another.

As it stands the only thing preventing #66 being a full solution is getting a shelve equivalent that support concurrency and that doesn't call fsync on every write to disk.

I don't think I know of one of these that isn't a database and not a persistence utility. You essentially need to buffer before disk, and you want concurrent access to that buffer; there are options here, like locking on the shelve database and flush periodically, but at that point you're looking like a database again and it still has to flush periodically. And, between the flush interval, the exported metrics will be stale.

brian-brazil · 2016-01-04T17:34:31Z

You essentially need to buffer before disk, and you want concurrent access to that buffer; there are options here, like locking on the shelve database and flush periodically,

What's needed is a database that will write to the disk via mmap, but not call fsync. Worst case we can implement it on top of mmap ourselves. I used shelve initially to avoid having to come up with a binary file format for the proof of concept.

bendemaree · 2016-01-04T19:04:24Z

mmap + pickle (or cPickle) will certainly be faster than shelve. I think you would need to create a custom format, as shelve will just use whatever DMB is available. You'll also have to track the files you're writing to, right? I don't know how the globbing trick will work on mmap'd files (not saying it won't, I just don't know). If you use file descriptors from files on disk, that could work, I suppose.

I still think those are optimizations to a file-based "metric backend." Why not a uWSGI metric backend? If a user would prefer memcached, that could be a metric backend too, right? I don't see why these are less valid approaches.

brian-brazil · 2016-01-04T19:27:48Z

You'll also have to track the files you're writing to, right? I don't know how the globbing trick will work on mmap'd files (not saying it won't, I just don't know).

It'll work the roughly same as now, they're files on disk. The only new problem is dealing with growing the file.

Why not a uWSGI metric backend?

It's a possibility, but it only covers some users. This PR is also far bigger than adding a uWSGI backend. I'm very hesitant to accept a significant re-architecture that provides only a partial solution. If it worked with the existing code I'd be willing to accept it (though we'll need to figure out something for the tests).

If a user would prefer memcached, that could be a metric backend too, right?

That should never be a backend. Talking to the network on each instrumentation call would be very limiting performance wise.

bendemaree · 2016-01-05T00:41:33Z

This PR is also far bigger than adding a uWSGI backend.

I won't argue there; said it myself at the outset. 😉 Would you be open to a different PR that removes any uWSGI concerns? The bulk of the work is to open up the logic of combining metrics from different PIDs to be more accessible and not shelve-dependent. I can defer other stylistic changes too, like the enums.

If you're open to an architectural PR as a critique aimed at being able to extend what comes with the base client, that might help others out too. For example, if that can get in, I'd be able to make my own subclass that uses the uWSGI cache as storage and it wouldn't need to be in this project.

That should never be a backend. Talking to the network on each instrumentation call would be very limiting performance wise.

I don't know if that's true. A local UNIX socket would be very fast. I won't assume that's right it until I can test it, though.

brian-brazil · 2016-01-05T14:36:33Z

The bulk of the work is to open up the logic of combining metrics from different PIDs to be more accessible and not shelve-dependent. I can defer other stylistic changes too, like the enums.

Your code is just under 4 times longer than what it's replacing. I think a straight factoring out of the merging code in collect would give the same benefits in a simpler way.

I'd be able to make my own subclass that uses the uWSGI cache as storage and it wouldn't need to be in this project.

It's intended to be extensible, though there's currently no way to hook in _ValueClass.

A local UNIX socket would be very fast. I won't assume that's right it until I can test it, though.

That's a syscall for every event versus a mutex for mmap. There'll be orders of magnitude difference in latency.

bendemaree · 2016-01-13T19:20:35Z

I'm closing this out and will work on it using the strategy discussed here and with @dmclain in #66.

Your code is just under 4 times longer than what it's replacing.

I don't think this is a fair assessment but I'll try to make later revisions more space-compact if that's important to this library.

Ben Demaree added 7 commits December 15, 2015 22:04

Explicitly open and close shelve files.

33eb6ef

Files may be in contention without this depending on the platform and DBM implementation.

Refactor checkpoint.

b3facd8

Promote submetrics, collect mode, partition.

ff86e35

Better constants.

19af78a

Remove global worker cache key.

71c6a36

Remove sampling from UWSGIValue.

ce48232

Also fixes the locking around the value so the lock is instantiated outside of the class. Additionally, bumped up the resolution multiplier to be significantly more useful/have greater precision.

dmclain mentioned this pull request Jan 12, 2016

Add multi-process support #66

Merged

bendemaree closed this Jan 13, 2016

bendemaree mentioned this pull request Sep 7, 2016

Support sharing registry across multiple workers (where possible) #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: uWSGI Multiprocess Support #70

WIP: uWSGI Multiprocess Support #70

bendemaree commented Jan 3, 2016

bendemaree commented Jan 3, 2016

brian-brazil commented Jan 4, 2016

bendemaree commented Jan 4, 2016

brian-brazil commented Jan 4, 2016

bendemaree commented Jan 4, 2016

brian-brazil commented Jan 4, 2016

bendemaree commented Jan 5, 2016

brian-brazil commented Jan 5, 2016

bendemaree commented Jan 13, 2016

WIP: uWSGI Multiprocess Support #70

WIP: uWSGI Multiprocess Support #70

Conversation

bendemaree commented Jan 3, 2016

bendemaree commented Jan 3, 2016

brian-brazil commented Jan 4, 2016

bendemaree commented Jan 4, 2016

brian-brazil commented Jan 4, 2016

bendemaree commented Jan 4, 2016

brian-brazil commented Jan 4, 2016

bendemaree commented Jan 5, 2016

brian-brazil commented Jan 5, 2016

bendemaree commented Jan 13, 2016