New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pre-fork servers #9

Open
atombender opened this Issue Feb 8, 2015 · 75 comments

Comments

Projects
None yet
@atombender

atombender commented Feb 8, 2015

If you use this gem with a multi-process Rack server such as Unicorn, surely each worker will be returning just a percentage of the correct results (eg., number of requests served, total time), thus making the exposed metrics fairly meaningless?

To solve this the easiest solution is to create a block of shared memory in the master process that all workers share, instead of using instance variables.

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Feb 8, 2015

Member

I believe this problem came up with some users of the (experimental) Python client, and some proposed work on the HAProxy exporter.

What you really want to do is separately scrape each worker, and then aggregate it up in the Prometheus server. This gives the most useful/correct results, and also gives you per-process statistics for things like memory and cpu. I'm not familiar with Unicorn or Rack, do they offer anything that'd allow for this?

Shared memory is an option, it has disadvantages though in that it'd be a source of contention and you can't do per-process analysis.

Member

brian-brazil commented Feb 8, 2015

I believe this problem came up with some users of the (experimental) Python client, and some proposed work on the HAProxy exporter.

What you really want to do is separately scrape each worker, and then aggregate it up in the Prometheus server. This gives the most useful/correct results, and also gives you per-process statistics for things like memory and cpu. I'm not familiar with Unicorn or Rack, do they offer anything that'd allow for this?

Shared memory is an option, it has disadvantages though in that it'd be a source of contention and you can't do per-process analysis.

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Feb 8, 2015

Member

From the Unicorn Design doc:

"Unicorn is a traditional UNIX prefork web server. [...] Load balancing between worker processes is done by the OS kernel. All workers share a common set of listener sockets and does non-blocking accept() on them. The kernel will decide which worker process to give a socket to and workers will sleep if there is nothing to accept()."

Given that, I'm not sure how doable it is to scrape individual workers. Even if it was possible, the benefit of having individual insight and avoiding contention has to be weighed against potentially having an order of magnitude more time series.

I'd expect that on average, the workers on a single machine will behave similarly, except admittedly in pathological situations like this one pointed out in the Unicorn docs: "When your application goes awry, a BOFH can just "kill -9" the runaway worker process without worrying about tearing all clients down, just one".

It also states that "Unicorn deployments are typically only 2-4 processes per-core", which would mean multiplying the number of exposed time series on a typical large machine with e.g. 32 cores by 64x to 128x.

@grobie Did you have any plans wrt. supporting preforked servers already?

Member

juliusv commented Feb 8, 2015

From the Unicorn Design doc:

"Unicorn is a traditional UNIX prefork web server. [...] Load balancing between worker processes is done by the OS kernel. All workers share a common set of listener sockets and does non-blocking accept() on them. The kernel will decide which worker process to give a socket to and workers will sleep if there is nothing to accept()."

Given that, I'm not sure how doable it is to scrape individual workers. Even if it was possible, the benefit of having individual insight and avoiding contention has to be weighed against potentially having an order of magnitude more time series.

I'd expect that on average, the workers on a single machine will behave similarly, except admittedly in pathological situations like this one pointed out in the Unicorn docs: "When your application goes awry, a BOFH can just "kill -9" the runaway worker process without worrying about tearing all clients down, just one".

It also states that "Unicorn deployments are typically only 2-4 processes per-core", which would mean multiplying the number of exposed time series on a typical large machine with e.g. 32 cores by 64x to 128x.

@grobie Did you have any plans wrt. supporting preforked servers already?

@atombender

This comment has been minimized.

Show comment
Hide comment
@atombender

atombender Feb 8, 2015

@brian-brazil: It's possible to collect per-process information with shared memory: Just organize the memory by PID. You'll have to reap dead PIDs, though. If you're clever you can avoid locking altogether, by having each process write to its own dedicated segment of the shared memory.

Scraping individual workers, as @juliusv points out, would mostly be useful to the extent that it would capture misbehaving workers that consume too much memory or CPU.

Having run Unicorn in production for years, I would agree that this doesn't happen often enough to justify the overhead of monitoring individual worker. Unicorn's master process will "kill -9" processes that don't respond within a per-process request timeout. In my experience, this tends to cover the pathological cases.

In terms of scaling, One of our clusters has 300 Unicorn workers on each box (anything from 4-25 per app). I wouldn't want to have Prometheus poll that many workers every few seconds.

atombender commented Feb 8, 2015

@brian-brazil: It's possible to collect per-process information with shared memory: Just organize the memory by PID. You'll have to reap dead PIDs, though. If you're clever you can avoid locking altogether, by having each process write to its own dedicated segment of the shared memory.

Scraping individual workers, as @juliusv points out, would mostly be useful to the extent that it would capture misbehaving workers that consume too much memory or CPU.

Having run Unicorn in production for years, I would agree that this doesn't happen often enough to justify the overhead of monitoring individual worker. Unicorn's master process will "kill -9" processes that don't respond within a per-process request timeout. In my experience, this tends to cover the pathological cases.

In terms of scaling, One of our clusters has 300 Unicorn workers on each box (anything from 4-25 per app). I wouldn't want to have Prometheus poll that many workers every few seconds.

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Feb 8, 2015

Member

👍 Sounds like we need a shared-memory solution or similar.

Member

juliusv commented Feb 8, 2015

👍 Sounds like we need a shared-memory solution or similar.

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Feb 8, 2015

Member

It's possible to collect per-process information with shared memory: Just organize the memory by PID. You'll have to reap dead PIDs, though. If you're clever you can avoid locking altogether, by having each process write to its own dedicated segment of the shared memory.

You need to be a little careful here, as you can't simply throw away data from old PIDs. You'd need to combine in the counters to ensure they stay monotonically increasing. Organizing on >128 byte boundaries (size of a cache line) should avoid locking issues during collection.

How would we deal with gauges? Both the set and inc/dec styles are problematic. The set style doesn't make sense when there's multiple processes (you really need it per-process), and with the inc/dec pattern to count in progress requests you probably want to reset to 0 for a pid when that pid dies.

I wouldn't want to have Prometheus poll that many workers every few seconds.

60s is a more than sufficient monitoring interval for most purposes. Benchmarking will determine the actual impact.

This seems like it'll get very complicated to implement and use, as most uses of instrumentation would need to take the process model into account to get useful and correct results out the other end.
How do other instrumentation systems solve this problem?

Member

brian-brazil commented Feb 8, 2015

It's possible to collect per-process information with shared memory: Just organize the memory by PID. You'll have to reap dead PIDs, though. If you're clever you can avoid locking altogether, by having each process write to its own dedicated segment of the shared memory.

You need to be a little careful here, as you can't simply throw away data from old PIDs. You'd need to combine in the counters to ensure they stay monotonically increasing. Organizing on >128 byte boundaries (size of a cache line) should avoid locking issues during collection.

How would we deal with gauges? Both the set and inc/dec styles are problematic. The set style doesn't make sense when there's multiple processes (you really need it per-process), and with the inc/dec pattern to count in progress requests you probably want to reset to 0 for a pid when that pid dies.

I wouldn't want to have Prometheus poll that many workers every few seconds.

60s is a more than sufficient monitoring interval for most purposes. Benchmarking will determine the actual impact.

This seems like it'll get very complicated to implement and use, as most uses of instrumentation would need to take the process model into account to get useful and correct results out the other end.
How do other instrumentation systems solve this problem?

@atombender

This comment has been minimized.

Show comment
Hide comment
@atombender

atombender Feb 8, 2015

You need to be a little careful here, as you can't simply throw away data from old PIDs.

How would this differ from the hypothetical non-shared case where each worker exposes a metrics port? If the worker dies and starts again, all its counters will restart at zero. In a hypothetical shared-memory implementation where each worker has a dedicated segment that is reported separately, it would be the same thing, no?

Anyway, the problem is moot because I just realized that Unicorn uniquely identifies workers by a sequential number, which is stable. So if worker 3 dies, the next worker spawned is also given the number 3. That means no reaping is needed, nor is it necessary to any kind of PID mapping, which I was worried about.

I did a little digging into the Unicorn sources and just realized that Unicorn uses a gem called raindrops to accomplish exactly what I have been describing so far. Raindrops uses shared memory to collect and aggregate worker metrics (Unicorn also seems to use Raindrops directly to detect idle workers), but it looks like it only cares about socket metrics, not anything related to HTTP. Worst case, if it's cannot be used as-is, I could look into extracting the relevant code from Raindrops into this gem.

atombender commented Feb 8, 2015

You need to be a little careful here, as you can't simply throw away data from old PIDs.

How would this differ from the hypothetical non-shared case where each worker exposes a metrics port? If the worker dies and starts again, all its counters will restart at zero. In a hypothetical shared-memory implementation where each worker has a dedicated segment that is reported separately, it would be the same thing, no?

Anyway, the problem is moot because I just realized that Unicorn uniquely identifies workers by a sequential number, which is stable. So if worker 3 dies, the next worker spawned is also given the number 3. That means no reaping is needed, nor is it necessary to any kind of PID mapping, which I was worried about.

I did a little digging into the Unicorn sources and just realized that Unicorn uses a gem called raindrops to accomplish exactly what I have been describing so far. Raindrops uses shared memory to collect and aggregate worker metrics (Unicorn also seems to use Raindrops directly to detect idle workers), but it looks like it only cares about socket metrics, not anything related to HTTP. Worst case, if it's cannot be used as-is, I could look into extracting the relevant code from Raindrops into this gem.

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Feb 8, 2015

Member

How would this differ from the hypothetical non-shared case where each worker exposes a metrics port?

Where the data comes from each worker, there's no persistence of data beyond a worker's lifetime and the rate() function works as expected. With shared memory where the client is smart and doing aggregation, you need to not lose increments from dead workers as counters need to be kept monotonic.

Prometheus has the paradigm that instrumentation is dumb, and all the logic is done in the prometheus server. You need to be careful when adding logic to clients that it maintains the same semantics as a dumb client.

So if worker 3 dies, the next worker spawned is also given the number 3.

That makes things easier alright. And as I presume everything is single-threaded, we don't need locking for metric updates.

Thinking on this over the past bit, I think we can make this work but only for Counter and Counter-like metrics (Summary without percentile and Histogram).

Member

brian-brazil commented Feb 8, 2015

How would this differ from the hypothetical non-shared case where each worker exposes a metrics port?

Where the data comes from each worker, there's no persistence of data beyond a worker's lifetime and the rate() function works as expected. With shared memory where the client is smart and doing aggregation, you need to not lose increments from dead workers as counters need to be kept monotonic.

Prometheus has the paradigm that instrumentation is dumb, and all the logic is done in the prometheus server. You need to be careful when adding logic to clients that it maintains the same semantics as a dumb client.

So if worker 3 dies, the next worker spawned is also given the number 3.

That makes things easier alright. And as I presume everything is single-threaded, we don't need locking for metric updates.

Thinking on this over the past bit, I think we can make this work but only for Counter and Counter-like metrics (Summary without percentile and Histogram).

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Feb 8, 2015

Member

@brian-brazil Gauges are indeed a problem. I see four possibilities in principle for them:

  1. Allow the user to configure a merging strategy (averaging, summing, ...) for each gauge. The actual merging work wouldn't need to happen continuously, but only upon scrape.
  2. Make an exception for gauges and distinguish them by a "worker" label.
  3. Use "last write wins". But that's not useful at all for any common gauge use cases (queue lengths, etc.).
  4. Don't support gauges in this kind of environment, expecting that the bulk of relevant metrics for this use case would be counters and summaries anyways.

1 is probably what we'd want if we don't split metrics by worker otherwise.

@atombender If indeed one Unicorn server is always expected to have the same number of stably identified worker processes (at least between complete Unicorn restarts, which would also wipe the shared memory), your approach sounds sensible to me. If I understand the raindrops approach correctly, it uses a single counter for all workers ("counters are shared across all forked children and lock-free"), so there wouldn't even be any counter resets anywhere in that approach if only a single worker is killed?

Member

juliusv commented Feb 8, 2015

@brian-brazil Gauges are indeed a problem. I see four possibilities in principle for them:

  1. Allow the user to configure a merging strategy (averaging, summing, ...) for each gauge. The actual merging work wouldn't need to happen continuously, but only upon scrape.
  2. Make an exception for gauges and distinguish them by a "worker" label.
  3. Use "last write wins". But that's not useful at all for any common gauge use cases (queue lengths, etc.).
  4. Don't support gauges in this kind of environment, expecting that the bulk of relevant metrics for this use case would be counters and summaries anyways.

1 is probably what we'd want if we don't split metrics by worker otherwise.

@atombender If indeed one Unicorn server is always expected to have the same number of stably identified worker processes (at least between complete Unicorn restarts, which would also wipe the shared memory), your approach sounds sensible to me. If I understand the raindrops approach correctly, it uses a single counter for all workers ("counters are shared across all forked children and lock-free"), so there wouldn't even be any counter resets anywhere in that approach if only a single worker is killed?

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Feb 8, 2015

Member

(if you're only reading by mail - I edited my last comment because Markdown made a 5 out of my 1...)

Member

juliusv commented Feb 8, 2015

(if you're only reading by mail - I edited my last comment because Markdown made a 5 out of my 1...)

@atombender

This comment has been minimized.

Show comment
Hide comment
@atombender

atombender Feb 8, 2015

Note that I am not suggesting that the client should aggregate anything. I am suggesting that each worker has a slot in the shared memory segment that it is responsible for updating. When a worker gets a /metrics request, it simply returns the contents of all slots, separately labeled by worker ID.

Shouldn't that completely work around the gauge problem? I was initially thinking that the client would aggregate workers, because, as I said earlier, I think per-worker metrics aren't very useful. But if it poses a problem for gauges, I think it's better to "over-report" and do aggregation on the server.

(Pardon my obtuseness, but I'm not too familiar with how Prometheus works yet.)

atombender commented Feb 8, 2015

Note that I am not suggesting that the client should aggregate anything. I am suggesting that each worker has a slot in the shared memory segment that it is responsible for updating. When a worker gets a /metrics request, it simply returns the contents of all slots, separately labeled by worker ID.

Shouldn't that completely work around the gauge problem? I was initially thinking that the client would aggregate workers, because, as I said earlier, I think per-worker metrics aren't very useful. But if it poses a problem for gauges, I think it's better to "over-report" and do aggregation on the server.

(Pardon my obtuseness, but I'm not too familiar with how Prometheus works yet.)

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Feb 8, 2015

Member

Don't support gauges in this kind of environment, expecting that the bulk of relevant metrics for this use case would be counters and summaries anyways.

This is my expectation. It sounds like entirely an online-serving use case, with no thread pools or the like where you'd usually want gauges. I can see some use cases for gauges (e.g. if you were batching up requests to send on elsewhere), but the counters uses are the primary ones.

Member

brian-brazil commented Feb 8, 2015

Don't support gauges in this kind of environment, expecting that the bulk of relevant metrics for this use case would be counters and summaries anyways.

This is my expectation. It sounds like entirely an online-serving use case, with no thread pools or the like where you'd usually want gauges. I can see some use cases for gauges (e.g. if you were batching up requests to send on elsewhere), but the counters uses are the primary ones.

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Feb 8, 2015

Member

@atombender Ah, ok. That'd definitely work and require fewer scrapes, though you'll still end up with a lot of dimensionality in your time series that might not prove to be of much value vs. their cost in transfer, storage, and queries.

@brian-brazil Yep.

Member

juliusv commented Feb 8, 2015

@atombender Ah, ok. That'd definitely work and require fewer scrapes, though you'll still end up with a lot of dimensionality in your time series that might not prove to be of much value vs. their cost in transfer, storage, and queries.

@brian-brazil Yep.

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Feb 8, 2015

Member

When a worker gets a /metrics request, it simply returns the contents of all slots, separately labeled by worker ID.

That'd work, I thought we were trying to avoid that due to all the additional timeseries.

Shouldn't that completely work around the gauge problem? I was initially thinking that the client would aggregate workers, because, as I said earlier, I think per-worker metrics aren't very useful. But if it poses a problem for gauges, I think it's better to "over-report" and do aggregation on the server.

Reporting everything is what I generally favour, assuming it works out resource wise.

That resolves the counter problem, it doesn't solve the gauge problem as gauge use cases tend to work under the assumption that the gauge is reset when the process dies/starts (think of a gauge that represented in-progress requests). If unicorn can clear them when a process dies, that should handle that.

Member

brian-brazil commented Feb 8, 2015

When a worker gets a /metrics request, it simply returns the contents of all slots, separately labeled by worker ID.

That'd work, I thought we were trying to avoid that due to all the additional timeseries.

Shouldn't that completely work around the gauge problem? I was initially thinking that the client would aggregate workers, because, as I said earlier, I think per-worker metrics aren't very useful. But if it poses a problem for gauges, I think it's better to "over-report" and do aggregation on the server.

Reporting everything is what I generally favour, assuming it works out resource wise.

That resolves the counter problem, it doesn't solve the gauge problem as gauge use cases tend to work under the assumption that the gauge is reset when the process dies/starts (think of a gauge that represented in-progress requests). If unicorn can clear them when a process dies, that should handle that.

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Feb 8, 2015

Member

The other limit, is that callback-like collectors won't work with a shared memory model. This would affect things like runtime and process statistics, that you want to gather at scrape time.

Member

brian-brazil commented Feb 8, 2015

The other limit, is that callback-like collectors won't work with a shared memory model. This would affect things like runtime and process statistics, that you want to gather at scrape time.

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie Feb 9, 2015

Member

@grobie Did you have any plans wrt. supporting preforked servers already?

When we had the use cases at SoundCloud, we discussed three options: a) scraping each worker independently b) letting the master process handle metrics (the current dumb lock implementation in the ruby client probably needs some improvement) c) let each worker push metrics to a pushgateway.

I believe we run all our Ruby services on JRuby by now using the java client. Going forward with b) sounds good to me.

Member

grobie commented Feb 9, 2015

@grobie Did you have any plans wrt. supporting preforked servers already?

When we had the use cases at SoundCloud, we discussed three options: a) scraping each worker independently b) letting the master process handle metrics (the current dumb lock implementation in the ruby client probably needs some improvement) c) let each worker push metrics to a pushgateway.

I believe we run all our Ruby services on JRuby by now using the java client. Going forward with b) sounds good to me.

@jeffutter

This comment has been minimized.

Show comment
Hide comment
@jeffutter

jeffutter Nov 6, 2015

I am very interested in this issue. I am wondering if it is seen with threaded servers (like Puma)? I think they work in a way that memory isn't shared? Depending on where/when things get initialized?

Is there there is a demo app in the repo I'll play around with it today.

Is anyone aware of any other rack middlewares that share global state that we could look to for solutions?

jeffutter commented Nov 6, 2015

I am very interested in this issue. I am wondering if it is seen with threaded servers (like Puma)? I think they work in a way that memory isn't shared? Depending on where/when things get initialized?

Is there there is a demo app in the repo I'll play around with it today.

Is anyone aware of any other rack middlewares that share global state that we could look to for solutions?

@jeffutter

This comment has been minimized.

Show comment
Hide comment
@jeffutter

jeffutter Nov 6, 2015

If raindrops is the right path to look at for this... They have a sample middleware. Their code isn't on github but here is a clone of that middleware: https://github.com/tmm1/raindrops/blob/master/lib/raindrops/middleware.rb

jeffutter commented Nov 6, 2015

If raindrops is the right path to look at for this... They have a sample middleware. Their code isn't on github but here is a clone of that middleware: https://github.com/tmm1/raindrops/blob/master/lib/raindrops/middleware.rb

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie Nov 6, 2015

Member

Thanks, I'll have a look.

Member

grobie commented Nov 6, 2015

Thanks, I'll have a look.

@jeffutter

This comment has been minimized.

Show comment
Hide comment
@jeffutter

jeffutter Nov 7, 2015

I have been playing around with raindrops, but can't seem to get the processes writing to the same counters. I'll keep at it. In the meantime I found this: https://github.com/TheClimateCorporation/unicorn-metrics which might show a similar use case.

jeffutter commented Nov 7, 2015

I have been playing around with raindrops, but can't seem to get the processes writing to the same counters. I'll keep at it. In the meantime I found this: https://github.com/TheClimateCorporation/unicorn-metrics which might show a similar use case.

@jeffutter

This comment has been minimized.

Show comment
Hide comment
@jeffutter

jeffutter Nov 8, 2015

For those interested in this I have made a little progress. It seems that when using raindrops the Raindrop counters need to be declared before the workers fork. Otherwise each worker will end up instantiating separate instances of the counters. If you declare them before hand, it actually works as expected.

This makes the prometheus middleware a little inconvenient, as you would need to know all possible endpoints and other counters beforehand. This is the approach that the unicorn-metrics project above takes.

I have began looking at a different approach - where each worker exposes it's own metrics at /_metrics and then there would be a middleware at /metrics that queries the others and combines the results. I currently can't find a way to figure out what the other workers TCP/unix sockets are from inside the worker. There is a Unicorn::HttpServer::LISTENERS constant, but it seems to only contain reference to the current worker.

Hopefully this information can be helpful to someone, I would really love to see prometheus being useful on ruby servers.

jeffutter commented Nov 8, 2015

For those interested in this I have made a little progress. It seems that when using raindrops the Raindrop counters need to be declared before the workers fork. Otherwise each worker will end up instantiating separate instances of the counters. If you declare them before hand, it actually works as expected.

This makes the prometheus middleware a little inconvenient, as you would need to know all possible endpoints and other counters beforehand. This is the approach that the unicorn-metrics project above takes.

I have began looking at a different approach - where each worker exposes it's own metrics at /_metrics and then there would be a middleware at /metrics that queries the others and combines the results. I currently can't find a way to figure out what the other workers TCP/unix sockets are from inside the worker. There is a Unicorn::HttpServer::LISTENERS constant, but it seems to only contain reference to the current worker.

Hopefully this information can be helpful to someone, I would really love to see prometheus being useful on ruby servers.

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Nov 8, 2015

Member

The approach I'm taking over in python is to have each process keep a store of it's metrics on a file per process on disk, and then read them in at scrape time.

I can see a challenge in distributing the scrape to workers if a worker is single-threaded. You want scapes to be independent of other processing.

Member

brian-brazil commented Nov 8, 2015

The approach I'm taking over in python is to have each process keep a store of it's metrics on a file per process on disk, and then read them in at scrape time.

I can see a challenge in distributing the scrape to workers if a worker is single-threaded. You want scapes to be independent of other processing.

@jeffutter

This comment has been minimized.

Show comment
Hide comment
@jeffutter

jeffutter Nov 9, 2015

I implemented backing the metrics with PStores. This seems to have the desired effect albeit causing a slowdown since it writes stats to disk on every request. My branch is here: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker I tested it out with apache bench doing 2000 requests with 10 concurrent requests and 4 unicorn workers. After running that the /metrics endpoint reported the correct number of hits, not sure about the other types of stats.

On this branch, with the above setup, I'm getting ~ 350 requests /sec. While on the master branch. I'm getting over 1,700. Now, while this sounds like a big difference, the duration of the request that returns OK/200 is about 13ms.. I'm guessing in a real-world load, with real request times, this difference will be much more negligible.

That being said, there may be better backing stores to use than PStore or better ways to store it. Also, it should probably be an option to back with a hash when not running in a prefork environment.

Let me know what you think.

jeffutter commented Nov 9, 2015

I implemented backing the metrics with PStores. This seems to have the desired effect albeit causing a slowdown since it writes stats to disk on every request. My branch is here: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker I tested it out with apache bench doing 2000 requests with 10 concurrent requests and 4 unicorn workers. After running that the /metrics endpoint reported the correct number of hits, not sure about the other types of stats.

On this branch, with the above setup, I'm getting ~ 350 requests /sec. While on the master branch. I'm getting over 1,700. Now, while this sounds like a big difference, the duration of the request that returns OK/200 is about 13ms.. I'm guessing in a real-world load, with real request times, this difference will be much more negligible.

That being said, there may be better backing stores to use than PStore or better ways to store it. Also, it should probably be an option to back with a hash when not running in a prefork environment.

Let me know what you think.

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Nov 9, 2015

Member

That's pretty much the same approach as I'm using in Python, however I'm only letting this behaviour kick in when explicitly asked for as it has a number of limitations including performance (see prometheus/client_python#66).

Scrape-time performance doesn't matter too much. what's really important is that metric increments etc. be fast. Optimally that'd mean something where writes hit the kernel, but not the disk.

Member

brian-brazil commented Nov 9, 2015

That's pretty much the same approach as I'm using in Python, however I'm only letting this behaviour kick in when explicitly asked for as it has a number of limitations including performance (see prometheus/client_python#66).

Scrape-time performance doesn't matter too much. what's really important is that metric increments etc. be fast. Optimally that'd mean something where writes hit the kernel, but not the disk.

@jeffutter

This comment has been minimized.

Show comment
Hide comment
@jeffutter

jeffutter Nov 9, 2015

On my branch: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker I have updated it so there are backing "stores". I have included one with the original behavior (Hash) and one with the new behavior (PStore). If you pass the prefork: true argument to the middlewares it will use the PStore backend, otherwise it uses the original. I imagine this could be extended in the future to support other on-disk or even DB (redis?) backends.

Also (I'm not sure why) but the benchmarks have improved a little. The Hash one is giving me 2,100 requests/sec while the PStore one is up to 1,400. Perhaps may just be a system resource difference from before, I need some more scientific benchmark. However at 1,400 requests/sec I doubt that will cause a bottleneck in any real web load.

Currently there are a bunch of broken tests. If this solution seems like something that might be considered for merging I will fixup the tests, update the docs and clean everything up.

jeffutter commented Nov 9, 2015

On my branch: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker I have updated it so there are backing "stores". I have included one with the original behavior (Hash) and one with the new behavior (PStore). If you pass the prefork: true argument to the middlewares it will use the PStore backend, otherwise it uses the original. I imagine this could be extended in the future to support other on-disk or even DB (redis?) backends.

Also (I'm not sure why) but the benchmarks have improved a little. The Hash one is giving me 2,100 requests/sec while the PStore one is up to 1,400. Perhaps may just be a system resource difference from before, I need some more scientific benchmark. However at 1,400 requests/sec I doubt that will cause a bottleneck in any real web load.

Currently there are a bunch of broken tests. If this solution seems like something that might be considered for merging I will fixup the tests, update the docs and clean everything up.

@jeffutter

This comment has been minimized.

Show comment
Hide comment
@jeffutter

jeffutter Nov 12, 2015

For continuity sake. Discussion for this topic has continued over on the mailing list here: https://groups.google.com/d/topic/prometheus-developers/oI3qPhaRY7I/discussion

jeffutter commented Nov 12, 2015

For continuity sake. Discussion for this topic has continued over on the mailing list here: https://groups.google.com/d/topic/prometheus-developers/oI3qPhaRY7I/discussion

@mortenlj

This comment has been minimized.

Show comment
Hide comment
@mortenlj

mortenlj Aug 12, 2016

Hi!

It seems discussion about this issue died out on the mailing list sometime before Christmas last year. Is there any movement towards a solution to this problem?

mortenlj commented Aug 12, 2016

Hi!

It seems discussion about this issue died out on the mailing list sometime before Christmas last year. Is there any movement towards a solution to this problem?

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie Aug 12, 2016

Member

I'm not aware of anyone working on a solution for this problem at the moment. Ideas and contributions are very welcome.

Member

grobie commented Aug 12, 2016

I'm not aware of anyone working on a solution for this problem at the moment. Ideas and contributions are very welcome.

@christoph-buente

This comment has been minimized.

Show comment
Hide comment
@christoph-buente

christoph-buente Jan 4, 2017

Thanks @brian-brazil, will have a look into that.

christoph-buente commented Jan 4, 2017

Thanks @brian-brazil, will have a look into that.

@mcfilib

This comment has been minimized.

Show comment
Hide comment
@mcfilib

mcfilib Feb 8, 2017

@SamSaffron did you ever get round to looking into writing that exporter based off raindrops?

mcfilib commented Feb 8, 2017

@SamSaffron did you ever get round to looking into writing that exporter based off raindrops?

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Feb 8, 2017

@filib I did but its running on internal code. Its a tiny amount of code actually:

something along the lines of:

net_stats = Raindrops::Linux::tcp_listener_stats("0.0.0.0:3000")["0.0.0.0:3000"]
if net_stats
puts "web_active:#{net_stats.active}"
puts "web_queued:#{net_stats.queued}"
end

Then we also have monitors that alert us if we are sitting at sustained queued for longer than a minute or so.

SamSaffron commented Feb 8, 2017

@filib I did but its running on internal code. Its a tiny amount of code actually:

something along the lines of:

net_stats = Raindrops::Linux::tcp_listener_stats("0.0.0.0:3000")["0.0.0.0:3000"]
if net_stats
puts "web_active:#{net_stats.active}"
puts "web_queued:#{net_stats.queued}"
end

Then we also have monitors that alert us if we are sitting at sustained queued for longer than a minute or so.

@lyda

This comment has been minimized.

Show comment
Hide comment
@lyda

lyda Mar 29, 2017

Note that @juliusv has pushed a branch that solves this. I think it needs more work to support all metric types.

lyda commented Mar 29, 2017

Note that @juliusv has pushed a branch that solves this. I think it needs more work to support all metric types.

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie Mar 30, 2017

Member

Another note: I'll work on a complete rewrite of client_ruby likely to start in the next few weeks. It will address several of the library's architectural short comings and include the option to change the value handling (lock-free for single process programs, multi-threaded, multi-process). client_ruby is the oldest client library in its current state, we've learned a lot in the meantime and similar rewrites have been done in client_golang and client_java.

Member

grobie commented Mar 30, 2017

Another note: I'll work on a complete rewrite of client_ruby likely to start in the next few weeks. It will address several of the library's architectural short comings and include the option to change the value handling (lock-free for single process programs, multi-threaded, multi-process). client_ruby is the oldest client library in its current state, we've learned a lot in the meantime and similar rewrites have been done in client_golang and client_java.

@lyda

This comment has been minimized.

Show comment
Hide comment
@lyda

lyda Mar 30, 2017

When doing this note that the exchangeable-value-types branch does multiprocess but has some flaws:

  • Not all metric types are supported. Specifically the summary type isn't supported.
  • This means that the Collector for rack isn't supported either.

Be nice to have fixes for that as well.

lyda commented Mar 30, 2017

When doing this note that the exchangeable-value-types branch does multiprocess but has some flaws:

  • Not all metric types are supported. Specifically the summary type isn't supported.
  • This means that the Collector for rack isn't supported either.

Be nice to have fixes for that as well.

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie Mar 30, 2017

Member
Member

grobie commented Mar 30, 2017

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Mar 30, 2017

Member

What doesn't work about Summaries? From a stats standpoint, it should be possible to get something out of them.

Member

brian-brazil commented Mar 30, 2017

What doesn't work about Summaries? From a stats standpoint, it should be possible to get something out of them.

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie Mar 30, 2017

Member
Member

grobie commented Mar 30, 2017

@juliusv

This comment has been minimized.

Show comment
Hide comment
@juliusv

juliusv Mar 30, 2017

Member

This is the branch where I put up (really hacky and experimental!) multiprocess support. It's in large part a transliteration of the Python client's multiprocess support, with tests hacked up to "pass". It's alpha-usable, but I'm hoping for @grobie's rewrite to do this all properly :) See https://github.com/prometheus/client_ruby/blob/multiprocess/example.rb

Member

juliusv commented Mar 30, 2017

This is the branch where I put up (really hacky and experimental!) multiprocess support. It's in large part a transliteration of the Python client's multiprocess support, with tests hacked up to "pass". It's alpha-usable, but I'm hoping for @grobie's rewrite to do this all properly :) See https://github.com/prometheus/client_ruby/blob/multiprocess/example.rb

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Mar 30, 2017

wanted to share some pics so you can see what you get with raindrops:

image

Reqs are very evenly distributed between the 5 boxes (which is to be expected with haproxy) - we run about 8 workers per machine, everything is counted.

We see some queueing when there are spikes, which indicates we may need an extra process or two here

image

SamSaffron commented Mar 30, 2017

wanted to share some pics so you can see what you get with raindrops:

image

Reqs are very evenly distributed between the 5 boxes (which is to be expected with haproxy) - we run about 8 workers per machine, everything is counted.

We see some queueing when there are spikes, which indicates we may need an extra process or two here

image

@brian-brazil

This comment has been minimized.

Show comment
Hide comment
@brian-brazil

brian-brazil Mar 31, 2017

Member

How do you aggregate per-process quantiles?

You can't. But you can report them all.

The _sum and _count work fine in any case.

Member

brian-brazil commented Mar 31, 2017

How do you aggregate per-process quantiles?

You can't. But you can report them all.

The _sum and _count work fine in any case.

@lyda

This comment has been minimized.

Show comment
Hide comment
@lyda

lyda May 30, 2017

We've released this which is what we're using for Gitlab. Hopefully it will be of some use for @grobie. It's multiproc mode only so not ideal for the general case but hopefully will be of use for future work.

lyda commented May 30, 2017

We've released this which is what we're using for Gitlab. Hopefully it will be of some use for @grobie. It's multiproc mode only so not ideal for the general case but hopefully will be of use for future work.

@grobie

This comment has been minimized.

Show comment
Hide comment
@grobie

grobie May 30, 2017

Member
Member

grobie commented May 30, 2017

@mrgordon

This comment has been minimized.

Show comment
Hide comment
@mrgordon

mrgordon Oct 2, 2017

I'm guessing there is no progress on this which is sad as it basically makes metrics collections useless in any Rails environment with a forking web server.

I'm trying to switch to the commits that @juliusv and @lyda linked but the examples don't show how to use their changes with the Rack collector / exporter.

Is there anything needed other than:

Prometheus::Client::MmapedValue.set_pid(1)
use Prometheus::Middleware::Collector
use Prometheus::Middleware::Exporter

I tried doing that but hit errors where it was trying to call total on a hash at lib/prometheus/client/formats/text.rb:171:in 'histogram'

Full trace:

app error: undefined method `total' for #<Hash:0x0055d849b5d520> (NoMethodError)
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:171:in `histogram'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:150:in `representation'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:33:in `block (2 levels) in marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:32:in `each'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:32:in `block in marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:28:in `each'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:28:in `marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/exporter.rb:69:in `respond_with'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/exporter.rb:30:in `call'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:37:in `block in call'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:77:in `trace'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:37:in `call'
/usr/local/bundle/gems/rack-1.4.7/lib/rack/content_length.rb:14:in `call'
/usr/local/bundle/gems/railties-3.2.22.5/lib/rails/rack/log_tailer.rb:17:in `call'
/usr/local/bundle/gems/unicorn-4.8.2/lib/unicorn/http_server.rb:572:in `process_client'

I reverted some value.get.total calls to value.total and I can get the app to load now but it looks like the metrics are still per fork as if I hit /metrics repeatedly I get different values.

mrgordon@68ca030

I'm guessing due to my changes I also get some metrics back like this:

http_server_request_duration_seconds_count{method="get",path="/metrics"} #<Prometheus::Client::SimpleValue:0x005555afd28278>

I'll look more into that issue but I'm guessing the SimpleValue indicates the mmap stuff isn't working.

mrgordon commented Oct 2, 2017

I'm guessing there is no progress on this which is sad as it basically makes metrics collections useless in any Rails environment with a forking web server.

I'm trying to switch to the commits that @juliusv and @lyda linked but the examples don't show how to use their changes with the Rack collector / exporter.

Is there anything needed other than:

Prometheus::Client::MmapedValue.set_pid(1)
use Prometheus::Middleware::Collector
use Prometheus::Middleware::Exporter

I tried doing that but hit errors where it was trying to call total on a hash at lib/prometheus/client/formats/text.rb:171:in 'histogram'

Full trace:

app error: undefined method `total' for #<Hash:0x0055d849b5d520> (NoMethodError)
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:171:in `histogram'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:150:in `representation'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:33:in `block (2 levels) in marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:32:in `each'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:32:in `block in marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:28:in `each'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:28:in `marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/exporter.rb:69:in `respond_with'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/exporter.rb:30:in `call'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:37:in `block in call'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:77:in `trace'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:37:in `call'
/usr/local/bundle/gems/rack-1.4.7/lib/rack/content_length.rb:14:in `call'
/usr/local/bundle/gems/railties-3.2.22.5/lib/rails/rack/log_tailer.rb:17:in `call'
/usr/local/bundle/gems/unicorn-4.8.2/lib/unicorn/http_server.rb:572:in `process_client'

I reverted some value.get.total calls to value.total and I can get the app to load now but it looks like the metrics are still per fork as if I hit /metrics repeatedly I get different values.

mrgordon@68ca030

I'm guessing due to my changes I also get some metrics back like this:

http_server_request_duration_seconds_count{method="get",path="/metrics"} #<Prometheus::Client::SimpleValue:0x005555afd28278>

I'll look more into that issue but I'm guessing the SimpleValue indicates the mmap stuff isn't working.

@mrgordon

This comment has been minimized.

Show comment
Hide comment
@mrgordon

mrgordon Oct 2, 2017

Quick update: I added prometheus_multiproc_dir=/tmp to the environment as I see the code checks for this even though none of the examples set it. My output from /metrics now shows mmap stuff but the http_server_requests_total still varies depending on which Unicorn thread I talk to.

e.g.

$ curl http://myapp/metrics
http_server_requests_total{code="200",method="get",path="/metrics"} 4.0
...
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="10"} 4.0
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="+Inf"} #<Prometheus::Client::MmapedValue:0x0055ba313dfc70>

a few seconds later

$ curl http://myapp/metrics
http_server_requests_total{code="200",method="get",path="/metrics"} 3.0
...
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="10"} 3.0
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="+Inf"} #<Prometheus::Client::MmapedValue:0x0055ba313da720>

mrgordon commented Oct 2, 2017

Quick update: I added prometheus_multiproc_dir=/tmp to the environment as I see the code checks for this even though none of the examples set it. My output from /metrics now shows mmap stuff but the http_server_requests_total still varies depending on which Unicorn thread I talk to.

e.g.

$ curl http://myapp/metrics
http_server_requests_total{code="200",method="get",path="/metrics"} 4.0
...
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="10"} 4.0
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="+Inf"} #<Prometheus::Client::MmapedValue:0x0055ba313dfc70>

a few seconds later

$ curl http://myapp/metrics
http_server_requests_total{code="200",method="get",path="/metrics"} 3.0
...
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="10"} 3.0
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="+Inf"} #<Prometheus::Client::MmapedValue:0x0055ba313da720>
@lzap

This comment has been minimized.

Show comment
Hide comment
@lzap

lzap Nov 8, 2017

What would be the best workaround for Rails app running in Passenger? I am considering statsd_exporter and pushgateway. I expect the latter to be easier to integrate with but slower.

lzap commented Nov 8, 2017

What would be the best workaround for Rails app running in Passenger? I am considering statsd_exporter and pushgateway. I expect the latter to be easier to integrate with but slower.

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Nov 9, 2017

SamSaffron commented Nov 9, 2017

@kvitajakub

This comment has been minimized.

Show comment
Hide comment
@kvitajakub

kvitajakub Dec 14, 2017

Hi, is there any progress on this please?

Looks like it should have issue or feature label in addition to the question.

kvitajakub commented Dec 14, 2017

Hi, is there any progress on this please?

Looks like it should have issue or feature label in addition to the question.

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Jan 10, 2018

I started my Discourse extraction here:

https://github.com/discourse/prometheus_exporter

Supports multiple processes with optional custom transport, recovery and chunked encoding. Should be released proper in a few weeks.

Works fine with Passenger, Unicorn and Puma in clustered mode. You can also aggregate sidekiq.

SamSaffron commented Jan 10, 2018

I started my Discourse extraction here:

https://github.com/discourse/prometheus_exporter

Supports multiple processes with optional custom transport, recovery and chunked encoding. Should be released proper in a few weeks.

Works fine with Passenger, Unicorn and Puma in clustered mode. You can also aggregate sidekiq.

@lzap

This comment has been minimized.

Show comment
Hide comment
@lzap

lzap Jan 11, 2018

Thanks. How many metrics do you send for each request in production? I am little bit concerned about that HTTP/JSON transport.

lzap commented Jan 11, 2018

Thanks. How many metrics do you send for each request in production? I am little bit concerned about that HTTP/JSON transport.

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Jan 12, 2018

@lzap will update the readme, I can send about 10k strings over in 200ms over a single request ... if you add a round trip of to_json and JSON.parse its around 600ms for 10k messages.

I persist the HTTP connection for 30 seconds, configurable

SamSaffron commented Jan 12, 2018

@lzap will update the readme, I can send about 10k strings over in 200ms over a single request ... if you add a round trip of to_json and JSON.parse its around 600ms for 10k messages.

I persist the HTTP connection for 30 seconds, configurable

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Feb 5, 2018

Just putting this out here, in case anyone is still facing the problem

https://samsaffron.com/archive/2018/02/02/instrumenting-rails-with-prometheus

SamSaffron commented Feb 5, 2018

Just putting this out here, in case anyone is still facing the problem

https://samsaffron.com/archive/2018/02/02/instrumenting-rails-with-prometheus

@grobie grobie added the Help wanted label May 4, 2018

@grobie grobie changed the title from Doesn't work with multiple Rack processes? to Support pre-fork servers May 4, 2018

@grobie grobie added enhancement and removed question labels May 4, 2018

@fajpunk

This comment has been minimized.

Show comment
Hide comment
@fajpunk

fajpunk Oct 14, 2018

Hi! I notice the Help wanted tag on this issue, but it's unclear from reading the comments exactly what help is wanted; all I see is a few proposed solutions and alternate libraries. Has a solution been chosen, and if so, is there a branch where work is being done? I'd like to help if I can.

fajpunk commented Oct 14, 2018

Hi! I notice the Help wanted tag on this issue, but it's unclear from reading the comments exactly what help is wanted; all I see is a few proposed solutions and alternate libraries. Has a solution been chosen, and if so, is there a branch where work is being done? I'd like to help if I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment