-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create histograms from per-thread counters #62
Comments
@ncabatoff this is an interesting idea. I can't say I have seen a requirement for a thread histogram like this. Maybe there are some use cases but I can't think of any. In my case (outlined in #98), I don't think this would work. Regarding cardinality, I agree this is a concern. In our case we run everything on Kubernetes and our Nginx PIDs have the same lifespan as pods, which we have labels for already because So for me, there are two use cases here:
|
Can you elaborate on your use case please? Specifically, how would you monitor and alert on resource limited threads? Some examples of the alert conditions? |
Nginx workers are a single thread and performance degrades severely if the thread is fully utilised. Nginx has a master process and N workers so the intention is to alert if any workers consume 100% cpu for say 10 seconds. |
In the context of the histogram proposal, assuming we had cpu buckets of say {0.1, 0.25, 0.5, 0.75, 1.0}, could you achieve what you want by alerting on
? In other words, alert if there's at least one thread in the nginx group that's consuming >75% of one core. As I understand it you don't really need to know for the alert which thread is misbehaving, just that one exists. |
How would that work with a duration? If the alert had a span or for clause to alert if the CPU was high for X seconds, I don't think it would be possible to know it's the same thread with a histogram. I would like to alert and graph the utilisation per WP to check the balancing is ok. We use cadvisor for container metrics: https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md. It's a simple counter for CPU e.g. |
Ah, sorry, I didn't catch that aspect of your use case. So what I proposed will yield false positives, though it's an open question whether the volume of false positives would be excessive or not. Even if it's not always the same thread pegged at >75% (or whatever, let's assume the bins can be user configurable), it's possible that a useable alert could still be constructed based on these metrics. I agree histograms have plenty of limitations and problems, but the only alternative I see is doing something I've always resisted, namely exposing the user to potentially unbounded cardinality. I realize that in your particular situation that won't normally happen, but there are enough footguns available here that I'd prefer not to provide this functionality. I recently learned about https://github.com/zwopir/osquery_exporter, I wonder whether it could work for what you have in mind? Finally, are you sure you want to be alerting based on an artificial metric that's a proxy for the real issue, rather than on the actual symptom? If you say performance degrades severely when this happens, why not instrument nginx performance and alert based on that, e.g. via https://github.com/hnlq715/nginx-vts-exporter ? |
Ok, fair enough. Maybe I'll write an exporter instead. I'll take a look at |
With -threads enabled, we already have CPU usage for each thread group. But using threadid as a label is problematic from a cardinality perspective, so we break down it down by thread name. This works great for some apps, like Chromium, that name their threads. Most apps do not, so it's completely unhelpful.
As a result I'm considering dropping the existing per-thread metrics in favour of a new approach. Since I can't usefully name groups of thread or use ids, I'll settle for characterizing the distribution of threads in a process namegroup. The deltas of the counters for each thread (cpu, io bytes, page faults, context switches) each cycle become histogram observations, e.g.
says that there were two threads consuming between 0 and 0.5s of user cpu time in the group named 'bash'.
Figuring out good bucket sizes for each of these that will apply to all or even most workloads may be challenging.
The text was updated successfully, but these errors were encountered: