RFE/Question: Using unique worker ID inplace of `pidhash` label #76

amitsaha · 2020-03-11T00:25:50Z

Hi, thank you for the project. We can see that in #8, we started using pid_hash derived from PID and start time of the process. This will lead to a prometheus metric explosion (as per the caution in https://prometheus.io/docs/practices/naming/#labels). Could we instead use something like a "worker id" as a label instead? That way, we only have a finite number of label values. Happy to contribute a PR as well.

Additional labels can be used to distinguish between multiple different instances of an application or across applications.

The text was updated successfully, but these errors were encountered:

estahn · 2020-03-11T01:14:20Z

@amitsaha I agree that could, in theory, lead to a metric explosion. In practice, we don't have enough churn for this to occur.

For this discussion, I assume you refer to the process ID when you say "worker id".

I tried to recall the rationale behind pid_hash and I believe this was to avoid worker ID duplicates, e.g. if a pool runs long enough or (more likely) the container being restarted then you will have overlapping process IDs reported to Prometheus. In that case, you're not able to distinguish request count for workers anymore, e.g. worker X1 (PID 33) has 50k requests and 1 day later worker X2 (PID 33) has 3k requests.

Suggestions for other solutions to this problem are welcome!

amitsaha · 2020-03-11T01:57:33Z

Hi @estahn thanks for getting back on this.

Regarding the worker ID, I meant something like this. Each worker process irrespective of it's process ID is one of the N workers at any given point of time. So, if we assign a worker ID (like, 1, 2...N) to each worker, we can use that as an identifier. Then, we have a stable and a finite set of unique label values. However, as you mention that idea breaks for counter metrics. It would be fine for non-counter metrics of course since we are dealing with observations. However, may be we don't need counter metrics since prometheus gives us the count of each summary/histogram metrics for example automatically. What do you think?

estahn · 2020-03-11T05:11:28Z

@amitsaha I see what you mean. In this case, how would you determine the number of killed/restarted processes over time? Is there a way to count the counter reset?

estahn · 2020-03-13T08:02:28Z

@amitsaha After thinking about this a little another question came to mind. Assuming we implement this solution, how would you assign a worker ID (or index)? I see quite a couple of issues with keeping track of these indices, e.g.

Index 1: Process ID 55
Index 2: Process ID 70
Index 3: Process ID 99

If the process "Process ID 70" gets killed and a new process is forked (e.g. Process ID 130) would that be Index 2 or something else?

stale · 2020-05-13T03:41:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

* fix: high cardanality of pid_hash fix (#110) issue, by removing pid_hash and replace by slot label instead. fixed like prometheus best practices want see https://prometheus.io/docs/practices/naming/#labels * fix: change slot by child label fix (#110) issue, by removing pid_hash and replace by child label instead. fixed like prometheus best practices want see https://prometheus.io/docs/practices/naming/#labels * fix (#110) issue, remove "child" prefix from label "child" value. fixes #110 #76 @all-contributors please add itcsoft54 for code

fixes #110 #76 BREAKING CHANGE: `pid_hash` is being removed in favour of `child` to avoid high cardinality explosion. In turn this means processes and their state changes can't be identified anymore. If you're using this behaviour please open an issue.

triage-new-issues bot added the triage label Mar 11, 2020

estahn self-assigned this Mar 11, 2020

estahn added the question Further information is requested label Mar 11, 2020

triage-new-issues bot removed the triage label Mar 11, 2020

stale bot added the wontfix This will not be worked on label May 13, 2020

stale bot closed this as completed May 20, 2020

estahn mentioned this issue Sep 28, 2020

High cardinality from pid_hash label #110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE/Question: Using unique worker ID inplace of `pidhash` label #76

RFE/Question: Using unique worker ID inplace of `pidhash` label #76

amitsaha commented Mar 11, 2020

estahn commented Mar 11, 2020

amitsaha commented Mar 11, 2020

estahn commented Mar 11, 2020

estahn commented Mar 13, 2020

stale bot commented May 13, 2020

RFE/Question: Using unique worker ID inplace of pidhash label #76

RFE/Question: Using unique worker ID inplace of pidhash label #76

Comments

amitsaha commented Mar 11, 2020

estahn commented Mar 11, 2020

amitsaha commented Mar 11, 2020

estahn commented Mar 11, 2020

estahn commented Mar 13, 2020

stale bot commented May 13, 2020

RFE/Question: Using unique worker ID inplace of `pidhash` label #76

RFE/Question: Using unique worker ID inplace of `pidhash` label #76