Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE/Question: Using unique worker ID inplace of pidhash label #76

Closed
amitsaha opened this issue Mar 11, 2020 · 5 comments
Closed

RFE/Question: Using unique worker ID inplace of pidhash label #76

amitsaha opened this issue Mar 11, 2020 · 5 comments
Assignees
Labels
question Further information is requested wontfix This will not be worked on

Comments

@amitsaha
Copy link

Hi, thank you for the project. We can see that in #8, we started using pid_hash derived from PID and start time of the process. This will lead to a prometheus metric explosion (as per the caution in https://prometheus.io/docs/practices/naming/#labels). Could we instead use something like a "worker id" as a label instead? That way, we only have a finite number of label values. Happy to contribute a PR as well.

Additional labels can be used to distinguish between multiple different instances of an application or across applications.

@estahn
Copy link
Contributor

estahn commented Mar 11, 2020

@amitsaha I agree that could, in theory, lead to a metric explosion. In practice, we don't have enough churn for this to occur.

For this discussion, I assume you refer to the process ID when you say "worker id".

I tried to recall the rationale behind pid_hash and I believe this was to avoid worker ID duplicates, e.g. if a pool runs long enough or (more likely) the container being restarted then you will have overlapping process IDs reported to Prometheus. In that case, you're not able to distinguish request count for workers anymore, e.g. worker X1 (PID 33) has 50k requests and 1 day later worker X2 (PID 33) has 3k requests.

Suggestions for other solutions to this problem are welcome!

@estahn estahn self-assigned this Mar 11, 2020
@estahn estahn added the question Further information is requested label Mar 11, 2020
@amitsaha
Copy link
Author

Hi @estahn thanks for getting back on this.

Regarding the worker ID, I meant something like this. Each worker process irrespective of it's process ID is one of the N workers at any given point of time. So, if we assign a worker ID (like, 1, 2...N) to each worker, we can use that as an identifier. Then, we have a stable and a finite set of unique label values. However, as you mention that idea breaks for counter metrics. It would be fine for non-counter metrics of course since we are dealing with observations. However, may be we don't need counter metrics since prometheus gives us the count of each summary/histogram metrics for example automatically. What do you think?

@estahn
Copy link
Contributor

estahn commented Mar 11, 2020

@amitsaha I see what you mean. In this case, how would you determine the number of killed/restarted processes over time? Is there a way to count the counter reset?

@estahn
Copy link
Contributor

estahn commented Mar 13, 2020

@amitsaha After thinking about this a little another question came to mind. Assuming we implement this solution, how would you assign a worker ID (or index)? I see quite a couple of issues with keeping track of these indices, e.g.

  • Index 1: Process ID 55
  • Index 2: Process ID 70
  • Index 3: Process ID 99

If the process "Process ID 70" gets killed and a new process is forked (e.g. Process ID 130) would that be Index 2 or something else?

@stale
Copy link

stale bot commented May 13, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label May 13, 2020
@stale stale bot closed this as completed May 20, 2020
estahn pushed a commit that referenced this issue Nov 21, 2020
* fix: high cardanality of pid_hash

fix (#110) issue, by removing pid_hash and replace by slot label instead.

fixed like prometheus best practices want
see https://prometheus.io/docs/practices/naming/#labels

* fix: change slot by child label

fix (#110) issue, by removing pid_hash and replace by child label instead.

fixed like prometheus best practices want
see https://prometheus.io/docs/practices/naming/#labels

* fix (#110) issue, remove "child" prefix from label "child" value.

fixes #110 #76 

@all-contributors please add itcsoft54 for code
estahn pushed a commit that referenced this issue Nov 21, 2020
* fix: high cardanality of pid_hash

fix (#110) issue, by removing pid_hash and replace by slot label instead.

fixed like prometheus best practices want
see https://prometheus.io/docs/practices/naming/#labels

* fix: change slot by child label

fix (#110) issue, by removing pid_hash and replace by child label instead.

fixed like prometheus best practices want
see https://prometheus.io/docs/practices/naming/#labels

* fix (#110) issue, remove "child" prefix from label "child" value.

fixes #110 #76

@all-contributors please add itcsoft54 for code
estahn pushed a commit that referenced this issue Nov 22, 2020
fixes #110 #76

BREAKING CHANGE: `pid_hash` is being removed in favour of `child` to
avoid high cardinality explosion. In turn this means processes and their
state changes can't be identified anymore. If you're using this behaviour
please open an issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants