High cardinality from `pid_hash` label #110

absolutejam · 2020-09-28T20:41:29Z

Hi 👋

Thanks for sharing this exporter, I really appreciate the insight it's given us.

I just wanted to create an issue because I've noticed that the default setup will add the pid_hash label. In our setup where we have 5 replicas of an application with a fixed 128 process count, that leads to 640 series. And when we redeploy every few hours... that's a cardinality explosion waiting to happen!

I just wondered if this is something that should possibly be documented or turned off by default, and added with a flag. I could look at adding this if you think there's merit?

The text was updated successfully, but these errors were encountered:

estahn · 2020-09-28T23:40:32Z

@absolutejam Glad the exporter is working for you. In regards to pid_hash you can find a discussion here #76.

If you have a solution in mind I'm open to contributions.

absolutejam · 2020-09-29T13:03:23Z

Personally, I'm not sure how much we value the visibility into seeing each process individually (at least, via. some kind of name). Maybe process lifetime (probably too hard at an aggregated level) or process restart rate would be a better indicator, but I can't attest to what others would use this for.

Really, I'd just advocate for either having this enabled via. a flag or make it abundantly clear in the docs about the potential risks in certain setups (Or better yet, a flag and some docs to cover it).

estahn · 2020-09-29T14:30:46Z

@absolutejam I believe I have used that to identify issues due to process churn, but I don't 100% recall anymore.

re docs; I think it makes sense to have this documented and warn for cardinality explosion in environments with processes that have high churn.
Are you advocating to remove the pid_hash entirely or to replace it with a different indicator (e.g. index)? If removed entirely I assume something like "# of processes by state" is not possible anymore:

matejzero · 2020-10-12T11:59:15Z

Another similar issue was opened here: #85.

In my case I have 100k unique label values for pid_hash label. Multiply that with 5 metrics that have pid_hash label and a few state values for phpfpm_process_state adds up to a very high value.

matejzero · 2020-10-13T17:02:39Z

I'm trying to come out with a solution for current pid_hash, but I don't think there is a good solution that would not explode in labels names.

One solution is to use array id. I know that has it's down sides as all threads change IDs if ie. first thread gets killed / restarted.

Another option could be to build PID:ID mapping, but that is more work as we need to figure out which PIDs still exists and which don't and map new PIDs to IDs. But if you need to map php-fpm status page output (if you want to manually debug) to an ID, that is hard to do.

itcsoft54 · 2020-11-19T10:09:22Z

* Are you advocating to remove the `pid_hash` entirely or to replace it with a different indicator (e.g. index)? If removed entirely I assume something like "_# of processes by state_" is not possible anymore:

Replace pid_hash by slot. slot means process number of total process :

Exemple slot 5 process, is the 5th process of a total of 10.

I look to php_fpm status, it seem to display process in pool always in same order, replacing old process by new, filling the holes :

pool:                 www
process manager:      dynamic
start time:           03/Nov/2020:03:52:34 +0100
start since:          1407704
accepted conn:        734089
listen queue:         0
max listen queue:     0
listen queue len:     0
idle processes:       5
active processes:     1
total processes:      78
max active processes: 84
max children reached: 0
slow requests:        0

************************
pid:                  19138
state:                Idle
[..]
************************
pid:                  20017
state:                Idle
[..]
************************
pid:                  24874
state:                Idle
[..]
************************
pid:                  19550
state:                Running
[..]
************************
pid:                  3622
state:                Idle
[..]
************************
pid:                  29377
state:                Idle
[..]
************************

process 19550 was termined and replace by new one :

pool:                 www
process manager:      dynamic
start time:           03/Nov/2020:03:52:34 +0100
start since:          1407704
accepted conn:        734089
listen queue:         0
max listen queue:     0
listen queue len:     0
idle processes:       5
active processes:     1
total processes:      78
max active processes: 84
max children reached: 0
slow requests:        0

************************
pid:                  19138
state:                Idle
[..]
************************
pid:                  20017
state:                Idle
[..]
************************
pid:                  24874
state:                Idle
[..]
************************
pid:                  17583
state:                Idle
[..]
************************
pid:                  3622
state:                Running
[..]
************************
pid:                  29377
state:                Idle
[..]
************************

Order was preserved.

So if php_fpm page was scraped "as is", you don't need to do anything but prefix pool index by "slot" keyword.

my php_fpm version (from debian php sury ) :
# php-fpm7.2 --version
PHP 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b (fpm-fcgi) (built: May 14 2020 08:33:26)
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
    with Zend OPcache v7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b, Copyright (c) 1999-2018, by Zend Technologies
    with Xdebug v2.7.2, Copyright (c) 2002-2019, by Derick Rethans

edit : formating

matejzero · 2020-11-19T13:35:10Z

Slot is probably better label than id.

For slot number, we could use array id from https://github.com/hipages/php-fpm_exporter/blob/master/phpfpm/exporter.go#L214, right? Order should be the same as on status pade.

Maybe add a +1, so it starts with 1 and not 0 and better match with pm.max_processes value.

olivierHa · 2020-11-19T14:29:17Z

Nice idea @itcsoft54
Slot number should solve the cardinality explosion with "pid_hash"

fix (hipages#110) issue, by removing pid_hash and replace by slot label instead. fixed like prometheus best practices want see https://prometheus.io/docs/practices/naming/#labels

estahn · 2020-11-20T09:30:30Z

@itcsoft54 Thanks for your contribution.

I just revisited the various discussions and arguments. I also reviewed fpm_children.c to verify your findings.

I suggest renaming slot to child as that is the wording used by php-fpm (pool -> child -> process).
How would you determine the number of killed/restarted processes over time? Is there a way to count the counter reset? RFE/Question: Using unique worker ID inplace of pidhash label #76 (comment)

Another option could be to provide process aggregates as suggested here or potentially both with an option to disable per-process metrics.

Thoughts?

itcsoft54 · 2020-11-20T11:40:05Z

@estahn

I add a new commit to rename slot to child
I don't know, but using "label" to this is a bad idea too. Label was associed to a metric, so if processes was spawn between to "scrap" we miss it.
Exemple : if process was respawn 5 time between two scraps, you registred 1 label value, and miss 4 "restart/kill" process.

So metrics based on label has low reliability. If php-fpm doesn't expose such metric (or metrics to compute it), we can't have it reliability.

In add, using "sum of phpfpm_process_requests" to compute "total requests" have same issue. It's better to use the php-fpm exposed "accepted conn" metric even if it not exactly what we want.

Using avg of phpfpm_process_requests can still use because "avg" of lot of value are less impacted if missing some values.

fix (hipages#110) issue, by removing pid_hash and replace by child label instead. fixed like prometheus best practices want see https://prometheus.io/docs/practices/naming/#labels

matejzero · 2020-11-20T12:35:21Z

Do we need child prefix in child label? Wouldn't it be enough to only have a number ie. child="0" or maybe child_num?

Valid point about sum of phpfpm_process_requests to calculate number of requests. It works in a lot of cases, but it might not be 100% correct in cases you pointed out.

Regarding process aggregates, that would be useful. Some users might not need extra details such as last process memory usage, cpu usage,... But for process states, it would be good to export all label values, like I pointed out in #123.

estahn · 2020-11-20T13:13:40Z

@itcsoft54

I add a new commit to rename slot to child

Thanks, I agree with @matejzero, use the index, or is there a need for a prefix?

I don't know, but using "label" to this is a bad idea too. Label was associed to a metric, so if processes was spawn between to "scrap" we miss it.
Exemple : if process was respawn 5 time between two scraps, you registred 1 label value, and miss 4 "restart/kill" process.

So metrics based on label has low reliability. If php-fpm doesn't expose such metric (or metrics to compute it), we can't have it reliability.

I understand where you are coming from but I don't fully agree with your argument. I assume if you say "reliability" you mean "accuracy". We don't require 100% accuracy as long as there is statistical relevance.

https://prometheus.io/docs/introduction/overview/#when-does-it-not-fit

In add, using "sum of phpfpm_process_requests" to compute "total requests" have same issue. It's better to use the php-fpm exposed "accepted conn" metric even if it not exactly what we want.

Using avg of phpfpm_process_requests can still use because "avg" of lot of value are less impacted if missing some values.

phpfpm_process_requests is fairly accurate for us, we don't use "dynamic mode" and therefore have relatively low churn. In that sense, it will also depend on your configuration how accurate your metrics are.

We can implement process aggregates in a separate PR and I'm happy to merge your PR once 1. has been worked out.

Thanks again for tackling this.

itcsoft54 · 2020-11-20T13:59:35Z

I understand where you are coming from but I don't fully agree with your argument. I assume if you say "reliability" you mean "accuracy". We don't require 100% accuracy as long as there is statistical relevance.

Ok, "reliability" may be a pretty loaded word for this case.

I re-read, some exemples of label in prometheus documentation's :

For example, rather than http_responses_500_total and http_responses_403_total, create a single metric called http_responses_total with a code label for the HTTP response code. You can then process the entire metric as one in rules and graphs.

So we don't need to prefix label.

I add commit to use index as label "child" value.

* fix: high cardanality of pid_hash fix (#110) issue, by removing pid_hash and replace by slot label instead. fixed like prometheus best practices want see https://prometheus.io/docs/practices/naming/#labels * fix: change slot by child label fix (#110) issue, by removing pid_hash and replace by child label instead. fixed like prometheus best practices want see https://prometheus.io/docs/practices/naming/#labels * fix (#110) issue, remove "child" prefix from label "child" value. fixes #110 #76 @all-contributors please add itcsoft54 for code

fixes #110 #76 BREAKING CHANGE: `pid_hash` is being removed in favour of `child` to avoid high cardinality explosion. In turn this means processes and their state changes can't be identified anymore. If you're using this behaviour please open an issue.

github-actions · 2020-11-22T00:43:00Z

🎉 This issue has been resolved in version 2.0.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

matejzero mentioned this issue Nov 17, 2020

Return all combinations for phpfpm_process_state metric #123

Open

itcsoft54 mentioned this issue Nov 19, 2020

fix: high cardanality of pid_hash #124

Merged

itcsoft54 added a commit to itcsoft54/php-fpm_exporter that referenced this issue Nov 20, 2020

fix (hipages#110) issue, remove "child" prefix from label "child" value.

55202ce

estahn closed this as completed in f8746c9 Nov 21, 2020

github-actions bot added the released label Nov 22, 2020

itcsoft54 mentioned this issue Feb 18, 2023

Proposal to deprecate the project for PHP 8.1+ #284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High cardinality from `pid_hash` label #110

High cardinality from `pid_hash` label #110

absolutejam commented Sep 28, 2020

estahn commented Sep 28, 2020

absolutejam commented Sep 29, 2020 •

edited

estahn commented Sep 29, 2020

matejzero commented Oct 12, 2020

matejzero commented Oct 13, 2020

itcsoft54 commented Nov 19, 2020 •

edited

matejzero commented Nov 19, 2020 •

edited

olivierHa commented Nov 19, 2020

estahn commented Nov 20, 2020 •

edited

itcsoft54 commented Nov 20, 2020 •

edited

matejzero commented Nov 20, 2020

estahn commented Nov 20, 2020

itcsoft54 commented Nov 20, 2020

github-actions bot commented Nov 22, 2020

High cardinality from pid_hash label #110

High cardinality from pid_hash label #110

Comments

absolutejam commented Sep 28, 2020

estahn commented Sep 28, 2020

absolutejam commented Sep 29, 2020 • edited

estahn commented Sep 29, 2020

matejzero commented Oct 12, 2020

matejzero commented Oct 13, 2020

itcsoft54 commented Nov 19, 2020 • edited

matejzero commented Nov 19, 2020 • edited

olivierHa commented Nov 19, 2020

estahn commented Nov 20, 2020 • edited

itcsoft54 commented Nov 20, 2020 • edited

matejzero commented Nov 20, 2020

estahn commented Nov 20, 2020

itcsoft54 commented Nov 20, 2020

github-actions bot commented Nov 22, 2020

High cardinality from `pid_hash` label #110

High cardinality from `pid_hash` label #110

absolutejam commented Sep 29, 2020 •

edited

itcsoft54 commented Nov 19, 2020 •

edited

matejzero commented Nov 19, 2020 •

edited

estahn commented Nov 20, 2020 •

edited

itcsoft54 commented Nov 20, 2020 •

edited