Allow for customizing process identification #214

liaden · 2021-03-19T21:24:00Z

My primary use case is to use the puma index label and forked worker's index to set the identity of the process, but it can also be useful with the process name ($0) since various rake/rails commands would be identified based on what task was used.

With the puma index label, that is nice on long running containers since a new deploy (or restart of the service) does not cause new processes to generate a new label value. With docker containers, it is less relevant since the puma parent is going to be 1 and the child processes are going to be a semi-consistent set of pids.

coveralls · 2021-03-19T21:29:33Z

Coverage remained the same at 100.0% when pulling c37b69e on liaden:custom-process-identification into 1c8e407 on prometheus:master.

coveralls · 2021-03-19T21:29:33Z

Coverage remained the same at 100.0% when pulling c37b69e on liaden:custom-process-identification into 1c8e407 on prometheus:master.

dmagliola · 2021-03-20T14:17:04Z

Hello Joel, thank you for your PR.

Could you explain a little bit more of your use case, please?
I'm not entirely sure I understand what this is for.

The way I understand this code, this would only affect Gauges with their aggregation mode set to ALL, and you are overriding the pid label with another name and value...

But if you wanted to expose something else in the labels (in your example, process_name: $0), couldn't you add that label when defining the metric? (and then the aggregation mode doesn't matter, as you're doing your own effectively)
Why would we need these extra parameters in that case?

Or am I misunderstanding what you are trying to do here?

liaden · 2021-03-20T15:03:59Z

Hey @dmagliola!

Pretty sure this is affecting all of the metrics when the aggregation is set to ALL. That said, the default aggregation for counters and histograms is SUM. I didn't write tests for other metric types though, so would that be preferred?

Initially, I wanted to be able to disable the pid to reduce cardinality concerns with influxdb/grafana, but that comes with a pitfall for end users if the application does not configure a different per process identifying label. The metrics with the an aggregation of ALL would behave as expected: it would end up be like MOST_RECENT.

My thought was that if an individual does not want the pid approach, they should specify their new identifier to avoid the above pitfall. I chose to allow the label value to be dynamic since the pid value is dynamic. If the application forks a new process (puma after getting a TTIN signal) the dynamic value makes sure the new child process gets a different label than the parent process.

With this approach, there still is the possibility that two concurrently running processes collide, but that can be intentional and beneficial.

In my use case, I am using yabeda with prometheus where I can nicely define a default_tag; however, yabeda does not currently support dynamic values for default tags. That is a separate concern, but it does make it more difficult to apply the label with a dynamic value everywhere, including in the various yabeda plugin gems.

dmagliola · 2021-03-20T16:23:59Z

Pretty sure this is affecting all of the metrics when the aggregation is set to ALL.

Ok, fair, but realistically, it'd be pretty unusual to set anything but Gauges to ALL

Initially, I wanted to be able to disable the pid to reduce cardinality concerns with influxdb/grafana

My point was that by default, you'd only get pid for Gauges, and if you want to disable that, you'll just need to pick another aggregation that's not ALL. And if you want a label other than pid to still have one gauge per process, you can do that when setting your labels.

This seems easier to understand for me than the parameters you added, and I think it'd be a pretty unusual use case to begin with, so i'm a bit unsure about adding what seems like a reasonably complex interface for a situation that probably won't arise much.

Sinjo · 2021-03-21T04:11:23Z

I can see the value in not using system PIDs as a label in specifically the situation where:

Your server is running in a process namespace where PIDs will grow large (i.e. a non-containerised setup or in a containerised setup where the namespace isn't thrown away with every deployment of the app)
You have another meaningful way to distinguish between individual child processes, such as the Puma or Unicorn worker ID

In that situation, it is a rather pointless source of cardinality and it would be nice to have a way of avoiding it. I assume the situation you've run into hits both of those bullet points?

The caveat of an override like that would be that you'd need to think harder about your other labels. If you were running a single webserver per host, then host would be fine. If you were running multliple, you'd need some other way of distinguishing between them, as you'd no longer have the server-unique PIDs to rely on.

This was something that crossed my mind when implementing #127 and you can see it mentioned in #131. I didn't do anything about it at the time as @dmagliola and I talked through it and decided that it was more complexity than it was worth. I'm not totally opposed to revising that decision, but I definitely want to think carefully about adding more options to the library that come with big new caveats for users.

liaden · 2021-03-21T13:54:54Z

In that situation, it is a rather pointless source of cardinality and it would be nice to have a way of avoiding it. I assume the situation you've run into hits both of those bullet points?

Yup!

The caveat of an override like that would be that you'd need to think harder about your other labels. If you were running a single webserver per host, then host would be fine. If you were running multliple, you'd need some other way of distinguishing between them, as you'd no longer have the server-unique PIDs to rely on.

We aren't doing this to ourselves, but we still inject the a app_name: Rails.application.class.parent_name label so that we can create dashboards and then switch over the app to get reusable dashboards.

I'm not totally opposed to revising that decision, but I definitely want to think carefully about adding more options to the library that come with big new caveats for users.

Would a disable_pid option, with documentation, be preferable? That leaves it up to the user then to address the concerns. This removes complexity from this gem, but that still leaves roughly the same complexity for the end user, I think.

dmagliola · 2021-03-24T11:46:27Z

In that situation, it is a rather pointless source of cardinality and it would be nice to have a way of avoiding it.

Right. What i'm proposing is you can get around that by not choosing :all, and then passing in your identifier as a label anyway. That still gives you one gauge per process (that label is unique per process), with your identifier in it.

I propose documenting that, instead of adding complexity to the interface when declaring your metrics.

The reason for this is that we're adding parameters that are quite hard to explain correctly to the majority of users, whereas the minority that'll have this problem, already comes in pre-loaded with the context they'd need to understand the solution I'm proposing above.

Would a disable_pid option, with documentation, be preferable? That leaves it up to the user then to address the concerns. This removes complexity from this gem, but that still leaves roughly the same complexity for the end user, I think.

I like this in principle, as a way of making the interface much, much simpler than passing in that lambda.
However, the way the :all aggregation works is it assumes there'll only be one value per labelset (and as such, takes just the "first" one. And if there is more than one, "first" is kinda random).
So this works, if the user definitely specifies a label that makes these labelsets unique. But if they disable_pid and they don't do the label part correctly, then they're going to get incorrect, and potentially semi-random results.

To me, that's a pretty large footgun. 😬
It's WAY more elegant than telling the user "just pick any aggregation other than :all", for sure, which makes it very appealing.
But i'm worried it can be easily misunderstood. I can imagine someone going: I want to get rid of that pid label, this seems to do it, without fully understanding the consequences.

I'm not adamant on this one, though, happy to go that route if you think i'm wrong. But this is why i'm reluctant to do this.

dmagliola · 2022-02-05T13:14:58Z

Closing this because the complexity this adds in using the store doesn't justify the value, given that there's a fairly straightforward workaround in using any aggregation that is not all and controlling the labels as necessary.

Allow for customizing process identification

c37b69e

Sinjo mentioned this pull request Mar 25, 2021

One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161

Open

dmagliola closed this Feb 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for customizing process identification #214

Allow for customizing process identification #214

liaden commented Mar 19, 2021

coveralls commented Mar 19, 2021

coveralls commented Mar 19, 2021

dmagliola commented Mar 20, 2021

liaden commented Mar 20, 2021

dmagliola commented Mar 20, 2021

Sinjo commented Mar 21, 2021

liaden commented Mar 21, 2021 •

edited

Loading

dmagliola commented Mar 24, 2021

dmagliola commented Feb 5, 2022

Allow for customizing process identification #214

Allow for customizing process identification #214

Conversation

liaden commented Mar 19, 2021

coveralls commented Mar 19, 2021

coveralls commented Mar 19, 2021

dmagliola commented Mar 20, 2021

liaden commented Mar 20, 2021

dmagliola commented Mar 20, 2021

Sinjo commented Mar 21, 2021

liaden commented Mar 21, 2021 • edited Loading

dmagliola commented Mar 24, 2021

dmagliola commented Feb 5, 2022

liaden commented Mar 21, 2021 •

edited

Loading