Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for customizing process identification #214

Closed

Conversation

liaden
Copy link

@liaden liaden commented Mar 19, 2021

My primary use case is to use the puma index label and forked worker's index to set the identity of the process, but it can also be useful with the process name ($0) since various rake/rails commands would be identified based on what task was used.

With the puma index label, that is nice on long running containers since a new deploy (or restart of the service) does not cause new processes to generate a new label value. With docker containers, it is less relevant since the puma parent is going to be 1 and the child processes are going to be a semi-consistent set of pids.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling c37b69e on liaden:custom-process-identification into 1c8e407 on prometheus:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling c37b69e on liaden:custom-process-identification into 1c8e407 on prometheus:master.

@dmagliola
Copy link
Collaborator

Hello Joel, thank you for your PR.

Could you explain a little bit more of your use case, please?
I'm not entirely sure I understand what this is for.

The way I understand this code, this would only affect Gauges with their aggregation mode set to ALL, and you are overriding the pid label with another name and value...

But if you wanted to expose something else in the labels (in your example, process_name: $0), couldn't you add that label when defining the metric? (and then the aggregation mode doesn't matter, as you're doing your own effectively)
Why would we need these extra parameters in that case?

Or am I misunderstanding what you are trying to do here?

@liaden
Copy link
Author

liaden commented Mar 20, 2021

Hey @dmagliola!

Pretty sure this is affecting all of the metrics when the aggregation is set to ALL. That said, the default aggregation for counters and histograms is SUM. I didn't write tests for other metric types though, so would that be preferred?

Initially, I wanted to be able to disable the pid to reduce cardinality concerns with influxdb/grafana, but that comes with a pitfall for end users if the application does not configure a different per process identifying label. The metrics with the an aggregation of ALL would behave as expected: it would end up be like MOST_RECENT.

My thought was that if an individual does not want the pid approach, they should specify their new identifier to avoid the above pitfall. I chose to allow the label value to be dynamic since the pid value is dynamic. If the application forks a new process (puma after getting a TTIN signal) the dynamic value makes sure the new child process gets a different label than the parent process.

With this approach, there still is the possibility that two concurrently running processes collide, but that can be intentional and beneficial.

In my use case, I am using yabeda with prometheus where I can nicely define a default_tag; however, yabeda does not currently support dynamic values for default tags. That is a separate concern, but it does make it more difficult to apply the label with a dynamic value everywhere, including in the various yabeda plugin gems.

@dmagliola
Copy link
Collaborator

Pretty sure this is affecting all of the metrics when the aggregation is set to ALL.

Ok, fair, but realistically, it'd be pretty unusual to set anything but Gauges to ALL

Initially, I wanted to be able to disable the pid to reduce cardinality concerns with influxdb/grafana

My point was that by default, you'd only get pid for Gauges, and if you want to disable that, you'll just need to pick another aggregation that's not ALL. And if you want a label other than pid to still have one gauge per process, you can do that when setting your labels.

This seems easier to understand for me than the parameters you added, and I think it'd be a pretty unusual use case to begin with, so i'm a bit unsure about adding what seems like a reasonably complex interface for a situation that probably won't arise much.

@Sinjo
Copy link
Member

Sinjo commented Mar 21, 2021

I can see the value in not using system PIDs as a label in specifically the situation where:

  • Your server is running in a process namespace where PIDs will grow large (i.e. a non-containerised setup or in a containerised setup where the namespace isn't thrown away with every deployment of the app)
  • You have another meaningful way to distinguish between individual child processes, such as the Puma or Unicorn worker ID

In that situation, it is a rather pointless source of cardinality and it would be nice to have a way of avoiding it. I assume the situation you've run into hits both of those bullet points?

The caveat of an override like that would be that you'd need to think harder about your other labels. If you were running a single webserver per host, then host would be fine. If you were running multliple, you'd need some other way of distinguishing between them, as you'd no longer have the server-unique PIDs to rely on.

This was something that crossed my mind when implementing #127 and you can see it mentioned in #131. I didn't do anything about it at the time as @dmagliola and I talked through it and decided that it was more complexity than it was worth. I'm not totally opposed to revising that decision, but I definitely want to think carefully about adding more options to the library that come with big new caveats for users.

@liaden
Copy link
Author

liaden commented Mar 21, 2021

In that situation, it is a rather pointless source of cardinality and it would be nice to have a way of avoiding it. I assume the situation you've run into hits both of those bullet points?

Yup!

The caveat of an override like that would be that you'd need to think harder about your other labels. If you were running a single webserver per host, then host would be fine. If you were running multliple, you'd need some other way of distinguishing between them, as you'd no longer have the server-unique PIDs to rely on.

We aren't doing this to ourselves, but we still inject the a app_name: Rails.application.class.parent_name label so that we can create dashboards and then switch over the app to get reusable dashboards.

I'm not totally opposed to revising that decision, but I definitely want to think carefully about adding more options to the library that come with big new caveats for users.

Would a disable_pid option, with documentation, be preferable? That leaves it up to the user then to address the concerns. This removes complexity from this gem, but that still leaves roughly the same complexity for the end user, I think.

@dmagliola
Copy link
Collaborator

In that situation, it is a rather pointless source of cardinality and it would be nice to have a way of avoiding it.

Right. What i'm proposing is you can get around that by not choosing :all, and then passing in your identifier as a label anyway. That still gives you one gauge per process (that label is unique per process), with your identifier in it.

I propose documenting that, instead of adding complexity to the interface when declaring your metrics.

The reason for this is that we're adding parameters that are quite hard to explain correctly to the majority of users, whereas the minority that'll have this problem, already comes in pre-loaded with the context they'd need to understand the solution I'm proposing above.

Would a disable_pid option, with documentation, be preferable? That leaves it up to the user then to address the concerns. This removes complexity from this gem, but that still leaves roughly the same complexity for the end user, I think.

I like this in principle, as a way of making the interface much, much simpler than passing in that lambda.
However, the way the :all aggregation works is it assumes there'll only be one value per labelset (and as such, takes just the "first" one. And if there is more than one, "first" is kinda random).
So this works, if the user definitely specifies a label that makes these labelsets unique. But if they disable_pid and they don't do the label part correctly, then they're going to get incorrect, and potentially semi-random results.

To me, that's a pretty large footgun. 😬
It's WAY more elegant than telling the user "just pick any aggregation other than :all", for sure, which makes it very appealing.
But i'm worried it can be easily misunderstood. I can imagine someone going: I want to get rid of that pid label, this seems to do it, without fully understanding the consequences.

I'm not adamant on this one, though, happy to go that route if you think i'm wrong. But this is why i'm reluctant to do this.

@dmagliola
Copy link
Collaborator

Closing this because the complexity this adds in using the store doesn't justify the value, given that there's a fairly straightforward workaround in using any aggregation that is not all and controlling the labels as necessary.

@dmagliola dmagliola closed this Feb 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants