Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use only ip:port or host:port in instance label #493

Closed
juliusv opened this Issue Jan 30, 2015 · 11 comments

Comments

Projects
None yet
4 participants
@juliusv
Copy link
Member

juliusv commented Jan 30, 2015

This makes instance labels smaller and only include parts that are (usually) used to identify an instance. The exact items to include might also vary depending on the type of static configuration or service discovery. TBD.

@atombender

This comment has been minimized.

Copy link
Contributor

atombender commented Feb 22, 2015

Would it not also make sense to have a label that is always the FQDN of the host?

My reasoning: Alerts don't want to know what the endpoint is, they want the host. For example, if I have an endpoint reporting whether a specific process is in an operational state. You end up with metrics stored as something like process_state{instance="host42:9132/metrics"}. Now you'd like to monitor whether the state is okay on host42, but in order to do that, you need to know its instance name, which is an unnecessary bit of information.

In our case, we're using Puppet exported resources to collect alert rules, and in our situation every rule we create must pertain to that specific host, rather than all instances of a specific metric. Concrete example: We need to monitor the number of running worker processes of a certain daemon, which happens to be 4 on one host, but 8 on another. So we need to two separate alert rules, one for each host.

Actually, adding the host label may have to go into whatever is collecting metrics, eg. node_exporter.

@atombender

This comment has been minimized.

Copy link
Contributor

atombender commented Feb 22, 2015

Follow-up question: Until this is resolved, would it make sense for us to devote one job per host, and rely on the job name as the host name?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 22, 2015

On 22 February 2015 at 03:09, Alexander Staubo notifications@github.com
wrote:

Would it not also make sense to have a label that is always the FQDN of
the host?

It very much depends on your local setup. For me for example the FQDN is of
little use (and I'd presume similar for anyone else on AWS as it ends in
.compute.internal), however I have something similar that's specific to our
setup that it'd be useful to have linked in alerts etc.

What I've been thinking is to have a template function that can go from
instance label to something you can get to in a browser, configured by a
regex in flags.

My reasoning: Alerts don't want to know what the endpoint is, they want
the host.

Alerts in general have no notion of endpoint or host.

For example, if I have an endpoint reporting whether a specific process is
in an operational state. You end up with metrics stored as something like
process_state{instance="host42:9132/metrics"}. Now you'd like to monitor
whether the state is okay on host42, but in order to do that, you need to
know its instance name, which is an unnecessary bit of information.

A big win of Prometheus-style monitoring is that you don't care about
individual hosts. You can ask questions like "Is the state okay
everywhere?", rather than having to ask about each individual host in turn.
This is the power labels give you.

In our case, we're using Puppet exported resources to collect alert rules,
and in our situation every rule we create must pertain to that specific
host, rather than all instances of a specific metric. Concrete example: We
need to monitor the number of running worker processes of a certain daemon,
which happens to be 4 on one host, but 8 on another. So we need to two
separate alert rules, one for each host.

How I'd do that is have exporter include how many processes are meant to be
running. Then you can use the same single rule everywhere. If you ever find
yourself writing rules per-instance or per-labelset, there's usually a
better way.

It's best to think about services, not machines. Don't think host42 runs
the daemon, think we have many daemons one of which currently happens to
run on host42.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 22, 2015

Follow-up question: Until this is resolved, would it make sense for us to devote one job per host, and rely on the job name as the host name?

I suspect you may be trying to directly translate from another monitoring system to Prometheus, which means you might miss many of the benefits of Prometheus's power. Why don't you pop into IRC on #prometheus on Freenode so we can help see if there's a better way to solve your problems?

@atombender

This comment has been minimized.

Copy link
Contributor

atombender commented Feb 22, 2015

I am not trying to translate from an existing monitoring system. But for as long as we're using Puppet, we are forced to follow the Puppet model where everything is host-oriented; exported resources are always host-based, and it would be complicated to de-duplicate them.

To explain — I don't know how well you know Puppet — at the moment every host "exports" an alert file for each thing (such as processes or network ports) to monitor. The alert file is something as simple as:

ALERT nginx_process_count_on_myhost
  IF process_count{name="nginx", host="myhost"} < 1
  [etc.]

This alert file, being exported, is something that can be read by the config manifest for any other host in the Puppet system.

Thus, the Puppet manifest for the Prometheus server "collects" all those alert files into a directory; there is then a template that generates prometheus.conf, which among other things then inserts a rule_file: entry for each such file (it also uses the exported resource system to register an instance for each host).

This way, everything is automated: If we add a service on any host, the correct rule appears on the Prometheus server.


What you're suggesting is that I don't include a host at all, but check the processes in the aggregate. That means that when a host exports its alerts, I will end up with duplicate alert files, because we want one alert per process, not per-process-per-host.

I can work around Puppet's limitations in a number of ways. For example, instead of exporting alert files, I can export a simple file which ${fqdn}-processcount-nginx which contains the count 1, for example. Or which is simply empty, if I can embed the target process count in each metric. Each host would export such a file. Then on the Prometheus server, I could generate rule files by reading all these files and generating an alert for each of them. It's painful, but doable, I guess.


But I'm not sure how to write an alert if the host name is not part of the alert, and the target metric is. I must be missing something. For example, given a reported metric:

process_count{name="unicorn", minimum=8} = 7

...how do I write an IF check? I believe I can't reference a metric label as part of the expression, so I can't do this:

process_count{name="unicorn"} < minimum

Or did you mean that I write an alert like this:

process_count{name="unicorn", minimum=8} < 8

If you meant the latter, that means I still need to emit one alert entry per host, since the host is only one that knows what rvalue to insert here. But you're right in that I don't have to include the host name. I do have to de-duplicate: If I have N hosts that all have a minimum of 8, I wouldn't want N alert entries with the same check.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 22, 2015

I really think you should join us over in IRC, there's better ways to monitor this but it's more difficult to explain it in a less-interactive medium such as this.

In general you want to autogenerate your list of targets, and your rules should almost always be completely static.

What you want to export is something like:
process_count{name="unicorn"} 7
process_minimum{name="unicorn"} 8

and then have a single alert rule that's:
process_count < process_minimum

@atombender

This comment has been minimized.

Copy link
Contributor

atombender commented Feb 22, 2015

Ah, the fact that expressions are able to join was the missing part I was looking for. I assume here that alert rule expressions match on all labels except the instance name. The fact that I can do this means I can simplify the rules to be non-host-specific.

I'm not on IRC, and don't want to go through the hassle of setting up a client just for this, sorry.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 22, 2015

alert rule expressions match on all labels except the instance name.

Binary operators match on everything except the time series name, it wouldn't work for this use case otherwise. Future extension to this are covered in #488 and #393.

@atombender

This comment has been minimized.

Copy link
Contributor

atombender commented Feb 22, 2015

Sorry, I meant the time series name. Got it. Thanks.

@juliusv

This comment has been minimized.

Copy link
Member Author

juliusv commented Feb 22, 2015

Yeah, you can do all kinds of fancy expression language stuff to select, aggregate, and match your time series to get useful alerts. Since Prometheus is quite unique here, I think we'll need more intro-level guides to this stuff... (but time!)

@beorn7 beorn7 closed this May 11, 2015

simonpasquier pushed a commit to simonpasquier/prometheus that referenced this issue Oct 12, 2017

Merge pull request prometheus#493 from prometheus/grobie/fix-k8s-docs…
…-formatting

Remove merge conflict leftover from k8s docs
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.