Adding running parameter to stats #2687

szuki · 2017-04-19T10:42:37Z

This change is intended to give information in metrics
if our process is not running or we are missing just a
metrics from it.

Required for all PRs:

CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
Sign CLA (if not already signed)
README.md updated (if adding a new plugin)

This change is intended to give information in metrics if our process is not running or we are missing just a metrics from it.

danielnelson · 2017-05-24T17:14:21Z

Can you explain briefly why you need to a positive assertion that the process is not running? I think most are simply alerting if points stop being created.

szuki · 2017-06-07T12:24:10Z

Most briefly I can get is to put you link https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

danielnelson · 2017-06-07T20:18:54Z

Thanks for the link, I really like how prometheus has this guidance.

One issue I see with doing this in procstat is we may not be able to produce the same tagset unless the process is found. The example I'm thinking of involves searching by process name (via the exe option) and with pid_tag true, normally you would get something like:

procstat,exe=dnsmasq,pid=44979 cpu_user=0.14,cpu_system=0.07

But if the process is not running you can't fill out the pid. This would mean we would create a new series.

I'm also not sure if how well this works if the query normally selects multiple processes.

@desa What do you think about this?

desa · 2017-06-07T21:09:02Z

I understand the desire to positively assert that data is no longer reporting, but the reasoning from the prometheus

...
The seconds is to have an myexporter_up (e.g. haproxy_up) variable that’s 0/1 depending on whether the scrape worked.

The latter is better where there’s still some useful metrics you can get even with a failed scrape, such as the haproxy exporter providing process stats.
...

doesn't quite apply (unless I'm missing something). What stats could be provided that previously would not have been?

Additionally, its possible now with subqueries write a query that will return the list of all series that are not currently reporting data.

Something the the effect of

SELECT * FROM (
    SELECT last(*) FROM (
        SELECT count(<field>) FROM <measurement>
        WHERE time > now() - <duration>
        GROUP BY time(<collection interval>), * FILL(0)
     )
     GROUP BY *
)
WHERE last_count = 0

By playing with WHERE last_count ... and the various time intervals, the type of deadman assertions you could do are fairly sophisticated. I struggle to see why this isn't sufficient.

szuki · 2017-06-08T11:42:59Z

We are talking about "pull-based" metrics instead of your influx model of "push-based".
That means that you don't get missing metrics if something failed after a while. You get them after expiration occures. Meanwhile you are scraping "old" metrics from

desa · 2017-06-08T13:23:19Z

Ah that makes sense. Forgot about the possibility of using pull based metrics. In that context it makes sense.

The issue that @danielnelson brings up about PID is still valid though. Being able to set the PID and this feature should be mutually exclusive so that you don't create an additional series.

szuki · 2017-06-08T14:16:07Z

Not exactly, It informs about pretty different thing. "*_up" metrics are to say if everything was ok or not from that host. We don't care about process id or whatever. As we failed to gather metrics from that process we want information about it failed. PID is something extra in that case.

desa · 2017-06-08T14:47:07Z

The issue is that it creates another series, kind of unnecessarily. I'm not saying that it shouldn't be implemented, but rather that it just rubs me the wrong way that data pertaining to one thing is getting written to two different series. e.g. It just doesn't feel right that

procstat,host=myhost,exe=dnsmasq,pid=44979 cpu_user=0.14,cpu_system=0.07,running=0

and

procstat,host=myhost,exe=dnsmasq running=1

can be generated from the same plugin. The more I think of it the less it bothers me.

danielnelson · 2017-06-08T18:21:36Z

If you are using the prometheus output, what about lowering the expiration_interval to a level that you are comfortable with?

The general point about old points is true. Does that mean we should always send something every interval for all measurements? This is something we recently did for http_response after a little persuasion #2784 (comment)

The multiple series issue is probably not a hard rule, just an indicator that something might not be right. In some respects the fact that pid is added as a tag is the real mistake, but something was needed to prevent collisions in the case of multiple process matches.

Also, it is going to cause problems for the prometheus output if the tagset changes until this issue is properly fixed #2822.

szuki · 2017-06-09T07:33:56Z

We encounter #2822 so we are still using 1.2 ;)

expiration_interval resolves problem only partly. Actually you get lag which is a at most expiration_interval*2 + interval. You need expiration_interval to be bigger than longest interval you have and something will pull with metrics according to some interval which is the lowest setting for expiration_interval. So yes and no -> It doesn't resolve problem it just gives you workaround which may in some cases work in other no, but still is far from perfect.

Does that mean we should always send something every interval for all measurements?

Actually yes, but rather for each series/plugin.

Multiple series are in every big plugin... You cannot avoid them. It just a matter of how big plugin will be. Think about OpenStack plugin.

tzz · 2018-03-07T15:20:16Z

@danielnelson I think a use case we have is related: using the latest Telegraf with Prometheus, we can't tell if a host is down without using absent(), which is an awkward workaround individually and really hard to do in a big group of hosts. This affects http_response and net_response inputs.

(I know this PR is not about those input plugins, but it definitely covers the absence of metrics.)

To explain a little more, if http_response fails due to a string mismatch, we get all the stats. But if the host is not reachable, the only stats are the result_code which is a string. But Prometheus doesn't collect string metrics.

So now in Prometheus if we check absent(...one host...) that works. But we can't check all our hosts at once because even 1 of them will make the check pass. I don't know of a workaround.

I want to suggest (although I don't know if these are technically the best solutions) that it would be very helpful to either have an aggregator to map fields to numbers (maybe the histogram aggregator can autobucket strings based on hash code), or to have a way to map the result_code from http_response and net_response to a number in the plugin itself. Or a field rewriter plugin. Let me know the best way to proceed so I can open a issue on the right components.

danielnelson · 2018-03-07T20:35:22Z

@tzz I think these pull requests might be of interest.

Anderen2 · 2018-05-24T19:17:11Z

@russorat @danielnelson Adding another usecase for this. We see that it is hard (or impossible) to differentiate between that a process has stopped running or that the Telegraf agent has stopped sending metrics (for whatever reason).

As an example, with the usage of Kapacitor deadman alerting this causes one alarm for each process that was monitored on a host to be sent when the Telegraf agent stops reporting. Also, we cannot really confirm that the lack of data is due to the process not running or simply due to lack of data in graphs.

danielnelson · 2018-06-05T18:04:24Z

We see that it is hard (or impossible) to differentiate between that a process has stopped running or that the Telegraf agent has stopped sending metrics (for whatever reason).

The best way to monitor if Telegraf is operational is by using the internal plugin. The internal_agent measurement is probably the best place to start:

internal_agent gather_errors=0i,metrics_dropped=0i,metrics_gathered=1i,metrics_written=0i

You can also use this plugin to check if procstat collected any processes:

internal_gather,input=procstat metrics_gathered=0i

danielnelson · 2018-06-05T18:34:14Z

I'm still uneasy with how this change would work when matching multiple processes. Here is a proposal for a slight modification which I would feel more comfortable with: #4237

If you are watching this issue, please take a look and let me know if it would work for your needs.

Anderen2 · 2018-06-05T18:44:25Z

Hi @danielnelson, thanks for the feedback.

Yes, checking whether Telegraf is operational or not may be done with the "internal" plugin. We currently do it by checking "uptime" in "system", as Chronograf does it.

I cannot see that using the "internal_gather" "metrics_gathered" metric may help in our situation at least. In our case we would like to use Kapacitor to send alarms when a process on a host has stopped running, meaning we require one alarm for each process that has stopped for each server.

Using "deadman" for this works in theory yes but is quite iffy in practice. The fact is that "deadman" alerting sends alarms due to lack of metrics within a certain time, which might be due to a multitude of other reasons than that the process has stopped running.

We do currently run with "deadman" for this, however we experience issues with storms of alarms due to sudden latency fluctuations in the platform and other reasons for why the metrics might not be coming through within the interval. We are likely to stop using "deadman" for this as the sheer amount of false alarms overwhelm the ones that are real.

There are several other inputs that do expose a "status" field such as "result_type / result_code" for http_response/net_response or "exists" for filestat, it seems fitting that procstat also follows the same pattern in my opinion at least.

danielnelson · 2018-06-05T19:05:43Z

There are several other inputs that do expose a "status" field such as "result_type / result_code" for http_response/net_response or "exists" for filestat, it seems fitting that procstat also follows the same pattern in my opinion at least.

I would use this pattern if procstat only matched a single process, but I think the idea in #4237 is fairly similar and mostly just drops the notion that this information is part of the procstat series.

danielnelson · 2018-06-21T21:24:18Z

Superseded by #4307

Adding running parameter to stats

dae6b05

This change is intended to give information in metrics if our process is not running or we are missing just a metrics from it.

danielnelson modified the milestones: 1.3.0, 1.4.0 Apr 19, 2017

danielnelson added the area/procstat label May 4, 2017

danielnelson modified the milestones: 1.4.0, 1.5.0 Aug 14, 2017

danielnelson added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Aug 24, 2017

danielnelson modified the milestones: 1.5.0, 1.6.0 Nov 30, 2017

danielnelson mentioned this pull request Jan 18, 2018

make error message as a field #3692

Closed

russorat modified the milestones: 1.6.0, 1.7.0 Jan 26, 2018

danielnelson mentioned this pull request Mar 26, 2018

procstat input plugin: notifying if process is dead or not found #3936

Closed

danielnelson modified the milestones: 1.7.0, 1.8.0 Jun 3, 2018

danielnelson mentioned this pull request Jun 5, 2018

Add metric for results of pid lookup in procstat input #4237

Closed

danielnelson closed this Jun 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding running parameter to stats #2687

Adding running parameter to stats #2687

szuki commented Apr 19, 2017 •

edited

Loading

danielnelson commented May 24, 2017

szuki commented Jun 7, 2017

danielnelson commented Jun 7, 2017

desa commented Jun 7, 2017 •

edited

Loading

szuki commented Jun 8, 2017

desa commented Jun 8, 2017

szuki commented Jun 8, 2017

desa commented Jun 8, 2017 •

edited

Loading

danielnelson commented Jun 8, 2017

szuki commented Jun 9, 2017 •

edited

Loading

tzz commented Mar 7, 2018 •

edited

Loading

danielnelson commented Mar 7, 2018

Anderen2 commented May 24, 2018

danielnelson commented Jun 5, 2018

danielnelson commented Jun 5, 2018

Anderen2 commented Jun 5, 2018

danielnelson commented Jun 5, 2018

danielnelson commented Jun 21, 2018

Adding running parameter to stats #2687

Adding running parameter to stats #2687

Conversation

szuki commented Apr 19, 2017 • edited Loading

Required for all PRs:

danielnelson commented May 24, 2017

szuki commented Jun 7, 2017

danielnelson commented Jun 7, 2017

desa commented Jun 7, 2017 • edited Loading

szuki commented Jun 8, 2017

desa commented Jun 8, 2017

szuki commented Jun 8, 2017

desa commented Jun 8, 2017 • edited Loading

danielnelson commented Jun 8, 2017

szuki commented Jun 9, 2017 • edited Loading

tzz commented Mar 7, 2018 • edited Loading

danielnelson commented Mar 7, 2018

Anderen2 commented May 24, 2018

danielnelson commented Jun 5, 2018

danielnelson commented Jun 5, 2018

Anderen2 commented Jun 5, 2018

danielnelson commented Jun 5, 2018

danielnelson commented Jun 21, 2018

szuki commented Apr 19, 2017 •

edited

Loading

desa commented Jun 7, 2017 •

edited

Loading

desa commented Jun 8, 2017 •

edited

Loading

szuki commented Jun 9, 2017 •

edited

Loading

tzz commented Mar 7, 2018 •

edited

Loading