Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent when querying cluster API fails. #212

sarfarazahmad89 · 2019-01-02T11:47:20Z

elasticsearch_cluster_health_timed_out is a gauge metric.
I was under the impression that its value would oscillate between 0 and 1 depending on whether it can query cluster health API or not.
So in a situation where elasticsearch service has gone down, I was expecting the metric to turn to 1, instead it just goes away.

I could configure alert rules using something like absent(elasticsearch_cluster_health_timed_out) but I think that isn't the right way to do this.

Even the official prometheus' docs recommend avoiding missing metrics.
Here https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics

sarfarazahmad89 · 2019-01-02T11:56:51Z

Actually absent won't work. absent() function will only work when all the nodes when querying the API fails for all members of an elasticsearch cluster .

How is this metric supposed to be used? I want to use this metric to monitor that the exporter can indeed query the search engine and alert when otherwise.

UPDATE:
If that metric were to return 0 and 1, prometheus alert rule becomes really simple, like

expr : elasticsearch_cluster_health_timed_out{cluster="myelkcluster"} != 0

Would raise an alert if that metric for any nodes in the cluster turned to 1.
Currently I think I will have to add rules along the lines of absent(elasticsearch_cluster_health_timed_out{instance="myinstance"})
for all instances in the elasticsearch cluster.

zwopir · 2019-01-02T14:33:48Z

Hi @sarfarazahmad89
the timed_out metric has a different meaning and has little to do with the availability of the cluster. You can look it up at https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html.

So in fact, having field from the _cluster/health endpoint exported as a metric doesn't make sense. I'll delete it.

You can use the following metrics as an indicator if the cluster responds to http requests:

# HELP elasticsearch_cluster_health_up Was the last scrape of the ElasticSearch cluster health endpoint successful.
# HELP elasticsearch_node_stats_up Was the last scrape of the ElasticSearch nodes endpoint successful.

These metrics are set to 1 if the endpoint was reachable, 0 else.

The metric elasticsearch_cluster_health_timed_out was removed in prometheus-community@320d8b3 per prometheus-community#212 Signed-off-by: Frank Ritchie <12985912+fritchie@users.noreply.github.com>

The metric elasticsearch_cluster_health_timed_out was removed in 320d8b3 per #212 Signed-off-by: Frank Ritchie <12985912+fritchie@users.noreply.github.com>

sarfarazahmad89 changed the title ~~Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent went querying cluster API fails.~~ Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent when querying cluster API fails. Jan 2, 2019

zwopir closed this as completed in 320d8b3 Feb 27, 2019

fritchie mentioned this issue Jul 9, 2024

Update README.md #911

Merged

sysadmind pushed a commit that referenced this issue Jul 11, 2024

Update README.md (#911)

bf89cef

The metric elasticsearch_cluster_health_timed_out was removed in 320d8b3 per #212 Signed-off-by: Frank Ritchie <12985912+fritchie@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent when querying cluster API fails. #212

Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent when querying cluster API fails. #212

sarfarazahmad89 commented Jan 2, 2019

sarfarazahmad89 commented Jan 2, 2019 •

edited

Loading

zwopir commented Jan 2, 2019 •

edited

Loading

Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent when querying cluster API fails. #212

Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent when querying cluster API fails. #212

Comments

sarfarazahmad89 commented Jan 2, 2019

sarfarazahmad89 commented Jan 2, 2019 • edited Loading

zwopir commented Jan 2, 2019 • edited Loading

sarfarazahmad89 commented Jan 2, 2019 •

edited

Loading

zwopir commented Jan 2, 2019 •

edited

Loading