Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent when querying cluster API fails. #212

Closed
sarfarazahmad89 opened this issue Jan 2, 2019 · 2 comments

Comments

@sarfarazahmad89
Copy link

elasticsearch_cluster_health_timed_out is a gauge metric.
I was under the impression that its value would oscillate between 0 and 1 depending on whether it can query cluster health API or not.
So in a situation where elasticsearch service has gone down, I was expecting the metric to turn to 1, instead it just goes away.

I could configure alert rules using something like absent(elasticsearch_cluster_health_timed_out) but I think that isn't the right way to do this.

Even the official prometheus' docs recommend avoiding missing metrics.
Here https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics

@sarfarazahmad89 sarfarazahmad89 changed the title Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent went querying cluster API fails. Metric "elasticsearch_cluster_health_timed_out" should be 1 and not absent when querying cluster API fails. Jan 2, 2019
@sarfarazahmad89
Copy link
Author

sarfarazahmad89 commented Jan 2, 2019

Actually absent won't work. absent() function will only work when all the nodes when querying the API fails for all members of an elasticsearch cluster .

How is this metric supposed to be used? I want to use this metric to monitor that the exporter can indeed query the search engine and alert when otherwise.

UPDATE:
If that metric were to return 0 and 1, prometheus alert rule becomes really simple, like

expr : elasticsearch_cluster_health_timed_out{cluster="myelkcluster"} != 0

Would raise an alert if that metric for any nodes in the cluster turned to 1.
Currently I think I will have to add rules along the lines of absent(elasticsearch_cluster_health_timed_out{instance="myinstance"})
for all instances in the elasticsearch cluster.

@zwopir
Copy link
Member

zwopir commented Jan 2, 2019

Hi @sarfarazahmad89
the timed_out metric has a different meaning and has little to do with the availability of the cluster. You can look it up at https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html.

So in fact, having field from the _cluster/health endpoint exported as a metric doesn't make sense. I'll delete it.

You can use the following metrics as an indicator if the cluster responds to http requests:

# HELP elasticsearch_cluster_health_up Was the last scrape of the ElasticSearch cluster health endpoint successful.
# HELP elasticsearch_node_stats_up Was the last scrape of the ElasticSearch nodes endpoint successful.

These metrics are set to 1 if the endpoint was reachable, 0 else.

@zwopir zwopir closed this as completed in 320d8b3 Feb 27, 2019
fritchie added a commit to fritchie/elasticsearch_exporter that referenced this issue Jul 9, 2024
The metric

elasticsearch_cluster_health_timed_out

was removed in

prometheus-community@320d8b3

per

prometheus-community#212

Signed-off-by: Frank Ritchie <12985912+fritchie@users.noreply.github.com>
sysadmind pushed a commit that referenced this issue Jul 11, 2024
The metric

elasticsearch_cluster_health_timed_out

was removed in

320d8b3

per

#212

Signed-off-by: Frank Ritchie <12985912+fritchie@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants