Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird error at startup with Consul 1.9.1: error gathering metrics: 13 error(s) occurred #9471

Closed
pierresouchay opened this issue Dec 28, 2020 · 2 comments · Fixed by hashicorp/go-metrics#122

Comments

@pierresouchay
Copy link
Contributor

Overview of the Issue

At startup, a weird message is issued when upgrading from 1.8.6 to 1.9.1:

agent: error gathering metrics: 13 error(s) occurred:
 * collected metric consul_version gauge:<value:nan >  has help "Represents the Consul version." but should have "consul_version"
 * collected metric consul_api_http summary:<sample_count:9 sample_sum:nan quantile:<quantile:0.5 value:nan > quantile:<quantile:0.9 value:nan > quantile:<quantile:0.99 value:nan > >  has help "Samples how long it takes to service the given HTTP request for the given verb and path." but should have "consul_api_http"
 * collected metric consul_client_api_catalog_service_nodes counter:<value:0 >  has help "Increments whenever a Consul agent receives a request to list nodes offering a service." but should have "consul_client_api_catalog_service_nodes"
 * collected metric consul_client_api_success_catalog_datacenters counter:<value:0 >  has help "Increments whenever a Consul agent successfully responds to a request to list datacenters." but should have "consul_client_api_success_catalog_datacenters"
 * collected metric consul_client_api_catalog_datacenters counter:<value:0 >  has help "Increments whenever a Consul agent receives a request to list datacenters in the catalog." but should have "consul_client_api_catalog_datacenters"
 * collected metric consul_consul_cache_fetch_success counter:<value:0 >  has help "Counts the number of successful fetches by the cache." but should have "consul_consul_cache_fetch_success"
 * collected metric consul_client_api_success_catalog_nodes label:<name:"node" value:"CZ3711L0SH" > counter:<value:40 >  has help "consul_client_api_success_catalog_nodes" but should have "Increments whenever a Consul agent successfully responds to a request to list nodes."
 * collected metric consul_client_api_catalog_services counter:<value:0 >  has help "Increments whenever a Consul agent receives a request to list services from the catalog." but should have "consul_client_api_catalog_services"
 * collected metric consul_client_api_success_catalog_services counter:<value:0 >  has help "Increments whenever a Consul agent successfully responds to a request to list services." but should have "consul_client_api_success_catalog_services"
 * collected metric consul_client_api_success_catalog_node_services counter:<value:0 >  has help "Increments whenever a Consul agent successfully responds to a request to list services in a node." but should have "consul_client_api_success_catalog_node_services"
 * collected metric consul_client_api_catalog_nodes counter:<value:0 >  has help "Increments whenever a Consul agent receives a request to list nodes from the catalog." but should have "consul_client_api_catalog_nodes"
 * collected metric consul_client_api_success_catalog_service_nodes label:<name:"node" value:"CZ3711L0SH" > counter:<value:1085 >  has help "consul_client_api_success_catalog_service_nodes" but should have "Increments whenever a Consul agent successfully responds to a request to list nodes offering a service."
 * collected metric consul_client_api_catalog_node_services label:<name:"node" value:"CZ3711L0SH" > counter:<value:3 >  has help "consul_client_api_catalog_node_services" but should have "Increments whenever a Consul agent successfully responds to a request to list nodes offering a service."

This also happens in dev mode on master:

error gathering metrics: 4 error(s) occurred:
* collected metric consul_fsm_ca summary:<sample_count:1 sample_sum:nan quantile:<quantile:0.5 value:nan > quantile:<quantile:0.9 value:nan > quantile:<quantile:0.99 value:nan > >  has help "Measures the time it takes to apply CA configuration operations to the FSM." but should have "consul_fsm_ca"
* collected metric consul_consul_fsm_intention summary:<sample_count:1 sample_sum:nan quantile:<quantile:0.5 value:nan > quantile:<quantile:0.9 value:nan > quantile:<quantile:0.99 value:nan > >  has help "Deprecated - use fsm_intention instead" but should have "consul_consul_fsm_intention"
* collected metric consul_fsm_intention label:<name:"op" value:"delete-all" > summary:<sample_count:1 sample_sum:0.04840100184082985 quantile:<quantile:0.5 value:nan > quantile:<quantile:0.9 value:nan > quantile:<quantile:0.99 value:nan > >  has help "consul_fsm_intention" but should have "Measures the time it takes to apply an intention operation to the FSM."
* collected metric consul_consul_fsm_ca summary:<sample_count:1 sample_sum:nan quantile:<quantile:0.5 value:nan > quantile:<quantile:0.9 value:nan > quantile:<quantile:0.99 value:nan > >  has help "Deprecated - use fsm_ca instead" but should have "consul_consul_fsm_ca"
    2020-12-28T18:01:03.363+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/metrics?format=prometheus from=127.0.0.1:63455 latency=15.124411ms

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create an agent with telemetry{prometheus_retention_time = "192h"}
  2. Fetch metrics http://localhost:8500/v1/agent/metrics?format=prometheus
  3. See error in logs
pierresouchay added a commit to pierresouchay/go-metrics that referenced this issue Dec 29, 2020
When metrics are pre-registered with Help and without labels,
when issuing metrics with labels, the help was filled with
the key_name.

But prometheus client enforces help to be consistent accross
several elements, so, it can break at Runtime if a metric
has been declared without its labels and later issued this metric
with labels.

In order to fix this, we store the help declared at startup for
each key, so, when declaring a metric with some labels, we try
to re-used pre-declared help for the metric.

It should fix the issue that appeared with Consul 1.9.x and
fix hashicorp/consul#9471
pierresouchay added a commit to pierresouchay/go-metrics that referenced this issue Dec 29, 2020
When metrics are pre-registered with Help and without labels,
when issuing metrics with labels, the help was filled with
the key_name.

But prometheus client enforces help to be consistent accross
several elements, so, it can break at Runtime if a metric
has been declared without its labels and later issued this metric
with labels.

In order to fix this, we store the help declared at startup for
each key, so, when declaring a metric with some labels, we try
to re-used pre-declared help for the metric.

It should fix the issue that appeared with Consul 1.9.x and
fix hashicorp/consul#9471
@jardleex
Copy link

jardleex commented Jan 4, 2021

@mkcp has created issue #9303 related to this error.
We're still on 1.8.3 and I'm hesitating to update to 1.9.0/1 due to this issues.

@pierresouchay
Copy link
Contributor Author

@jardleex Thank you for the report, closing this issue as #9303 did exist before, let's continue the talk on #9303

pierresouchay added a commit to pierresouchay/consul that referenced this issue Jan 6, 2021
go-metrics is updated to 0.3.6 to properly handle help in prometheus metrics

This fixes hashicorp#9303 and
hashicorp#9471
pierresouchay added a commit to criteo-forks/consul that referenced this issue Jan 7, 2021
go-metrics is updated to 0.3.6 to properly handle help in prometheus metrics

This fixes hashicorp#9303 and
hashicorp#9471
criteoconsul pushed a commit to criteo-forks/consul that referenced this issue Feb 4, 2021
go-metrics is updated to 0.3.6 to properly handle help in prometheus metrics

This fixes hashicorp#9303 and
hashicorp#9471
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants