Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Node Metadata missing #5044

Closed
Zenebatos opened this Issue Dec 27, 2018 · 11 comments

Comments

Projects
None yet
4 participants
@Zenebatos
Copy link

Zenebatos commented Dec 27, 2018

Bug Report

What did you do?
I configured some nodes in consul to have metadata attached to them. Below is a sample node config(that also happens to be a server):

{
    "advertise_addr": "123.123.123.123",
    "bind_addr": "0.0.0.0",
    "bootstrap_expect": 3,
    "client_addr": "0.0.0.0",
    "data_dir": "/localdata/consul",
    "datacenter": "somelocation",
    "log_level": "INFO",
    "node_meta": {
        "env": "prod",
        "host": "somehost001.somelocation.mydomain.com",
        "host_name": "somehost001",
        "host_type": "somehost",
        "pop": "somelocation"
    },
    "node_name": "somehost001",
    "retry_join": [
        "server1",
        "server2",
        "server3"
    ],
    "retry_join_wan": [
        "server1.otherlocation.mydomain.com",
        "server2.otherlocation.mydomain.com",
    ],
    "server": true,
    "ui": true
}

I was able to properly query consul to give me the host's metadata:

docker exec consul consul catalog nodes --service=myservice --detailed
Node         ID                                    Address     DC          TaggedAddresses                 Meta
somehost001  d8dd170c-620c-f255-c090-d58cb7974583  <redacted>  <redacted>  lan=<redacted>, wan=<redacted>  consul-network-segment=, env=prod, host=somehost001.somelocation.mydomain.com, host_name=somehost001, host_type=somehost, pop=somelocation
somehost002  7150e149-c480-5f6d-94c5-d93d60f015e2  <redacted>  <redacted>  lan=<redacted>, wan=<redacted>  consul-network-segment=, env=prod, host=somehost002.somelocation.mydomain.com, host_name=somehost002, host_type=somehost, pop=somelocation

I then configured prometheus with some service discovery and relabel configs for myservice

- job_name: myservice
  scrape_interval: 1m
  scrape_timeout: 55s
  metrics_path: /metrics
  scheme: http
  consul_sd_configs:
  - server: localhost:8500
    datacenter: somelocation
    tag_separator: ','
    scheme: http
    allow_stale: true
    refresh_interval: 30s
    services:
    - myservice
  relabel_configs:
  - source_labels: [__meta_consul_service]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: $1
    action: replace
  - source_labels: [__meta_consul_metadata_env]
    separator: ;
    regex: (.*)
    target_label: env
    replacement: $1
    action: replace
  - source_labels: [__meta_consul_metadata_pop]
    separator: ;
    regex: (.*)
    target_label: pop
    replacement: $1
    action: replace
  - source_labels: [__meta_consul_metadata_host_name]
    separator: ;
    regex: (.*)
    target_label: host_name
    replacement: $1
    action: replace
  - source_labels: [__meta_consul_metadata_host_type]
    separator: ;
    regex: (.*)
    target_label: host_type
    replacement: $1
    action: replace
  - source_labels: [__meta_consul_metadata_host]
    separator: ;
    regex: (.*)
    target_label: host
    replacement: $1
    action: replace

What did you expect to see?
I expected that, under /service-discovery , my job would include a discovered label for each node metadata that I exposed via consul.

What did you see instead? Under which circumstances?
The node metadata was missing from my job. I expected to see entries like __meta_consul_metadata_env and __meta_consul_metadata_pop. Other metadata like __meta_consul_service, __meta_consul_service_port, and __meta_consul_dc was present.

Environment
I have both prometheus and consul running as docker containers each using host networking mode. Both containers have their config and data dirs mounted as volumes on the main host.

  • System information:

uname -srm Linux 3.10.0-693.2.2.el7.x86_64 x86_64
centos-release centos-release-7-4.1708.el7.centos.x86_64
consul version Consul v1.0.7

  • Prometheus version:
prometheus, version 2.3.2 (branch: HEAD, revision: 71af5e29e815795e9dd14742ee7725682fa14b7b)
  build user:       root@5258e0bd9cc1
  build date:       20180712-14:02:52
  go version:       go1.10.3

Issue was also reproduced on the same server but the below prometheus version running in a different container w/ the same config as the above prometheus instance:

prometheus, version 2.6.0 (branch: HEAD, revision: dbd1d58c894775c0788470944b818cc724f550fb)
  build user:       root@bf5760470f13
  build date:       20181217-15:14:46
  go version:       go1.11.3
  • Prometheus configuration file:
---
  global:
    scrape_interval: "10s"
    scrape_timeout: "8s"
    evaluation_interval: "10s"
  rule_files:
    - /etc/prometheus/alert.rules
    - /etc/prometheus/recording.rules
  scrape_configs:
    - job_name: someservice
      scrape_interval: "60s"
      scrape_timeout: "55s"
      consul_sd_configs:
        - datacenter: somelocation
          services:
            - someservice
      relabel_configs:
        - source_labels:
            - "__meta_consul_service"
          target_label: job
        - source_labels:
            - "__meta_consul_metadata_env"
          target_label: env
        - source_labels:
            - "__meta_consul_metadata_pop"
          target_label: pop
        - source_labels:
            - "__meta_consul_metadata_host_name"
          target_label: host_name
        - source_labels:
            - "__meta_consul_metadata_host_type"
          target_label: host_type
        - source_labels:
            - "__meta_consul_metadata_host"
          target_label: host
  alerting:
    alert_relabel_configs: []
    alertmanagers:
      - static_configs:
          - targets:
              - "prm001:9093"
              - "prm002:9093"
              - "prm003:9093"
  remote_read: []
  remote_write: []
  • Logs:
    I ran this in debug mode but no relevant discovery logs were outputted
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 2, 2019

You need to update to a more recent version of Prometheus. Metadata is available since v2.4.0.

@Zenebatos

This comment has been minimized.

Copy link
Author

Zenebatos commented Jan 3, 2019

@simonpasquier 2.4.0 introduced Service metadata support. The metadata I'm attempting to access is node metadata, which has been around since 1.8.0:

https://github.com/prometheus/prometheus/blob/master/CHANGELOG.md#180--2017-10-06

In addition, this works on the other prometheus servers I have in other dc's that are all running 2.3.2. This particular issue is only happening on this host, and seems to be a prometheus issue because i'm able to obtain the node meta from consul directly.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 4, 2019

Can you share a screenshot of the /service-discovery page? Anything in the logs?

@Zenebatos

This comment has been minimized.

Copy link
Author

Zenebatos commented Jan 7, 2019

Here's a screenshot:

screen shot 2019-01-07 at 15 15 38

I put it on debug mode and I didn't see anything in the logs that stood out from discovery

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 8, 2019

I tried to reproduce on my local machine and node's metadata shows as expected. Maybe you can try to capture the traffic between Prometheus and Consul with tcpdump (or similar tool)?

@jacksontj

This comment has been minimized.

Copy link
Contributor

jacksontj commented Jan 22, 2019

I'm seeing this same behavior (NodeMeta sometimes goes missing) from my debugging I found that the issue was consul sometimes returning a null for the NodeMeta field (found in tcpdump) -- there is an upstream issue but not a lot of traction there yet unfortunately.

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Jan 22, 2019

Maybe we should treat null in the NodeMeta field as an invalid response and not update the TargetGroups? In normal operation should consul always return {} if there was no NodeMeta to return, or is null valid?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 22, 2019

nil or {} would be treated in the same way by the service discovery. Not sure what we can do until upstream fixes it.

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Jan 22, 2019

Right at the moment they're the same. I'm suggesting we (or @jacksontj if he's interested) could look into whether we get null/nil in any case other than this bug.

@jacksontj

This comment has been minimized.

Copy link
Contributor

jacksontj commented Jan 31, 2019

We just finished an upgrade to consul 1.4 (to pick up the fix in the upstream issue). After the upgrade we are no longer seeing the issue! So if you are running into the issue I suggest upgrading consul -- and assuming that works for you as well its probably not worth trying to add workarounds in the code since it is an upstream bug in consul

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 1, 2019

Thanks @jacksontj! I agree with you that we shouldn't try to hack around a Consul bug that's been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.