Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZooKeeper serverset discovery can become out-of-sync #2758

Closed
StephanErb opened this Issue May 23, 2017 · 5 comments

Comments

Projects
None yet
1 participant
@StephanErb
Copy link
Contributor

StephanErb commented May 23, 2017

What did you do?
Run the ZooKeeper service discovery on a cluster with a high churn rate.

What did you expect to see?
The target set in Prometheus should stay up to date and always know about the service instances currently running on the cluster.

What did you see instead? Under which circumstances?
Over a timespan of several hours, instances discovered by Prometheus become unhealthy due to natural churn and the following alert triggers. A restart of the Prometheus server corrects the issue. Prometheus is now aware of the correct endpoints of all instances running on the cluster.

ALERT PrometheusServerSetsUnhealthy
    IF avg(up{job='aurora-services'}) < 0.85
    FOR 5m
    LABELS { severity="page" }
    ANNOTATIONS {
        summary="A significant portion instances discovered via ZooKeeper serversets are unhealthy.",
        description=The Prometheus serverset integration could be hanging."
  }

Our current mean time between failures is about 1-3 days, depending on how often we restart/reload the Prometheus server due to configuration changes. The ZK server in question is stable and contains nothing suspicious in its logs.

Environment

  • System information:

    3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux

  • Prometheus version: 1.6.2

  • Zookeeper version: 3.4.5

  • Prometheus configuration file:

  - job_name: 'aurora-services'
    serverset_sd_configs:
    - servers:
      - '10.x.x.3:2181'
      - '10.x.x.4:2181'
      - '10.x.x.5:2181'
      paths:
      - '/aurora/roleA'
      - '/aurora/roleB'
    relabel_configs:
      # Only scrape services that subscribe to metrics
      - source_labels: ['__meta_serverset_endpoint_port_metrics']
        regex: '.+'
        action: 'keep'
      - source_labels: [__address__]
        regex: '(ip-\d+-\d+-\d+-\d+)\..*:(\d+)'
        target_label: __address__
        replacement: "${1}:${2}"
      # Extract `service` label from zookeeper serverset path
      - source_labels: [__meta_serverset_path]
        regex: '^/aurora/([^/]+)/([^/]+)/([^/]+)/.*'
        target_label: service
        replacement: ${1}.${2}.${3}
      # Extract labels 'role' and 'environment' from service 'role.env.aurora_job'
      - source_labels: ['service']
        regex: '^([^.]+).[^.]+.[^.]+.[^.]+$'
        target_label: 'role'
        replacement: '${1}'
      - source_labels: ['service']
        regex: '^[^.]+.([^.]+).[^.]+.[^.]+$'
        target_label: 'environment'
        replacement: '${1}'
      # Concatenate 'instance' label to contain unique task source 'role.env.aurora_job'
      - source_labels: ['service', '__meta_serverset_shard']
        separator: '.'
        target_label: 'instance'
  • Logs:
    Once such an event has been emitted to the log, Prometheus fails to receive ZK watcher updates for certain instances.
time="2017-05-23T12:57:57Z" level=info msg="read tcp 10.x.x.60:39170->10.1.x.4:2181: i/o timeout" source="treecache.go:52"
time="2017-05-23T12:57:57Z" level=info msg="Failed to set previous watches: zk: buffer too small" source="treecache.go:52"
time="2017-05-23T12:57:57Z" level=info msg="read tcp 10.x.x.60:39172->10.x.x.4:2181: i/o timeout" source="treecache.go:52"
time="2017-05-23T12:57:57Z" level=info msg="read tcp 10.x.x.60:47393->10.x.x.3:2181: i/o timeout" source="treecache.go:52"
time="2017-05-23T12:57:57Z" level=info msg="read tcp 10.x.x.60:39171->10.x.x.4:2181: i/o timeout" source="treecache.go:52"
time="2017-05-23T12:57:57Z" level=info msg="Failed to set previous watches: zk: buffer too small" source="treecache.go:52"
@StephanErb

This comment has been minimized.

Copy link
Contributor Author

StephanErb commented May 23, 2017

@jjneely have you seen these problems as well?

@StephanErb

This comment has been minimized.

Copy link
Contributor Author

StephanErb commented May 23, 2017

Plotting the watch count of ZK indicates that watchers are dropped completely from time to time. This could explain the behaviour observed above. The relevant go clientcode seems to point to the same direction as well https://github.com/samuel/go-zookeeper/blob/master/zk/conn.go#L538

screen shot 2017-05-23 at 17 09 30

@StephanErb

This comment has been minimized.

Copy link
Contributor Author

StephanErb commented May 27, 2017

This upstream change, which we have not ingested yet, seems to be related samuel/go-zookeeper@03a78d2.

StephanErb added a commit to StephanErb/prometheus that referenced this issue May 27, 2017

@StephanErb

This comment has been minimized.

Copy link
Contributor Author

StephanErb commented May 27, 2017

I have submitted pull request #2778 to bump the vendored go client. Comments on the upstream library indicate this should help to resolve certain errors around handling connections and watchers.

@juliusv juliusv closed this in 14eee34 May 29, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.