Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upZooKeeper serverset discovery can become out-of-sync #2758
Comments
This comment has been minimized.
This comment has been minimized.
|
@jjneely have you seen these problems as well? |
This comment has been minimized.
This comment has been minimized.
|
Plotting the watch count of ZK indicates that watchers are dropped completely from time to time. This could explain the behaviour observed above. The relevant go clientcode seems to point to the same direction as well https://github.com/samuel/go-zookeeper/blob/master/zk/conn.go#L538 |
This comment has been minimized.
This comment has been minimized.
|
This upstream change, which we have not ingested yet, seems to be related samuel/go-zookeeper@03a78d2. |
StephanErb
added a commit
to StephanErb/prometheus
that referenced
this issue
May 27, 2017
This comment has been minimized.
This comment has been minimized.
|
I have submitted pull request #2778 to bump the vendored go client. Comments on the upstream library indicate this should help to resolve certain errors around handling connections and watchers. |
juliusv
closed this
in
14eee34
May 29, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |

StephanErb commentedMay 23, 2017
What did you do?
Run the ZooKeeper service discovery on a cluster with a high churn rate.
What did you expect to see?
The target set in Prometheus should stay up to date and always know about the service instances currently running on the cluster.
What did you see instead? Under which circumstances?
Over a timespan of several hours, instances discovered by Prometheus become unhealthy due to natural churn and the following alert triggers. A restart of the Prometheus server corrects the issue. Prometheus is now aware of the correct endpoints of all instances running on the cluster.
Our current mean time between failures is about 1-3 days, depending on how often we restart/reload the Prometheus server due to configuration changes. The ZK server in question is stable and contains nothing suspicious in its logs.
Environment
System information:
3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux
Prometheus version: 1.6.2
Zookeeper version: 3.4.5
Prometheus configuration file:
Once such an event has been emitted to the log, Prometheus fails to receive ZK watcher updates for certain instances.