Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upZookeeper treecache may end up in a state where it no longer receives updates #1895
Comments
This comment has been minimized.
This comment has been minimized.
|
@StephanErb I think one problem is that when the treecache impl. can't talk to ZooKeeper it enters a failure mode and will loop until it can connect again. Upon connecting it will rebuild the current target state but nodes deleted during this window will not be sent to the target manager. Is this what you are seeing? I have a fix for the above that we are trying out internally. |
This comment has been minimized.
This comment has been minimized.
|
@tommyulfsparre I have been seeing the problem you have been describing as well. However, when I filed the ticket I was hit by an even more severe problem: Prometheus was no longer receiving any changes at all, even though Zookeeper was working properly again. If a new serverset was submitted to Zookeeper, Prometheus wasn't notified. |
This comment has been minimized.
This comment has been minimized.
|
Did you have any success with the patch you have tried internally? |
This comment has been minimized.
This comment has been minimized.
|
@StephanErb Yes! I will submit the patch, just wanted to write some tests that doesn't require Zookeeper. |
This comment has been minimized.
This comment has been minimized.
|
I am not able to reproduce this in any recent version. |
StephanErb
closed this
Nov 27, 2016
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
StephanErb commentedAug 15, 2016
What did you do?
We have setup our Prometheus server to perform service discovery using zookeeper server sets.
What did you expect to see?
Prometheus view of the available targets is in sync with what is written in Zookeeper. This has been working as expected for months without any issues.
What did you see instead? Under which circumstances?
During a network problem or short zookeeper downtime, Prometheus ended up in a state where it was no longer notified by changes performed in Zookeeper. This resulted in Prometheus not scraping all intended targets.
The prometheus log contained the following log message:
Restarting the Prometheus server 'fixed' the issue. After the restart, Prometheus has correctly discovered the current cluster state and has been properly receiving all Zookeeper update since then.
The root cause seems to be related to the used zookeeper go library. Other users of the same library report similar issues (wvanbergen/kafka#76). In particular, they report that this merged pull request (samuel/go-zookeeper#87) could help to prevent the issue. Bumping the vendored library version within the Prometheus might therefore be a good idea.
Environment