Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper treecache may end up in a state where it no longer receives updates #1895

Closed
StephanErb opened this Issue Aug 15, 2016 · 6 comments

Comments

Projects
None yet
2 participants
@StephanErb
Copy link
Contributor

StephanErb commented Aug 15, 2016

What did you do?
We have setup our Prometheus server to perform service discovery using zookeeper server sets.

What did you expect to see?
Prometheus view of the available targets is in sync with what is written in Zookeeper. This has been working as expected for months without any issues.

What did you see instead? Under which circumstances?
During a network problem or short zookeeper downtime, Prometheus ended up in a state where it was no longer notified by changes performed in Zookeeper. This resulted in Prometheus not scraping all intended targets.

The prometheus log contained the following log message:

time="2016-07-26T04:19:33Z" level=info msg="Failed to set previous watches: write tcp 10.x.x.60:50016->10.x.x.1:2182: write: broken pipe" source="treecache.go:31"

Restarting the Prometheus server 'fixed' the issue. After the restart, Prometheus has correctly discovered the current cluster state and has been properly receiving all Zookeeper update since then.

The root cause seems to be related to the used zookeeper go library. Other users of the same library report similar issues (wvanbergen/kafka#76). In particular, they report that this merged pull request (samuel/go-zookeeper#87) could help to prevent the issue. Bumping the vendored library version within the Prometheus might therefore be a good idea.

Environment

  • Zookeeper: 3.3.5
  • Prometheus version: 0.20.0
  • Prometheus configuration file:
scrape_configs:  
  - job_name: 'aurora'
    metrics_path: /metrics
    serverset_sd_configs:
    - servers:
      - '10.x.x.1:2182'
      - '10.x.x.2:2182'
      - '10.x.x.3:2182'
      paths:
      - '/aurora'
@tommyulfsparre

This comment has been minimized.

Copy link
Contributor

tommyulfsparre commented Aug 17, 2016

@StephanErb I think one problem is that when the treecache impl. can't talk to ZooKeeper it enters a failure mode and will loop until it can connect again. Upon connecting it will rebuild the current target state but nodes deleted during this window will not be sent to the target manager.

Is this what you are seeing? I have a fix for the above that we are trying out internally.

@StephanErb

This comment has been minimized.

Copy link
Contributor Author

StephanErb commented Aug 18, 2016

@tommyulfsparre I have been seeing the problem you have been describing as well.

However, when I filed the ticket I was hit by an even more severe problem: Prometheus was no longer receiving any changes at all, even though Zookeeper was working properly again. If a new serverset was submitted to Zookeeper, Prometheus wasn't notified.

@StephanErb

This comment has been minimized.

Copy link
Contributor Author

StephanErb commented Aug 31, 2016

Did you have any success with the patch you have tried internally?

@tommyulfsparre

This comment has been minimized.

Copy link
Contributor

tommyulfsparre commented Sep 2, 2016

@StephanErb Yes! I will submit the patch, just wanted to write some tests that doesn't require Zookeeper.

@StephanErb

This comment has been minimized.

Copy link
Contributor Author

StephanErb commented Nov 27, 2016

I am not able to reproduce this in any recent version.

@StephanErb StephanErb closed this Nov 27, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.