Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upZookeeper Serverset SD broken on 1.4.1 #2254
Comments
This comment has been minimized.
This comment has been minimized.
cristifalcas
commented
Dec 6, 2016
|
We have the same issue. After 1.4.0, zookeepeer scrape_config doesn't work anymore. |
This comment has been minimized.
This comment has been minimized.
|
This obviously affects the Nerve Service Discovery as well, as it's based on zookeeper. |
This comment has been minimized.
This comment has been minimized.
JustinVenus
commented
Jan 4, 2017
|
Ran into this issue as well. Prometheus-1.3.1 is the last release version that isn't affected by this bug. |
This comment has been minimized.
This comment has been minimized.
jacobrichard
commented
Jan 4, 2017
•
|
Looks like a lot of ZK service discovery changes in 1.4.0. I'd have to pick through what is being done here, but it looks like nerve and serverset got merged into one file. Seems like a lot of locking logic was rewritten or removed, which could contribute to a deadlock. See the changes in: a1eec44 |
This comment has been minimized.
This comment has been minimized.
Jackey2015
commented
Jan 11, 2017
|
in this file: util/treecache/treecache.go rebuild prometheus after change code, it works OK |
This comment has been minimized.
This comment has been minimized.
cristifalcas
commented
Mar 3, 2017
|
Any updates on this? |
This comment has been minimized.
This comment has been minimized.
|
I am not familiar with the Zookeeper SD at all and don't have a Zookeper around to try anything out myself, but just from having a quick look at the code I suspect the problem to be this deadlock:
Perhaps it is sufficient to remove the initial synchronous sync by just removing these lines: prometheus/util/treecache/treecache.go Lines 89 to 92 in 3730255 If someone could try that out, that would be great. |
This comment has been minimized.
This comment has been minimized.
|
@juliusv thanks for doing the code analysis. I will test it out and report back here. |
StephanErb
referenced this issue
Mar 3, 2017
Merged
Prevent deadlock in ZK TreeCache constructor by deferring the initial sync. #2470
This comment has been minimized.
This comment has been minimized.
|
Following Julius excellent lead, I was able to assemble a fix. See #2470 for details. |
fabxc
closed this
in
3038d0e
Mar 6, 2017
This comment has been minimized.
This comment has been minimized.
|
Any chance to get this into a bugfix release? (I can understand if this is too much effort, but maybe you have good automation in place to make this feasible :-). Thanks!) |
This comment has been minimized.
This comment has been minimized.
|
I filed #2481 regarding this patch, there seems to be issues after a SIGHUP reload where the Zookeeper code begins consuming a large amount of CPU time. |
This comment has been minimized.
This comment has been minimized.
|
The 1.6 release is not too far away. (If I only wouldn't be drawn into production incidents all the time, or had to prepare talks for CloudNativeCon ;). |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
jjneely commentedDec 5, 2016
•
edited
What did you do?
Upgraded from 1.2.1 to 1.4.1.
What did you expect to see?
Known working serverset configuration should continue to function.
What did you see instead? Under which circumstances?
The
/targetpage refuses to load. Scraping of all targets does not happen.Environment
System information:
Linux 3.13.0-85-generic x86_64
Ubuntu Trusty
Prometheus version:
Prometheus configuration file:
Complete configuration can be made available privately. The following is what my server set looks like. (Yes, JSON.)
Removing the
serverset_sd_configssection from the configuration makes the Prometheus 1.4.1 instance function normally. Issues return when the serverset section is re-added.No additional logs produced, even with
-log.leve=debugpresent. I believe the target manager is deadlocked.