Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign uppanic: sync: negative WaitGroup counter #5408
Comments
This comment has been minimized.
This comment has been minimized.
|
This crash happens using the Debian package. Have you tried to reproduce it with the vanilla version? If it fails systematically after the initial crash, maybe try running the vanilla version on the same data directory... |
This comment has been minimized.
This comment has been minimized.
|
I'm somewhat skeptical that it's due to the Debian package, since the stack trace shows the panic occurring in the github/prometheus/tsdb package, and the Debian package is built using the same version (0.4.0) as what is specified in go.mod for the v2.7.1 tag of Prometheus. Since Go binaries are statically linked, I don't think one can so readily blame the distro (i.e. due to outdated shared libraries), unless the package maintainer has been negligent in building the Go binary with outdated dev packages. I mostly opened this issue because we've seen it happen quite a few times now, and I'm surprised that nobody else has experienced it. I'll keep a vanilla build handy for next time it occurs. |
This comment has been minimized.
This comment has been minimized.
|
If you can send us WAL files that cause this, that'd be handy. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil I no longer have the WALs from the most recent crash, so I would have to try to provoke it on a test instance. However, since the panic still occurs after removing the WAL completely (but not touching the most recent chunk), do you think the WAL will actually hold the answer? |
This comment has been minimized.
This comment has been minimized.
|
Just the most recent chunk would help in that case. |
This comment has been minimized.
This comment has been minimized.
I didn't want to imply that downstream was doing a bad job (quite the opposite in fact). I simply want to clear out any possible cause. |
This comment has been minimized.
This comment has been minimized.
|
Looking at the tsdb code in question (https://github.com/prometheus/tsdb/blob/master/block.go#L476): func (r blockIndexReader) Close() error {
r.b.pendingReaders.Done()
return nil
}This strikes me as suspect because:
This is speculation, but perhaps someone with more knowledge of the area like @gouthamve can comment. |
krasi-georgiev
added
component/local storage
kind/bug
labels
Apr 6, 2019
This comment has been minimized.
This comment has been minimized.
|
@mdlayher I will dig to see if I can find the culprit. |
This comment has been minimized.
This comment has been minimized.
|
it seems that @dswarbrick can you ping me on IRC #prometheus-dev so that when it happens you can send me some corrupted data and I will try to replicate locally to find the culprit. If the panic clears when deleting a chunk than I think the easiest way to replicate is by sending that entire corrupted block. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I just hit an occurrence of this, after a voluntary (and clean) shutdown, a few minutes after a checkpoint was written. Upon starting Prometheus again, it hit the negative WG counter moments after reaching "ready for web requests" stage. I'll ping you tomorrow during business hours (CEST) about sending you a copy of the head chunk and WAL. |
This comment has been minimized.
This comment has been minimized.
|
thanks , I am on IRC all day. |
krasi-georgiev
referenced this issue
Apr 10, 2019
Open
make Close methods for the querier safe to call more than once. #581
This comment has been minimized.
This comment has been minimized.
|
A brief update for anybody following this - I have managed to reproduce the panic reliably with a single TSDB block (the one that I had to move out of the way on the crashed instance), and a single alerting rule (any rule, doesn't matter what). Thanks to @krasi-georgiev we seem to be narrowing down the culprit. |
dswarbrick commentedMar 26, 2019
Bug Report
What did you do?
Restart Prometheus after oom-kill / crash.
What did you expect to see?
Resumption of normal operation.
What did you see instead? Under which circumstances?
Prometheus panicked with "negative WaitGroup counter" shortly after reaching "Server is ready to receive web requests" state. Once this occurs, the symptoms are repeatable.
This condition occurs most times after a hard crash, rendering the whole point of a WAL somewhat useless.
To resolve the issue, I usually have remove the WAL and sometimes also remove the most recent chunk. After doing this, I am able to restart without this panic, but of course we lose the most recent ~2h of metrics.
Environment
System information:
Debian stretch,
Linux 4.19.0-0.bpo.2-amd64 x86_64Prometheus version:
Issue was also observed on v2.6.0, possibly also earlier versions.
(large, can paste if devs consider relevant to issue)