Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: sync: negative WaitGroup counter #5408

Open
dswarbrick opened this Issue Mar 26, 2019 · 12 comments

Comments

Projects
None yet
5 participants
@dswarbrick
Copy link

dswarbrick commented Mar 26, 2019

Bug Report

What did you do?
Restart Prometheus after oom-kill / crash.

What did you expect to see?
Resumption of normal operation.

What did you see instead? Under which circumstances?
Prometheus panicked with "negative WaitGroup counter" shortly after reaching "Server is ready to receive web requests" state. Once this occurs, the symptoms are repeatable.

This condition occurs most times after a hard crash, rendering the whole point of a WAL somewhat useless.

To resolve the issue, I usually have remove the WAL and sometimes also remove the most recent chunk. After doing this, I am able to restart without this panic, but of course we lose the most recent ~2h of metrics.

Environment

  • System information:
    Debian stretch, Linux 4.19.0-0.bpo.2-amd64 x86_64

  • Prometheus version:

prometheus, version 2.7.1+ds (branch: debian/sid, revision: 2.7.1+ds-3)
  build user:       pkg-go-maintainers@lists.alioth.debian.org
  build date:       20190210-18:10:02
  go version:       go1.11.5

Issue was also observed on v2.6.0, possibly also earlier versions.

  • Prometheus configuration file:

(large, can paste if devs consider relevant to issue)

  • Logs:
Mar 26 10:58:19 fra-prom1 prometheus[3431]: level=info ts=2019-03-26T10:58:19.149959065Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1553547600000 maxt=1553580000000 ulid=01D6WET9316VK1153QBPEZ3S34
Mar 26 10:58:19 fra-prom1 prometheus[3431]: level=info ts=2019-03-26T10:58:19.150052223Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1553590800000 maxt=1553594400000 ulid=01D6WRSD2AD5TZTM6R2Y7GBK2G
Mar 26 10:58:19 fra-prom1 prometheus[3431]: level=warn ts=2019-03-26T10:58:19.150549146Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/var/lib/prometheus/metrics2/wal/00000001
Mar 26 10:59:03 fra-prom1 prometheus[3431]: level=info ts=2019-03-26T10:59:03.851930449Z caller=main.go:633 msg="TSDB started"
Mar 26 10:59:03 fra-prom1 prometheus[3431]: level=info ts=2019-03-26T10:59:03.852751558Z caller=main.go:693 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Mar 26 10:59:03 fra-prom1 prometheus[3431]: level=info ts=2019-03-26T10:59:03.910828212Z caller=main.go:720 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
Mar 26 10:59:03 fra-prom1 prometheus[3431]: level=info ts=2019-03-26T10:59:03.910941827Z caller=main.go:587 msg="Server is ready to receive web requests."
Mar 26 10:59:23 fra-prom1 prometheus[3431]: panic: sync: negative WaitGroup counter
Mar 26 10:59:23 fra-prom1 prometheus[3431]: goroutine 898 [running]:
Mar 26 10:59:23 fra-prom1 prometheus[3431]: sync.(*WaitGroup).Add(0xc0a713b8dc, 0xffffffffffffffff)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011/usr/lib/go-1.11/src/sync/waitgroup.go:74 +0x137
Mar 26 10:59:23 fra-prom1 prometheus[3431]: sync.(*WaitGroup).Done(0xc0a713b8dc)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011/usr/lib/go-1.11/src/sync/waitgroup.go:99 +0x34
Mar 26 10:59:23 fra-prom1 prometheus[3431]: github.com/prometheus/tsdb.blockIndexReader.Close(0x17a2700, 0xc0001a7ce0, 0xc0a713b8c0, 0xc104faec20, 0xc0d859ba80)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/tsdb/block.go:445 +0x31
Mar 26 10:59:23 fra-prom1 prometheus[3431]: github.com/prometheus/tsdb.(*blockQuerier).Close(0xc1073983c0, 0x10, 0x1493660)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/tsdb/querier.go:227 +0x5f
Mar 26 10:59:23 fra-prom1 prometheus[3431]: github.com/prometheus/tsdb.(*querier).Close(0xc104faeb80, 0x2, 0xc0d859bc80)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/tsdb/querier.go:141 +0x97
Mar 26 10:59:23 fra-prom1 prometheus[3431]: github.com/prometheus/prometheus/storage/tsdb.querier.Close(0x179cae0, 0xc104faeb80, 0x6, 0xc00004e588)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/prometheus/storage/tsdb/tsdb.go:253 +0x31
Mar 26 10:59:23 fra-prom1 prometheus[3431]: github.com/prometheus/prometheus/rules.(*Group).RestoreForState.func1(0x179a920, 0xc10681aaf0, 0xc0efb2b2c0)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/prometheus/rules/manager.go:553 +0x35
Mar 26 10:59:23 fra-prom1 prometheus[3431]: github.com/prometheus/prometheus/rules.(*Group).RestoreForState(0xc0efb2b2c0, 0xbf1e9f62c112f3d3, 0xee1a8fe2d, 0x234afe0)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/prometheus/rules/manager.go:667 +0x265
Mar 26 10:59:23 fra-prom1 prometheus[3431]: github.com/prometheus/prometheus/rules.(*Group).run(0xc0efb2b2c0, 0x1799960, 0xc000034038)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/prometheus/rules/manager.go:325 +0x621
Mar 26 10:59:23 fra-prom1 prometheus[3431]: github.com/prometheus/prometheus/rules.(*Manager).Update.func1.1(0xc0001b8b40, 0xc0efb2b2c0)
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/prometheus/rules/manager.go:782 +0x60
Mar 26 10:59:23 fra-prom1 prometheus[3431]: created by github.com/prometheus/prometheus/rules.(*Manager).Update.func1
Mar 26 10:59:23 fra-prom1 prometheus[3431]: #011github.com/prometheus/prometheus/rules/manager.go:777 +0x56
Mar 26 10:59:23 fra-prom1 systemd[1]: prometheus.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 26 10:59:23 fra-prom1 systemd[1]: prometheus.service: Unit entered failed state.
Mar 26 10:59:23 fra-prom1 systemd[1]: prometheus.service: Failed with result 'exit-code'.
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 26, 2019

This crash happens using the Debian package. Have you tried to reproduce it with the vanilla version? If it fails systematically after the initial crash, maybe try running the vanilla version on the same data directory...

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Mar 26, 2019

I'm somewhat skeptical that it's due to the Debian package, since the stack trace shows the panic occurring in the github/prometheus/tsdb package, and the Debian package is built using the same version (0.4.0) as what is specified in go.mod for the v2.7.1 tag of Prometheus.

Since Go binaries are statically linked, I don't think one can so readily blame the distro (i.e. due to outdated shared libraries), unless the package maintainer has been negligent in building the Go binary with outdated dev packages.

I mostly opened this issue because we've seen it happen quite a few times now, and I'm surprised that nobody else has experienced it. I'll keep a vanilla build handy for next time it occurs.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 26, 2019

If you can send us WAL files that cause this, that'd be handy.

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Mar 26, 2019

@brian-brazil I no longer have the WALs from the most recent crash, so I would have to try to provoke it on a test instance. However, since the panic still occurs after removing the WAL completely (but not touching the most recent chunk), do you think the WAL will actually hold the answer?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 26, 2019

Just the most recent chunk would help in that case.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 26, 2019

Since Go binaries are statically linked, I don't think one can so readily blame the distro (i.e. due to outdated shared libraries), unless the package maintainer has been negligent in building the Go binary with outdated dev packages.

I didn't want to imply that downstream was doing a bad job (quite the opposite in fact). I simply want to clear out any possible cause.

@mdlayher

This comment has been minimized.

Copy link
Member

mdlayher commented Apr 5, 2019

Looking at the tsdb code in question (https://github.com/prometheus/tsdb/blob/master/block.go#L476):

func (r blockIndexReader) Close() error {
	r.b.pendingReaders.Done()
	return nil
}

This strikes me as suspect because:

  • we're using a sync.WaitGroup as a reference counter of sorts, which feels weird
  • since r is a value of blockIndexReader and not a pointer, we're copying the values within r, including the pointer to the Block (which could possibly change from under our copy and invalidate this Done call)

This is speculation, but perhaps someone with more knowledge of the area like @gouthamve can comment.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 6, 2019

@mdlayher r is copy , but r.b is a pointer so even the copy should hold the same pointer, but of course you could be right that some weird dereferencing could happen somewhere.

I will dig to see if I can find the culprit.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 6, 2019

it seems that Close() is called twice somewhere, but can't figure out where.

@dswarbrick can you ping me on IRC #prometheus-dev so that when it happens you can send me some corrupted data and I will try to replicate locally to find the culprit.

If the panic clears when deleting a chunk than I think the easiest way to replicate is by sending that entire corrupted block.

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Apr 9, 2019

@krasi-georgiev I just hit an occurrence of this, after a voluntary (and clean) shutdown, a few minutes after a checkpoint was written. Upon starting Prometheus again, it hit the negative WG counter moments after reaching "ready for web requests" stage.

I'll ping you tomorrow during business hours (CEST) about sending you a copy of the head chunk and WAL.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 9, 2019

thanks , I am on IRC all day.

@dswarbrick

This comment has been minimized.

Copy link
Author

dswarbrick commented Apr 16, 2019

A brief update for anybody following this - I have managed to reproduce the panic reliably with a single TSDB block (the one that I had to move out of the way on the crashed instance), and a single alerting rule (any rule, doesn't matter what). Thanks to @krasi-georgiev we seem to be narrowing down the culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.