Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Situational trigger of a memory leak in 2.0.0-rc.1 #3316

Closed
beorn7 opened this Issue Oct 19, 2017 · 42 comments

Comments

Projects
None yet
4 participants
@beorn7
Copy link
Member

beorn7 commented Oct 19, 2017

What did you do?

Run 2.0.0-rc.1 on three different servers with very different high load (high churn, high steady series count, high retention).

What did you expect to see?

Stable memory consumption, with the usual compaction oscillation.

What did you see instead? Under which circumstances?

Sometimes, the memory starts to grow forever until the server OOMs. It also appears that no compaction cycles are happening anymore. And no clean shutdown is possible. Shutdown hangs.

This needs more investigation. I'll try to provide goroutine dumps and memory profiles and such. Very busy at the moment, so I thought I just leave this here as a gathering point in case others run into the same.

Environment

  • System information:

Linux 4.4.10+soundcloud x86_64

  • Prometheus version:
$ prometheus -version
Error parsing commandline arguments: unknown short flag '-v'
prometheus: error: unknown short flag '-v'

I guess the issue template needs an update. ;o)

$ prometheus --version
prometheus, version 2.0.0-rc.1 (branch: HEAD, revision: 5ab8834befbd92241a88976c790ace7543edcd59)
  build user:       root@1f56dd8b6f7b
  build date:       20171017-12:34:15
  go version:       go1.9.1

I'll add more once I find time. (Or @grobie might.)

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 19, 2017

screenshot_20171019_132120

Memory usage pattern. To the left, server with many time series. Works fine, but at some point the oscillation stops and memory just grows forever. (Not crashed yet, had happened for the first time.)

The middle panel is the high-series-churn server. It is very relaxed (but no deploys have happened in the two days shown, so no churn after all).

The right panel is a mid to high series count, but very steady (scrapes many node exporters) with decent retention. This one hits the issue quite regularly and is essentially in a slow crash loop. Restart take long, which hints again towards a giant WAL.

More to follow...

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 19, 2017

See goroutine dump of a server stuck in shutdown (NB: This is not rc.1 but slightly before, without relevant changes, I believe, but just in case, it is at commit 721050c ).

prom2-721050c6cbbcb7061cb57c0783886e78b78710f3-deadlocked-on-shutdown.20171118.txt

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 19, 2017

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 19, 2017

At the time the problem started, the server to the left had a (non-crashing) panic. Log excerpt:

2017-10-18_15:02:19.30867 level=error ts=2017-10-18T15:02:19.308519991Z caller=engine.go:592 component="query engine" msg="runtime panic in parser" err="runtime error: invalid memory address or nil pointer dereference" stacktrace="goroutine 2772 [running]:\ngithub.com/prometheus/prometheus/promql.(*evaluator).recover(0xc9dc5466c0, 0xc914d093b0)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:590 +0xcd\npanic(0x1925ac0, 0x293e830)\n\t/usr/local/go/src/runtime/panic.go:491 +0x283\ngithub.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*memSeries).iterator(0xc5287ed550, 0x0, 0x0, 0xc914d088c0)\n\t/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go:1275 +0x138\ngithub.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*safeChunk).Iterator(0xc49d7b76a0, 0x0, 0x18316a0)\n\t/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go:787 +0x50\ngithub.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*chunkSeriesIterator).Seek(0xc85284ee40, 0x15f2f93d5f6, 0xc4204a2e00)\n\t/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:734 +0xa0\ngithub.com/prometheus/prometheus/storage.(*mergeIterator).Seek(0xc53fc5e390, 0x15f2f93d5f6, 0xc6cdf91e00)\n\t/go/src/github.com/prometheus/prometheus/storage/fanout.go:372 +0xcb\ngithub.com/prometheus/prometheus/storage.(*BufferedSeriesIterator).Seek(0xc53fc5e3c0, 0x15f3001b2f6, 0x6)\n\t/go/src/github.com/prometheus/prometheus/storage/buffer.go:63 +0xca\ngithub.com/prometheus/prometheus/promql.(*evaluator).matrixSelector(0xc9dc5466c0, 0xc59e89f730, 0x1bb7536, 0x15, 0x0)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:842 +0x5b2\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc9dc5466c0, 0x2881400, 0xc59e89f730, 0x7f56e4dfc568, 0x0)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:717 +0x536\ngithub.com/prometheus/prometheus/promql.(*evaluator).evalMatrix(0xc9dc5466c0, 0x2881400, 0xc59e89f730, 0xc9ada82a10, 0x197c040, 0x198f380)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:638 +0x49\ngithub.com/prometheus/prometheus/promql.funcDeriv(0xc9dc5466c0, 0xc8c19bdff0, 0x1, 0x1, 0x0, 0x0)\n\t/go/src/github.com/prometheus/prometheus/promql/functions.go:644 +0x6d\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc9dc5466c0, 0x28813c0, 0xc9e64f6080, 0x0, 0xc74a8fac60)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:714 +0x11c4\ngithub.com/prometheus/prometheus/promql.(*evaluator).evalVector(0xc9dc5466c0, 0x28813c0, 0xc9e64f6080, 0x15, 0x0, 0x0)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:611 +0x49\ngithub.com/prometheus/prometheus/promql.(*evaluator).eval(0xc9dc5466c0, 0x2881340, 0xc67eaad950, 0xc914d093b0, 0xc9dc5466c0)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:681 +0x422\ngithub.com/prometheus/prometheus/promql.(*evaluator).Eval(0xc9dc5466c0, 0x2881340, 0xc67eaad950, 0x0, 0x0, 0x0, 0x0)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:668 +0xc1\ngithub.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc420378ab0, 0x288bcc0, 0xc614c676e0, 0xc899de0c80, 0xc67eaad9a0, 0x0, 0x0, 0x0, 0x0)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:385 +0x3ae\ngithub.com/prometheus/prometheus/promql.(*Engine).exec(0xc420378ab0, 0x288bcc0, 0xc614c676e0, 0xc899de0c80, 0x0, 0x0, 0x0, 0x0)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:341 +0x317\ngithub.com/prometheus/prometheus/promql.(*query).Exec(0xc899de0c80, 0x288bc40, 0xc42041ee40, 0x288bc40)\n\t/go/src/github.com/prometheus/prometheus/promql/engine.go:179 +0xb8\ngithub.com/prometheus/prometheus/rules.RecordingRule.Eval(0xc81dd88360, 0x1a, 0x2881340, 0xc81dcc0690, 0x297a078, 0x0, 0x0, 0x7fa1e0167408, 0xc42041ee40, 0xbe71f91ea4a26be6, ...)\n\t/go/src/github.com/prometheus/prometheus/rules/recording.go:61 +0x13b\ngithub.com/prometheus/prometheus/rules.(*RecordingRule).Eval(0xc5bdfe3f40, 0x7fa1e0167408, 0xc42041ee40, 0xbe71f91ea4a26be6, 0x60543b053bd, 0x2956e60, 0xc420378ab0, 0xc4202f6a00, 0x0, 0x17, ...)\n\t<autogenerated>:1 +0xd0\ngithub.com/prometheus/prometheus/rules.(*Group).Eval.func1(0x1ba7e62, 0x9, 0xc4c5eedd80, 0xbe71f91ea4a26be6, 0x60543b053bd, 0x2956e60, 0x1, 0x288bfc0, 0xc5bdfe3f40)\n\t/go/src/github.com/prometheus/prometheus/rules/manager.go:307 +0x194\ngithub.com/prometheus/prometheus/rules.(*Group).Eval(0xc4c5eedd80, 0xbe71f91ea4a26be6, 0x60543b053bd, 0x2956e60)\n\t/go/src/github.com/prometheus/prometheus/rules/manager.go:375 +0xbf\ngithub.com/prometheus/prometheus/rules.(*Group).run.func1()\n\t/go/src/github.com/prometheus/prometheus/rules/manager.go:182 +0x81\ngithub.com/prometheus/prometheus/rules.(*Group).run(0xc4c5eedd80)\n\t/go/src/github.com/prometheus/prometheus/rules/manager.go:207 +0x23c\ngithub.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1.1(0xc42041ee80, 0xc4c5eedd80)\n\t/go/src/github.com/prometheus/prometheus/rules/manager.go:510 +0x46\ncreated by github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1\n\t/go/src/github.com/prometheus/prometheus/rules/manager.go:505 +0x56\n"
2017-10-18_15:02:27.73794 level=info ts=2017-10-18T15:02:27.737823426Z caller=head.go:335 component=tsdb msg="head GC completed" duration=8.436346592s
2017-10-18_15:02:53.77456 level=info ts=2017-10-18T15:02:53.774448467Z caller=head.go:344 component=tsdb msg="WAL truncation completed" duration=26.036539329s

Sorry for line breaks in the stack dump represented as \n. That's apparently a feature of the new logging library.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 19, 2017

One line per thing that happened sounds right to me. Expanded stack trace:

goroutine 2772 [running]:
github.com/prometheus/prometheus/promql.(*evaluator).recover(0xc9dc5466c0, 0xc914d093b0)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:590 +0xcd
panic(0x1925ac0, 0x293e830)
	/usr/local/go/src/runtime/panic.go:491 +0x283
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*memSeries).iterator(0xc5287ed550, 0x0, 0x0, 0xc914d088c0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go:1275 +0x138
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*safeChunk).Iterator(0xc49d7b76a0, 0x0, 0x18316a0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go:787 +0x50
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*chunkSeriesIterator).Seek(0xc85284ee40, 0x15f2f93d5f6, 0xc4204a2e00)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:734 +0xa0
github.com/prometheus/prometheus/storage.(*mergeIterator).Seek(0xc53fc5e390, 0x15f2f93d5f6, 0xc6cdf91e00)
	/go/src/github.com/prometheus/prometheus/storage/fanout.go:372 +0xcb
github.com/prometheus/prometheus/storage.(*BufferedSeriesIterator).Seek(0xc53fc5e3c0, 0x15f3001b2f6, 0x6)
	/go/src/github.com/prometheus/prometheus/storage/buffer.go:63 +0xca
github.com/prometheus/prometheus/promql.(*evaluator).matrixSelector(0xc9dc5466c0, 0xc59e89f730, 0x1bb7536, 0x15, 0x0)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:842 +0x5b2
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc9dc5466c0, 0x2881400, 0xc59e89f730, 0x7f56e4dfc568, 0x0)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:717 +0x536
github.com/prometheus/prometheus/promql.(*evaluator).evalMatrix(0xc9dc5466c0, 0x2881400, 0xc59e89f730, 0xc9ada82a10, 0x197c040, 0x198f380)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:638 +0x49
github.com/prometheus/prometheus/promql.funcDeriv(0xc9dc5466c0, 0xc8c19bdff0, 0x1, 0x1, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/promql/functions.go:644 +0x6d
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc9dc5466c0, 0x28813c0, 0xc9e64f6080, 0x0, 0xc74a8fac60)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:714 +0x11c4
github.com/prometheus/prometheus/promql.(*evaluator).evalVector(0xc9dc5466c0, 0x28813c0, 0xc9e64f6080, 0x15, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:611 +0x49
github.com/prometheus/prometheus/promql.(*evaluator).eval(0xc9dc5466c0, 0x2881340, 0xc67eaad950, 0xc914d093b0, 0xc9dc5466c0)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:681 +0x422
github.com/prometheus/prometheus/promql.(*evaluator).Eval(0xc9dc5466c0, 0x2881340, 0xc67eaad950, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:668 +0xc1
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc420378ab0, 0x288bcc0, 0xc614c676e0, 0xc899de0c80, 0xc67eaad9a0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:385 +0x3ae
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc420378ab0, 0x288bcc0, 0xc614c676e0, 0xc899de0c80, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:341 +0x317
github.com/prometheus/prometheus/promql.(*query).Exec(0xc899de0c80, 0x288bc40, 0xc42041ee40, 0x288bc40)
	/go/src/github.com/prometheus/prometheus/promql/engine.go:179 +0xb8
github.com/prometheus/prometheus/rules.RecordingRule.Eval(0xc81dd88360, 0x1a, 0x2881340, 0xc81dcc0690, 0x297a078, 0x0, 0x0, 0x7fa1e0167408, 0xc42041ee40, 0xbe71f91ea4a26be6, ...)
	/go/src/github.com/prometheus/prometheus/rules/recording.go:61 +0x13b
github.com/prometheus/prometheus/rules.(*RecordingRule).Eval(0xc5bdfe3f40, 0x7fa1e0167408, 0xc42041ee40, 0xbe71f91ea4a26be6, 0x60543b053bd, 0x2956e60, 0xc420378ab0, 0xc4202f6a00, 0x0, 0x17, ...)
	<autogenerated>:1 +0xd0
github.com/prometheus/prometheus/rules.(*Group).Eval.func1(0x1ba7e62, 0x9, 0xc4c5eedd80, 0xbe71f91ea4a26be6, 0x60543b053bd, 0x2956e60, 0x1, 0x288bfc0, 0xc5bdfe3f40)
	/go/src/github.com/prometheus/prometheus/rules/manager.go:307 +0x194
github.com/prometheus/prometheus/rules.(*Group).Eval(0xc4c5eedd80, 0xbe71f91ea4a26be6, 0x60543b053bd, 0x2956e60)
	/go/src/github.com/prometheus/prometheus/rules/manager.go:375 +0xbf
github.com/prometheus/prometheus/rules.(*Group).run.func1()
	/go/src/github.com/prometheus/prometheus/rules/manager.go:182 +0x81
github.com/prometheus/prometheus/rules.(*Group).run(0xc4c5eedd80)
	/go/src/github.com/prometheus/prometheus/rules/manager.go:207 +0x23c
github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1.1(0xc42041ee80, 0xc4c5eedd80)
	/go/src/github.com/prometheus/prometheus/rules/manager.go:510 +0x46
created by github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1
	/go/src/github.com/prometheus/prometheus/rules/manager.go:505 +0x56
@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 19, 2017

Is this problem persistent, i.e. after it OOMs and restarts, does it OOM right again and still doesn't compact etc.?

That panic only occurred once in one server but the problem still occurred in other servers where nothing like that happened?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 19, 2017

goroutine dumps which collapse repeated stack traces would be super helpful. Those are ones are a bit hard to get an overview from.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 19, 2017

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 19, 2017

We only went through multiple iterations on the node-exporter scraping Prometheus. There, the panic as reported above seem to happen just seconds after a restart.

Will play with the other stuck one soon.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Oct 19, 2017

Here are all the panics of that node @beorn7 mentioned: https://gist.github.com/03f825b7684aa0c9a413f63597e37d70

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 19, 2017

Just stopped the one that never OOM'd (left panel), got stuck during shutdown, goroutine dump (collapsed) below.

prom2-collapsed-5ab8834befbd92241a88976c790ace7543edcd59-stuck-on-shutdown.20171019.txt

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 19, 2017

Clarification about “just seconds after a restart” from above: That's seconds after the server started to do real work again, i.e. after the WAL got replayed (if I'm saying that right).

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 19, 2017

Intermediate status: it appears that the panic is the common trigger and causes a lock not to be returned. This in turn blocks several other code paths subsequently, including compaction and shutdown.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 20, 2017

Presumed intermediate fix in prometheus/tsdb#181

@smarterclayton

This comment has been minimized.

Copy link

smarterclayton commented Oct 22, 2017

Not sure if this is related, on a server recently updated to rc1 with high series, high churn, and high volume prometheus is now in a crash loop series on load where it:

  1. starts, goes through normal (~1 core) startup for a few seconds
  2. goes into a longer period of using max machine CPU (8 cores) where virtual memory steadily grows
  3. process is OOM killed and then restarts

Size on disk is 97GB, which is only slightly larger than it was before where prometheus was using around 10-20gb of memory. Total machine memory was 64gb, it looks like it OOMkilled around 57gb

prom.profile.tar.gz

go tool pprof /tmp/prometheus /tmp/cpu
File: prometheus
Type: cpu
Time: Oct 22, 2017 at 1:35pm (EDT)
Duration: 30.01s, Total samples = 3.85mins (769.30%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for 221.17s, 95.81% of 230.84s total
Dropped 142 nodes (cum <= 1.15s)
Showing top 30 nodes out of 57
      flat  flat%   sum%        cum   cum%
    64.43s 27.91% 27.91%    102.74s 44.51%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*memSeries).append /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go
    52.79s 22.87% 50.78%    203.97s 88.36%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*Head).processWALSamples /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go
    31.85s 13.80% 64.58%     33.20s 14.38%  runtime.mapaccess1_fast64 /usr/lib/golang/src/runtime/hashmap_fast.go
    11.85s  5.13% 69.71%     12.24s  5.30%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.(*bstream).writeBits /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks/bstream.go
    10.67s  4.62% 74.33%     10.67s  4.62%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.(*XORChunk).NumSamples /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks/xor.go
     9.10s  3.94% 78.27%      9.10s  3.94%  sync.(*RWMutex).RLock /usr/lib/golang/src/sync/rwmutex.go
     5.48s  2.37% 80.65%     20.70s  8.97%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.(*xorAppender).Append /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks/xor.go
     3.90s  1.69% 82.34%     47.71s 20.67%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*stripeSeries).getByID /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go
     3.74s  1.62% 83.96%     12.55s  5.44%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*walReader).decodeSamples /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal.go
     3.54s  1.53% 85.49%      3.54s  1.53%  encoding/binary.Uvarint /usr/lib/golang/src/encoding/binary/varint.go
     3.44s  1.49% 86.98%      8.77s  3.80%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*decbuf).varint64 /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/encoding_helpers.go
     2.85s  1.23% 88.22%      6.86s  2.97%  runtime.mallocgc /usr/lib/golang/src/runtime/malloc.go
     2.50s  1.08% 89.30%      2.50s  1.08%  runtime.memmove /usr/lib/golang/src/runtime/memmove_amd64.s
     2.28s  0.99% 90.29%      5.56s  2.41%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.(*xorAppender).writeVDelta /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks/xor.go
        2s  0.87% 91.15%      5.33s  2.31%  encoding/binary.Varint /usr/lib/golang/src/encoding/binary/varint.go
     1.80s  0.78% 91.93%      1.80s  0.78%  sync.(*RWMutex).RUnlock /usr/lib/golang/src/sync/rwmutex.go
     1.69s  0.73% 92.67%      1.69s  0.73%  runtime.memclrNoHeapPointers /usr/lib/golang/src/runtime/memclr_amd64.s
     1.55s  0.67% 93.34%      3.57s  1.55%  runtime.mapassign /usr/lib/golang/src/runtime/hashmap.go
     1.35s  0.58% 93.92%      1.35s  0.58%  runtime.aeshash64 /usr/lib/golang/src/runtime/asm_amd64.s
     1.25s  0.54% 94.46%      1.32s  0.57%  syscall.Syscall /usr/lib/golang/src/syscall/asm_linux_amd64.s
     0.63s  0.27% 94.74%      6.94s  3.01%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*memSeries).cut /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go
     0.63s  0.27% 95.01%      2.06s  0.89%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.(*XORChunk).Appender /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks/xor.go
     0.50s  0.22% 95.23%     17.93s  7.77%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*walReader).Read /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal.go
     0.27s  0.12% 95.34%      6.55s  2.84%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*Head).getOrCreateWithID /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go
     0.27s  0.12% 95.46%      2.47s  1.07%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*memPostings).addFor /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/postings.go
     0.26s  0.11% 95.57%      3.29s  1.43%  runtime.newobject /usr/lib/golang/src/runtime/malloc.go
     0.17s 0.074% 95.65%      2.09s  0.91%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*decbuf).uvarintStr /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/encoding_helpers.go
     0.14s 0.061% 95.71%      1.35s  0.58%  runtime.slicebytetostring /usr/lib/golang/src/runtime/string.go
     0.13s 0.056% 95.76%      3.21s  1.39%  github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*walReader).decodeSeries /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal.go
     0.11s 0.048% 95.81%      1.82s  0.79%  runtime.(*mcentral).cacheSpan /usr/lib/golang/src/runtime/mcentral.go
@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 22, 2017

A few questions @smarterclayton:

  • Was it running fine before attempting the upgrade or did it OOM before that already?
  • Did you encounter the deadlock on shutdown?
  • Does it OOM while still in the phase of using all cores?
  • How many files are in the wal directory?

Memory profiles shortly before OOMing and logs would be helpful as well.

@smarterclayton

This comment has been minimized.

Copy link

smarterclayton commented Oct 23, 2017

  • Previous was beta5 where it was crashing due to the label panics
  • No, it was in continuous OOM kill looping and did not get a chance to kill it
  • At least once I observed it return to normal CPU use and the web frontend started, but was killed almost immediately afterwards (10-20 seconds of normal CPU load, after 5-10 minutes of continuous CPU use)
  • Did not capture, had to clear the WAL files

Will try to capture again.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 23, 2017

If this is more of a test instance and some downtime is okay, it would be great to test the restart cycle a few times. Once it has been running for 3-4h that should be long enough to reproduce the issue.

From the CPU usage it wasn't doing anything it shouldn't be doing AFAICS.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 23, 2017

We have been running three servers with commit be5422a plus the commit ea817e169b7f7bc17206285e6bac27fea69f2f2d from TSDB un-cleanly compiled into it (as per @grobie ). I don't know if the issues we are seeing are still the same issue as the original one reported here, or if they have been created by the fix, or if they are completely independent.

One server (the one with fairly high load, many series, medium churn rate, monitoring mostly kubelets and such) seemed to have directly gone into "I don't do compaction" mode until it OOM'd after about 3d. Then, it went into a much faster crash loop, about once every 35m. I assume it is now in a state where it tries to replay the WAL but cannot succeed with the RAM available (128GiB). I guess taking any forensic information from that server is fairly useless. (The original state where it was just not doing any compactions was probably interesting, but we are not even getting there now.)

The 2nd server (scraping many node_exporters) behaved normally for about 2.5d and then spat out that mysterious log line 2017-10-22_22:52:48.76060 fatal: morestack on g0 and nothing ever since. The process is still around, taking RAM (RSS 24GiB of the 64GiB available), but zero CPU usage. It doesn't talk to me via HTTP (but still has the server socket open, just not serving anything). So I cannot get a goroutine dump. If I attach an strace -f, I get 63 OS threads, all in something like futex(0x295df70, FUTEX_WAIT, 0, NULL. Smells like deadlocked.

The 3rd server (high series churn but otherwise moderate load) just took more and more memory. It stopped logging after a few hours of uptime. It is not reachable via HTTP now, but it still grows its memory usage slowly but steadily. It shows occasional spikes in CPU usage, but more every 90m than the typical 120m. strace -f shows plenty of activities, 52 threads, but some also in FUTEX_WAIT state.

A weirdness of the logs:

2017-10-20_12:34:26.50510 level=info ts=2017-10-20T12:34:26.505003455Z caller=main.go:216 msg="Starting prometheus" version="(version=2.0.0-rc.1.sc0, branch=master, revision=be5422a166c4484d6c3d9c3fefbd370068a4f786)"
2017-10-20_12:34:26.50513 level=info ts=2017-10-20T12:34:26.50506967Z caller=main.go:217 build_context="(go=go1.9.1, user=grobie@grobook, date=20171020-12:19:47)"
2017-10-20_12:34:26.50515 level=info ts=2017-10-20T12:34:26.505093659Z caller=main.go:218 host_details="(Linux 4.4.10+soundcloud #1 SMP Thu Jun 16 15:17:20 UTC 2016 x86_64 ip-10-33-62-43.k16.ams5.s-cloud.net (none))"
2017-10-20_12:34:26.50813 level=info ts=2017-10-20T12:34:26.508023894Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
2017-10-20_12:34:26.50843 level=info ts=2017-10-20T12:34:26.507959854Z caller=main.go:315 msg="Starting TSDB"
2017-10-20_12:34:26.50847 level=info ts=2017-10-20T12:34:26.508000879Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..."
2017-10-20_12:34:57.06135 level=info ts=2017-10-20T12:34:57.061206002Z caller=main.go:327 msg="TSDB started"
2017-10-20_12:34:57.06140 level=info ts=2017-10-20T12:34:57.061347514Z caller=main.go:395 msg="Loading configuration file" filename=/srv/prometheus/prometheus.yml
2017-10-20_12:34:57.10900 level=info ts=2017-10-20T12:34:57.108960963Z caller=main.go:372 msg="Server is ready to receive requests."
2017-10-20_12:49:56.37354 level=warn ts=2017-10-20T12:49:56.373402319Z caller=main.go:378 msg="Received SIGTERM, exiting gracefully..."
2017-10-20_12:49:56.37357 level=info ts=2017-10-20T12:49:56.373513461Z caller=main.go:385 msg="See you next time!"
2017-10-20_16:17:13.38348 level=error ts=2017-10-20T16:17:13.383353045Z caller=main.go:212 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:205: Failed to watch *v1.Pod: Get https://api.k2.k8s.s-cloud.net/api/v1/pods?resourceVersion=1148455858&timeoutSeconds=379&watch=true: net/http: TLS handshake timeout"
2017-10-20_16:17:13.38355 level=error ts=2017-10-20T16:17:13.38345761Z caller=main.go:212 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:205: Failed to watch *v1.Pod: Get https://api.k2.k8s.s-cloud.net/api/v1/pods?resourceVersion=1148455858&timeoutSeconds=455&watch=true: net/http: TLS handshake timeout"

Thus, that 3rd server might have received a graceful shutdown signal from Chef or something and then simply got stuck in shutdown, perhaps.

However, the pattern is different for each of the three.

I assume I cannot really get anything useful out of the server as they are. If I don't get any other requests, I'll wipe the storage of the 1st server (which currently runs out of RAM when replaying the WAL, presumably) and simply restart the other two so that we can perhaps get a goroutine dump while something weird is happening.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 23, 2017

Attached goroutine dump of server 1, the one presumably crashing during WAL replay.
prom2-collapsed-crashlooping-while-replaying.20171023.txt

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 23, 2017

Attached a heap profile of server 1 just before OOMing again.
prom2-heapprofile-crashlooping-while-replaying.20171023.gz

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 23, 2017

WAL on server 1 was 421GiB in size.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 23, 2017

Thanks, overall the heap profile doesn't show any memory in any unexpected code path. Though ~40% of space was allocated here.

This can practically only come from the buffers we are using for the samples read from the WAL. It shouldn't amount to THAT much memory though even if the queue of capacity 300 was filled for every core.
The diagram shows a lot of allocs being 8 to 16MB in size. Given the slices have 32 byte elements, this would mean there are some scrapes in the WAL with ~500k series. I might be interpreting some thing wrongly here.
Ultimately, we could probably do something to not OOM nearly that quickly. But it's ultimately not the fundamental issue here.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 23, 2017

@beorn7 Not sure how it came to scrape sample batches in the WAL of that size to begin with, but this should mitigate the problem: prometheus/tsdb#185

(This is still irrelevant to the underlying issue of why compactions didn't run of course)

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 23, 2017

All three servers restarted. As of now (three hours in) they are happily ingesting and serving queries, but none of them seems to run compactions. Logs are unsuspicious (in particular: no panics anymore), with the exception of the following repeated error message on the kubelet-scraping Prometheus server:

2017-10-23_16:50:14.79633 level=error ts=2017-10-23T16:50:14.796205069Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: iterate compaction set: get series 2239667: not found"
2017-10-23_16:51:14.79775 level=info ts=2017-10-23T16:51:14.797608653Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1508565600000 maxt=1508572800000
2017-10-23_16:52:13.09933 level=error ts=2017-10-23T16:52:13.0992296Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: iterate compaction set: get series 2239667: not found"
2017-10-23_16:53:13.10052 level=info ts=2017-10-23T16:53:13.10039082Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1508565600000 maxt=1508572800000
2017-10-23_16:54:09.76711 level=error ts=2017-10-23T16:54:09.766984648Z caller=db.go:260 component=tsdb msg="compaction failed" err="persist head block: write compaction: iterate compaction set: get series 2239667: not found"
2017-10-23_16:55:09.76848 level=info ts=2017-10-23T16:55:09.768339934Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1508565600000 maxt=1508572800000

Collapsed goroutine dumps (let me know what else you need):
prom2-collapsed-no-compaction-1.20171023.txt
prom2-collapsed-no-compaction-2.20171023.txt
prom2-collapsed-no-compaction-3.20171023.txt

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 23, 2017

The first goroutine dump actually shows a compaction running. Blocks are 2h long, but we only compact a 2h slice once we are 50% into the next one. The 50% are our "appendable" window, i.e. the maximum delta between most recent added timestamp and minimum timestamp we could still write.
Thus at the 3h mark compactions should be just now happening.

As suggested on IRC, setting --storage.tsdb.min-block-duration to a value smaller than 2h can be helpful here for debugging purposes.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 23, 2017

OK, got it.

WRT flag change: I am doing this in parallel to my normal work. I better leave the config as is and just wait longer instead of starting to reconfigure those servers.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 23, 2017

The first one is obviously failing at compacting and retries it. This one will thus OOM in all likelihood eventually.
The other ones should've shown log lines on successful compaction by now.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 23, 2017

1st keep showing the same error message and doesn't seem to compact (looking at RAM growth).

2nd hasn't logged anything in almost 3 hours, and doesn't seem to compact either. Last log lines:

2017-10-23_14:56:38.75393 level=info ts=2017-10-23T14:56:38.753868588Z caller=compact.go:361 component=tsdb msg="compact blocks" count=3 mint=1508673600000 maxt=1508738400000
2017-10-23_15:00:00.15822 level=info ts=2017-10-23T15:00:00.158085002Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1508760000000 maxt=1508767200000
2017-10-23_15:00:25.83620 level=info ts=2017-10-23T15:00:25.836074725Z caller=head.go:335 component=tsdb msg="head GC completed" duration=1.330877357s
2017-10-23_15:00:32.29593 level=info ts=2017-10-23T15:00:32.295819717Z caller=head.go:344 component=tsdb msg="WAL truncation completed" duration=6.459654474s

3rd has logged the following, and RAM usage dropped:

2017-10-23_17:00:00.14952 level=info ts=2017-10-23T17:00:00.149389384Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1508767200000 maxt=1508774400000
2017-10-23_17:01:16.04761 level=info ts=2017-10-23T17:01:16.047501937Z caller=head.go:335 component=tsdb msg="head GC completed" duration=3.453583699s
2017-10-23_17:01:32.00674 level=info ts=2017-10-23T17:01:32.00660289Z caller=head.go:344 component=tsdb msg="WAL truncation completed" duration=15.959019996s

Goroutine dumps for 2nd and 3rd:
prom2-collapsed-no-compaction-2.20171023-2.txt
prom2-collapsed-no-compaction-3.20171023-2.txt

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 23, 2017

Thanks, looks like no 2 got stuck when reloading after compaction. I see it correctly that it's still processing queries though?

It is waiting on a WaitGroup here. This was added somewhat recently to ensure we do not unmap a block that pending readers still have access to. Otherwise a segfault may happen.

There was nothing suspicious in the log otherwise that could indicate a query being improperly terminated? I guess this boils down to carefully walking the code paths and checking where we don't call pendingReaders.Done() for a previous pendingReaders.Add(1).
Extra eyes appreciated.

Looks like the third one is doing fine for now then?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 23, 2017

Okay, so the pendingReader stuff confused me as I added it quite carefully and tested it.
Things are getting hairy now.

@beorn7 you were running my patch supposed to fix the panic, which caused a deadlock. That patch seems to be working as intended.
But it was built against a commit in master that does not exist in rc.1. Specifically this contains this section here:

// StartTime implements the Storage interface.
func (a adapter) StartTime() (int64, error) {
startTime := int64(model.Latest)
var indexr tsdb.IndexReader
if len(a.db.Blocks()) > 0 {
var err error
indexr, err = a.db.Blocks()[0].Index()
if err != nil {
return startTime, err
}
} else {
var err error
indexr, err = a.db.Head().Index()
if err != nil {
return startTime, err
}
}
joblabel := "job"
tpls, err := indexr.LabelValues(joblabel)
if err != nil {
return startTime, err
}
for i := 0; i < tpls.Len(); i++ {
vals, err := tpls.At(i)
if err != nil {
continue
}
for _, v := range vals {
p, err := indexr.Postings(joblabel, v)
if err != nil {
continue
}
if p.Next() {
var lset tsdbLabels.Labels
var chks []tsdb.ChunkMeta
indexr.Series(p.At(), &lset, &chks)
if startTime > chks[0].MinTime {
startTime = chks[0].MinTime
}
}
}
}
// Add a safety margin as it may take a few minutes for everything to spin up.
return startTime + a.startTimeMargin, nil
}
, which does not follow the rule of closing any open index readers.

This is my fault because a) I didn't review that related PR and b) the consistent behavior of having to Close() all instantiated {Index,Chunk,Tombstone}Readers was only added with the pendingReaders additions which was only done ~1 week before the code causing trouble here.

However, this again raises some concerns with me on all the remote read code being run even if no remote-read is configured. Not only does it cause notably increased query latency, but as in this case also widens the impact of potential regressions.
@tomwilkie @juliusv @brian-brazil

Good thing is though that this is easy to fix, my original patch seems to work, and we got a few more improvements/fixes in prometheus/tsdb out of it.

@beorn7 @grobie I pushed a branch for debug purposes that contains current fixes from prometheus/tsdb and a fix for the ultimate issue found here: https://github.com/prometheus/prometheus/tree/closeidxr

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 24, 2017

Quick update: We are now running a binary built from commit 48b303b on our three test servers. Sorry for the delay. Plenty of distractions… I'll report back here once we have some results.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 25, 2017

Update: All three servers have been running without any issues for the last ~24hrs.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 25, 2017

Thanks. Should we include those fixes plus prometheus/tsdb#187 and do an rc2?
Feedback on the latter would be good but that might have another 48h+ of RTT. I think there's enough reason to believe it's the right fix.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 25, 2017

I think rc2 is great to get more testers without risking the backslash we get from a premature final release.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 25, 2017

Yes absolutely, the question to me was more whether we just trust ourselves and do an RC2 or whether we should validate that particular fix, given it the bug's severity.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 25, 2017

"that particular fix" == prometheus/tsdb#187 ?
I can run a binary with that PR merged in. Is that what you would like?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 25, 2017

Well, you didn't hit the problem before, so that might be of limited use to confirm.
I guess just letting lose rc.2 makes sense in the end. A basic review on prometheus/tsdb#187 would be great though.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 25, 2017

I really don't feel qualified at this moment to review tsdb code. But that shouldn't discourage others…

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 26, 2017

Some things got conflated here. But I'm reasonably confident the original issue is fixed.

@fabxc close this or not, whatever helps your tracking more.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Oct 26, 2017

Yea, looks like the fixes were generally successful. As you said, it got quite conflated. So simply filing new issues if something comes up again seems either.

Thanks everyone for providing all the info!

@fabxc fabxc closed this Oct 26, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.