Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upSituational trigger of a memory leak in 2.0.0-rc.1 #3316
Comments
beorn7
added
component/local storage
kind/bug
labels
Oct 19, 2017
This comment has been minimized.
This comment has been minimized.
|
Memory usage pattern. To the left, server with many time series. Works fine, but at some point the oscillation stops and memory just grows forever. (Not crashed yet, had happened for the first time.) The middle panel is the high-series-churn server. It is very relaxed (but no deploys have happened in the two days shown, so no churn after all). The right panel is a mid to high series count, but very steady (scrapes many node exporters) with decent retention. This one hits the issue quite regularly and is essentially in a slow crash loop. Restart take long, which hints again towards a giant WAL. More to follow... |
This comment has been minimized.
This comment has been minimized.
|
See goroutine dump of a server stuck in shutdown (NB: This is not rc.1 but slightly before, without relevant changes, I believe, but just in case, it is at commit 721050c ). prom2-721050c6cbbcb7061cb57c0783886e78b78710f3-deadlocked-on-shutdown.20171118.txt |
This comment has been minimized.
This comment has been minimized.
|
Goroutine dump of the server to the left. prom2-5ab8834befbd92241a88976c790ace7543edcd59-maybe-compaction-stuck.20171119.txt |
This comment has been minimized.
This comment has been minimized.
|
At the time the problem started, the server to the left had a (non-crashing) panic. Log excerpt:
Sorry for line breaks in the stack dump represented as |
This comment has been minimized.
This comment has been minimized.
|
One line per thing that happened sounds right to me. Expanded stack trace:
|
This comment has been minimized.
This comment has been minimized.
|
Is this problem persistent, i.e. after it OOMs and restarts, does it OOM right again and still doesn't compact etc.? That panic only occurred once in one server but the problem still occurred in other servers where nothing like that happened? |
This comment has been minimized.
This comment has been minimized.
|
goroutine dumps which collapse repeated stack traces would be super helpful. Those are ones are a bit hard to get an overview from. |
This comment has been minimized.
This comment has been minimized.
|
Collapsed stack traces: |
This comment has been minimized.
This comment has been minimized.
|
We only went through multiple iterations on the node-exporter scraping Prometheus. There, the panic as reported above seem to happen just seconds after a restart. Will play with the other stuck one soon. |
This comment has been minimized.
This comment has been minimized.
|
Here are all the panics of that node @beorn7 mentioned: https://gist.github.com/03f825b7684aa0c9a413f63597e37d70 |
This comment has been minimized.
This comment has been minimized.
|
Just stopped the one that never OOM'd (left panel), got stuck during shutdown, goroutine dump (collapsed) below. prom2-collapsed-5ab8834befbd92241a88976c790ace7543edcd59-stuck-on-shutdown.20171019.txt |
This comment has been minimized.
This comment has been minimized.
|
Clarification about “just seconds after a restart” from above: That's seconds after the server started to do real work again, i.e. after the WAL got replayed (if I'm saying that right). |
This comment has been minimized.
This comment has been minimized.
|
Intermediate status: it appears that the panic is the common trigger and causes a lock not to be returned. This in turn blocks several other code paths subsequently, including compaction and shutdown. |
This comment has been minimized.
This comment has been minimized.
|
Presumed intermediate fix in prometheus/tsdb#181 |
This comment has been minimized.
This comment has been minimized.
smarterclayton
commented
Oct 22, 2017
|
Not sure if this is related, on a server recently updated to rc1 with high series, high churn, and high volume prometheus is now in a crash loop series on load where it:
Size on disk is 97GB, which is only slightly larger than it was before where prometheus was using around 10-20gb of memory. Total machine memory was 64gb, it looks like it OOMkilled around 57gb
|
This comment has been minimized.
This comment has been minimized.
|
A few questions @smarterclayton:
Memory profiles shortly before OOMing and logs would be helpful as well. |
This comment has been minimized.
This comment has been minimized.
smarterclayton
commented
Oct 23, 2017
Will try to capture again. |
This comment has been minimized.
This comment has been minimized.
|
If this is more of a test instance and some downtime is okay, it would be great to test the restart cycle a few times. Once it has been running for 3-4h that should be long enough to reproduce the issue. From the CPU usage it wasn't doing anything it shouldn't be doing AFAICS. |
This comment has been minimized.
This comment has been minimized.
|
We have been running three servers with commit be5422a plus the commit ea817e169b7f7bc17206285e6bac27fea69f2f2d from TSDB un-cleanly compiled into it (as per @grobie ). I don't know if the issues we are seeing are still the same issue as the original one reported here, or if they have been created by the fix, or if they are completely independent. One server (the one with fairly high load, many series, medium churn rate, monitoring mostly kubelets and such) seemed to have directly gone into "I don't do compaction" mode until it OOM'd after about 3d. Then, it went into a much faster crash loop, about once every 35m. I assume it is now in a state where it tries to replay the WAL but cannot succeed with the RAM available (128GiB). I guess taking any forensic information from that server is fairly useless. (The original state where it was just not doing any compactions was probably interesting, but we are not even getting there now.) The 2nd server (scraping many node_exporters) behaved normally for about 2.5d and then spat out that mysterious log line The 3rd server (high series churn but otherwise moderate load) just took more and more memory. It stopped logging after a few hours of uptime. It is not reachable via HTTP now, but it still grows its memory usage slowly but steadily. It shows occasional spikes in CPU usage, but more every 90m than the typical 120m. A weirdness of the logs:
Thus, that 3rd server might have received a graceful shutdown signal from Chef or something and then simply got stuck in shutdown, perhaps. However, the pattern is different for each of the three. I assume I cannot really get anything useful out of the server as they are. If I don't get any other requests, I'll wipe the storage of the 1st server (which currently runs out of RAM when replaying the WAL, presumably) and simply restart the other two so that we can perhaps get a goroutine dump while something weird is happening. |
This comment has been minimized.
This comment has been minimized.
|
Attached goroutine dump of server 1, the one presumably crashing during WAL replay. |
This comment has been minimized.
This comment has been minimized.
|
Attached a heap profile of server 1 just before OOMing again. |
This comment has been minimized.
This comment has been minimized.
|
WAL on server 1 was 421GiB in size. |
This comment has been minimized.
This comment has been minimized.
|
Thanks, overall the heap profile doesn't show any memory in any unexpected code path. Though ~40% of space was allocated here. This can practically only come from the buffers we are using for the samples read from the WAL. It shouldn't amount to THAT much memory though even if the queue of capacity 300 was filled for every core. |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 Not sure how it came to scrape sample batches in the WAL of that size to begin with, but this should mitigate the problem: prometheus/tsdb#185 (This is still irrelevant to the underlying issue of why compactions didn't run of course) |
This comment has been minimized.
This comment has been minimized.
|
All three servers restarted. As of now (three hours in) they are happily ingesting and serving queries, but none of them seems to run compactions. Logs are unsuspicious (in particular: no panics anymore), with the exception of the following repeated error message on the kubelet-scraping Prometheus server:
Collapsed goroutine dumps (let me know what else you need): |
This comment has been minimized.
This comment has been minimized.
|
The first goroutine dump actually shows a compaction running. Blocks are 2h long, but we only compact a 2h slice once we are 50% into the next one. The 50% are our "appendable" window, i.e. the maximum delta between most recent added timestamp and minimum timestamp we could still write. As suggested on IRC, setting |
This comment has been minimized.
This comment has been minimized.
|
OK, got it. WRT flag change: I am doing this in parallel to my normal work. I better leave the config as is and just wait longer instead of starting to reconfigure those servers. |
This comment has been minimized.
This comment has been minimized.
|
The first one is obviously failing at compacting and retries it. This one will thus OOM in all likelihood eventually. |
This comment has been minimized.
This comment has been minimized.
|
1st keep showing the same error message and doesn't seem to compact (looking at RAM growth). 2nd hasn't logged anything in almost 3 hours, and doesn't seem to compact either. Last log lines:
3rd has logged the following, and RAM usage dropped:
Goroutine dumps for 2nd and 3rd: |
This comment has been minimized.
This comment has been minimized.
|
Thanks, looks like no 2 got stuck when reloading after compaction. I see it correctly that it's still processing queries though? It is waiting on a WaitGroup here. This was added somewhat recently to ensure we do not unmap a block that pending readers still have access to. Otherwise a segfault may happen. There was nothing suspicious in the log otherwise that could indicate a query being improperly terminated? I guess this boils down to carefully walking the code paths and checking where we don't call Looks like the third one is doing fine for now then? |
This comment has been minimized.
This comment has been minimized.
|
Okay, so the @beorn7 you were running my patch supposed to fix the panic, which caused a deadlock. That patch seems to be working as intended. prometheus/storage/tsdb/tsdb.go Lines 152 to 201 in be5422a This is my fault because a) I didn't review that related PR and b) the consistent behavior of having to Close() all instantiated However, this again raises some concerns with me on all the remote read code being run even if no remote-read is configured. Not only does it cause notably increased query latency, but as in this case also widens the impact of potential regressions. Good thing is though that this is easy to fix, my original patch seems to work, and we got a few more improvements/fixes in prometheus/tsdb out of it. @beorn7 @grobie I pushed a branch for debug purposes that contains current fixes from prometheus/tsdb and a fix for the ultimate issue found here: https://github.com/prometheus/prometheus/tree/closeidxr |
This comment has been minimized.
This comment has been minimized.
|
Quick update: We are now running a binary built from commit 48b303b on our three test servers. Sorry for the delay. Plenty of distractions… I'll report back here once we have some results. |
smarterclayton
referenced this issue
Oct 24, 2017
Closed
Some rate calculations on rc1 are incorrect #3345
This comment has been minimized.
This comment has been minimized.
|
Update: All three servers have been running without any issues for the last ~24hrs. |
This comment has been minimized.
This comment has been minimized.
|
Thanks. Should we include those fixes plus prometheus/tsdb#187 and do an rc2? |
This comment has been minimized.
This comment has been minimized.
|
I think rc2 is great to get more testers without risking the backslash we get from a premature final release. |
This comment has been minimized.
This comment has been minimized.
|
Yes absolutely, the question to me was more whether we just trust ourselves and do an RC2 or whether we should validate that particular fix, given it the bug's severity. |
This comment has been minimized.
This comment has been minimized.
|
"that particular fix" == prometheus/tsdb#187 ? |
This comment has been minimized.
This comment has been minimized.
|
Well, you didn't hit the problem before, so that might be of limited use to confirm. |
This comment has been minimized.
This comment has been minimized.
|
I really don't feel qualified at this moment to review tsdb code. But that shouldn't discourage others… |
This comment has been minimized.
This comment has been minimized.
|
Some things got conflated here. But I'm reasonably confident the original issue is fixed. @fabxc close this or not, whatever helps your tracking more. |
This comment has been minimized.
This comment has been minimized.
|
Yea, looks like the fixes were generally successful. As you said, it got quite conflated. So simply filing new issues if something comes up again seems either. Thanks everyone for providing all the info! |
fabxc
closed this
Oct 26, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |

beorn7 commentedOct 19, 2017
What did you do?
Run 2.0.0-rc.1 on three different servers with very different high load (high churn, high steady series count, high retention).
What did you expect to see?
Stable memory consumption, with the usual compaction oscillation.
What did you see instead? Under which circumstances?
Sometimes, the memory starts to grow forever until the server OOMs. It also appears that no compaction cycles are happening anymore. And no clean shutdown is possible. Shutdown hangs.
This needs more investigation. I'll try to provide goroutine dumps and memory profiles and such. Very busy at the moment, so I thought I just leave this here as a gathering point in case others run into the same.
Environment
Linux 4.4.10+soundcloud x86_64
I guess the issue template needs an update. ;o)
I'll add more once I find time. (Or @grobie might.)