Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upApparent memory leak & WAL file accumulation after unclean shutdown #4842
Comments
krasi-georgiev
added
the
component/local storage
label
Nov 8, 2018
This comment has been minimized.
This comment has been minimized.
|
Thanks, it would really help to get the output of the promtool debug command (inside the same release folder as the Prometheus binary) run at the time this happens. This will give us a clue what starts consuming so much memory. |
This comment has been minimized.
This comment has been minimized.
when it reaches this state no errors in the logs? I suspect some WAL checkpointing loop |
This comment has been minimized.
This comment has been minimized.
The only log output is what I've already posted in this ticket. IOW, there is no indication from the logs that anything weird is happening, until it goes boom. I have not had time yet to provoke the issue on a test system, and have just been extremely cautious about ensuring that instances are cleanly shut down in the meantime. This seems to avoid the issue. |
This comment has been minimized.
This comment has been minimized.
|
That is weird. Any corruptions should be detected and repaired at start-up. Ok i will try to replicate by many dirty shutdowns. |
krasi-georgiev
referenced this issue
Nov 13, 2018
Merged
return an error when the last wal segment record is torn. #451
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I just upgraded a test instance to v2.5.0, which recently landed as a package in debian/sid. Within a couple of hours it was exhibiting the familiar symptoms I've described above. I grabbed the debug info you requested from promtool. |
This comment has been minimized.
This comment has been minimized.
|
Additionally, trying to do a clean shutdown of this "bugged" instance is failing. It is ignoring further sigterms, and the web UI is still reachable. Logs below, and another debug pprof taken whilst it was in that state:
|
This comment has been minimized.
This comment has been minimized.
|
thanks that should help I will look into these today. |
This comment has been minimized.
This comment has been minimized.
|
unfortunately the block and mutex profiles are empty as these are enabled only in debug mode. prometheus/cmd/prometheus/main.go Lines 79 to 82 in ca93fd5 I will continue looking into the other profiles, but if you can start another instance and gather the profiles with "DEBUG" enabled it will also help. |
This comment has been minimized.
This comment has been minimized.
AndorCS
commented
Nov 19, 2018
•
|
The same issue happens to us. We have There is a screenshot from Grafana: https://i.imgur.com/207elUK.png |
This comment has been minimized.
This comment has been minimized.
|
Looking at the profile before the shutdown it seems that there is some silent panic in caused somewhere in the k8s Service discovery. the |
This comment has been minimized.
This comment has been minimized.
|
That's odd. I don't use k8s SD, although I am using Consul SD. |
This comment has been minimized.
This comment has been minimized.
|
that is strange indeed. can you also post the configs to look for some clues there. |
This comment has been minimized.
This comment has been minimized.
|
@AndorCS can you help with steps to reproduce?
chunks are inside the actual blocks. Are you sure that you actually removed chunks as these are not related to the WAL. Could you post the full logs to check if there is anything unusual. it would help if you could run |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev Here is the config from the instance where I ran |
This comment has been minimized.
This comment has been minimized.
|
I find the collapsible block a good way to add long text directly in the comment.
Your Config
|
This comment has been minimized.
This comment has been minimized.
AndorCS
commented
Nov 19, 2018
It happens again right now, trying to profile.
Yes, I'm sure, I removed some chunks, not WAL.
Will try right now. |
This comment has been minimized.
This comment has been minimized.
AndorCS
commented
Nov 19, 2018
This comment has been minimized.
This comment has been minimized.
|
@AndorCS it is good that it happened again , this should help track down the culprit.
hm , that is strange, wal checkpointing involves only the files in the |
This comment has been minimized.
This comment has been minimized.
AndorCS
commented
Nov 19, 2018
•
Logs do not exist anymore. That happened ~1 week ago last time. Sorry.
As you can see, checkpoint happens twice and after that is stuck again. |
This comment has been minimized.
This comment has been minimized.
|
@AndorCS I am unable to open the archive. |
This comment has been minimized.
This comment has been minimized.
AndorCS
commented
Nov 19, 2018
|
Sorry, that archive is broken |
This comment has been minimized.
This comment has been minimized.
|
@AndorCS any way to save the logs for the next time? Seeing some errors in the the logs would be the easiest way to troubleshoot this. The WAL graph shows that something happened between 13.30-14.00. Do you know what was it? Do you know if this container was OOM killed at some point? I already fixed a bug casued by this in prometheus/tsdb#451 Can you share the actual file paths you deleted to make Prometheus work again? |
This comment has been minimized.
This comment has been minimized.
|
@AndorCS your profiles looks normal so these won't help to troubleshoot your issue. Anyway if these were taken at the time when the wal started misbehaving please open a new issue and ping me there as this most likely caused by a different issue than the one we are troubleshooting here. |
This comment has been minimized.
This comment has been minimized.
|
@dswarbrick can you ping me on the @prometheus-dev IRC channel , it might be quicker that way. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev tried to, but I seem to be read-only in that channel. |
This comment has been minimized.
This comment has been minimized.
|
I think it requires a registration. |
This comment has been minimized.
This comment has been minimized.
AndorCS
commented
Nov 19, 2018
•
I checked, there was nothing suspicious in logs
It was some restarts related to configuration changes, not related to the current issue.
Unfortunately, I didn't save it. There is fresh data: |
This comment has been minimized.
This comment has been minimized.
AndorCS
commented
Nov 20, 2018
•
This comment has been minimized.
This comment has been minimized.
|
it looks like the head is not truncated. |
This comment has been minimized.
This comment has been minimized.
|
could you also ping me on IRC for a quick chat. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I upgraded another instance from 2.4.3 to 2.5.0, which appeared to run ok for a while, but alas ultimately showed the same deadlock-like symptom, and the head chunk kept growing in the absence of WAL checkpoints. The only thing that stuck out in the logs was 40 unknown series references. After another clean shutdown / restart, it has been running fine for about 12 hours now. I have just now upgraded another, slightly larger instance. When attempting to shutdown the 2.4.3 process, it seemed to hang.
|
This comment has been minimized.
This comment has been minimized.
|
the stuck shutdown is most likely a side effect of the wal issue so I wouldn't focus too much on that yet. I am still waiting for the other maintainers to have a look at some PRs for WAL fixes and once we merge these can make another release and test it again. |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev I finally managed to get a debug stack pprof from an instance with DEBUG=1. This instance was only setup for the first time yesterday - it has only ever run v2.5.0. I nuked the DB earlier today, so it started with a fresh DB, and within a few hours was showing the deadlock-like symptoms. This is also a very low volume instance (17K series head chunk, ~1350 samples per second), so it does not appear to only affect high traffic instances. |
This comment has been minimized.
This comment has been minimized.
|
A couple more details about the aforementioned instance: despite several attempts, I was not able to get it to shut down cleanly. Every time I attempted this, it would hang during the shutdown. After each restart, it would compact any loose blocks, but it appeared that it was simply not going to do a WAL checkpoint, and in the end I just left it running for several hours. When I checked back later, it has indeed performed a checkpoint, but they are not occurring regularly or predictably (for the most part) like the other instances. It seems that on a fresh DB with sufficiently low traffic, there is a much longer gap between checkpoints. The last checkpoint for this new instance was nearly six hours ago. In any case, the shutdown hangs are pretty concerning, and if they are as easy to provoke on a new instance as this one was, it's surprising that this got through automated testing (or even manual testing by developers). |
This comment has been minimized.
This comment has been minimized.
|
Logs from aforementioned hanging instance: |
This comment has been minimized.
This comment has been minimized.
|
I will most likely have to defer to others who know the tsdb code much better, but I have so far traced the bug as far as the pendingReaders waitgroup in block.go: // Close closes the on-disk block. It blocks as long as there are readers reading from the block.
func (pb *Block) Close() error {
pb.mtx.Lock()
pb.closing = true
pb.mtx.Unlock()
pb.pendingReaders.Wait()
var merr MultiError
merr.Add(pb.chunkr.Close())
merr.Add(pb.indexr.Close())
merr.Add(pb.tombstones.Close())
log.Println("block closed")
return merr.Err()
}I have seen goroutines hang there indefinitely. It seems to be more likely to occur at shutdown, possibly due to blocks being closed whilst reads are still pending. It can however happen spontaneously during normal operation, if the race conditions (?) are met. |
krasi-georgiev
referenced this issue
Nov 27, 2018
Merged
querier for RestoreForState not closed. #4922
This comment has been minimized.
This comment has been minimized.
|
fixed in #4922 @dswarbrick, @AndorCS thanks for the team work! |
krasi-georgiev
closed this
Nov 28, 2018
simonpasquier
modified the milestone:
v2.6.0
Nov 28, 2018
This comment has been minimized.
This comment has been minimized.
spjspjspj
commented
Dec 7, 2018
•
|
We faced the same issue after a probably unclean restart. It seems Next GC is way too high and GC never took place for days until OOM (followed by crash loop): Head Time Series / Chunks kept growing as well: Using Wiped 2.4.3 - 167a4b4 I wonder if there is a way to fix without losing data in WAL. |
This comment has been minimized.
This comment has been minimized.
|
2.6.0 rc is out , upgrade to it and run with the same data folder and this should clear the problem without loosing any data. |
spjspjspj
referenced this issue
Dec 11, 2018
Closed
Compaction memory requirements are too high #4110
This comment has been minimized.
This comment has been minimized.
Andor
commented
Dec 17, 2018
•
|
@krasi-georgiev I accidentally deployed "stable" version Log: http://dpaste.com/27594H5
|
This comment has been minimized.
This comment has been minimized.
Andor
commented
Dec 17, 2018
|
Ok, after some "latest" WAL-data cleanup my instance started to work. |
This comment has been minimized.
This comment has been minimized.
|
yeah if the WAL grows very big you can't avoid OOM even after a restart as it needs to read the whole thing in memory. |
This comment has been minimized.
This comment has been minimized.
FANLONGFANLONG
commented
Dec 18, 2018
|
I ran into the same issue, I use the 2.6.0 rc1 version but the issue happened again,any sugggestion? prometheus keeps eating memory |
This comment has been minimized.
This comment has been minimized.
|
you are most probably hitting some other issue please open a new one with more details. |







dswarbrick commentedNov 8, 2018
•
edited
Bug Report
What did you do?
Reboot host running Prometheus.
What did you expect to see?
Normal operation.
What did you see instead? Under which circumstances?
Apparent memory leak, WAL files accumulation.
I suspect that we had a few unclean shutdowns, and after the crash recovery (which incidentally eats a LOT of memory), things seemed to be back to normal, until the first WAL checkpoint.
We run
retention=30d,min-block-duration=1h,max-block-duration=24h. The WAL checkpoint appeared to become somehow deadlocked. Metrics were still being ingested and could still be queried, however RAM was being steadily consumed, and the WAL files were accumulating.Prometheus inevitably crashed when it ran out of memory or disk space, whichever happened first.
The only workaround I found was to most importantly catch Prometheus in such a state before it crashed, and perform a clean shutdown / restart. Once restarted, it began to process the backlog of WAL files, disk usage receded, and eventually everything got back to normal.
Environment
System information:
Linux 4.18.0-0.bpo.1-amd64 x86_64
Prometheus version:
prometheus, version 2.4.3+ds (branch: debian/sid, revision: 2.4.3+ds-2)
build user: pkg-go-maintainers@lists.alioth.debian.org
build date: 20181022-04:09:52
go version: go1.10.4
Alertmanager version:
n/a
Prometheus configuration file:
n/a (but can be added if really necessary)
Alertmanager configuration file:
n/a
Logs:
(originally responding to @krasi-georgiev in #4695)
This was partly motivated after reading the article written by @PeterZaitsev (https://www.percona.com/blog/2018/09/20/prometheus-2-times-series-storage-performance-analyses/). We found that the RAM usage was about 10-20% lower using the fixed, lower max-block-duration=24h, instead of the 10% calculated from our retention period. The cold startup time also appeared to be lower.
The systemd unit file in the Debian Prometheus package has TimeoutStopSec=20s, which I suspect is too short for a graceful shutdown of a large instance. It also has SendSIGKILL=false, so at least it it will allow Prometheus to continue its shutdown despite overshooting the timeout. However if this is occurring as part of a reboot, it is likely that the host will already by trying to unmount filesystems and reboot. I think this is where we are getting a corrupt DB / WAL.
Upon restarting, we see the usual log messages, but nothing too scary:
Even the next WAL commit and compactions look normal:
But by this time, it is already starting to "leak" memory and the WAL files are piling up. Several hours later, Prometheus runs out of memory:
I think you should be able to reproduce it if you SIGKILL Prometheus whilst it is still flushing / shutting down. It may take a while after restarting before you actually see the issue however.