Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upWAL files are not truncated after upgrade to 2.4.2, caused by `unexpected gap to last checkpoint` #4695
Comments
This comment has been minimized.
This comment has been minimized.
|
This looks serious, could you share the logs for the EBS setup before the log line you shared? It could have been there was an error prior to this that could have triggered this. Sharing all the logs helps us debug the issue faster. Also, is it the same case with NFS? We're not truncating the WAL there too? That would be surprising as we should be logging all the errors in that case. |
gouthamve
added
kind/bug
component/local storage
labels
Oct 4, 2018
This comment has been minimized.
This comment has been minimized.
|
@gouthamve, I've pasted the full output since the last restart. |
This comment has been minimized.
This comment has been minimized.
|
@gouthamve, yes. Nothing in the logs of NFS one. We just started it at 10:30, but I'm not sure how often compaction happens
What we also see on NFS - ammount of |
This comment has been minimized.
This comment has been minimized.
|
After some time. Check the length of head block which is calculated like:
We've seen only one compaction on that time. Since the logs above only config reload visible from the logs. |
This comment has been minimized.
This comment has been minimized.
|
From your logs, it seems that the WAL directory was corrupted and had been repaired:
It might be why the head truncation failed subsequently with |
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier, please see below:
|
This comment has been minimized.
This comment has been minimized.
|
From the timestamps it looks like the first checkpoint was created at the expected time. That 2nd set of logs you pasted shows that. Do you happen to have the wal directory for the instance using EBS? As @simonpasquier pointed out, the corruption of segment 396 and deletion of segments after that is probably what's happened. EDIT: nevermind I misunderstood what was happening, there would have had to have been missing segments between the checkpoint and the current oldest segment in your WAL, regardless of whether or not this repair takes place, to get that error. |
This comment has been minimized.
This comment has been minimized.
|
@trnl Looks like the WAL directory you shared belongs to the NFS one, could you share the contents for the EBS one? |
This comment has been minimized.
This comment has been minimized.
|
This error is triggered when there is a gap between the last checkpoint and the first wal file. @trnl maybe some more details how to replicate? |
This comment has been minimized.
This comment has been minimized.
|
btw I will also check if there could be any way to handle this more gracefully. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev yeah the I spent most of Thursday looking into this but I couldn't figure out what would cause us to have missing WAL segments between the most recent checkpoint and the current oldest segment. One idea @gouthamve had was to rebuild the segments from Head if they're missing when we do the checkpoint vs oldest segment check. I could work on this after I get remote write via WAL closer to done. |
krasi-georgiev
referenced this issue
Oct 10, 2018
Merged
more descriptive var names and some more logging. #405
This comment has been minimized.
This comment has been minimized.
|
that is a good idea, |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
tml
commented
Oct 11, 2018
This comment has been minimized.
This comment has been minimized.
|
damn , autocomplete :) thanks. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev, thank you. We'll try with 2.4.3 later and get back to you. |
krasi-georgiev
referenced this issue
Oct 22, 2018
Closed
log an error instead of blocking a wal truncation on checkpoint gaps. #407
This comment has been minimized.
This comment has been minimized.
|
Can anyone give any more clues on how to replicate this or what might have caused it? |
This comment has been minimized.
This comment has been minimized.
duhang
commented
Oct 31, 2018
•
|
Hi, @krasi-georgiev We hit similar WAL truncation error here. And the WAL files kept piling up at 1GB/hour.
We always bump into this issue when enabling a very slow scrape target (we scrape it every 3min). But we can't confirm which metric caused the problem since we have 4+ millions Head Series in our Prometheus, making it almost impossible to isolate. We can't do anything about it right now, besides restart Prometheus periodically to cleanup the dangling WAL files. Here is how the current WAL directory looks like:
I can ship the first couple of suspected WAL files to you (~3GB uncompressed). Maybe you could help us figure out what Prometheus TSDB was complaining about at I had sent WAL files to you before, so this time we can do the same. Just let me know. Thanks, |
This comment has been minimized.
This comment has been minimized.
|
@duhang that is a separate, I am trying to investigate the cause for I want to look into the error message you encountered as well, but can you please copy and paste your comment into a new issue and ping me there so we can track these separately. |
This comment has been minimized.
This comment has been minimized.
duhang
commented
Oct 31, 2018
|
Sure. Will enter a new issue about our case, once I get a chance.
Thanks,
…________________________________
From: Krasi Georgiev <notifications@github.com>
Sent: Wednesday, October 31, 2018 2:24 AM
To: prometheus/prometheus
Cc: Hang Du; Mention
Subject: Re: [prometheus/prometheus] WAL files are not truncated after upgrade to 2.4.2 (#4695)
@duhang<https://github.com/duhang> that is a separate, I am trying to investigate the cause for unexpected gap to last checkpoint
I want to look into the error message you encountered as well, but can you please copy and paste your comment into a new issue and ping me there so we can track these separately.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4695 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEkwJ9PVR6x0kJjQW-i-4gPh0YO4WnIoks5uqWw-gaJpZM4XHwvC>.
|
This comment has been minimized.
This comment has been minimized.
chlunde
commented
Nov 7, 2018
|
We're hitting this on an prometheus in a test environment with low retention time (2h), prometheus_tsdb_head_series around 800k-1000k. First it got OOM killed, then the disk fills up with WAL files, and finally we get crashes because of the full disk. I will try to reproduce it on 2.5.0. The data disk is an emptyDir volume on XFS with a 20 GB quota. Memory limit is set to 10GiB. @trnl did you also see an OOM or unclean shutdown before the WAL issue?
|
This comment has been minimized.
This comment has been minimized.
dswarbrick
commented
Nov 7, 2018
|
I've also seen v2.4.3 appear to exhibit this issue. I suspect that we had a few unclean shutdowns, and after the crash recovery (which incidentally eats a LOT of memory), things seemed to be back to normal, until the first WAL checkpoint. We run retention=30d, min-block-duration=1h, max-block-duration=24h. For some strange reason, the default min-block-duration=2h triggers a checkpoint every 2h on the hour, but setting it to 1h results in checkpoints every hour on the half-hour. In any case, the WAL checkpoint appeared to become somehow deadlocked. Metrics were still being ingested and could still be queried, however RAM was being steadily consumed, and the WAL files were accumulating. Prometheus inevitably crashed when it ran out of memory or disk space, whichever happened first. The only workaround I found was to most importantly catch Prometheus in such a state before it crashed, and perform a clean shutdown / restart. Once restarted, it began to process the backlog of WAL files, disk usage receded, and eventually everything got back to normal. |
This comment has been minimized.
This comment has been minimized.
|
@chlunde unfortunately I don't see anything in the logs to help troubleshoot the issue.
I didn't quite understand why did you need to lower the default block duration?
any unusual log messages when this happens? can you try to describe a minimal setup and step to reproduce this so I can spend some time to find the culprit for this weird behaviour. |
krasi-georgiev
changed the title
WAL files are not truncated after upgrade to 2.4.2
WAL files are not truncated after upgrade to 2.4.2, caused by `unexpected gap to last checkpoint`
Nov 8, 2018
This comment has been minimized.
This comment has been minimized.
|
@dswarbrick thanks for the update. Would you mind moving your comments to a new issues and ping me there as if you don't have the error logs reported in the first comment ( btw attaching the output of the |
This comment has been minimized.
This comment has been minimized.
|
@trnl any more pointers how to replicate? |
dswarbrick
referenced this issue
Nov 8, 2018
Closed
Apparent memory leak & WAL file accumulation after unclean shutdown #4842
krasi-georgiev
referenced this issue
Nov 13, 2018
Merged
return an error when the last wal segment record is torn. #451
This comment has been minimized.
This comment has been minimized.
|
@trnl have you experienced this again? I couldn't find any clues in the code of what might have caused it. |
This comment has been minimized.
This comment has been minimized.
|
ping @trnl |
This comment has been minimized.
This comment has been minimized.
|
looks stale so closing. feel free to reopen if this is still an issue. |




trnl commentedOct 4, 2018
•
edited
Bug Report
Updated from v2.2.1 to v2.4.2
We have 2 setups:
Both running inside Docker/Kubernetes.
We see that with 2.4.2 Prometheus stopped truncating WAL files and thus the disc usage growths and at some moment process stops. Repair is problematic.
prometheus_tsdb_wal_truncate_duration_seconds_countis 0, 'prometheus_tsdb_wal_truncate_duration_seconds' is NaN, 'prometheus_tsdb_wal_corruptions_total' is not available (probably because of 2.4.2).In the logs on EBS one:
Nothing in the logs of NFS based
On-disk layout:
On another cluster with Prometheus 2.2.1 we have similar setups (EBS and NFS) and there the situation is fine.
Environment
Docker version 1.12.6, build 78d1802
Linux 4.4.26-k8s #1 SMP Fri Oct 21 05:21:13 UTC 2016