Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL truncation fails in weird ways under load #4578

Closed
beorn7 opened this Issue Sep 5, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@beorn7
Copy link
Member

beorn7 commented Sep 5, 2018

Bug Report

What did you do?

Ran a beefy Prometheus (128GiB RAM) under high load (15M head series and in particular two rules that aggregated 5.5M time series into a still quite large number of new serios).

What did you expect to see?

Overload/degradation is probably expected in that case, but I would like to see some metrics suggesting that, and I would like to not see data corruptions and such.

What did you see instead? Under which circumstances?

WAL truncation failed with error messages like this:

level=error ts=2018-09-05T17:08:05.642050268Z caller=head.go:359 component=tsdb msg="WAL truncation failed" err="read candidate WAL files: invalid entry body size 1073741824 <file: 21, lastOffset: 209422616>" duration=31.138027662s

I have never seen this error message reported in any issue.

Blocks were still written, but the WAL grew forever until the disk filled.

Restarting the server cleared the WAL (leaving a gap in the time series) and presumably also got rid of the corruption.

I could not find any metric that got incremented for a failure to truncate the WAL. That would at least allow to alert on a non-truncating WAL before the disk runs full.

Environment

  • System information:

    Linux 4.15.3+soundcloud2 x86_64

  • Prometheus version:

    build user: root@5258e0bd9cc1
    build date: 20180712-14:02:52
    go version: go1.10.3

  • Logs:

2018-09-05_13:12:59.68015 level=error ts=2018-09-05T13:12:59.680012942Z caller=head.go:359 component=tsdb msg="WAL truncation failed" err="read candidate WAL files: invalid entry body size 1073741824 <file: 12, lastOffset: 96370441>" duration=17.455949261s
@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Sep 6, 2018

WRT the metrics about this: prometheus_tsdb_wal_truncate_duration_seconds seems to still get observations even if the truncation fails, i.e. prometheus_tsdb_wal_truncate_duration_seconds_count still gets incremented every 2h even if truncation fails. Which means I cannot create an alert "truncation is stuck".

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 4, 2018

It should be fixed by #4692 (once merged) that includes prometheus/tsdb#396 and adds the following metrics:

  • prometheus_tsdb_head_truncations_failed_total
  • prometheus_tsdb_head_truncations_total
  • prometheus_tsdb_checkpoint_creations_failed_total
  • prometheus_tsdb_checkpoint_creations_total
  • prometheus_tsdb_checkpoint_deletions_failed_total
  • prometheus_tsdb_checkpoint_deletions_total
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 10, 2018

@beorn7 the new wal metrics are added.
Shall I try and replicate this to find the culprit or we will revisit if we get another report?

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 10, 2018

I'm happy to report if I run into the issue again. Once I get to it, I'll add alerts for the new metrics. (That's yet another "ASAP" on my growing list...)
I don't think I'll find time soon to deliberately create the same overload situation again.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 10, 2018

Shall we close and revisit if needed?

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Oct 10, 2018

Sure. Whatever helps bookkeeping most...

@beorn7 beorn7 closed this Oct 10, 2018

@lock lock bot locked and limited conversation to collaborators Apr 8, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.