Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL files are not truncated after upgrade to 2.4.2, caused by `unexpected gap to last checkpoint` #4695

Closed
trnl opened this Issue Oct 4, 2018 · 28 comments

Comments

Projects
None yet
9 participants
@trnl
Copy link

trnl commented Oct 4, 2018

Bug Report

Updated from v2.2.1 to v2.4.2

We have 2 setups:

  • Prometheus writing to AWS EBS mount
  • Prometheus writing to NFS mount

Both running inside Docker/Kubernetes.

We see that with 2.4.2 Prometheus stopped truncating WAL files and thus the disc usage growths and at some moment process stops. Repair is problematic. prometheus_tsdb_wal_truncate_duration_seconds_count is 0, 'prometheus_tsdb_wal_truncate_duration_seconds' is NaN, 'prometheus_tsdb_wal_corruptions_total' is not available (probably because of 2.4.2).

In the logs on EBS one:

> k -n monitor logs prometheus2-1016978423-5ctn9 prometheus2-server
level=info ts=2018-10-03T10:08:38.215387081Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.2, branch=HEAD, revision=c305ffaa092e94e9d2dbbddf8226c4813b1190a0)"
level=info ts=2018-10-03T10:08:38.215442292Z caller=main.go:239 build_context="(go=go1.10.3, user=root@dcde2b74c858, date=20180921-07:22:29)"
level=info ts=2018-10-03T10:08:38.215470526Z caller=main.go:240 host_details="(Linux 4.4.26-k8s #1 SMP Fri Oct 21 05:21:13 UTC 2016 x86_64 prometheus2-1016978423-5ctn9 (none))"
level=info ts=2018-10-03T10:08:38.215495632Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-10-03T10:08:38.215517843Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-10-03T10:08:38.21619011Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-10-03T10:08:38.216180869Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-10-03T10:08:38.216552343Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537930800000 maxt=1537963200000 ulid=01CRB15J94YM1BRYJ7BYWPN5C5
level=info ts=2018-10-03T10:08:38.216638156Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537963200000 maxt=1537995600000 ulid=01CRC02HZS61TK0R0YX5HV90GB
level=info ts=2018-10-03T10:08:38.216695171Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537995600000 maxt=1538028000000 ulid=01CRCYZ0Q792Q2M89F30K8G0VT
level=info ts=2018-10-03T10:08:38.217246309Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538028000000 maxt=1538060400000 ulid=01CRR1FX3C7R0SVKHTHVGMMVAD
level=info ts=2018-10-03T10:08:38.219129334Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538060400000 maxt=1538092800000 ulid=01CRR1GAC0KR4FC6CJ1R7XNFCB
level=info ts=2018-10-03T10:08:38.21963145Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538092800000 maxt=1538125200000 ulid=01CRR1GFJ75MMW0WER488HT1GW
level=info ts=2018-10-03T10:08:38.220083155Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538125200000 maxt=1538157600000 ulid=01CRR1GGYKQFZQ3ZAT5AW77V09
level=info ts=2018-10-03T10:08:38.220556735Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538157600000 maxt=1538190000000 ulid=01CRR1GJB8S636ZNKPB9PBXF4Z
level=info ts=2018-10-03T10:08:38.221042056Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538190000000 maxt=1538222400000 ulid=01CRR1GKWX8AW1NR8HVTYDPFX3
level=info ts=2018-10-03T10:08:38.221495319Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538222400000 maxt=1538254800000 ulid=01CRR1GXSYNH1Z4SW42Q07BGE9
level=info ts=2018-10-03T10:08:38.221926609Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538254800000 maxt=1538287200000 ulid=01CRR1H74FNFZF3JXZTBX5VJVK
level=info ts=2018-10-03T10:08:38.222378092Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538287200000 maxt=1538319600000 ulid=01CRR1H8EQ0WS9KT0GGQS6XHWY
level=info ts=2018-10-03T10:08:38.222803748Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538319600000 maxt=1538352000000 ulid=01CRR1H9VGVR9P7J175MG9APMJ
level=info ts=2018-10-03T10:08:38.22285932Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538352000000 maxt=1538384400000 ulid=01CRR1HB71ZKF0PSVFWRGM38A0
level=info ts=2018-10-03T10:08:38.223351595Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538384400000 maxt=1538416800000 ulid=01CRRHR5Y0Z6D9WEK9V4DVQ1YC
level=info ts=2018-10-03T10:08:38.223830051Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538416800000 maxt=1538449200000 ulid=01CRSGMXWQC15N18D2G0AP45R5
level=info ts=2018-10-03T10:08:38.224295058Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538449200000 maxt=1538481600000 ulid=01CRTFHR2FX84Z8QGJTVRC4Z8V
level=info ts=2018-10-03T10:08:38.22474151Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538481600000 maxt=1538514000000 ulid=01CRVEEJYVC8WRCSBMTQMGZ0FJ
level=info ts=2018-10-03T10:08:38.225145504Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538546400000 maxt=1538550000000 ulid=01CRWDAR6BF7XC7WPAKZV6SYA2
level=info ts=2018-10-03T10:08:38.225579422Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538514000000 maxt=1538546400000 ulid=01CRWDB87VEG5A5FRTXVGFX37N
level=info ts=2018-10-03T10:08:38.225999299Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538550000000 maxt=1538553600000 ulid=01CRWGRKTB6YVN9X0R34DPD8HZ
level=warn ts=2018-10-03T10:17:44.434195279Z caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 396 at 31197204: unexpected full record"
level=warn ts=2018-10-03T10:17:44.434267258Z caller=wal.go:274 component=tsdb msg="starting corruption repair" segment=396 offset=31197204
level=warn ts=2018-10-03T10:17:44.434560097Z caller=wal.go:282 component=tsdb msg="deleting all segments behind corruption"
level=warn ts=2018-10-03T10:17:44.434581496Z caller=wal.go:304 component=tsdb msg="rewrite corrupted segment"
level=info ts=2018-10-03T10:17:44.677390143Z caller=main.go:564 msg="TSDB started"
level=info ts=2018-10-03T10:17:44.678233927Z caller=main.go:624 msg="Loading configuration file" filename=/etc/config/prometheus.yml
level=info ts=2018-10-03T10:17:44.700554556Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-10-03T10:17:44.705841765Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-10-03T10:17:44.70631673Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-10-03T10:17:44.70682067Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-10-03T10:17:44.713801809Z caller=main.go:650 msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml
level=info ts=2018-10-03T10:17:44.713831797Z caller=main.go:523 msg="Server is ready to receive web requests."
level=info ts=2018-10-03T10:17:55.888548528Z caller=compact.go:398 component=tsdb msg="write block" mint=1538553600000 maxt=1538557200000 ulid=01CRWPXXPYEY4AD9E1K203CH6G
level=info ts=2018-10-03T10:17:58.113813035Z caller=head.go:446 component=tsdb msg="head GC completed" duration=1.410857299s
level=error ts=2018-10-03T10:17:58.114662648Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T10:30:06.337341204Z caller=compact.go:398 component=tsdb msg="write block" mint=1538557200000 maxt=1538560800000 ulid=01CRWQMB60X81PQMA8KH6GTG43
level=info ts=2018-10-03T10:30:07.400370196Z caller=head.go:446 component=tsdb msg="head GC completed" duration=451.655629ms
level=error ts=2018-10-03T10:30:07.401206256Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T10:30:15.7591088Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538546400000 maxt=1538557200000 ulid=01CRWQMKM8F21TSJXW63EH0B29 sources="[01CRWDAR6BF7XC7WPAKZV6SYA2 01CRWGRKTB6YVN9X0R34DPD8HZ 01CRWPXXPYEY4AD9E1K203CH6G]"
level=info ts=2018-10-03T11:30:06.246000778Z caller=compact.go:398 component=tsdb msg="write block" mint=1538560800000 maxt=1538564400000 ulid=01CRWV26T0EM44WS5SP8AX3FCM
level=info ts=2018-10-03T11:30:07.367381088Z caller=head.go:446 component=tsdb msg="head GC completed" duration=463.486103ms
level=error ts=2018-10-03T11:30:07.368299206Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T12:30:06.711789848Z caller=compact.go:398 component=tsdb msg="write block" mint=1538564400000 maxt=1538568000000 ulid=01CRWYG2E0YSY5F1X0MWDM9VG4
level=info ts=2018-10-03T12:30:07.84785015Z caller=head.go:446 component=tsdb msg="head GC completed" duration=480.53574ms
level=error ts=2018-10-03T12:30:07.848700643Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T13:30:06.668314178Z caller=compact.go:398 component=tsdb msg="write block" mint=1538568000000 maxt=1538571600000 ulid=01CRX1XY208V0H2SXGEXCCC50K
level=info ts=2018-10-03T13:30:07.800913233Z caller=head.go:446 component=tsdb msg="head GC completed" duration=467.662806ms
level=error ts=2018-10-03T13:30:07.801768789Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T13:30:15.32975036Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538557200000 maxt=1538568000000 ulid=01CRX1Y6QN9A5BVZX3DAZ343GD sources="[01CRWQMB60X81PQMA8KH6GTG43 01CRWV26T0EM44WS5SP8AX3FCM 01CRWYG2E0YSY5F1X0MWDM9VG4]"
level=info ts=2018-10-03T14:30:06.724733812Z caller=compact.go:398 component=tsdb msg="write block" mint=1538571600000 maxt=1538575200000 ulid=01CRX5BSP0CD3AF1B86FBGVVDQ
level=info ts=2018-10-03T14:30:07.996606627Z caller=head.go:446 component=tsdb msg="head GC completed" duration=515.838557ms
level=error ts=2018-10-03T14:30:07.997501582Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T15:30:07.082840021Z caller=compact.go:398 component=tsdb msg="write block" mint=1538575200000 maxt=1538578800000 ulid=01CRX8SN8G26WZNT4N0YGZ71FX
level=info ts=2018-10-03T15:30:08.438870058Z caller=head.go:446 component=tsdb msg="head GC completed" duration=572.819436ms
level=error ts=2018-10-03T15:30:08.439785634Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T16:30:06.995215713Z caller=compact.go:398 component=tsdb msg="write block" mint=1538578800000 maxt=1538582400000 ulid=01CRXC7GWBT80925E1BJHBBZ50
level=info ts=2018-10-03T16:30:08.209404284Z caller=head.go:446 component=tsdb msg="head GC completed" duration=519.779316ms
level=error ts=2018-10-03T16:30:08.210323364Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T16:30:16.312571047Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538568000000 maxt=1538578800000 ulid=01CRXC7T0JYTBBQF5D81TCXZQF sources="[01CRX1XY208V0H2SXGEXCCC50K 01CRX5BSP0CD3AF1B86FBGVVDQ 01CRX8SN8G26WZNT4N0YGZ71FX]"
level=info ts=2018-10-03T16:30:29.798040824Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538546400000 maxt=1538578800000 ulid=01CRXC82DYH99K092XW5V9868Z sources="[01CRWQMKM8F21TSJXW63EH0B29 01CRX1Y6QN9A5BVZX3DAZ343GD 01CRXC7T0JYTBBQF5D81TCXZQF]"
level=info ts=2018-10-03T17:30:06.513940099Z caller=compact.go:398 component=tsdb msg="write block" mint=1538582400000 maxt=1538586000000 ulid=01CRXFNCGCVFVFR11R6F8G517E
level=info ts=2018-10-03T17:30:07.746319541Z caller=head.go:446 component=tsdb msg="head GC completed" duration=468.478243ms
level=error ts=2018-10-03T17:30:07.747212241Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T18:30:06.445768336Z caller=compact.go:398 component=tsdb msg="write block" mint=1538586000000 maxt=1538589600000 ulid=01CRXK384FC6CK0ZQ73A2F83A2
level=info ts=2018-10-03T18:30:07.659768464Z caller=head.go:446 component=tsdb msg="head GC completed" duration=525.800255ms
level=error ts=2018-10-03T18:30:07.66115912Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T19:30:06.529991202Z caller=compact.go:398 component=tsdb msg="write block" mint=1538589600000 maxt=1538593200000 ulid=01CRXPH3RD32RKAGJ5VBS8X354
level=info ts=2018-10-03T19:30:07.835992712Z caller=head.go:446 component=tsdb msg="head GC completed" duration=596.011148ms
level=error ts=2018-10-03T19:30:07.836964601Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T19:30:15.443161518Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538578800000 maxt=1538589600000 ulid=01CRXPHCGVH4TBT7RH1PBF7RG5 sources="[01CRXC7GWBT80925E1BJHBBZ50 01CRXFNCGCVFVFR11R6F8G517E 01CRXK384FC6CK0ZQ73A2F83A2]"
level=info ts=2018-10-03T20:30:06.091342034Z caller=compact.go:398 component=tsdb msg="write block" mint=1538593200000 maxt=1538596800000 ulid=01CRXSYZCBY7H3VMP1MWDQ702Q
level=info ts=2018-10-03T20:30:07.230485715Z caller=head.go:446 component=tsdb msg="head GC completed" duration=458.883631ms
level=error ts=2018-10-03T20:30:07.231398762Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T21:30:06.322771925Z caller=compact.go:398 component=tsdb msg="write block" mint=1538596800000 maxt=1538600400000 ulid=01CRXXCV0C9Z9GRF5AMGEMG8XK
level=info ts=2018-10-03T21:30:07.533493874Z caller=head.go:446 component=tsdb msg="head GC completed" duration=489.828369ms
level=error ts=2018-10-03T21:30:07.534519305Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T22:30:09.162199091Z caller=compact.go:398 component=tsdb msg="write block" mint=1538600400000 maxt=1538604000000 ulid=01CRY0TPMBKMC1X31VZ55ATB2X
level=info ts=2018-10-03T22:30:10.91690412Z caller=head.go:446 component=tsdb msg="head GC completed" duration=608.104824ms
level=error ts=2018-10-03T22:30:10.918060286Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-03T22:30:21.09554546Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538589600000 maxt=1538600400000 ulid=01CRY0V2F7SN3EPMZYAVGS942P sources="[01CRXPH3RD32RKAGJ5VBS8X354 01CRXSYZCBY7H3VMP1MWDQ702Q 01CRXXCV0C9Z9GRF5AMGEMG8XK]"
level=info ts=2018-10-03T23:30:06.613369141Z caller=compact.go:398 component=tsdb msg="write block" mint=1538604000000 maxt=1538607600000 ulid=01CRY48J8CAXV8DVR9NAA1JW8N
level=info ts=2018-10-03T23:30:07.811735823Z caller=head.go:446 component=tsdb msg="head GC completed" duration=497.767606ms
level=error ts=2018-10-03T23:30:07.812796345Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T00:30:06.203225563Z caller=compact.go:398 component=tsdb msg="write block" mint=1538607600000 maxt=1538611200000 ulid=01CRY7PDWBWF1PZYY392MEB2K8
level=info ts=2018-10-04T00:30:07.309066177Z caller=head.go:446 component=tsdb msg="head GC completed" duration=463.924948ms
level=error ts=2018-10-04T00:30:07.310079976Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T01:30:06.462733717Z caller=compact.go:398 component=tsdb msg="write block" mint=1538611200000 maxt=1538614800000 ulid=01CRYB49GD9EPQ6BQ6TKDBM93C
level=info ts=2018-10-04T01:30:07.702122301Z caller=head.go:446 component=tsdb msg="head GC completed" duration=474.898269ms
level=error ts=2018-10-04T01:30:07.703176744Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T01:30:15.375861684Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538600400000 maxt=1538611200000 ulid=01CRYB4J4YK9HVW3E0TKX6CQ6F sources="[01CRY0TPMBKMC1X31VZ55ATB2X 01CRY48J8CAXV8DVR9NAA1JW8N 01CRY7PDWBWF1PZYY392MEB2K8]"
level=info ts=2018-10-04T01:30:27.903915874Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538578800000 maxt=1538611200000 ulid=01CRYB4SYTFQ6M5RWTDTCWHKGC sources="[01CRXPHCGVH4TBT7RH1PBF7RG5 01CRY0V2F7SN3EPMZYAVGS942P 01CRYB4J4YK9HVW3E0TKX6CQ6F]"
level=info ts=2018-10-04T02:30:06.46729758Z caller=compact.go:398 component=tsdb msg="write block" mint=1538614800000 maxt=1538618400000 ulid=01CRYEJ54D2RBK1SW57QYQ6V6W
level=info ts=2018-10-04T02:30:07.608028395Z caller=head.go:446 component=tsdb msg="head GC completed" duration=461.984454ms
level=error ts=2018-10-04T02:30:07.609083433Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T03:30:06.44583931Z caller=compact.go:398 component=tsdb msg="write block" mint=1538618400000 maxt=1538622000000 ulid=01CRYJ00RCP45Y51WND8J8ASNJ
level=info ts=2018-10-04T03:30:07.640977584Z caller=head.go:446 component=tsdb msg="head GC completed" duration=475.34158ms
level=error ts=2018-10-04T03:30:07.642106532Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T04:30:06.305762278Z caller=compact.go:398 component=tsdb msg="write block" mint=1538622000000 maxt=1538625600000 ulid=01CRYNDWCE7JMF3TRGGZ4TCZ48
level=info ts=2018-10-04T04:30:07.520159831Z caller=head.go:446 component=tsdb msg="head GC completed" duration=478.175797ms
level=error ts=2018-10-04T04:30:07.521241774Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T04:30:14.833922583Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538611200000 maxt=1538622000000 ulid=01CRYNE4TT4P27S6W69ADGWCT2 sources="[01CRYB49GD9EPQ6BQ6TKDBM93C 01CRYEJ54D2RBK1SW57QYQ6V6W 01CRYJ00RCP45Y51WND8J8ASNJ]"
level=info ts=2018-10-04T05:30:06.342597547Z caller=compact.go:398 component=tsdb msg="write block" mint=1538625600000 maxt=1538629200000 ulid=01CRYRVR0B74TVWW7FA7HE0KM5
level=info ts=2018-10-04T05:30:07.461366342Z caller=head.go:446 component=tsdb msg="head GC completed" duration=446.936244ms
level=error ts=2018-10-04T05:30:07.462502848Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T06:30:06.335313793Z caller=compact.go:398 component=tsdb msg="write block" mint=1538629200000 maxt=1538632800000 ulid=01CRYW9KMBWT9RQRQ0TR5Y8A17
level=info ts=2018-10-04T06:30:07.534534357Z caller=head.go:446 component=tsdb msg="head GC completed" duration=479.605618ms
level=error ts=2018-10-04T06:30:07.535697315Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T07:30:06.578863289Z caller=compact.go:398 component=tsdb msg="write block" mint=1538632800000 maxt=1538636400000 ulid=01CRYZQF8BCE5DV90R23SE6SGM
level=info ts=2018-10-04T07:30:07.883494301Z caller=head.go:446 component=tsdb msg="head GC completed" duration=585.830203ms
level=error ts=2018-10-04T07:30:07.884590226Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T07:30:15.185991407Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538622000000 maxt=1538632800000 ulid=01CRYZQR3Q14N5S405F9QA1RMT sources="[01CRYNDWCE7JMF3TRGGZ4TCZ48 01CRYRVR0B74TVWW7FA7HE0KM5 01CRYW9KMBWT9RQRQ0TR5Y8A17]"
level=info ts=2018-10-04T08:30:06.721983155Z caller=compact.go:398 component=tsdb msg="write block" mint=1538636400000 maxt=1538640000000 ulid=01CRZ35AWB5TVH3VYKE1AW57GM
level=info ts=2018-10-04T08:30:07.811065334Z caller=head.go:446 component=tsdb msg="head GC completed" duration=510.430822ms
level=error ts=2018-10-04T08:30:07.812140743Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"
level=info ts=2018-10-04T09:30:06.664353787Z caller=compact.go:398 component=tsdb msg="write block" mint=1538640000000 maxt=1538643600000 ulid=01CRZ6K6GBBT8B273QR82K97N0
level=info ts=2018-10-04T09:30:07.916256835Z caller=head.go:446 component=tsdb msg="head GC completed" duration=530.902026ms
level=error ts=2018-10-04T09:30:07.917396786Z caller=db.go:305 component=tsdb msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint"

Nothing in the logs of NFS based

On-disk layout:

/var/lib/kubelet/pods/09807702-c57a-11e8-be3b-0231dedefc0a/volumes/kubernetes.io~aws-ebs/pvc-d86dde3c-9975-11e8-be3b-0231dedefc0a# du -hs *
494M    01CRR1FX3C7R0SVKHTHVGMMVAD
105M    01CRR1GAC0KR4FC6CJ1R7XNFCB
5.2M    01CRR1GFJ75MMW0WER488HT1GW
5.2M    01CRR1GGYKQFZQ3ZAT5AW77V09
5.2M    01CRR1GJB8S636ZNKPB9PBXF4Z
368M    01CRR1GKWX8AW1NR8HVTYDPFX3
365M    01CRR1GXSYNH1Z4SW42Q07BGE9
5.0M    01CRR1H74FNFZF3JXZTBX5VJVK
5.0M    01CRR1H8EQ0WS9KT0GGQS6XHWY
5.0M    01CRR1H9VGVR9P7J175MG9APMJ
5.0M    01CRR1HB71ZKF0PSVFWRGM38A0
287M    01CRRHR5Y0Z6D9WEK9V4DVQ1YC
468M    01CRSGMXWQC15N18D2G0AP45R5
473M    01CRTFHR2FX84Z8QGJTVRC4Z8V
480M    01CRVEEJYVC8WRCSBMTQMGZ0FJ
460M    01CRWDB87VEG5A5FRTXVGFX37N
457M    01CRXC82DYH99K092XW5V9868Z
461M    01CRYB4SYTFQ6M5RWTDTCWHKGC
182M    01CRYNE4TT4P27S6W69ADGWCT2
101M    01CRYZQF8BCE5DV90R23SE6SGM
181M    01CRYZQR3Q14N5S405F9QA1RMT
99M    01CRZ35AWB5TVH3VYKE1AW57GM
0    lock
16K    lost+found
4.0K    snapshots
43G    wal

On another cluster with Prometheus 2.2.1 we have similar setups (EBS and NFS) and there the situation is fine.

Environment
Docker version 1.12.6, build 78d1802
Linux 4.4.26-k8s #1 SMP Fri Oct 21 05:21:13 UTC 2016

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Oct 4, 2018

This looks serious, could you share the logs for the EBS setup before the log line you shared?

It could have been there was an error prior to this that could have triggered this. Sharing all the logs helps us debug the issue faster.

Also, is it the same case with NFS? We're not truncating the WAL there too? That would be surprising as we should be logging all the errors in that case.

@trnl

This comment has been minimized.

Copy link
Author

trnl commented Oct 4, 2018

@gouthamve, I've pasted the full output since the last restart.

@trnl

This comment has been minimized.

Copy link
Author

trnl commented Oct 4, 2018

@gouthamve, yes. Nothing in the logs of NFS one. We just started it at 10:30, but I'm not sure how often compaction happens

level=info ts=2018-10-04T08:41:22.778321732Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.2, branch=HEAD, revision=c305ffaa092e94e9d2dbbddf8226c4813b1190a0)"
level=info ts=2018-10-04T08:41:22.778483204Z caller=main.go:239 build_context="(go=go1.10.3, user=root@dcde2b74c858, date=20180921-07:22:29)"
level=info ts=2018-10-04T08:41:22.778527689Z caller=main.go:240 host_details="(Linux 3.10.0-693.11.1.el7.x86_64 #1 SMP Fri Oct 27 05:39:05 EDT 2017 x86_64 prometheus2-64998b78c5-x8vfl (none))"
level=info ts=2018-10-04T08:41:22.778564923Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-10-04T08:41:22.778599119Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-10-04T08:41:22.780095396Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-10-04T08:41:22.780645165Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-10-04T08:41:22.784941889Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1535954400000 maxt=1536051600000 ulid=01CPJ24ZG7Q55Z0CY72HX07QZ0
level=info ts=2018-10-04T08:41:22.785666662Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536051600000 maxt=1536148800000 ulid=01CPMYVM4558GVDMRG07D77Q17
level=info ts=2018-10-04T08:41:22.786247714Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536148800000 maxt=1536246000000 ulid=01CPQVHEX117R6JY15H676Q5SC
level=info ts=2018-10-04T08:41:22.786922389Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536246000000 maxt=1536343200000 ulid=01CPTR7YV4829W32WPDX2QAYJ3
level=info ts=2018-10-04T08:41:22.787510195Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536343200000 maxt=1536440400000 ulid=01CPXMY6CHX2SP630DDW30CRF0
level=info ts=2018-10-04T08:41:22.788071728Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536440400000 maxt=1536537600000 ulid=01CQ0HMFZR8W4DN8ACDN72MY4C
level=info ts=2018-10-04T08:41:22.788771892Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536634800000 ulid=01CQ3EAVVCWQFYGQ4EECJA981M
level=info ts=2018-10-04T08:41:22.789470765Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536634800000 maxt=1536732000000 ulid=01CQ6B16T890YG6JXQM7QEX788
level=info ts=2018-10-04T08:41:22.790107618Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536829200000 ulid=01CQ97QT3G5Q1NS9A7W1DCK83J
level=info ts=2018-10-04T08:41:22.790700257Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536829200000 maxt=1536926400000 ulid=01CQC4EENRB3PDSJ339BTHBCTJ
level=info ts=2018-10-04T08:41:22.79125085Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1536926400000 maxt=1537023600000 ulid=01CQF149BTG3GN3DJ902TWZ21G
level=info ts=2018-10-04T08:41:22.791844163Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537023600000 maxt=1537120800000 ulid=01CQHXTPTFW9HPSCEXNH1SKRA1
level=info ts=2018-10-04T08:41:22.792358927Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537120800000 maxt=1537218000000 ulid=01CQMTGNYR47JDPAZMB94EE9TN
level=info ts=2018-10-04T08:41:22.792905739Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537218000000 maxt=1537315200000 ulid=01CQQQ6M1XAM8N61SQ97614HFQ
level=info ts=2018-10-04T08:41:22.793411216Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537315200000 maxt=1537412400000 ulid=01CQTKWZQGAPW12CX5GG1RH7K2
level=info ts=2018-10-04T08:41:22.793955682Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537412400000 maxt=1537509600000 ulid=01CQXGKCQWGXPHH00YKNGRC2V0
level=info ts=2018-10-04T08:41:22.794487189Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537509600000 maxt=1537606800000 ulid=01CR0D9GH4SCZ8EM32WPN78PNV
level=info ts=2018-10-04T08:41:22.795027463Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537606800000 maxt=1537704000000 ulid=01CR39ZV2W0B94SN0MEYM26M5B
level=info ts=2018-10-04T08:41:22.796290255Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537704000000 maxt=1537801200000 ulid=01CR66P4SXF0YNZWR8BNV81GNT
level=info ts=2018-10-04T08:41:22.798726133Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537801200000 maxt=1537898400000 ulid=01CR93CJ5DE31V6PZ5DHTMA1BM
level=info ts=2018-10-04T08:41:22.799384201Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537898400000 maxt=1537995600000 ulid=01CRC02SN4499ADEFGHTK4Q9G4
level=info ts=2018-10-04T08:41:22.800014874Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537995600000 maxt=1538092800000 ulid=01CRQFWTCH8G1ZSH9K2QA92SWS
level=info ts=2018-10-04T08:41:22.800641493Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538092800000 maxt=1538190000000 ulid=01CRQFXA2B7T31HC6ZEWTTXJ25
level=info ts=2018-10-04T08:41:22.801253975Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538190000000 maxt=1538287200000 ulid=01CRQFXAPJ3PG3T141Q145ATQZ
level=info ts=2018-10-04T08:41:22.801783346Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538632800000 maxt=1538636400000 ulid=01CRZ2JHMZ7SW461MF7D4CAV2F
level=info ts=2018-10-04T08:41:22.803190439Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538611200000 maxt=1538622000000 ulid=01CRZ2K5EN12SKT53JQW46JF01
level=info ts=2018-10-04T08:41:22.80380489Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538622000000 maxt=1538632800000 ulid=01CRZ2K67G5R61XSDTFDJB791A
level=info ts=2018-10-04T08:41:22.804396017Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538578800000 maxt=1538611200000 ulid=01CRZ2KBFKS2MAE7PF0MY7MA05
level=info ts=2018-10-04T08:41:22.804946213Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538287200000 maxt=1538384400000 ulid=01CRZ2KC3REFKK23ZHHXHCXG29
level=info ts=2018-10-04T08:41:22.80548316Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538481600000 maxt=1538578800000 ulid=01CRZ2KDBCYZ6KNTST2D2GJHA6
level=info ts=2018-10-04T08:41:22.80594309Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538636400000 maxt=1538640000000 ulid=01CRZ35ATWV63T4GEDT8AFHT0X
level=info ts=2018-10-04T08:41:22.806895606Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1538384400000 maxt=1538388000000 ulid=01CRZ3RTZWY1B4SF2YS4Z50YTC
level=warn ts=2018-10-04T08:41:33.434146401Z caller=head.go:371 component=tsdb msg="unknown series references" count=185551
level=info ts=2018-10-04T08:41:33.512538446Z caller=main.go:564 msg="TSDB started"
level=info ts=2018-10-04T08:41:33.51278962Z caller=main.go:624 msg="Loading configuration file" filename=/etc/config/prometheus.yml
level=info ts=2018-10-04T08:41:33.523434797Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-10-04T08:41:33.526448543Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-10-04T08:41:33.527651768Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-10-04T08:41:33.539770804Z caller=main.go:650 msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml
level=info ts=2018-10-04T08:41:33.539879067Z caller=main.go:523 msg="Server is ready to receive web requests."
level=info ts=2018-10-04T09:30:07.984969436Z caller=compact.go:398 component=tsdb msg="write block" mint=1538640000000 maxt=1538643600000 ulid=01CRZ6K6FAH74RJPA0FAZNGGQ8
level=info ts=2018-10-04T09:30:08.949560654Z caller=head.go:446 component=tsdb msg="head GC completed" duration=438.897931ms
level=info ts=2018-10-04T10:30:07.51171227Z caller=compact.go:398 component=tsdb msg="write block" mint=1538643600000 maxt=1538647200000 ulid=01CRZA1238WMZGYE5H9ZQQJTRV
level=info ts=2018-10-04T10:30:08.522257598Z caller=head.go:446 component=tsdb msg="head GC completed" duration=554.31193ms
level=info ts=2018-10-04T10:30:12.09627372Z caller=head.go:477 component=tsdb msg="WAL checkpoint complete" low=0 high=1 duration=3.573862897s
level=info ts=2018-10-04T10:30:17.125481178Z caller=compact.go:352 component=tsdb msg="compact blocks" count=3 mint=1538632800000 maxt=1538643600000 ulid=01CRZA1ECEGDNMW2NWFW02THK7 sources="[01CRZ2JHMZ7SW461MF7D4CAV2F 01CRZ35ATWV63T4GEDT8AFHT0X 01CRZ6K6FAH74RJPA0FAZNGGQ8]"

What we also see on NFS - ammount of prometheus_tsdb_head_series constantly growing without any leaking metrics on the back
image

@trnl

This comment has been minimized.

Copy link
Author

trnl commented Oct 4, 2018

After some time. Check the length of head block which is calculated like:

max(prometheus_tsdb_head_max_time{instance="$instance"}) by (instance) - min(prometheus_tsdb_head_min_time{instance="$instance"}) by (instance)

image

We've seen only one compaction on that time. Since the logs above only config reload visible from the logs.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 4, 2018

From your logs, it seems that the WAL directory was corrupted and had been repaired:

level=warn ts=2018-10-03T10:17:44.434195279Z caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 396 at 31197204: unexpected full record"
level=warn ts=2018-10-03T10:17:44.434267258Z caller=wal.go:274 component=tsdb msg="starting corruption repair" segment=396 offset=31197204
level=warn ts=2018-10-03T10:17:44.434560097Z caller=wal.go:282 component=tsdb msg="deleting all segments behind corruption"
level=warn ts=2018-10-03T10:17:44.434581496Z caller=wal.go:304 component=tsdb msg="rewrite corrupted segment"

It might be why the head truncation failed subsequently with reload blocks: head truncate failed: create checkpoint: unexpected gap to last checkpoint. Can you list the files in the wal/ directory?

@trnl

This comment has been minimized.

Copy link
Author

trnl commented Oct 4, 2018

@simonpasquier, please see below:

# ls -lh
total 2.3G
-rw-r--r--. 1 root root 128M Oct  4 11:14 00000002
-rw-r--r--. 1 root root 128M Oct  4 11:38 00000003
-rw-r--r--. 1 root root 128M Oct  4 12:01 00000004
-rw-r--r--. 1 root root 128M Oct  4 12:24 00000005
-rw-r--r--. 1 root root 128M Oct  4 12:47 00000006
-rw-r--r--. 1 root root 128M Oct  4 13:10 00000007
-rw-r--r--. 1 root root 128M Oct  4 13:33 00000008
-rw-r--r--. 1 root root 128M Oct  4 13:57 00000009
-rw-r--r--. 1 root root 128M Oct  4 14:20 00000010
-rw-r--r--. 1 root root 128M Oct  4 14:43 00000011
-rw-r--r--. 1 root root 128M Oct  4 15:06 00000012
-rw-r--r--. 1 root root 128M Oct  4 15:29 00000013
-rw-r--r--. 1 root root 128M Oct  4 15:52 00000014
-rw-r--r--. 1 root root 128M Oct  4 16:15 00000015
-rw-r--r--. 1 root root 128M Oct  4 16:38 00000016
-rw-r--r--. 1 root root 128M Oct  4 17:01 00000017
-rw-r--r--. 1 root root 128M Oct  4 17:25 00000018
-rw-r--r--. 1 root root  74M Oct  4 17:38 00000019
drwxr-xr-x. 2 root root 4.0K Oct  4 12:30 checkpoint.000001
@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Oct 4, 2018

From the timestamps it looks like the first checkpoint was created at the expected time. That 2nd set of logs you pasted shows that.

Do you happen to have the wal directory for the instance using EBS? As @simonpasquier pointed out, the corruption of segment 396 and deletion of segments after that is probably what's happened.

EDIT: nevermind I misunderstood what was happening, there would have had to have been missing segments between the checkpoint and the current oldest segment in your WAL, regardless of whether or not this repair takes place, to get that error.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Oct 5, 2018

@trnl Looks like the WAL directory you shared belongs to the NFS one, could you share the contents for the EBS one?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 10, 2018

This error is triggered when there is a gap between the last checkpoint and the first wal file.
for example checkpoint.000001 and first wal is 00000003 (00000002 is missing) which doesn't seem to be the case with the files list posted by @trnl.
I looked at the code and it can't see the problem yet, but will keep digging.

@trnl maybe some more details how to replicate?
Can you please confirm that the failing Prometheus does indeed have a gap between the checkpoint the the first wal file ?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 10, 2018

btw I will also check if there could be any way to handle this more gracefully.

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Oct 10, 2018

@krasi-georgiev yeah the ls output seems to be from a different Prometheus' WAL.

I spent most of Thursday looking into this but I couldn't figure out what would cause us to have missing WAL segments between the most recent checkpoint and the current oldest segment. One idea @gouthamve had was to rebuild the segments from Head if they're missing when we do the checkpoint vs oldest segment check. I could work on this after I get remote write via WAL closer to done.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 10, 2018

that is a good idea,
I will test it and if it takes the same time and resource we can do this be default and not only when there are missing segments.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 11, 2018

@tml although I am unable to replicate locally and can't see what might cause this I opened a PR to unblock the WAL truncation.
This way if a wal segment is missing you might loose some data, but at least it will not grow indefinitely or cause a crash.

@tml

This comment has been minimized.

Copy link

tml commented Oct 11, 2018

I appreciate the heads-up, but you wanted @trnl not @tml. :)

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 11, 2018

damn , autocomplete :) thanks.

@trnl

This comment has been minimized.

Copy link
Author

trnl commented Oct 16, 2018

@krasi-georgiev, thank you. We'll try with 2.4.3 later and get back to you.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 29, 2018

Can anyone give any more clues on how to replicate this or what might have caused it?

@duhang

This comment has been minimized.

Copy link

duhang commented Oct 31, 2018

Hi, @krasi-georgiev

We hit similar WAL truncation error here. And the WAL files kept piling up at 1GB/hour.
Everything else is fine though, the trunks still compact normally. BTW: we are using Prometheus 2.3.0~2.3.2. We tried 2.4.2 but hit the same issue.

Oct 31 05:02:48 monitoring-1001 prometheus[30290]: level=error ts=2018-10-31T05:02:48.138094737Z caller=head.go:359 component=tsdb msg="WAL truncation failed" err="read candidate WAL files: invalid entry body size 1073741824 <file: 2, lastOffset: 10975973>" duration=8.195689383s

We always bump into this issue when enabling a very slow scrape target (we scrape it every 3min).

But we can't confirm which metric caused the problem since we have 4+ millions Head Series in our Prometheus, making it almost impossible to isolate.

We can't do anything about it right now, besides restart Prometheus periodically to cleanup the dangling WAL files.

Here is how the current WAL directory looks like:

-rw-r--r--. 1 root root 1.9M Oct  7 17:00 011348
-rw-r--r--. 1 root root 1.1G Oct  7 17:02 011352
-rw-r--r--. 1 root root 1.1G Oct  7 17:02 011353
-rw-r--r--. 1 root root 623M Oct  7 17:03 011354
-rw-r--r--. 1 root root 246M Oct  7 17:16 011355
-rw-r--r--. 1 root root 243M Oct  7 17:29 011356
-rw-r--r--. 1 root root 253M Oct  7 17:43 011357
+ 2000+ more and growing WAL files

I can ship the first couple of suspected WAL files to you (~3GB uncompressed). Maybe you could help us figure out what Prometheus TSDB was complaining about at read candidate WAL files: invalid entry body size 1073741824 <file: 2, lastOffset: 10975973>

I had sent WAL files to you before, so this time we can do the same. Just let me know.

Thanks,

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 31, 2018

@duhang that is a separate, I am trying to investigate the cause for unexpected gap to last checkpoint

I want to look into the error message you encountered as well, but can you please copy and paste your comment into a new issue and ping me there so we can track these separately.

@duhang

This comment has been minimized.

Copy link

duhang commented Oct 31, 2018

@chlunde

This comment has been minimized.

Copy link

chlunde commented Nov 7, 2018

We're hitting this on an prometheus in a test environment with low retention time (2h), prometheus_tsdb_head_series around 800k-1000k. First it got OOM killed, then the disk fills up with WAL files, and finally we get crashes because of the full disk. I will try to reproduce it on 2.5.0. The data disk is an emptyDir volume on XFS with a 20 GB quota. Memory limit is set to 10GiB. @trnl did you also see an OOM or unclean shutdown before the WAL issue?

graphs are TZ+1
image
image

level=info ts=2018-11-06T15:00:21.094117763Z caller=compact.go:398 component=tsdb msg="write block" mint=1541505600000 maxt=1541512800000 ulid=01CVMRK5E6F8HQ5PBDGR5VNHD9
level=info ts=2018-11-06T15:00:25.192424279Z caller=head.go:446 component=tsdb msg="head GC completed" duration=1.484368166s
level=info ts=2018-11-06T15:01:08.855327128Z caller=head.go:477 component=tsdb msg="WAL checkpoint complete" low=494 high=536 duration=43.662836909s

oom killed at 16:49Z:

[Tue Nov  6 17:49:30 2018] prometheus invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=986
[Tue Nov  6 17:49:30 2018] prometheus cpuset=docker-90aeb5c06462cd723db717bb78e63073e127459b12576446828749f07c20ea33.scope mems_allowed=0
...
[Tue Nov  6 17:49:30 2018] memory: usage 9765624kB, limit 9765624kB, failcnt 31783484
[Tue Nov  6 17:49:30 2018] memory+swap: usage 9765624kB, limit 9007199254740988kB, failcnt 0
[Tue Nov  6 17:49:30 2018] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[Tue Nov  6 17:49:30 2018] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podef895ea3_e0bd_11e8_abb3_005056a5d7b0.slice/docker-90aeb5c06462cd723db717bb78e63073e127459b12576446828749f07c20ea33.scope: cache:9440KB rss:9755820KB rss_huge:1191936KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:9755820KB inactive_file:4744KB active_file:4696KB unevictable:0KB


level=info ts=2018-11-06T16:49:58.428380618Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"
level=info ts=2018-11-06T16:49:58.42845151Z caller=main.go:239 build_context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"
level=info ts=2018-11-06T16:49:58.428468346Z caller=main.go:240 host_details="(Linux 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 prometheus-core-f6d6449b5-t7w98 (none))"
level=info ts=2018-11-06T16:49:58.428483015Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-11-06T16:49:58.428495374Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-11-06T16:49:58.430167033Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-11-06T16:49:58.430252338Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-11-06T16:49:58.448327166Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541491200000 maxt=1541498400000 ulid=01CVMAVPY69FGPQ1CD336M8RQY
level=info ts=2018-11-06T16:49:58.453824985Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541498400000 maxt=1541505600000 ulid=01CVMHQE66BM5R513KZS6ZR2M0
level=info ts=2018-11-06T16:49:58.455364333Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541505600000 maxt=1541512800000 ulid=01CVMRK5E6F8HQ5PBDGR5VNHD9
level=warn ts=2018-11-06T16:54:30.89934958Z caller=head.go:371 component=tsdb msg="unknown series references" count=126
level=info ts=2018-11-06T16:54:31.125332101Z caller=main.go:564 msg="TSDB started"
level=info ts=2018-11-06T16:54:31.125408459Z caller=main.go:624 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yaml
level=info ts=2018-11-06T16:54:31.126691966Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T16:54:31.127615186Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T16:54:31.12841872Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T16:54:31.129292469Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T16:54:31.12998879Z caller=main.go:650 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yaml
level=info ts=2018-11-06T16:54:31.13000705Z caller=main.go:523 msg="Server is ready to receive web requests."
level=info ts=2018-11-06T16:56:06.516654304Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"
level=info ts=2018-11-06T16:56:06.516725065Z caller=main.go:239 build_context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"
level=info ts=2018-11-06T16:56:06.516742088Z caller=main.go:240 host_details="(Linux 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 prometheus-core-f6d6449b5-t7w98 (none))"
level=info ts=2018-11-06T16:56:06.516756971Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-11-06T16:56:06.516769444Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-11-06T16:56:06.517723815Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-11-06T16:56:06.517753111Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-11-06T16:56:06.518764779Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541491200000 maxt=1541498400000 ulid=01CVMAVPY69FGPQ1CD336M8RQY
level=info ts=2018-11-06T16:56:06.520517189Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541498400000 maxt=1541505600000 ulid=01CVMHQE66BM5R513KZS6ZR2M0
level=info ts=2018-11-06T16:56:06.521285876Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541505600000 maxt=1541512800000 ulid=01CVMRK5E6F8HQ5PBDGR5VNHD9
level=warn ts=2018-11-06T17:00:46.682633299Z caller=head.go:371 component=tsdb msg="unknown series references" count=126
level=info ts=2018-11-06T17:00:46.960451756Z caller=main.go:564 msg="TSDB started"
level=info ts=2018-11-06T17:00:46.960533164Z caller=main.go:624 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yaml
level=info ts=2018-11-06T17:00:46.961838617Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T17:00:46.962710925Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T17:00:46.963329451Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T17:00:46.963969491Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T17:00:46.964564627Z caller=main.go:650 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yaml
level=info ts=2018-11-06T17:00:46.964592042Z caller=main.go:523 msg="Server is ready to receive web requests."
level=info ts=2018-11-06T17:01:17.044983421Z caller=compact.go:398 component=tsdb msg="write block" mint=1541512800000 maxt=1541520000000 ulid=01CVMZGFHFEDBWCZ22HG9H6FN8
level=info ts=2018-11-06T17:01:22.874136569Z caller=head.go:446 component=tsdb msg="head GC completed" duration=2.0217453s
level=info ts=2018-11-06T17:02:19.520619861Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"
level=info ts=2018-11-06T17:02:19.520693944Z caller=main.go:239 build_context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"
level=info ts=2018-11-06T17:02:19.520714014Z caller=main.go:240 host_details="(Linux 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 prometheus-core-f6d6449b5-t7w98 (none))"
level=info ts=2018-11-06T17:02:19.520730892Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-11-06T17:02:19.520745199Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-11-06T17:02:19.522026821Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-11-06T17:02:19.521945116Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-11-06T17:02:19.523351983Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541498400000 maxt=1541505600000 ulid=01CVMHQE66BM5R513KZS6ZR2M0
level=info ts=2018-11-06T17:02:19.524716029Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541505600000 maxt=1541512800000 ulid=01CVMRK5E6F8HQ5PBDGR5VNHD9
level=info ts=2018-11-06T17:02:19.525652288Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1541512800000 maxt=1541520000000 ulid=01CVMZGFHFEDBWCZ22HG9H6FN8
level=warn ts=2018-11-06T17:06:29.81404573Z caller=head.go:371 component=tsdb msg="unknown series references" count=126
level=info ts=2018-11-06T17:06:30.06512187Z caller=main.go:564 msg="TSDB started"
level=info ts=2018-11-06T17:06:30.065199822Z caller=main.go:624 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yaml
level=info ts=2018-11-06T17:06:30.066454938Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T17:06:30.067293968Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T17:06:30.067981342Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T17:06:30.068633238Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-11-06T17:06:30.069231546Z caller=main.go:650 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yaml
level=info ts=2018-11-06T17:06:30.069248835Z caller=main.go:523 msg="Server is ready to receive web requests."
level=warn ts=2018-11-06T17:25:57.532017287Z caller=scrape.go:804 component="scrape manager" scrape_pool=kubernetes-cadvisor target=https://....:10250/metrics/cadvisor msg="append failed" err="write to WAL: log samples: write data/wal/00000675: disk quota exceeded"
panic: runtime error: slice bounds out of range
goroutine 391 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal.(*WAL).log(0xc0000934a0, 0xc169c8c000, 0x39, 0x78000, 0xc1b45f6001, 0xc169c8c000, 0x39)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal/wal.go:480 +0x3fd
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal.(*WAL).Log(0xc0000934a0, 0xc0e03a9b28, 0x1, 0x1, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/wal/wal.go:450 +0xd4
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*headAppender).log(0xc01741c5f0, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go:678 +0x1c0
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*headAppender).Commit(0xc01741c5f0, 0x0, 0x0)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/head.go:689 +0xd4
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.dbAppender.Commit(0x1d3d480, 0xc01741c5f0, 0xc0006800c0, 0xc0e03a9c88, 0x14e89fc)
	/go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:349 +0x35
github.com/prometheus/prometheus/storage/tsdb.appender.Commit(0x1d3d900, 0xc18840a260, 0xc0e03a9d50, 0x14e6e70)
	/go/src/github.com/prometheus/prometheus/storage/tsdb/tsdb.go:258 +0x31
github.com/prometheus/prometheus/storage.(*fanoutAppender).Commit(0xc1cce7a800, 0xc18840a280, 0x0)
	/go/src/github.com/prometheus/prometheus/storage/fanout.go:161 +0x49
github.com/prometheus/prometheus/scrape.(*scrapeLoop).report(0xc10d54d400, 0xbef091891a1facbb, 0x149e8a5ba84, 0x2d77a60, 0x417d91fd, 0x9475, 0x9475, 0x1d1e2c0, 0xc18840a120, 0x0, ...)
	/go/src/github.com/prometheus/prometheus/scrape/scrape.go:1127 +0x397
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xc10d54d400, 0x2540be400, 0x2540be400, 0x0)
	/go/src/github.com/prometheus/prometheus/scrape/scrape.go:818 +0x6f9
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
	/go/src/github.com/prometheus/prometheus/scrape/scrape.go:329 +0x45d
level=info ts=2018-11-06T17:26:00.054555057Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"
level=info ts=2018-11-06T17:26:00.054643901Z caller=main.go:239 build_context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"
level=info ts=2018-11-06T17:26:00.054662548Z caller=main.go:240 host_details="(Linux 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 prometheus-core-f6d6449b5-t7w98 (none))"
level=info ts=2018-11-06T17:26:00.05467989Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-11-06T17:26:00.054694587Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-11-06T17:26:00.055231815Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-11-06T17:26:00.055323628Z caller=main.go:423 msg="Stopping scrape discovery manager..."
level=info ts=2018-11-06T17:26:00.055346932Z caller=main.go:437 msg="Stopping notify discovery manager..."
level=info ts=2018-11-06T17:26:00.055360861Z caller=main.go:459 msg="Stopping scrape manager..."
level=info ts=2018-11-06T17:26:00.055372338Z caller=main.go:433 msg="Notify discovery manager stopped"
level=info ts=2018-11-06T17:26:00.055388165Z caller=main.go:419 msg="Scrape discovery manager stopped"
level=info ts=2018-11-06T17:26:00.055257417Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-11-06T17:26:00.055405662Z caller=main.go:453 msg="Scrape manager stopped"
level=info ts=2018-11-06T17:26:00.055448384Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-11-06T17:26:00.055465402Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-11-06T17:26:00.055482618Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2018-11-06T17:26:00.055494183Z caller=main.go:608 msg="Notifier manager stopped"
level=error ts=2018-11-06T17:26:00.055996603Z caller=main.go:617 err="opening storage failed: mkdir data/: permission denied"
@dswarbrick

This comment has been minimized.

Copy link

dswarbrick commented Nov 7, 2018

I've also seen v2.4.3 appear to exhibit this issue. I suspect that we had a few unclean shutdowns, and after the crash recovery (which incidentally eats a LOT of memory), things seemed to be back to normal, until the first WAL checkpoint.

We run retention=30d, min-block-duration=1h, max-block-duration=24h. For some strange reason, the default min-block-duration=2h triggers a checkpoint every 2h on the hour, but setting it to 1h results in checkpoints every hour on the half-hour. In any case, the WAL checkpoint appeared to become somehow deadlocked. Metrics were still being ingested and could still be queried, however RAM was being steadily consumed, and the WAL files were accumulating. Prometheus inevitably crashed when it ran out of memory or disk space, whichever happened first.

The only workaround I found was to most importantly catch Prometheus in such a state before it crashed, and perform a clean shutdown / restart. Once restarted, it began to process the backlog of WAL files, disk usage receded, and eventually everything got back to normal.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 8, 2018

@chlunde unfortunately I don't see anything in the logs to help troubleshoot the issue.

@dswarbrick

We run retention=30d, min-block-duration=1h, max-block-duration=24h. For some strange reason, the default min-block-duration=2h triggers a checkpoint every 2h on the hour, but setting it to 1h results in checkpoints every hour on the half-hour.

I didn't quite understand why did you need to lower the default block duration?

In any case, the WAL checkpoint appeared to become somehow deadlocked. Metrics were still being ingested and could still be queried, however RAM was being steadily consumed, and the WAL files were accumulating. Prometheus inevitably crashed when it ran out of memory or disk space, whichever happened first.

any unusual log messages when this happens?

can you try to describe a minimal setup and step to reproduce this so I can spend some time to find the culprit for this weird behaviour.

@krasi-georgiev krasi-georgiev changed the title WAL files are not truncated after upgrade to 2.4.2 WAL files are not truncated after upgrade to 2.4.2, caused by `unexpected gap to last checkpoint` Nov 8, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 8, 2018

@dswarbrick thanks for the update. Would you mind moving your comments to a new issues and ping me there as if you don't have the error logs reported in the first comment (unexpected gap to last checkpoint) this is likely a separate issue.

btw attaching the output of the promtool debug command (inside the same release folder as the Prometheus binary) when this happens will give us a clue what start eating up all the memory.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 8, 2018

@trnl any more pointers how to replicate?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

@trnl have you experienced this again? I couldn't find any clues in the code of what might have caused it.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 12, 2018

ping @trnl

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 17, 2018

looks stale so closing. feel free to reopen if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.