Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recovery from snapshots fails on compaction #4442

Closed
karlhungus opened this Issue Aug 1, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@karlhungus
Copy link

karlhungus commented Aug 1, 2018

Bug Report

What did you do?

  • created a snapshot
  • stopped prometheus
  • created a tar of the snapshot dir
  • created a new directory
  • extracted the tar to that directroy
  • started pointing to the untar'ed directory

Prometheus runs for a bit then errors out with compaction errors

Steps to reproduce:

  1. create tmp and cfg
  2. create `cfg/prometheus.yml
  3. docker run -p 9090:9090 -v $(pwd)/tmp:/data -v $(pwd)/cfg:/cfg docker.io/prom/prometheus:v2.3.1 --storage.tsdb.path=/data/ --web.enable-admin-api --config.file=/cfg/prometheus.yaml
  4. let it run for a bit
  5. take a snapshot: curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
  6. stop prometheus
  7. tar up the snapshot dir: tar cf foo.tar *
  8. remove contents of tmp
  9. tar the contents back to tmp tar xf foo.tar
  10. start prometheus: docker run -p 9090:9090 -v $(pwd)/tmp/:/data/ -v $(pwd)/cfg:/cfg docker.io/prom/prometheus:v2.3.1 --storage.tsdb.path=/data/ --web.enable-admin-api --config.file=/cfg/prometheus.yaml
  11. Eventually (this takes some time ) : component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap

Note:

this step took almost 4 hours for me to reproduce:

level=info ts=2018-07-31T20:49:01.3456933Z caller=main.go:524 msg="TSDB started"
level=info ts=2018-07-31T20:49:01.3458056Z caller=main.go:603 msg="Loading configuration file" filename=/cfg/prometheus.yaml
level=info ts=2018-07-31T20:49:01.349905Z caller=main.go:500 msg="Server is ready to receive web requests."
level=info ts=2018-07-31T23:47:49.8883852Z caller=compact.go:387 component=tsdb msg="write block" mint=1533067200000 maxt=1533074400000 ulid=01CKSBV6XGBPPY2P1HPS17B3XZ
level=error ts=2018-07-31T23:47:49.9125301Z caller=db.go:277 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap: [mint: 1533067200000, maxt: 1533069894057, range: 44m54s, blocks: 2]: <ulid: 01CKS1CAG08B2228N605TEY9NW, mint: 1533067200000, maxt: 1533069894057, range: 44m54s>, <ulid: 01CKSBV6XGBPPY2P1HPS17B
3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>\n[mint: 1533069894057, maxt: 1533070067482, range: 2m53s, blocks: 2]: <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKS1HKF77GWY69ZBGWMQ73A9, mint: 1533069894057, maxt: 1533070067482, range: 2m53s>"
level=info ts=2018-07-31T23:47:51.4487278Z caller=compact.go:387 component=tsdb msg="write block" mint=1533067200000 maxt=1533074400000 ulid=01CKSBV8FVY0MHG3CDMQT5SMK0

What did you expect to see?

prometheus running with data from the snapshot

What did you see instead? Under which circumstances?

compaction seemed to detect overlapping blocks

Environment

  • System information:

    insert output of uname -srm here

uname -srm
Darwin 17.7.0 x86_64

I've also seen this multiple times with containers on:

 uname -srm
Linux 4.16.6-1.el7.elrepo.x86_64 x86_64
  • Prometheus version:

    insert output of prometheus --version here

  • 2.3.1, also reproduced in 2.3.2
  • Prometheus configuration file:
# https://prometheus.io/docs/prometheus/latest/configuration/configuration/
global:
  scrape_interval: 10s
  scrape_timeout: 7s

scrape_configs:
# Scrape config for Prometheus itself and the sidecar container.
- job_name: 'prometheus'
  static_configs:
  - targets:
    - localhost:9090
- job_name: 'prometheus-manager'
  static_configs:
  - targets:
    - localhost:9091
  • Logs:
level=info ts=2018-07-31T23:47:49.8883852Z caller=compact.go:387 component=tsdb msg="write block" mint=1533067200000 maxt=1533074400000 ulid=01CKSBV6XGBPPY2P1HPS17B3XZ                                                                                                                                                                                                     level=error ts=2018-07-31T23:47:49.9125301Z caller=db.go:277 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap: [mint: 1533067200000, maxt: 1533069894057, range: 44m54s, blocks: 2]: <ulid: 01CKS1CAG08B2228N605TEY9NW, mint: 1533067200000, maxt: 1533069894057, range: 44m54s>, <ulid: 01CKSBV6XGBPPY2P1HPS17B
3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>\n[mint: 1533069894057, maxt: 1533070067482, range: 2m53s, blocks: 2]: <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKS1HKF77GWY69ZBGWMQ73A9, mint: 1533069894057, maxt: 1533070067482, range: 2m53s>"
level=info ts=2018-07-31T23:47:51.4487278Z caller=compact.go:387 component=tsdb msg="write block" mint=1533067200000 maxt=1533074400000 ulid=01CKSBV8FVY0MHG3CDMQT5SMK0                                                                                                                                                                                                     level=error ts=2018-07-31T23:47:51.4731134Z caller=db.go:277 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap: [mint: 1533067200000, maxt: 1533069894057, range: 44m54s, blocks: 3]: <ulid: 01CKS1CAG08B2228N605TEY9NW, mint: 1533067200000, maxt: 1533069894057, range: 44m54s>, <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBV8FVY0MHG3CDMQT5SMK0, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>\n[mint: 1533069894057, maxt: 1533070067482, range: 2m53s, blocks: 3]: <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBV8FVY0MHG3CDMQT5SMK0, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKS1HKF77GWY69ZBGWMQ73A9, mint: 1533069894057, maxt: 1533070067482, range: 2m53s>"                                                                                                                                                                                                                                  level=info ts=2018-07-31T23:47:53.5064036Z caller=compact.go:387 component=tsdb msg="write block" mint=1533067200000 maxt=1533074400000 ulid=01CKSBVAGA0SADBTN793V2W0P1                                                                                                                                                                                                     level=error ts=2018-07-31T23:47:53.5380963Z caller=db.go:277 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap: [mint: 1533067200000, maxt: 1533069894057, range: 44m54s, blocks: 4]: <ulid: 01CKS1CAG08B2228N605TEY9NW, mint: 1533067200000, maxt: 1533069894057, range: 44m54s>, <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBV8FVY0MHG3CDMQT5SMK0, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVAGA0SADBTN793V2W0P1, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>\n[mint: 1533069894057, maxt: 1533070067482, range: 2m53s, blocks: 4]: <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBV8FVY0MHG3CDMQT5SMK0, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVAGA0SADBTN793V2W0P1, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKS1HKF77GWY69ZBGWMQ73A9, mint: 1533069894057, maxt: 1533070067482, range: 2m53s>"                                        level=info ts=2018-07-31T23:47:59.8282593Z caller=compact.go:387 component=tsdb msg="write block" mint=1533067200000 maxt=1533074400000 ulid=01CKSBVGNRYGQ4WHEEW9Z9K3Z7                                                                                                                                                                                                     level=error ts=2018-07-31T23:47:59.8651846Z caller=db.go:277 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap: [mint: 1533067200000, maxt: 1533069894057, range: 44m54s, blocks: 5]: <ulid: 01CKS1CAG08B2228N605TEY9NW, mint: 1533067200000, maxt: 1533069894057, range: 44m54s>, <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBV8FVY0MHG3CDMQT5SMK0, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVAGA0SADBTN793V2W0P1, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVGNRYGQ4WHEEW9Z9K3Z7, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>\n[mint: 1533069894057, maxt: 1533070067482, range: 2m53s, blocks: 5]: <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBV8FVY0MHG3CDMQT5SMK0, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVAGA0SADBTN793V2W0P1, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVGNRYGQ4WHEEW9Z9K3Z7, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKS1HKF77GWY69ZBGWMQ73A9, mint: 1533069894057, maxt: 1533070067482, range: 2m53s>"                                                                                                                                                                                                                          level=info ts=2018-07-31T23:48:07.9052446Z caller=compact.go:387 component=tsdb msg="write block" mint=1533067200000 maxt=1533074400000 ulid=01CKSBVRJ7QDSVTJV0SX17A32N                                                                                                                                                                                                     level=error ts=2018-07-31T23:48:07.9591026Z caller=db.go:277 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap: [mint: 1533067200000, maxt: 1533069894057, range: 44m54s, blocks: 6]: <ulid: 01CKS1CAG08B2228N605TEY9NW, mint: 1533067200000, maxt: 1533069894057, range: 44m54s>, <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBV8FVY0MHG3CDMQT5SMK0, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVAGA0SADBTN793V2W0P1, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVGNRYGQ4WHEEW9Z9K3Z7, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVRJ7QDSVTJV0SX17A32N, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>\n[mint: 1533069894057, maxt: 1533070067482, range: 2m53s, blocks: 6]: <ulid: 01CKSBV6XGBPPY2P1HPS17B3XZ, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBV8FVY0MHG3CDMQT5SMK0, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVAGA0SADBTN793V2W0P1, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVGNRYGQ4WHEEW9Z9K3Z7, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKSBVRJ7QDSVTJV0SX17A32N, mint: 1533067200000, maxt: 1533074400000, range: 2h0m0s>, <ulid: 01CKS1HKF77GWY69ZBGWMQ73A9, mint: 1533069894057, maxt: 1533070067482, range: 2m53s>"                                level=info ts=2018-07-31T23:48:23.9655972Z caller=compact.go:387 component=tsdb msg="write block" mint=1533067200000 maxt=1533074400000 ulid=01CKSBW885RP874VTSET7P5M5T

Other things

  • I haven't tried simplifying the reproduction (i.e. removing the tar bit, and just pointing at the snapshotdir) steps (because at first i didn't think i could reproduce it)
  • I haven't tried manually triggering a compaction
  • I'm attempting now to reproduce this in 2.3.2 and will post my findings.
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 1, 2018

Thanks for the report. This is a known issue, which was fixed with 2.3.2.

@karlhungus

This comment has been minimized.

Copy link
Author

karlhungus commented Aug 1, 2018

No problem! Weird, I swore I just reproduced it in 2.3.2. Is there a way to force a compaction (or have it occur sooner)?

@karlhungus

This comment has been minimized.

Copy link
Author

karlhungus commented Aug 1, 2018

Should i expect a 2.3.1 snapshot to to still exhibit this behavior on 2.3.2 (i.e. do i need to take a new snapshot on 2.3.2)?

I tried this in 2.3.2, both using the docker run method above and haven't seen this error. Thanks!

@karlhungus karlhungus closed this Aug 1, 2018

@karlhungus

This comment has been minimized.

Copy link
Author

karlhungus commented Aug 2, 2018

A little more info for people who come across this problem:

  1. We couldn't reproduce this with data just from 2.3.2
  2. We couldn't get snapshots taken from 2.3.2 that had some 2.3.1 data in them to come up clean
  • they'd come up initially, but after restart would throw the errors above
  • Removing the complained about directories and restarting would make them restart clean. We didn't see too much data loss - but we weren't terribly scientific about how much we did lose. subjectively it didn't look like much if any at all
  1. copying data over without taking a snapshot seemed to work, and survive restarts without any problem at all! (however ymmv)
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.