Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot creation error: found chunk outside of compacted #5105

Open
zuzzas opened this Issue Jan 17, 2019 · 8 comments

Comments

Projects
None yet
5 participants
@zuzzas
Copy link

zuzzas commented Jan 17, 2019

Bug Report

What did you do?
Updated Prometheus v2.5.0 -> v2.6.0.
What did you expect to see?
A working snapshotting feature.
What did you see instead? Under which circumstances?
Snapshots stopped working, recovered backups from v2.5.0 aren't working as well.

{"status":"error","errorType":"internal","error":"create snapshot: snapshot head block: write compaction: found chunk with minTime: 1547726468168 maxTime: 1547726528168 outside of compacted minTime: 1547719200000 maxTime: 1547726527867"}

Environment

  • System information:

    Linux 4.15.0-43-generic x86_64

  • Prometheus version:

    prometheus, version 2.6.0 (branch: HEAD, revision: dbd1d58)
    build user: root@bf5760470f13
    build date: 20181217-15:14:46
    go version: go1.11.3

  • Logs:

level=info ts=2019-01-17T12:00:00.045764719Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D0ZBFH0DZABWG7VYNNVTX9V7
level=info ts=2019-01-17T12:00:00.046032635Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D154WDRC1RJ4Z8BJPYYHCP4C
level=info ts=2019-01-17T12:00:00.046270873Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1AYA05AFGWN8AP8QCPBXP7Q
level=info ts=2019-01-17T12:00:00.046494941Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1CW28GMGNRRQDGN3DEWF0H9
level=info ts=2019-01-17T12:00:00.046644839Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1DGMT7Q571MBSV66MG2M55C
level=info ts=2019-01-17T12:00:00.046755629Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1DGKQWN3Q4K870EQTGCJ0CD
level=info ts=2019-01-17T12:00:00.046845693Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1DQFF5Q5EXDTT5GRKB470SD
level=info ts=2019-01-17T12:00:27.62712416Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D0ZBFH0DZABWG7VYNNVTX9V7
level=info ts=2019-01-17T12:00:27.627376645Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D154WDRC1RJ4Z8BJPYYHCP4C
level=info ts=2019-01-17T12:00:27.627594562Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1AYA05AFGWN8AP8QCPBXP7Q
level=info ts=2019-01-17T12:00:27.627812374Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1CW28GMGNRRQDGN3DEWF0H9
level=info ts=2019-01-17T12:00:27.627949074Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1DGMT7Q571MBSV66MG2M55C
level=info ts=2019-01-17T12:00:27.628038373Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1DGKQWN3Q4K870EQTGCJ0CD
level=info ts=2019-01-17T12:00:27.628112874Z caller=db.go:787 component=tsdb msg="snapshotting block" block=01D1DQFF5Q5EXDTT5GRKB470SD
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 17, 2019

What flags do you pass to Prometheus? Can you try using the tsdb CLI to check the data directory?

@zuzzas

This comment has been minimized.

Copy link
Author

zuzzas commented Jan 17, 2019

Will try tsdb tool.

/bin/prometheus --storage.tsdb.retention=1008h --storage.tsdb.path=/var/prometheus/data --config.file=/etc/prometheus/config/prometheus.yaml --web.route-prefix=/prometheus/ --web.external-url=https://test.flant.com/prometheus/ --web.console.libraries=/etc/prometheus/console_libraries --web.console.templates=/etc/prometheus/consoles --web.enable-lifecycle --web.enable-admin-api --rules.alert.resend-delay=30s
@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 17, 2019

Hi @zuzzas, can you elaborate on "Snapshots stopped working"? I am able to snapshot and recover just fine on 2.6.1 which afaics has the same TSDB code as 2.6.

I'm looking into the restore from 2.5

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 17, 2019

Ah, I see the error now, sorry for missing it earlier.

@zuzzas

This comment has been minimized.

Copy link
Author

zuzzas commented Jan 21, 2019

@gouthamve
I've rm -rf'ed the entire Prometheus data directory. It still fails with the same error. I've tried turning on the debug log level, but there is nothing new. Could there be some strange metric that causes TSDB to misbehave?

@zuzzas

This comment has been minimized.

Copy link
Author

zuzzas commented Jan 21, 2019

Here's the interesting part. The snapshot works for all blocks, but the last one. Here is a graph from an attempt to restore such a snapshot to another Prometheus instance.

This snapshot was taken at 03:06.

image

@zuzzas

This comment has been minimized.

Copy link
Author

zuzzas commented Jan 21, 2019

I've resolved it by passing ?skip_head=true URL option to snapshot command. I lose a couple of hours in backup, but it works. Still, far from ideal.

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Feb 4, 2019

Looks like prometheus/tsdb#514 should fix this.

bwplotka added a commit to bwplotka/tsdb that referenced this issue Feb 4, 2019

compact: Fixed populateBlock to fix out ouf boundary chunks mautomati…
…cally.

Fixes prometheus/prometheus#5105

This is seems to be in the case when you do snapshot and it will invoke `write.Compact` for head
block with certain MaxTime being time.Now in the moment of invoke. During compaction
head block's chunks can be appended anytime resulting in chunks being out max Time.

Changes:
* For block write:
    * If chunk is partially outside max time, remove leftovers
    * If chunk is completely outside skip it.
    * If chunk is before min time - error out, we don't have use case for this.
* Improvement comments for compactor write API. We aim to guarantee block chunks being
within requested min and max time.
* Added test cases
* Avoided returning via argument if not needed (meta stats)
* Handled error from reader closers in populate Block. For now they are logged as error level.
(We should not block write if we cannot close reader I guess, but we have to be aware at least)

Just to double check: Blocks has (minTime, maxTime> chunks right? Where was documentation for this?
I remember we wanted to note down this somewhere (:

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit to bwplotka/tsdb that referenced this issue Feb 5, 2019

compact: Fixed populateBlock to fix out ouf boundary chunks mautomati…
…cally.

Fixes prometheus/prometheus#5105

This is seems to be in the case when you do snapshot and it will invoke `write.Compact` for head
block with certain MaxTime being time.Now in the moment of invoke. During compaction
head block's chunks can be appended anytime resulting in chunks being out max Time.

Changes:
* For block write:
    * If chunk is partially outside max time, remove leftovers
    * If chunk is completely outside skip it.
    * If chunk is before min time - error out, we don't have use case for this.
* Improvement comments for compactor write API. We aim to guarantee block chunks being
within requested min and max time.
* Added test cases
* Avoided returning via argument if not needed (meta stats)
* Handled error from reader closers in populate Block. For now they are logged as error level.
(We should not block write if we cannot close reader I guess, but we have to be aware at least)

Just to double check: Blocks has (minTime, maxTime> chunks right? Where was documentation for this?
I remember we wanted to note down this somewhere (:

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit to bwplotka/tsdb that referenced this issue Feb 5, 2019

compact: Fixed populateBlock to fix out ouf boundary chunks mautomati…
…cally.

Fixes prometheus/prometheus#5105

This is seems to be in the case when you do snapshot and it will invoke `write.Compact` for head
block with certain MaxTime being time.Now in the moment of invoke. During compaction
head block's chunks can be appended anytime resulting in chunks being out max Time.

Changes:
* For block write:
    * If chunk is partially outside max time, remove leftovers
    * If chunk is completely outside skip it.
    * If chunk is before min time - error out, we don't have use case for this.
* Improvement comments for compactor write API. We aim to guarantee block chunks being
within requested min and max time.
* Added test cases
* Avoided returning via argument if not needed (meta stats)
* Handled error from reader closers in populate Block. For now they are logged as error level.
(We should not block write if we cannot close reader I guess, but we have to be aware at least)

Just to double check: Blocks has (minTime, maxTime> chunks right? Where was documentation for this?
I remember we wanted to note down this somewhere (:

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit to bwplotka/tsdb that referenced this issue Feb 5, 2019

compact: Fixed populateBlock to fix out ouf boundary chunks mautomati…
…cally.

Fixes prometheus/prometheus#5105

This is seems to be in the case when you do snapshot and it will invoke `write.Compact` for head
block with certain MaxTime being time.Now in the moment of invoke. During compaction
head block's chunks can be appended anytime resulting in chunks being out max Time.

Changes:
* For block write:
    * If chunk is partially outside max time, remove leftovers
    * If chunk is completely outside skip it.
    * If chunk is before min time - error out, we don't have use case for this.
* Improvement comments for compactor write API. We aim to guarantee block chunks being
within requested min and max time.
* Added test cases
* Avoided returning via argument if not needed (meta stats)
* Handled error from reader closers in populate Block. For now they are logged as error level.
(We should not block write if we cannot close reader I guess, but we have to be aware at least)

Just to double check: Blocks has (minTime, maxTime> chunks right? Where was documentation for this?
I remember we wanted to note down this somewhere (:

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit to bwplotka/tsdb that referenced this issue Feb 5, 2019

compact: Fixed populateBlock to fix out ouf boundary chunks mautomati…
…cally.

Fixes prometheus/prometheus#5105

This is seems to be in the case when you do snapshot and it will invoke `write.Compact` for head
block with certain MaxTime being time.Now in the moment of invoke. During compaction
head block's chunks can be appended anytime resulting in chunks being out max Time.

Changes:
* For block write:
    * If chunk is partially outside max time, remove leftovers
    * If chunk is completely outside skip it.
    * If chunk is before min time - error out, we don't have use case for this.
* Improvement comments for compactor write API. We aim to guarantee block chunks being
within requested min and max time.
* Added test cases
* Avoided returning via argument if not needed (meta stats)
* Handled error from reader closers in populate Block. For now they are logged as error level.
(We should not block write if we cannot close reader I guess, but we have to be aware at least)

Just to double check: Blocks has (minTime, maxTime> chunks right? Where was documentation for this?
I remember we wanted to note down this somewhere (:

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit to bwplotka/tsdb that referenced this issue Feb 5, 2019

compact: Fixed populateBlock to fix out ouf boundary chunks mautomati…
…cally.

Fixes prometheus/prometheus#5105

This is seems to be in the case when you do snapshot and it will invoke `write.Compact` for head
block with certain MaxTime being time.Now in the moment of invoke. During compaction
head block's chunks can be appended anytime resulting in chunks being out max Time.

Changes:
* For block write:
    * If chunk is partially outside max time, remove leftovers
    * If chunk is completely outside skip it.
    * If chunk is before min time - error out, we don't have use case for this.
* Improvement comments for compactor write API. We aim to guarantee block chunks being
within requested min and max time.
* Added test cases
* Avoided returning via argument if not needed (meta stats)
* Handled error from reader closers in populate Block. For now they are logged as error level.
(We should not block write if we cannot close reader I guess, but we have to be aware at least)

Just to double check: Blocks has (minTime, maxTime> chunks right? Where was documentation for this?
I remember we wanted to note down this somewhere (:

Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.