Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash recovery should deal better with corrupt data for individual series in checkpoint file #2475

Closed
korovkin opened this Issue Mar 6, 2017 · 5 comments

Comments

Projects
None yet
2 participants
@korovkin
Copy link

korovkin commented Mar 6, 2017

What did you do?

upgrade to newer version or prometheus

What did you expect to see?

clean start

What did you see instead? Under which circumstances?

crash

Environment

WARN[0007] Lost at least 9 chunks for fingerprint de012972ecd6f680, metric tunein_taps{country="DE", instance="localhost:9301", item="s236774", job="prometheus", source="browse"}.  source=crashrecovery.go:116
panic: runtime error: makeslice: cap out of range

goroutine 1 [running]:
panic(0x18b2120, 0xc446690260)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/prometheus/prometheus/storage/local.(*persistence).recoverFromCrash(0xc420362c60, 0xc4200a90b0, 0x1, 0x17bdd20)
        /go/src/github.com/prometheus/prometheus/storage/local/crashrecovery.go:119 +0xb82
github.com/prometheus/prometheus/storage/local.(*persistence).loadSeriesMapAndHeads.func1(0xc420362c60, 0xc445e859e8, 0xc4200a90b0, 0xc445e859d8)
        /go/src/github.com/prometheus/prometheus/storage/local/persistence.go:817 +0xf6
github.com/prometheus/prometheus/storage/local.(*persistence).loadSeriesMapAndHeads(0xc420362c60, 0xc42000e500, 0x0, 0x267d540, 0xc420010190)
        /go/src/github.com/prometheus/prometheus/storage/local/persistence.go:840 +0x3f5
github.com/prometheus/prometheus/storage/local.(*MemorySeriesStorage).Start(0xc420238b00, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/storage/local/storage.go:374 +0x1f5
main.Main(0x0)
        /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:181 +0x114f
main.main()
        /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:43 +0x22
  • System information:

uname -srm
Linux 4.4.0-62-generic x86_64

  • Prometheus version:

./prometheus --version

prometheus, version 1.5.2 (branch: master, revision: bd1182d)
build user: root@a8af9200f95d
build date: 20170210-14:41:22
go version: go1.7.5

  • Alertmanager version:

    insert output of alertmanager -version here (if relevant to the issue)

  • Prometheus configuration file:

insert configuration here
@korovkin

This comment has been minimized.

Copy link
Author

korovkin commented Mar 6, 2017

see
#1873 (comment)

for reference

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 6, 2017

The code should handle this case more gracefully (by discarding this series and move on), but the root cause is a data corruption in the checkpoint file. Most likely you will have more corruptions, so even with the more graceful behavior in place, you might not be able to recover much.

@korovkin

This comment has been minimized.

Copy link
Author

korovkin commented Mar 7, 2017

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 7, 2017

I'll take it as a reminder to fix the code as described above. Will change title.

@beorn7 beorn7 changed the title unable to recover after an upgrade Crash recovery should deal better with corrupt data for individual series in checkpoint file Mar 7, 2017

beorn7 added a commit that referenced this issue Apr 6, 2017

@beorn7 beorn7 closed this in #2594 Apr 6, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.