Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus fails to recover after getting KILLed #435

Closed
discordianfish opened this Issue Dec 26, 2014 · 9 comments

Comments

Projects
None yet
4 participants
@discordianfish
Copy link
Member

discordianfish commented Dec 26, 2014

Hi,

I KILLed prometheus accidentally. After restarting, apparently recovered a bunch of samples but then panic'ed due a index being ot of range: https://gist.github.com/discordianfish/18b5c2eaea2e2d66858f

  • Affected: prometheus, version 0.9.0rc2 (branch: master, revision: 401decc)
@discordianfish

This comment has been minimized.

Copy link
Member Author

discordianfish commented Dec 29, 2014

Unfortunately I can't reproduce the full situation. First time I started prometheus, I got tons of "Truncating file.. Recovered ...". This took around an hour with prometheus_local_storage_memory_series = 83490.

Now if I start it again, it's not truncating things anymore but just says:

prometheus, version 0.9.0rc2 (branch: master, revision: 401decc)
  build user:       @347a3d6eaa91
  build date:       20141218-17:01:18
  go version:       1.4
I1229 17:47:25.680919       1 targetmanager.go:61] Pool for job prometheus does not exist; creating and starting...
I1229 17:47:25.681010       1 targetmanager.go:61] Pool for job node_exporter does not exist; creating and starting...
I1229 17:47:25.681061       1 targetmanager.go:61] Pool for job container-exporter does not exist; creating and starting...
I1229 17:47:25.681150       1 targetmanager.go:61] Pool for job haproxy-exporter does not exist; creating and starting...
I1229 17:47:25.681219       1 targetmanager.go:61] Pool for job push-gateway does not exist; creating and starting...
I1229 17:47:25.681247       1 targetmanager.go:61] Pool for job byte-piper does not exist; creating and starting...
I1229 17:47:25.790101       1 storage.go:107] Loading series map and head chunks...
W1229 17:47:27.439218       1 persistence.go:964] Persistence layer appears dirty.
W1229 17:47:27.439250       1 persistence.go:267] Starting crash recovery. Prometheus is inoperational until complete.
I1229 17:47:27.439263       1 persistence.go:273] Scanning files.
W1229 17:47:27.944221       1 persistence.go:438] Treating recovered metric container_memory_stats{...}, fingerprint 009ddc0b004fc531, as freshly unarchived, with 99 chunks in series file.
W1229 17:47:27.971665       1 persistence.go:472] Recovered metric container_blkio_io_service_bytes_recursive_total{...}, fingerprint 00108bfa1c9f8287: head chunk found among the 102 recovered chunks in series file.

It might be a bit quicker but still really slow. Strace shows that prometheus open the db files, lseeks, reads and write a bit. Generally what I would expect it to do. But it's slow. This is strace limited to open with timestamps:

[pid 20766] 17:58:31.805745 open("/prometheus/metrics/2f/1ee8d3e709a9df.db", O_RDONLY|O_CLOEXEC) = 96
[pid 20766] 17:58:31.858377 open("/prometheus/metrics/2f/59e9b526a4e8c5.db", O_RDONLY|O_CLOEXEC) = 96
[pid 20779] 17:58:31.915915 open("/prometheus/metrics/2f/b9b91e45274f6c.db", O_RDONLY|O_CLOEXEC) = 96
[pid 20766] 17:58:31.958809 open("/prometheus/metrics/2f/5859fb548f1635.db", O_RDONLY|O_CLOEXEC) = 96
[pid 20766] 17:58:31.974165 open("/prometheus/metrics/2f/5af0d3d7cf3f77.db", O_RDONLY|O_CLOEXEC) = 96
[pid 20766] 17:58:32.014185 open("/prometheus/metrics/2f/e52a86733633ea.db", O_RDONLY|O_CLOEXEC) = 96
[pid 20766] 17:58:32.053942 open("/prometheus/metrics/2f/6e6f6aaa2d6e31.db", O_RDONLY|O_CLOEXEC) = 96
[pid 20766] 17:58:32.109511 open("/prometheus/metrics/2f/35f2d31a4e5c4a.db", O_RDONLY|O_CLOEXEC) = 96
[pid 20766] 17:58:32.130977 open("/prometheus/metrics/2f/f0d302f5af5075.db", O_RDONLY|O_CLOEXEC) = 96

So not sure what's wrong here. Either it's unexpected that prometheus touches all those files or it's just too slow doing that. That system isn't doing much, it's running a elastic search node but that's barely under load.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jan 6, 2015

@brian-brazil encountered the same issue in #437 - see there for additional info.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 6, 2015

If the file length of a series file is consistent with the information in the checkpoint, the file should only be stat'd during crash recovery, not open'd. When you see "Cecovered metric [...]: head chunk found among the [...] recovered chunks in series file.", then the file size was inconsistent and the file had to be inspected. That happens if a chunk was written to the file after the last checkpoint was taken. Series with a high ingestion rate will more often be in that state.

What was your ingestion rate? What was your checkpointing interval?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 7, 2015

We're using the defaults, and have about 23k series scraped once a minute.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 7, 2015

My question was related to the slowness, which (in my current theory) can happen if you have relatively few series but write chunks quite frequently (more often than checkpoints being written). @brian-brazil You had no trouble with slowness of crash recovery, but @discordianfish had, so I was particularly interested in his numbers of samples per series.

@discordianfish

This comment has been minimized.

Copy link
Member Author

discordianfish commented Jan 7, 2015

@beorn7 My scrape interval is 60s, except when scraping itself (still 15s)

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 7, 2015

@discordianfish Thanks. So there goes my theory.

I tried several scenarios now, and the crash recovery behaved as expected. In your case, it had to seek to many series files, and that can only happen if most series have a head chunk that is different from the one in the most recent checkpoint. I plan to implement a mechanism that will trigger a snapshot early if that happens.

Perhaps something delayed the checkpointing in your case.

All of that is still only about the slowness, not about the panic.

Still investigating...

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 12, 2015

Modified version of my theory: If many metrics have the same collection pattern (i.e. they are scraped with the same interval, and scraping similarly structured data, which has a similar compression behavior), then all those metrics will start a new chunk at approx. the same time. Just after that has happened, all the corresponding time series are "dirty", i.e. the head chunk in the last checkpoint is not the current head chunk. If a crash happens at that time, all those series have to be checked during recovery, requiring a disk seek each.

Three fixes related to this issue went into the code now:

  • Fix of the post-recovery crash reported by Brian, too.
  • Fix of a deadlock during shutdown (which might have created the unclean shutdown in your case in the first place).
  • Introduced a new heuristics to checkpoint earlier if many series are "dirty".

I assume that addresses all the issues, but in general, the whole crash recovery desperately needs proper automated testing, for which I'll open a separate issue.

@beorn7 beorn7 closed this Jan 12, 2015

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.