Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upIngestion stops, probably due to deadlocked series maintenance #1459
Comments
This comment has been minimized.
This comment has been minimized.
|
Can you paste the error message you're getting about going into rushed mode? |
brian-brazil
added
the
question
label
Mar 4, 2016
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil any more debug information ? BTW, all of things is ok when i rolled back to thanks. |
This comment has been minimized.
This comment has been minimized.
|
You should be seeing this info log: prometheus/storage/local/storage.go Line 1309 in a8c79f0 |
This comment has been minimized.
This comment has been minimized.
|
Also interesting would be the command line flags you have started your server with. And the output of
|
This comment has been minimized.
This comment has been minimized.
Alfred-concur
commented
Mar 4, 2016
|
experiencing the same errors - will issue command as suggested above and post them her for review ... |
This comment has been minimized.
This comment has been minimized.
Alfred-concur
commented
Mar 4, 2016
|
Observed that when refreshing the status page- the last scrape time for at least one target jumps between 150-800 seconds .. and it jumps around between targets every time I refresh ... |
This comment has been minimized.
This comment has been minimized.
Alfred-concur
commented
Mar 4, 2016
|
FYTI - my configuration YAML file is the default syntax for everything with only targets added .... |
This comment has been minimized.
This comment has been minimized.
Alfred-concur
commented
Mar 4, 2016
|
[root@seapr1zenaa001 prometheus-0.17.0rc2.linux-amd64]# curl http://localhost:9090/metrics | grep '^prometheus_local_storage' % Total % Received % Xferd Average Speed Time Time Time Current |
This comment has been minimized.
This comment has been minimized.
|
As said, command line flags are also needed. But I can see most of what I need from the above. There is definitely something wrong because you have |
This comment has been minimized.
This comment has been minimized.
|
It actually looks as if the series maintenance goroutine is deadlocked. Then neither are chunks persisted, nor is the urgency score updated. That would also explain why the prometheus_local_storage_maintain_series_duration_milliseconds are NaN although maintenance has happened in the past. They are probably more than 10m ago. |
beorn7
added
bug
and removed
question
labels
Mar 4, 2016
beorn7
changed the title
scrape skipped due to throttled ingestion
Ingestion stops, probably due to deadlocked series maintenance
Mar 4, 2016
This comment has been minimized.
This comment has been minimized.
Alfred-concur
commented
Mar 4, 2016
|
I have restarted the Prometheus server, started Prometheus and node-exported for the local host ... currently everything is working fine - will monitor for reoccurrence of the issue .... When Prometheus came up - there were a lot of WARN message regarding data recovery ... Small sample WARN[0009] Recovered metric prometheus_notifications_latency_seconds{instance="localhost:9090", job="prometheus", quantile="0.9"}, fingerprint ff83bffa6f97fd9e: all 345 chunks recovered from series file. source=crashrecovery.go:319 |
This comment has been minimized.
This comment has been minimized.
|
Yeah, that's the crash recovery and normal if you don't have a clean shutdown. It looks like the series maintenance stops under circumstances. Unfortunately, I couldn't reproduce the problem in any of our 48 production prometheus servers that run 0.17.0. Every single one of them is maintaining their series just fine... |
beorn7
self-assigned this
Mar 4, 2016
This comment has been minimized.
This comment has been minimized.
Alfred-concur
commented
Mar 4, 2016
|
Understood ... I will keep an eye on Prometheus and report any further problems related to this issue. Thanks for your help .. |
This comment has been minimized.
This comment has been minimized.
|
In case it gets into that state again, a goroutine dump would be great. Then we could see which goroutine is deadlocked, if any. You get it with
Another explanation would be if your server is stuck in writing a checkpoint file, e.g. because the underlying disk is very slow or blocked. (Perhaps that could happen on Amazon or other cloud providers if you are running out of IOps quota?) |
This comment has been minimized.
This comment has been minimized.
Alfred-concur
commented
Mar 4, 2016
|
This is my first time working with Prometheus and the first time my company has looked into using it so it is all new to me. I am using a Centos 6.5 server running on VMware - the guest is hosted on a dedicated server with only 6 other guests - the storage is local SSD storage - I would hope the SSD are fast enough :) |
This comment has been minimized.
This comment has been minimized.
|
hi @beorn7 , sorry for late reply. here is my situation:
It is reproduced in my side. any debug information you need i can post here. when i come back to my lab a later. thanks. |
This comment has been minimized.
This comment has been minimized.
thanks. |
This comment has been minimized.
This comment has been minimized.
|
@guanglinlv You have configured a maximum of 10000 memory chunks, but in reality, you have 88624. 0.16.x would just keep going in that case (and most likely OOM, assuming that people tailor their As your machine is apparently happy with 88624 memory chunks (and since you have 10586 active time series, i.e. more than you would allow chunks in memory), set |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 yeah, another problem is the instant vector selectors can not work with regex-match expression. for example
is any changes about it in thanks. |
This comment has been minimized.
This comment has been minimized.
|
that's all right ,i have check the change log for
this change make me very confused, awesome QL is very popular feature for me. why did you need to anchor it ? Thanks. |
This comment has been minimized.
This comment has been minimized.
|
@guanglinlv See #996. You'll also find some other issues about the anchoring behavior. It is believed to help people doing the right thing. |
This comment has been minimized.
This comment has been minimized.
|
@grobie ok, i agree with that exactly match should help people doing the right thing. but i think a optional choice is better than full anchor. this will break some user case, for example
thanks. |
This comment has been minimized.
This comment has been minimized.
|
@guanglinlv We went way off-topic here, I extracted your question in a new issue #1470. Let's move the discussion there. |
This comment has been minimized.
This comment has been minimized.
|
To not lose track: The issue still remaining here is the one reported by @Alfred-concur . @Alfred-concur , if it happens again, it would be extremely useful to get the goroutine dump as described. |
This comment has been minimized.
This comment has been minimized.
Alfred-concur
commented
Mar 8, 2016
|
Cool - thank you .... |
This comment has been minimized.
This comment has been minimized.
ArtemChekunov
commented
Mar 24, 2016
|
i have a similar problem
|
This comment has been minimized.
This comment has been minimized.
|
@sc0rp1us Please send the storage metrics, too
|
This comment has been minimized.
This comment has been minimized.
ArtemChekunov
commented
Mar 24, 2016
|
This comment has been minimized.
This comment has been minimized.
|
@sc0rp1us This doesn't look like a deadlock. It looks more like your disk cannot keep up with the ingestion speed. On what kind of disk are you running? |
This comment has been minimized.
This comment has been minimized.
|
@beorn7 @sc0rp1us any updates? If this is a bug I would be keen to resolve it for 0.18.0. |
This comment has been minimized.
This comment has been minimized.
ArtemChekunov
commented
Apr 6, 2016
|
nop
because i have enough RAM current time it's work ok |
This comment has been minimized.
This comment has been minimized.
|
Okay, seems that it was what @beorn7 suspected then. |
fabxc
closed this
Apr 6, 2016
This comment has been minimized.
This comment has been minimized.
|
FTR: The deadlock as seen by @Alfred-concur was in fact a real issue, not just a configuration snafu. But without the goroutine dump, we have no idea where to start looking. So as long as it doesn't reoccur, we can leave this closed. |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
guanglinlv commentedMar 4, 2016
hi folks,
i update prometheus form
0.16.2to0.17.0. i try to reuse the old prometheus configuration and the data. but i got the error in status page, i can't get any samples.my configuration is very simple
there is not any useful debug log or hints. what am i lost?
thanks.