Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upScraping of metrics stops #4736
Comments
This comment has been minimized.
This comment has been minimized.
|
Hi, could you share a screenshot of the targets page (with the URLs redacted)? This is a serious issue if reproducible but you're the only one to report it which makes me wonder if it's a config issue. You could send the profiles to gouthamve [at] gmail.com and I'll make sure to forward to other maintainers. |
This comment has been minimized.
This comment has been minimized.
|
Hi, |
simonpasquier
added
kind/more-info-needed
component/scraping
labels
Oct 15, 2018
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@smelchior I'd be interested to look at the debugging info. You can send me the output of the |
This comment has been minimized.
This comment has been minimized.
|
@simonpasquier i send you an email with the details i have, this includes the pprof debug info i retrieved via the /debug/.. endpoints. Thanks! |
This comment has been minimized.
This comment has been minimized.
|
I've had a quick look at the pprof data shared by @smelchior and I suspect that one scrape appended is stuck and it is blocking the other appenders. @krasi-georgiev thoughts? Complete graph of goroutines: Graph of goroutines matching on |
krasi-georgiev
added
the
component/local storage
label
Nov 12, 2018
This comment has been minimized.
This comment has been minimized.
|
The issue occured again, also after the upgrade to 2.5.0:
The log messages are the last ones. Sometime after that the scraping started to hang. I emailed the debug details to @simonpasquier. |
This comment has been minimized.
This comment has been minimized.
|
hi @smelchior I can now start looking into this. Can you ping me on the prometheus-dev channel to see if you can help me replicate this. |
This comment has been minimized.
This comment has been minimized.
|
ping @smelchior |
This comment has been minimized.
This comment has been minimized.
|
Sorry i was busy the last days. I am not sure if i can help you in this regard as i have no way to really reproduce this. It just happens in our env from time to time. We now have been without an issue for over 2 weeks, but i guess it might happen again anytime :-( |
This comment has been minimized.
This comment has been minimized.
|
@smelchior thanks for the update. Would you mind to still ping me on irc as I might ask some additional detail and clues to try and replicate it myself. The profiles show that it locks when writing to the WAL file so the first logical question is what is the storage type used for the WAL files. |
This comment has been minimized.
This comment has been minimized.
|
ping @smelchior |
This comment has been minimized.
This comment has been minimized.
|
I tried to find you on Tuesday in IRC but had no luck, what is your username there? |
This comment has been minimized.
This comment has been minimized.
|
the same as here @krasi-georgiev #prometheus-dev btw what is the storage type used for the WAL files? |
This comment has been minimized.
This comment has been minimized.
|
Ok, i will get back to you tomorrow in IRC. The instances are running on K8S on AWS and the PV for the prometheus data is an SSD EBS Volume. |
This comment has been minimized.
This comment has been minimized.
|
also if possible try master as there were quite a few fixes the recent few weeks. |
This comment has been minimized.
This comment has been minimized.
|
@smelchior can you still replicate it with the master branch? 2.6 will be out in few days btw. |
This comment has been minimized.
This comment has been minimized.
|
@smelchior 2.6 is out , could you try it? |
This comment has been minimized.
This comment has been minimized.
|
i have updated now, i will close this here for now and reopen should it happen again. |
smelchior
closed this
Dec 20, 2018
This comment has been minimized.
This comment has been minimized.
|
thanks, appreciated! |
This comment has been minimized.
This comment has been minimized.
|
unfortunately the issue happend again with this version:
Also this time i was not able to access the /targets page anymore, it just never responded. The other pages did still respond. Unfortunately i was not able to get the debug info this time either. The volume was still writable though, i checked in the container. |
smelchior
reopened this
Jan 10, 2019
This comment has been minimized.
This comment has been minimized.
|
does this happen on a single machine only or in different setups? |
This comment has been minimized.
This comment has been minimized.
|
we haven't had any other reports for such behaviour which makes me think it is something specific to your setup. |
This comment has been minimized.
This comment has been minimized.
|
As discussed via IRC the issue occurred again and i send you the debug info via email. One other note, the first startup after prometheus has been restarted looks like this:
|
This comment has been minimized.
This comment has been minimized.
|
waiting for the profiles with the block/mutes included. |
This comment has been minimized.
This comment has been minimized.
|
I am waiting for the next crash to happen :) |
This comment has been minimized.
This comment has been minimized.
|
no need , there is no rush. |
This comment has been minimized.
This comment has been minimized.
nickbp
commented
Feb 10, 2019
•
|
Edit: Nevermind, turns out the issue I had been seeing was 100% user error: The two bad instances had an incorrect list of namespaces to query (in |
This comment has been minimized.
This comment has been minimized.
vears91
commented
Feb 18, 2019
|
Maybe related #4249? I also saw targets not being scraped, while the web UI was still accessible. |
This comment has been minimized.
This comment has been minimized.
|
@vears91 I doubt it, when we checked the profiles it indicates some blocking when writing to the database. @smelchior any more info since we last chatted? |
This comment has been minimized.
This comment has been minimized.
|
no did not happen again, maybe the |
This comment has been minimized.
This comment has been minimized.
|
so weird, maybe it was some glitch in the storage. In that case will close it for now. Feel free to reopen if it happens again or if you have more info. |



smelchior commentedOct 13, 2018
Proposal
The prometheus process stops to collect metrics after a while silently.
Bug Report
What did you do?
Run prometheus in our kubernetes cluster, scapring stops after some days of runtime
What did you expect to see?
Prometheus to collect my metrics :-)
What did you see instead? Under which circumstances?
Prometheus stops to collect metrics after a while. The process still runs normal and i can access the web ui. There are no errors in the logfile:
(scrapring stopped around 2018-09-10T17:18)
The /targets page shows a time of about 65h ago for all targets, although they should be scraped every minute.
I already posted this on the mailing list (since then we updated to 2.4.2, but this still happens the same way):
https://groups.google.com/d/topic/prometheus-users/bUFa24XKup8/discussion
I collected the go profiles the last time it happend. Whom can i send these to?
Environment
System information:
Linux 4.4.0-1065-aws x86_64Prometheus version:
prometheus, version 2.4.2 (branch: HEAD, revision: c305ffaa092e94e9d2dbbddf8226c4813b1190a0) build user: root@dcde2b74c858 build date: 20180921-07:22:29 go version: go1.10.3(we use the prom/prometheus:v2.4.2 container image in our kubernetes cluster installed via the helm chart)
Prometheus configuration file: