Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus not scraping metrics `adding stale sample failed` #4249
Comments
This comment has been minimized.
This comment has been minimized.
|
Is the time on your machine correct? If it's turning on and off for an hour at a time, I'm smelling something weird happening on your machine with timezones. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil I checked, time seems correct on all my nodes of the cluster. However the timezone of the cluster is EDT and of my pods is UTC. Can that be an issue? I have been using it for a while, and it was working fine, not sure why only one of the two promethei suddenly started exhibiting this. |
This comment has been minimized.
This comment has been minimized.
|
No, Prometheus itself ignores timezones completely. If only one of them has the issue it's likely something up with that machine. |
This comment has been minimized.
This comment has been minimized.
|
I verified the time in all the machines, they don't seem to be wrong. |
This comment has been minimized.
This comment has been minimized.
|
Can you share your full configuration? |
This comment has been minimized.
This comment has been minimized.
|
It's quite large. Also attaching a graph indicative of how prometheus was responding. The straight lines indicate when it was up. Link Here
|
This comment has been minimized.
This comment has been minimized.
|
What does As an aside, the pushgateway should have honor_labels: true but nothing else should. |
This comment has been minimized.
This comment has been minimized.
|
It is |
This comment has been minimized.
This comment has been minimized.
|
That doesn't make sense, stale markers won't be produced if up=1 all the time. Are you sure it's always 1? |
This comment has been minimized.
This comment has been minimized.
|
You mean up from one instance(the prometheus that is scraping) to another(the prometheus that is not working), right? Graph here |
This comment has been minimized.
This comment has been minimized.
|
I mean the up of the node exporter mentioned in the log. |
This comment has been minimized.
This comment has been minimized.
|
I don't just have my node_exporter metrics going on and off, it is the prometheus itself which is either scraping all targets or none. The second prometheus is able to scrape off the node_exporter, therefore that is indeed |
This comment has been minimized.
This comment has been minimized.
|
As this point I'm suspecting a dodgy CPU in the Prometheus machine that's messing up timestamps. Does this still happen if the Prometheus is moved to another machine? |
This comment has been minimized.
This comment has been minimized.
|
I believe the problem will stop, but we will still not be able to pin point the issue. Why can't prometheus just wait in the queue like the other processes in case the CPU is the bottleneck (I think it isn't, as we have enough cores)? |
This comment has been minimized.
This comment has been minimized.
|
The information you have provided does not support that conclusion. The gaps would have to be much larger for that to make sense. |
brian-brazil
added
the
kind/more-info-needed
label
Jun 13, 2018
This comment has been minimized.
This comment has been minimized.
|
Also, an other observation, when killing the pod to restart somewhere else, it took like 40 minutes to die. |
This comment has been minimized.
This comment has been minimized.
vears91
commented
Feb 8, 2019
|
I'm also running into this. Running Prometheus 2.7.1 on Kubernetes 1.13.2 in AWS. Prometheus stops scraping all targets, I get an alert for heartbeat lost from my alerting system, and the container cannot be killed during this time. When this happens, I see in the logs that the configuration begins to reload and takes way longer than usual to complete. I see the
|
This comment has been minimized.
This comment has been minimized.
|
@vears91 did you find the culprit for your issue? |
This comment has been minimized.
This comment has been minimized.
vears91
commented
Feb 28, 2019
|
@krasi-georgiev I still see the issue with scraping stopping. It was happening very often when we upgraded our Kubernetes networking component (kube-router), up to once a day. We downgraded to a previous version but it still happens from time to time, maybe once a week. As described in #4736, scraping stops but the web UI is still accessible. The container can't be killed when this is happening. |
This comment has been minimized.
This comment has been minimized.
|
would you mind starting the container with an evn DEBUG=1 the profile doesn't include any sensitive data so you can attach it here. also it might be worth starting Prometheus with the extra flag
|
anandsinghkunwar commentedJun 11, 2018
Bug Report
What did you do?
I have setup prometheus v2.2.1 using the prometheus operator on my kubernetes cluster. I have deployed node-exporter, kube-state-metrics for cluster monitoring. I haven't been able to ingest samples on and off for hours. I have a 30s evaluation interval. I feel it is somehow related to #2894, but that seems to be fixed. I have 2 instances of prometheus as HA and only one of them is facing this issue. This doesn't seem to be a memory, cpu, network issue as other pods on that node seem to be working fine.
What did you expect to see?
I expected prometheus to scrape every 30s as usual.
What did you see instead? Under which circumstances?
It is going on and off every few hours. Off for hours then on for an hour and so all weekend.
System information:
Linux 3.10.0-693.el7.x86_64 x86_64
Prometheus version:
v2.2.1
Logs: