Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upKubernetes discovery not refreshed after intermittent Kubernetes master failure #1603
Comments
This comment has been minimized.
This comment has been minimized.
|
Prometheus did succeed in refresh the Kubernetes discovery a few minutes after the recovery of Kubernetes API. |
This comment has been minimized.
This comment has been minimized.
|
It looks like the amount of time that routing took to come back up - I assume the anonymized IP was correct for the Kubernetes master service? |
This comment has been minimized.
This comment has been minimized.
|
Yes the IP is correct. And prometheus started to work fine after it recovered. |
This comment has been minimized.
This comment has been minimized.
|
The retry interval is 1 second by default (hence the repeated logs). As soon as the API is available again & routeable Prometheus should reconnect. |
This comment has been minimized.
This comment has been minimized.
|
The API was available way before prometheus refreshed the pods |
This comment has been minimized.
This comment has been minimized.
|
This could be something to do with the Kubernetes service routing syncing up again, but not sure. I can't think how this could be to do with the Prometheus reconnection logic. |
This comment has been minimized.
This comment has been minimized.
|
Sounds like what caused my demo yesterday to fail. After my laptop went On Thu, Apr 28, 2016, 6:57 PM Jimmi Dyson notifications@github.com wrote:
|
This comment has been minimized.
This comment has been minimized.
macb
commented
Jun 6, 2016
|
I'm getting similarly stale targets but my kube apiserver was available consistently while the results remained stale. I'm seeing a fair number of EOF errors in the prometheus logs. This doesn't seem consistent with other applications running within the cluster talking to the api server (they aren't reporting EOFs). |
This comment has been minimized.
This comment has been minimized.
macb
commented
Jun 30, 2016
|
This just happened again. This status page shows attempts to scrape 2 instances that are no longer listed in the k8s api (the service and pods were all deleted previously). I'd be happy to provide any additional information that would be helpful in tracking this down. As it stands we use metrics for quite a few things so I don't want to leave the instance broken but I can capture that data when it occurs again. |
This comment has been minimized.
This comment has been minimized.
|
Can I ask which Prometheus version you're on? |
This comment has been minimized.
This comment has been minimized.
macb
commented
Jul 1, 2016
•
I see a few kubernetes related messages in the changelog since 0.18. Nothing that calls this out expicitly though. Is there any reason we couldn't upgrade from 0.18 to 0.20? |
brian-brazil
added
the
kind/bug
label
Jul 13, 2016
This comment has been minimized.
This comment has been minimized.
|
I'd be very interested how this works for you in 0.20 / 1.0. I'm running into a similar thing where the first API contact at startup fails (due to everything in our integration test starting up at once, I suppose) and then Prometheus never recovers from that (at least not in the timeframes I've waited). It does recover when I SIGHUP it. This particular instance is a regression from 0.19.2, in that version the same setup worked fine. |
This comment has been minimized.
This comment has been minimized.
|
… and with 1.0.0 it seems to work again. |
This comment has been minimized.
This comment has been minimized.
treed
commented
Aug 26, 2016
|
FWIW, I just ran into this running a version of master I pulled on the 22nd. Hitting the process with HUP caused it to update, thanks for the workaround. |
This comment has been minimized.
This comment has been minimized.
|
Hi, |
This comment has been minimized.
This comment has been minimized.
|
We've recently merged a brand new k8 integration. While it's still beta, it will be replacing the existing one and thus it's unlikely we'll be digging further into this issue with the old version. |
brian-brazil
closed this
Oct 19, 2016
This comment has been minimized.
This comment has been minimized.
macb
commented
Oct 19, 2016
|
@brian-brazil which version of prometheus has the new integration and/or when did it go in? |
This comment has been minimized.
This comment has been minimized.
|
The 1.3.0-beta.0 release will have it in. It should be cut tomorrow. On Wed, Oct 19, 2016, 7:16 PM Mac Browning notifications@github.com wrote:
|
This comment has been minimized.
This comment has been minimized.
macb
commented
Oct 19, 2016
|
thanks! looking forward to giving it a try in some of our k8s clusters |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
rvrignaud commentedApr 28, 2016
Hello,
I'm using Kubernetes SD on Google Container Engine. For yet an unknown reason it seems that my Kubernetes master stopped / failed. It is back again but prometheus do not refresh the discovery.
Here is an extract of the logs:
prometheus.log.txt
My prometheus server is running inside the cluster as a pod.
I'm running 0.18.0 release on ubuntu.
Here is the configuration: