Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upUsing k8s SD for alertmanager: doesn't cache address when API server goes away #4019
Comments
This comment has been minimized.
This comment has been minimized.
|
That sounds like a general SD issue. All SDs are meant to be resilient to this. |
brian-brazil
added
kind/bug
component/service discovery
labels
Mar 27, 2018
This comment has been minimized.
This comment has been minimized.
|
We're using k8s SD for the targets, and they didn't disappear. |
This comment has been minimized.
This comment has been minimized.
|
Hmm, it should all be the same code. |
brian-brazil
added
the
component/notify
label
Mar 27, 2018
This comment has been minimized.
This comment has been minimized.
|
They are two separate instances of the same code, so its conceivable one got into a bad state and the other didn't. |
This comment has been minimized.
This comment has been minimized.
|
Heres the goroutines: https://gist.github.com/tomwilkie/054376c3559dd24281a333ab2ec82d01 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Another interesting aside; the config was reloaded at 14:46, and a bunch of the k8s client goroutines are stuck since then (waiting for HasSynced). So a new theory: API server went away, but everything was fine. Then our config got reloaded, and our alertmanagers got forgotten about - our alerts started clearing around 14:51PM, indicating the alert manager was forgotten at ~14:46PM. |
This comment has been minimized.
This comment has been minimized.
|
We've experienced multiple issues with the K8s discovery stopping to work as well in 2.x releases. See the original description in #3810. It can be unbblocked by sending a SIGHUP. The latest occurance was today on a v2.2.1 server. |
This comment has been minimized.
This comment has been minimized.
|
Can you try the patch in #4013? It could explain some of this. |
brian-brazil
removed
the
component/notify
label
Mar 27, 2018
This comment has been minimized.
This comment has been minimized.
|
Is 2.3.1 better? |
This comment has been minimized.
This comment has been minimized.
|
@tomwilkie are you still seeing this issue? v2.4.x included lots of |

tomwilkie commentedMar 27, 2018
We've had this a couple of times now: when the API servers go away or Prometheus has problems speaking to them, Prometheus "forgets" where the alertmanagers are and we no longer get alerts.
I'm investigating.
This is on 2.2.1