Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using k8s SD for alertmanager: doesn't cache address when API server goes away #4019

Open
tomwilkie opened this Issue Mar 27, 2018 · 11 comments

Comments

Projects
None yet
4 participants
@tomwilkie
Copy link
Member

tomwilkie commented Mar 27, 2018

We've had this a couple of times now: when the API servers go away or Prometheus has problems speaking to them, Prometheus "forgets" where the alertmanagers are and we no longer get alerts.

I'm investigating.

This is on 2.2.1

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 27, 2018

That sounds like a general SD issue. All SDs are meant to be resilient to this.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 27, 2018

We're using k8s SD for the targets, and they didn't disappear.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 27, 2018

Hmm, it should all be the same code.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 27, 2018

They are two separate instances of the same code, so its conceivable one got into a bad state and the other didn't.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 27, 2018

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 27, 2018

Yeah, looks like an SD issue (rather that a alerts issue) - here are the events, and they all just stopped at about 14:11, which is about when the master went down:

screen shot 2018-03-27 at 16 54 24

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 27, 2018

Another interesting aside; the config was reloaded at 14:46, and a bunch of the k8s client goroutines are stuck since then (waiting for HasSynced).

So a new theory: API server went away, but everything was fine. Then our config got reloaded, and our alertmanagers got forgotten about - our alerts started clearing around 14:51PM, indicating the alert manager was forgotten at ~14:46PM.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Mar 27, 2018

We've experienced multiple issues with the K8s discovery stopping to work as well in 2.x releases. See the original description in #3810. It can be unbblocked by sending a SIGHUP. The latest occurance was today on a v2.2.1 server.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 27, 2018

Can you try the patch in #4013? It could explain some of this.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 22, 2018

Is 2.3.1 better?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 1, 2018

@tomwilkie are you still seeing this issue? v2.4.x included lots of improvements fixes on the SD side and in particular, the annoying #4124 should ™️ be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.