Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While ingestion is suspended, neither the status page is accessible, nor is a clean shutdown possible. #1319

Closed
grobie opened this Issue Jan 15, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@grobie
Copy link
Member

grobie commented Jan 15, 2016

We just found one of our servers having problems to keep up with persistence up to the point that ingestion got suspended. The exact circumstances are still not clear and @beorn7 is investigating.

Unexpectedly (to me at least), the status page would not load during the whole time, while the query interface was still available. It seems the target manager got into an undefined state as well.

@grobie grobie added the bug label Jan 15, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 15, 2016

And the server would not even shutdown properly.

It seems to me that some parts of the server were completely hosed, among them either signal handling or the parts that would shut down targets and query them for the status page... Very difficult to debug.

In general, I think we have to organize the throttling / suspension of sample ingestion more cleanly.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 15, 2016

I just saw the "no clean shutdown" issue again now after suspension of ingestion. Something breaks if that happens.

@beorn7 beorn7 changed the title Status page unavailable while server is in a degraded state While ingestion is suspended, neither the status page is accessible, nor is a clean shutdown possible. Jan 16, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 16, 2016

Theory is now: As long as any target has suspended the ingestion, both shutdown and status page don't work.

We probably want one big red "suspension" switch that is flipped once we reach the limit of chunks to be persisted, and then flipped back once we are at 90% of that.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 16, 2016

This could go into (or needs to be coordinated with) #1064

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jan 18, 2016

The status page issues probably come from the target manager being stuck while holding a lock, e.g. waiting for old scrapers to terminate, which in turn are stuck because their samples aren't ingested.

At which step does shutdown get stuck? Or does sending SIGTERM not even trigger anything?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 19, 2016

The theory is that SIGTERM triggers as usual, but the target manager can only shutdown once all targets are out of suspended ingestion (which might never happen). This theory is not completely proven, though.

But I think we should have a central switch "stop ingestion" anyway, which would make many things much cleaner. It can be gated by an atomic variable to not cause too much lock contention.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.