Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add graceful shutdown timeout for Apache #66

Conversation

zerebubuth
Copy link
Contributor

@tomhughes humour me and my worrying? 😉

…s to not-present, so default behaviour should not change.
…wn any idle connections. This is the same timeout used for client connections, so should have no user-visible effect.
@tomhughes
Copy link
Member

What problem is this solving? I mean five minutes is much longer than we would normally want to wait...

@tomhughes
Copy link
Member

I mean I don't ever remember waiting that long for apache to restart so either the default is less than that, or we just don't have connections that need that long, or the master process restarts to take new connections and the old one hangs around harmlessly sending data back to the outstanding clients.

@zerebubuth
Copy link
Contributor Author

I think that this may help mitigate circumstances where we would otherwise get scoreboard is full errors. Under normal conditions, I don't think this would have an effect, and I've never waited that long for Apache, which is partly why I chose a long timeout to try and avoid impacting normal operations.

@tomhughes
Copy link
Member

Well "scoreboard is full" is a secondary problem normally, not a root cause. It just means something else is screwed and connections aren't making progress so eventually apache runs out of slots.

I'm not sure why changing the behaviour of shutdown would even help that? Presumably nobody was shutting apache down when they happened given we were all asleep...

An actual connection timeout would help but that's hard because a small proportion of our connections are legitimatally long running.

@zerebubuth
Copy link
Contributor Author

The suggestion in this Apache bug report is that MPM workers stopping due to reduced load will enter a "shutting down" state which consumes "scoreboard" slots while waiting for the connections to finish. The default timeout is infinity, so they can wait until the TCP connection resets if the other side has vanished. When the server takes more load, the "shutting down" state isn't reversed, but new MPM workers started, which leads to eventual resource starvation.

Unfortunately, I wasn't around to capture /server-status when it was happening, so I can't confirm whether all the slots were really in the G state. But I think it's worth adding the timeout, just to be on the safe side. If 300s is too short, then 600s or more would still be better than an outage.

@tomhughes
Copy link
Member

So the thing is I read http://httpd.apache.org/docs/2.4/mod/mpm_common.html#gracefulshutdowntimeout as only applying when apache is shutdown with apachectl graceful but you're reading it as applying when an individual process is recycled because it has hit it's connection limit.

Now you may well be right - the documentation could easily be read either way.

I'm not sure it helps that much though if you're the person whose large diff upload happened to be one of the last requests sent to a server and you don't get the reply because the five minute timeout was hit...

@tomhughes
Copy link
Member

One data point - the longest request on thorn-04 today was 18.5 minutes.

@zerebubuth
Copy link
Contributor Author

Good point. I see we already have the timeout for proxied connections set very high. In which case, adding a shutdown timeout on top of it probably isn't going to make much difference.

The real fix is clearly API change, but while out 99.99% upload response time is 351s (for a very large changeset, it has to be said) a timeout of 300s isn't going to work.

@zerebubuth zerebubuth closed this Jun 1, 2016
@zerebubuth zerebubuth deleted the add-graceful-shutdown-timeout branch June 1, 2016 11:25
@tomhughes
Copy link
Member

Oh I was still looking at this... But I need to unbreak logstash first so I can see the distribution of run times and I had to give up on that and go to bed in the end last night.

@zerebubuth
Copy link
Contributor Author

zerebubuth commented Jun 1, 2016

I just did some changeset upload statistics from thorn-04's access.log.1, so probably the same you were looking at:

  • 99% of requests finished in <42s
  • 99.9% of requests finished in <161s
  • 99.99% of requests finished in <399s.

0.4% (35 / 87975) finished at or after 300s.

@zerebubuth
Copy link
Contributor Author

The figure in the above comment should have been 399s not 351s - I wasn't paying enough attention to the rounding, so it's the 99.985th percentile rather than the 99.99th!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants