Remove TimeoutFilter and default timeouts #73

obeattie · 2017-04-22T20:25:44Z

The same behaviour is achieved by using context.WithTimeout.

The same behaviour is achieved by using context.WithTimeout

enginoid · 2017-04-24T16:34:16Z

I agree that we need to make timeouts configurable, but I'm skeptical of removing timeouts altogether. Even though there are a few places where we'd like to have longer timeouts, default timeouts still serve a purpose.

In day to day operation, a timeout like this is a catch-all approach to preventing resource leaks for the occasional request that for some reason never terminates (file descriptors, outgoing connections, incoming connections for the downstream).

A request timeout also serves an important purpose as a recovery mechanism during abnormal operation. Without it — if a particular downstream starts hanging on every request — the server's request pool will fill up (or in the absence of a request pool, a resource like file descriptors) and it will be unable to serve new requests. This requires someone to go in and restart, but identifying hosts that have run into this condition might be tricky, and it's not necessarily obvious to anyone that this might be happening.

Would it make sense to adapt this change to make the request timeout configurable? If no-one is specifically calling for this at the moment, we could also put this off and instead prepare an RFC on timeouts so that the approach is prepared when we need the implementation.

enginoid · 2017-04-24T16:42:16Z

It's also worth keeping in mind that the downstream's server won't necessarily trigger eventual cancellation of the request. An example of that is when a server with request pooling accepts a connection but queues it, but does not make progress on its queue and therefore never returns. (This is surprisingly common server behavior.)

Even though we felt fairly confident that linkerd's current behavior is desirable in this scenario, I think it's helpful to be able to think about communication over networks in terms of guarantees enforced at every level so that we can basically say "if anything hangs, it will automatically stop hanging after its downstream starts hanging." Of course, this is where you start wanting timeouts much lower than a minute (or ideally deadlines), because in a request chain involving five services that all hang on a failed leaf, the worst-case recovery time is five minutes.

obeattie · 2017-04-24T16:47:38Z

I completely agree with you, but I don't think Typhon is the place to enforce this. See discussion in Slack 😄

enginoid

Ah, I follow -- use the context to time out instead of a filter 👍. LGTM.

Remove TimeoutFilter

2a48535

The same behaviour is achieved by using context.WithTimeout

obeattie requested a review from enginoid April 22, 2017 20:25

enginoid approved these changes Apr 27, 2017

View reviewed changes

obeattie merged commit f8c1a59 into master May 2, 2017

obeattie deleted the remove-timeout branch May 2, 2017 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove TimeoutFilter and default timeouts #73

Remove TimeoutFilter and default timeouts #73

obeattie commented Apr 22, 2017

enginoid commented Apr 24, 2017

enginoid commented Apr 24, 2017

obeattie commented Apr 24, 2017

enginoid left a comment

Remove TimeoutFilter and default timeouts #73

Remove TimeoutFilter and default timeouts #73

Conversation

obeattie commented Apr 22, 2017

enginoid commented Apr 24, 2017

enginoid commented Apr 24, 2017

obeattie commented Apr 24, 2017

enginoid left a comment

Choose a reason for hiding this comment