Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove TimeoutFilter and default timeouts #73

Merged
merged 1 commit into from
May 2, 2017
Merged

Conversation

obeattie
Copy link
Contributor

The same behaviour is achieved by using context.WithTimeout.

The same behaviour is achieved by using context.WithTimeout
@obeattie obeattie requested a review from enginoid April 22, 2017 20:25
@enginoid
Copy link

I agree that we need to make timeouts configurable, but I'm skeptical of removing timeouts altogether. Even though there are a few places where we'd like to have longer timeouts, default timeouts still serve a purpose.

In day to day operation, a timeout like this is a catch-all approach to preventing resource leaks for the occasional request that for some reason never terminates (file descriptors, outgoing connections, incoming connections for the downstream).

A request timeout also serves an important purpose as a recovery mechanism during abnormal operation. Without it — if a particular downstream starts hanging on every request — the server's request pool will fill up (or in the absence of a request pool, a resource like file descriptors) and it will be unable to serve new requests. This requires someone to go in and restart, but identifying hosts that have run into this condition might be tricky, and it's not necessarily obvious to anyone that this might be happening.

Would it make sense to adapt this change to make the request timeout configurable? If no-one is specifically calling for this at the moment, we could also put this off and instead prepare an RFC on timeouts so that the approach is prepared when we need the implementation.

@enginoid
Copy link

It's also worth keeping in mind that the downstream's server won't necessarily trigger eventual cancellation of the request. An example of that is when a server with request pooling accepts a connection but queues it, but does not make progress on its queue and therefore never returns. (This is surprisingly common server behavior.)

Even though we felt fairly confident that linkerd's current behavior is desirable in this scenario, I think it's helpful to be able to think about communication over networks in terms of guarantees enforced at every level so that we can basically say "if anything hangs, it will automatically stop hanging after its downstream starts hanging." Of course, this is where you start wanting timeouts much lower than a minute (or ideally deadlines), because in a request chain involving five services that all hang on a failed leaf, the worst-case recovery time is five minutes.

@obeattie
Copy link
Contributor Author

I completely agree with you, but I don't think Typhon is the place to enforce this. See discussion in Slack 😄

Copy link

@enginoid enginoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I follow -- use the context to time out instead of a filter 👍. LGTM.

@obeattie obeattie merged commit f8c1a59 into master May 2, 2017
@obeattie obeattie deleted the remove-timeout branch May 2, 2017 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants