session: fail stuck session RPCs on health timeout#6649
session: fail stuck session RPCs on health timeout#6649tonistiigi merged 2 commits intomoby:masterfrom
Conversation
Bind session RPC contexts to caller lifetime so ongoing RPCs fail when the session is canceled. Add an integration test that blocks the session tunnel and verifies the health monitor releases the hung build after timeout. Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
Reduce the default session health timeout and reset the failure state after one successful probe so recovery is immediate after transient session tunnel issues. Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
41c6012 to
77f1a03
Compare
| context.AfterFunc(callerCtx, func() { | ||
| cause := context.Cause(callerCtx) | ||
| if cause == nil { | ||
| cause = context.Canceled | ||
| } | ||
| cancel(cause) | ||
| }) |
There was a problem hiding this comment.
Could it leaks one AfterFunc registration per derived RPC until the whole session is canceled? Since the stop function is ignored, completed request contexts stay attached to the long-lived callerCtx for the rest of the session lifetime.
Maybe this should unregister the callback when the derived context finishes first.
| context.AfterFunc(callerCtx, func() { | |
| cause := context.Cause(callerCtx) | |
| if cause == nil { | |
| cause = context.Canceled | |
| } | |
| cancel(cause) | |
| }) | |
| stopCaller := context.AfterFunc(callerCtx, func() { | |
| cause := context.Cause(callerCtx) | |
| if cause == nil { | |
| cause = context.Canceled | |
| } | |
| cancel(cause) | |
| }) | |
| context.AfterFunc(ctx, func() { | |
| stopCaller() | |
| }) |
There was a problem hiding this comment.
I think it should be fine to keep this until the session goes away? It's similar things like the storage ref counting when it becomes active and is released when build ends.
I don't understand your example. Looks like you are canceling ctx after it has already been stopped.
There was a problem hiding this comment.
Ah makes sense yeah, I was worried about callback accumulation on busy long-lived sessions, but I agree that without evidence this is probably not worth complicating the helper for.
And right, the example was sloppy. What I meant was only that if the derived request context finishes before callerCtx, we could unregister the callerCtx callback at that point instead of keeping it until session teardown.
Bind session RPC contexts to caller lifetime so ongoing RPCs fail when the session is canceled. Add an integration test that blocks the session tunnel and verifies the health monitor releases the hung build after timeout.
@glightfoot