Skip to content

session: fail stuck session RPCs on health timeout#6649

Merged
tonistiigi merged 2 commits intomoby:masterfrom
tonistiigi:session-monitor-cancel
Apr 15, 2026
Merged

session: fail stuck session RPCs on health timeout#6649
tonistiigi merged 2 commits intomoby:masterfrom
tonistiigi:session-monitor-cancel

Conversation

@tonistiigi
Copy link
Copy Markdown
Member

Bind session RPC contexts to caller lifetime so ongoing RPCs fail when the session is canceled. Add an integration test that blocks the session tunnel and verifies the health monitor releases the hung build after timeout.

@glightfoot

Bind session RPC contexts to caller lifetime so ongoing RPCs fail when the
session is canceled. Add an integration test that blocks the session tunnel
and verifies the health monitor releases the hung build after timeout.

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
Reduce the default session health timeout and reset the failure state
after one successful probe so recovery is immediate after transient
session tunnel issues.

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
@tonistiigi tonistiigi force-pushed the session-monitor-cancel branch from 41c6012 to 77f1a03 Compare April 3, 2026 06:32
@tonistiigi tonistiigi marked this pull request as ready for review April 3, 2026 15:39
Comment thread session/context.go
Comment on lines +9 to +15
context.AfterFunc(callerCtx, func() {
cause := context.Cause(callerCtx)
if cause == nil {
cause = context.Canceled
}
cancel(cause)
})
Copy link
Copy Markdown
Member

@crazy-max crazy-max Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it leaks one AfterFunc registration per derived RPC until the whole session is canceled? Since the stop function is ignored, completed request contexts stay attached to the long-lived callerCtx for the rest of the session lifetime.

Maybe this should unregister the callback when the derived context finishes first.

Suggested change
context.AfterFunc(callerCtx, func() {
cause := context.Cause(callerCtx)
if cause == nil {
cause = context.Canceled
}
cancel(cause)
})
stopCaller := context.AfterFunc(callerCtx, func() {
cause := context.Cause(callerCtx)
if cause == nil {
cause = context.Canceled
}
cancel(cause)
})
context.AfterFunc(ctx, func() {
stopCaller()
})

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine to keep this until the session goes away? It's similar things like the storage ref counting when it becomes active and is released when build ends.

I don't understand your example. Looks like you are canceling ctx after it has already been stopped.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense yeah, I was worried about callback accumulation on busy long-lived sessions, but I agree that without evidence this is probably not worth complicating the helper for.

And right, the example was sloppy. What I meant was only that if the derived request context finishes before callerCtx, we could unregister the callerCtx callback at that point instead of keeping it until session teardown.

@tonistiigi tonistiigi merged commit a8a4b0f into moby:master Apr 15, 2026
210 of 211 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants