Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

progress: fix leak of pipe goroutine from MultiReader #4902

Merged
merged 1 commit into from
May 7, 2024

Conversation

sipsma
Copy link
Collaborator

@sipsma sipsma commented May 4, 2024

In the case where the context provided to MultiReader.Reader is cancelled, the writer will be deleted.

Before this, closeWriter was not called in that codepath, which meant that MultiReader.handle would never see the writer and never close it.

  • I don't have an isolated repro, but when running a few hundred of Dagger's tests I was seeing several thousand goroutines blocked on waiting for the progress pipe's context to be done.

Now, we just call closeWriter when it gets deleted so that the pipe's goroutine doesn't leak.

  • After this change, those thousands of leaked goroutines are entirely gone.

In the case where the context provided to MultiReader.Reader is
cancelled, the writer will be deleted.

Before this, closeWriter was not called in that codepath, when meant
that MultiReader.handle would never see the writer and never close it.
* I don't have an isolated repro, but when running a few hundred of
  Dagger's tests I was seeing several thousand goroutines blocked on
  waiting for the progress pipe's context to be done.

Now, we just call closeWriter when it gets deleted so that the pipe's
goroutine doesn't leak.
* After this change, those thousands of leaked goroutines are entirely
  gone.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@tonistiigi
Copy link
Member

I don't have an isolated repro, but when running a few hundred of Dagger's tests I was seeing several thousand goroutines blocked on waiting for the progress pipe's context to be done.

This looks correct but I don't understand what is the actual case triggering this if it doesn't easily repro.

@sipsma
Copy link
Collaborator Author

sipsma commented May 6, 2024

This looks correct but I don't understand what is the actual case triggering this if it doesn't easily repro.

Basically the conditions that trigger it are:

  1. context provided to Job.Status (which ends up passed to MultiReader.Reader) gets cancelled
  2. MultiReader.handle is still running because io.EOF hasn't been hit on the underlying main reader yet

So I believe in the case where the context is cancelled the writer there will be missing progress, but if the context is cancelled that feels like the right behavior. Just need to make sure we don't leave around goroutines blocked forever.

It's pretty easy to repro in dagger, but we use buildkit as a library and are directly calling Job.Status with a context that can get cancelled.

  • We'll obviously deal with that ourselves, the fact that the context is being cancelled before the reader is flushed is not a buildkit issue

However, even for vanilla buildkitd, it looks possible for this to be hit if the controller Status API stream is abruptly cancelled/closed: https://github.com/sipsma/buildkit/blob/04745a7cc38eca09ff9e9e3ccd91044038e0841d/control/control.go#L493-L493

I'm guessing there's some client-side protections around not closing that stream prematurely (i.e. before the main reader is fully drained), but still feels like a possibility in the case where a client ungracefully crashes or similar. So I think the fix is worth making for overall correctness even if harder to hit in vanilla buildkitd/buildx.

@tonistiigi tonistiigi merged commit b3cee61 into moby:master May 7, 2024
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants