Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: fix resource sampler goroutine leak #4081

Merged
merged 1 commit into from
Aug 1, 2023

Conversation

sipsma
Copy link
Collaborator

@sipsma sipsma commented Jul 30, 2023

Before this, the runc executor did not close the cgroupRecord when the container exited non-zero, which resulted in goroutines leaking.


Verified the fix manually by getting goroutine stack dumps after builds that included failed execs finished, which previously showed a bunch of goroutines like this:

goroutine 748062 [select]:
github.com/moby/buildkit/executor/resources.(*Sampler[...]).run(0x1905f80)
  /go/pkg/mod/github.com/moby/buildkit@v0.12.1-0.20230723195917-40aabfe76809/executor/resources/sampler.go:92 +0x70
created by github.com/moby/buildkit/executor/resources.(*Sampler[...]).Record
  /go/pkg/mod/github.com/moby/buildkit@v0.12.1-0.20230723195917-40aabfe76809/executor/resources/sampler.go:83 +0x184

and now after the fix they are gone. Not sure if there's reasonable ways of adding an integ test for this at this time.

@sipsma sipsma requested review from tonistiigi and jedevc July 30, 2023 18:49
executor/runcexecutor/executor.go Show resolved Hide resolved
@@ -335,7 +335,11 @@ func (w *runcExecutor) Run(ctx context.Context, id string, root executor.Mount,

err = exitError(ctx, err)
if err != nil {
releaseContainer(context.TODO())
if rec != nil {
rec.CloseAsync(releaseContainer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change the release behavior depending on if recorder is enabled.

CloseAsync is an optimization to make sure the additional resource gathering and deleting cgroup by buildkit does not make container exec slower. But that case is not needed in case of error and I think the recorder should be just discarded synchronously instead of trying the sample it again async.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I added a synchronous Close to the interface now and updated to call that in the error case.

Before this, the runc executor did not close the cgroupRecord when the
container exited non-zero, which resulted in goroutines leaking.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma sipsma force-pushed the fix-resource-goroutine-leak branch from 052740d to fa11bf9 Compare July 31, 2023 14:53
@sipsma sipsma requested a review from tonistiigi July 31, 2023 14:56
@sipsma sipsma merged commit 2267f00 into moby:master Aug 1, 2023
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants