Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return an error instead of panicking when failing to get edge #2382

Merged
merged 1 commit into from
Sep 30, 2021

Conversation

maxlaverse
Copy link
Contributor

@maxlaverse maxlaverse commented Sep 28, 2021

Currently, buildkitd crashes with a panic when it can't get an edge in solver.NewInputRequest(). This is unfortunate as it affects all the other builds running on the daemon at the same time.

As suggested by a comment left in the code, returning an errored pipe is much more elegant and leaves only the problematic build with a return leaving incoming open, instead of crashing the daemon.

See #2303

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can reproduce these errors (or incoming open) then please also try to figure out in what conditions they appear. Reaching these errors should not be possible without another bug.

@@ -351,7 +352,7 @@ type pipeFactory struct {
func (pf *pipeFactory) NewInputRequest(ee Edge, req *edgeRequest) pipe.Receiver {
target := pf.s.ef.getEdge(ee)
if target == nil {
panic("failed to get edge") // TODO: return errored pipe
return pipe.NewErroredPipe(req, fmt.Errorf("failed to get edge"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be easier to call pf.NewFuncRequest(func() {return errors.Errorf()}) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got it right, this would only fail the sender pipe and not the receiver, which we are looking for in NewInputRequest()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Said differently, pf.NewFuncRequest(func() {return errors.Errorf()}) doesn't return an errored pipe.Receiver. If you call .Status().Err on it, you get nil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call .Status().Err on it, you get nil

After you call Receive you still get nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Maybe I'm not getting something right :)

func TestPanic(t *testing.T) {
	s := NewSolver(SolverOpt{
		ResolveOpFunc: testOpResolver,
	})
	defer s.Close()

	e := edge{}
	pf := &pipeFactory{s: s.s, e: &e}
	receivedPipe := pf.NewFuncRequest(func(_ context.Context) (interface{}, error) { return nil, errors.Errorf("failed to get edge") })
	receivedPipe.Receive()
	require.Error(t, receivedPipe.Status().Err)
}

We only need an errored pipe in this one place and therefore I have nothing against dropping the special purpose errored pipe structure I introduce. On the other hand, it's quite simple to understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event is already captured by the solver internally.


func TestPanic(t *testing.T) {
	f := func(_ context.Context) (interface{}, error) {
		return nil, errors.Errorf("failed to get edge")
	}
	p, start := pipe.NewWithFunction(f)
	go start()

	for {
		if !p.Receiver.Receive() {
			time.Sleep(time.Millisecond) // in real code this is achieved with signal
			continue
		}
		require.Error(t, p.Receiver.Status().Err)
		break
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example you can replace receivedPipe.Receive() with time.Sleep(time.Second). Function runs async (and Receive() does not block). You can't get the actual function completion time or wait for Receive() as these are already captured internally in schedulers loop thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation !

@jamesalucas
Copy link

Would either of you be able to review my comments in #2303 to help me pinpoint the root cause of this error when it occurs in our CI env? Many thanks!

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DCO failing + please squash commits

solver/scheduler.go Outdated Show resolved Hide resolved
Signed-off-by: Maxime Lagresle <maxime@angel.co>
@tonistiigi tonistiigi merged commit b8e4ed1 into moby:master Sep 30, 2021
@jamesalucas
Copy link

Would this change work if patched to a slightly older version of buildkit? I have patched it in to the version I'm running in our CI and now get the following panic:

panic: interface conversion: interface {} is nil, not *solver.edgeState

goroutine 14 [running]:
github.com/moby/buildkit/solver.(*edge).processUpdate(0xc000392b40, 0x180c4d0, 0xc00220ab40, 0x451025)
        /src/solver/edge.go:461 +0x21ac
github.com/moby/buildkit/solver.(*edge).unpark(0xc000392b40, 0xc006413d10, 0x1, 0x1, 0xc006413d50, 0x1, 0x1, 0xc006413d20, 0x1, 0x1, ...)
        /src/solver/edge.go:326 +0x7d
github.com/moby/buildkit/solver.(*scheduler).dispatch(0xc00033d810, 0xc000392b40)
        /src/solver/scheduler.go:136 +0x425
github.com/moby/buildkit/solver.(*scheduler).loop(0xc00033d810)
        /src/solver/scheduler.go:104 +0x168
created by github.com/moby/buildkit/solver.newScheduler
        /src/solver/scheduler.go:35 +0x1b3
Error: buildkit process has exited

@maxlaverse maxlaverse deleted the panic_failed_to_get_edge branch October 1, 2021 08:25
@maxlaverse
Copy link
Contributor Author

@jamesalucas can you try this instead ? Maybe the handling around the pipe was different before:, or you're hitting another bug ? ae8af5c

I'll keep my eyes on crashes and the stack trace you posted.

@tonistiigi
Copy link
Member

@maxlaverse looks like this needs a follow-up fix. In https://github.com/moby/buildkit/blob/master/solver/edge.go#L461 the value needs to be edgeState. Returning &req.edgeState might do it or changing the processing to avoid converting interface after error.

Not quite sure why the error says "interface {} is nil," though. You might want to test this by adding a code that manually produces getedge() error x% of the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants