Skip to content
This repository was archived by the owner on Feb 8, 2021. It is now read-only.

Conversation

@wrouesnel
Copy link
Contributor

@wrouesnel wrouesnel commented Oct 26, 2016

Running a command like the following:

cid=$(./docker run -tid busybox sh) && \
    echo $cid && ./docker exec -ti $cid echo hello

would block indefinitely because we would return success to docker, but fail because the exec is attempted before the container is started up.

This leads to the docker daemon blocking indefinitely waiting for an exec command it believes was successful via containerd.

This patch solves the problem by blocking the containerd responses until the VM startup has returned and moved to process wait (or else returning an error if it fails at this stage). This resolves deadlocks into docker and allows the above command to succeed.

Closes #279

@wrouesnel wrouesnel force-pushed the make-rapid-exec-sane branch from bc54110 to 36938ee Compare October 27, 2016 13:51
@wrouesnel wrouesnel mentioned this pull request Oct 27, 2016
@wrouesnel wrouesnel force-pushed the make-rapid-exec-sane branch 3 times, most recently from 8f8fa52 to 03142f7 Compare October 27, 2016 14:52

go func() {
c.ownerPod.podStatus.AddExec(c.Id, processId, "", spec.Terminal)
err := c.ownerPod.vm.AddProcess(c.Id, processId, spec.Terminal, spec.Args, spec.Env, spec.Cwd, p.stdio)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vm.AddProcess will block until process is finished, this change will block AddProcess. And seems like we double call the WaitForFinish, this is unexpected.

IMO, EventProcessStart & EventExit seems in the right order & place. even vm.AddProcess failed, we will send out EventExit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was motivated by it being unclear what the expectation higher up containerd is if AddProcess fails. If we fail to ever start the process, then an attempt at recovering containers later isn't going to find it since it was probably a runv/hyperstart error. So better not to notify it was added until we're sure it was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately vm.AddProcess exits after process is exited in current codes. the EventProcessStart in this pr is sent out too late

Copy link
Contributor Author

@wrouesnel wrouesnel Oct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can concoct a test to catch this situation so I can avoid it (my point testing was based on the noted ticket, which of course exits quite quickly). Once #385 is merged that'll bring AddContainer and AddProcess inline in behavior, which should make the fix here more obvious.

func (c *Container) run(p *Process) error {
// Receives the early result of the attempt to start the container,
// which ensures that we do not return before EventContainerStart
// is emitted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, this part looks good to me, one tips, we can move c.start out of goroutine and leave c.wait inside goroutine to get rid of channel.

bad news is vm.NewContainer in c.start doesn't guarantee container is started in vm, it just passes out the new container message, we still need another patch to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that - this fix is still racey in that regard, but it seems to mostly work. In general the problem seems to be there's a missing "container started" message that hyperstart needs to be returning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we has the "container started" message in hyperstart as "ack to new container" message, but runv doesn't wait for it.

Copy link
Contributor Author

@wrouesnel wrouesnel Oct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you fix that in #385 :)

@wrouesnel wrouesnel force-pushed the make-rapid-exec-sane branch from 03142f7 to ed42f25 Compare January 6, 2017 03:53
@wrouesnel
Copy link
Contributor Author

This is a far simpler patch that takes advantage of the new hyperstart / runv changes. I've confirmed without it #279 is still an issue, with it, #279 is fixed.

@laijs
Copy link
Contributor

laijs commented Jan 12, 2017

c.run(p) is called in the supervisor RWMutext, I would mind if it is blocked too long, otherwise I would not use go routine to create container. Is there any other solution?

@wrouesnel
Copy link
Contributor Author

Seems unavoidable to me - we can't return from the supervisor until we know the process has actually launched. Under all normal circumstances this is quick - and certainly much better behavior then now where you block indefinitely if it fails.

@laijs
Copy link
Contributor

laijs commented Jan 13, 2017

you can add a 'started chan error' argument to hp.createContainer() and c.run() and "doing receive from the chan" after the lock is unlocked.

WDYT?

@wrouesnel
Copy link
Contributor Author

I looked into doing this, and it feels like a very leaky abstraction at the Supervisor level. I'm going to look into splitting the global supervisor lock into a finer-grained pod/container lock so concurrent requests can start/addProcesses to containers.

@laijs
Copy link
Contributor

laijs commented Aug 3, 2017

@wrouesnel runv cli is being improved continuously and had just been refactored #537 . As a result, it seams that the changes in this pr is outdated.

Close this PR. If the problem mentioned still exists, a new PR could be created for it.

Thank you for your contribution.

@laijs laijs closed this Aug 3, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runv-containerd hangs

3 participants