-
Notifications
You must be signed in to change notification settings - Fork 128
Handle exec being called before a container is initialized correctly. #379
Conversation
bc54110 to
36938ee
Compare
8f8fa52 to
03142f7
Compare
|
|
||
| go func() { | ||
| c.ownerPod.podStatus.AddExec(c.Id, processId, "", spec.Terminal) | ||
| err := c.ownerPod.vm.AddProcess(c.Id, processId, spec.Terminal, spec.Args, spec.Env, spec.Cwd, p.stdio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vm.AddProcess will block until process is finished, this change will block AddProcess. And seems like we double call the WaitForFinish, this is unexpected.
IMO, EventProcessStart & EventExit seems in the right order & place. even vm.AddProcess failed, we will send out EventExit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was motivated by it being unclear what the expectation higher up containerd is if AddProcess fails. If we fail to ever start the process, then an attempt at recovering containers later isn't going to find it since it was probably a runv/hyperstart error. So better not to notify it was added until we're sure it was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately vm.AddProcess exits after process is exited in current codes. the EventProcessStart in this pr is sent out too late
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can concoct a test to catch this situation so I can avoid it (my point testing was based on the noted ticket, which of course exits quite quickly). Once #385 is merged that'll bring AddContainer and AddProcess inline in behavior, which should make the fix here more obvious.
supervisor/container.go
Outdated
| func (c *Container) run(p *Process) error { | ||
| // Receives the early result of the attempt to start the container, | ||
| // which ensures that we do not return before EventContainerStart | ||
| // is emitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, this part looks good to me, one tips, we can move c.start out of goroutine and leave c.wait inside goroutine to get rid of channel.
bad news is vm.NewContainer in c.start doesn't guarantee container is started in vm, it just passes out the new container message, we still need another patch to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed that - this fix is still racey in that regard, but it seems to mostly work. In general the problem seems to be there's a missing "container started" message that hyperstart needs to be returning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we has the "container started" message in hyperstart as "ack to new container" message, but runv doesn't wait for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you fix that in #385 :)
03142f7 to
ed42f25
Compare
|
c.run(p) is called in the supervisor RWMutext, I would mind if it is blocked too long, otherwise I would not use go routine to create container. Is there any other solution? |
|
Seems unavoidable to me - we can't return from the supervisor until we know the process has actually launched. Under all normal circumstances this is quick - and certainly much better behavior then now where you block indefinitely if it fails. |
|
you can add a 'started chan error' argument to hp.createContainer() and c.run() and "doing receive from the chan" after the lock is unlocked. WDYT? |
|
I looked into doing this, and it feels like a very leaky abstraction at the Supervisor level. I'm going to look into splitting the global supervisor lock into a finer-grained pod/container lock so concurrent requests can start/addProcesses to containers. |
|
@wrouesnel runv cli is being improved continuously and had just been refactored #537 . As a result, it seams that the changes in this pr is outdated. Close this PR. If the problem mentioned still exists, a new PR could be created for it. Thank you for your contribution. |
Running a command like the following:
would block indefinitely because we would return success to docker, but fail because the exec is attempted before the container is started up.
This leads to the docker daemon blocking indefinitely waiting for an exec command it believes was successful via containerd.
This patch solves the problem by blocking the containerd responses until the VM startup has returned and moved to process wait (or else returning an error if it fails at this stage). This resolves deadlocks into docker and allows the above command to succeed.
Closes #279