container: add a timeout for deletion #47713

tych0 · 2024-04-12T14:57:14Z

On a heavily loaded host, we were experiencing long container(d) deletion times:

containerd[8907]: time="2024-03-25T13:47:45.938479195Z" level=debug msg="event forwarded" ns=moby topic=/tasks/exit type=containerd.events.TaskExit
# our control plane logic deletes the successfully exited container via
# the docker API, and...
containerd[8907]: time="2024-03-25T13:47:47.202055216Z" level=debug msg="failed to delete task" error="context deadline exceeded" id=a618057629b35e3bfea82d5ce4cbb057ba979498496428dfe6935a1322b94add

Before 4bafaa0 ("Refactor libcontainerd to minimize c8d RPCs") when this happens, the docker API reports a 255 exit code and no error:

0a7ddd027c0497d5a titus-executor-[900884]: Processing msg from a docker: main container exited with code 255

which is especially confusing. After 4bafaa0, the behavior has changed to report the container's real exit code, although there is still a hard coded timeout after which containerd will (try to) stop cleaning up. We would like to wait for this cleanup, so let's add a user configurable DeleteTimeout here.

Reported-by: Hechao Li hli@netflix.com

tych0 · 2024-04-12T16:12:38Z

Hmm, I'm not sure I understand the failure here,

=== Failed
=== FAIL: amd64.integration-cli TestDockerAPISuite/TestExecStateCleanup (10.43s)
    docker_api_exec_test.go:225: assertion failed: expected an error, got nil
    --- FAIL: TestDockerAPISuite/TestExecStateCleanup (10.43s)

it looks like it tries to submit and kill an invalid exec request, I don't think this code should be invoked on that path?

thaJeztah · 2024-04-12T16:22:34Z

Commenting from my phone, but 👋 👋👋 good seeing you! Hope you're doing well ☺️

tych0 · 2024-04-12T17:57:32Z

Thanks, I am! This PR is not urgent, but it would be nice to understand the failures here.

daemon/monitor.go

api/types/container/config.go

On a heavily loaded host, we were experiencing long container(d) deletion times: containerd[8907]: time="2024-03-25T13:47:45.938479195Z" level=debug msg="event forwarded" ns=moby topic=/tasks/exit type=containerd.events.TaskExit # our control plane logic deletes the successfully exited container via # the docker API, and... containerd[8907]: time="2024-03-25T13:47:47.202055216Z" level=debug msg="failed to delete task" error="context deadline exceeded" id=a618057629b35e3bfea82d5ce4cbb057ba979498496428dfe6935a1322b94add Before 4bafaa0 ("Refactor libcontainerd to minimize c8d RPCs") when this happens, the docker API reports a 255 exit code and no error: 0a7ddd027c0497d5a titus-executor-[900884]: Processing msg from a docker: main container exited with code 255 which is especially confusing. After 4bafaa0, the behavior has changed to report the container's real exit code, although there is still a hard coded timeout after which containerd will (try to) stop cleaning up. We would like to wait for this cleanup, so let's add a user configurable DeleteTimeout here. Reported-by: Hechao Li <hli@netflix.com> Signed-off-by: Tycho Andersen <tandersen@netflix.com>

vvoland

LGTM; Would also appreciate input from other maintainers though

cpuguy83 · 2024-04-25T17:11:08Z

This doesn't seem like the right place to me.
Deletion issues are not really part of the makeup of a container.
In terms of terminology this is also somewhat ambiguous at the API level what DeleteTimeout means, because this is not about deleting the container, just the underlying containerd task.

You mention you want to wait for delete, what exactly do you mean by this, and how are you looking at it today?

thaJeztah · 2024-04-25T18:35:24Z

api/types/container/config.go

+	Labels        map[string]string // List of labels set to this container
+	StopSignal    string            `json:",omitempty"` // Signal to stop a container
+	StopTimeout   *int              `json:",omitempty"` // Timeout (in seconds) to stop a container
+	DeleteTimeout *int              `json:",omitempty"` // Timeout (in seconds) to wait for a container to be deleted


I need to have a closer look at the full proposal (sorry; really behind on my backlog); mostly was curious if this is persisted to disk, and if it needs to be, or only to be passed when performing a delete request.

For StopTimeout and StopSignal, the time-out is related to what's running in the container (e.g. database that you want to provide a time to do a graceful shutdown)

For DeleteTimeout, I wonder if it's a property of the container or not

cpuguy83 · 2024-04-26T17:39:37Z

Discussed this with other maintainers yesterday and seeing where the existing behavior can be problematic:

Right now we have a timeout on containerd task deletion, which on a loaded node could be hit.
That's ok, but only if the same container is restarted again because before it is started the task will be deleted... it must be otherwise we can't start the container. I think this is even done without a timeout.

Where the real problem lies is if the timeout is hit and then you remove the (dockerd) container, the task will be leaked and effectively have a zombie task, shim process, and containerd container.

Some observations in the existing code:

We don't actually need to collect the exit code from the task delete, we get that from the containerd event.
We also should not be processing that delete in the event loop, as this blocks other events from being processed.
Container removal (in dockerd) should block until the task is deleted (which, as mentioned before, is not happening today).
Having a timeout on task delete seems (mostly) counter-productive since it must always be deleted. Especially when the system is under high load.

corhere · 2024-04-26T18:30:08Z

Do you want the daemon to wait indefinitely until the task has been deleted so as not to leak c8d tasks, or do you want the delete API call to block until the task has been deleted so that your orchestrator knows when cleanup is complete? If it's the latter, this PR is insufficient as there are timeouts on the waits for the container to leave the Running state in the API handler code paths. I'm going to assume you want the former. (Note that the daemon only transitions a container from the Running state to Stopped until after the task has been deleted – or the delete times out.)

Leaking c8d tasks on heavily loaded systems (or runtimes which are slow to delete tasks) is bad for everyone. I would ideally want to find a solution which does not require any configuration, by actually waiting indefinitely. The daemon has a separate event queue for each (c8d) container, so blocking indefinitely handling one event will not block events for other containers from being handled. The trouble is, the Docker container's state is only transitioned from Running to Stopped after the delete operation returns, and until then the container continues to be reported as Running in the Engine API. And a container can't be restarted until after the old c8d task has been deleted, so we can't just transition the container to Stopped immediately upon receiving the c8d TaskExit event, before deleting the task. I think that for waiting indefinitely to be workable in practice, the Docker container state machine will need a new "Exited" (Stopping?) state for that time when the task has exited but not yet been deleted. Exited containers could then be reported as such when inspected. And more importantly, we could allow users to leak the c8d tasks on demand (releasing the Docker container resources) by making that the behaviour when force-deleting a container when it is in the Exited state. Maybe it would also be useful for non-force delete API request to block until the container has transitioned out of the Exited state?

corhere · 2024-04-26T19:10:25Z

We also should not be processing that delete in the event loop, as this blocks other events from being processed.

@cpuguy83 I checked into that -- only other events for the same Container ID are blocked. Each Container ID gets its own separate event queue inside libcontainerd and each event callback is invoked from a new goroutine. The existing implementation where the event handler callback blocks until handleContainerExit returns is correct.

vvoland requested changes Apr 17, 2024

View reviewed changes

daemon/monitor.go Outdated Show resolved Hide resolved

api/types/container/config.go Show resolved Hide resolved

vvoland added area/runtime kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny status/2-code-review labels Apr 17, 2024

tych0 force-pushed the add-delete-timeout branch from 8fa8abc to 988e948 Compare April 18, 2024 13:29

tych0 force-pushed the add-delete-timeout branch from 988e948 to 9463a5e Compare April 24, 2024 20:43

vvoland approved these changes Apr 25, 2024

View reviewed changes

vvoland requested review from corhere, cpuguy83 and thaJeztah April 25, 2024 08:31

thaJeztah reviewed Apr 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container: add a timeout for deletion #47713

container: add a timeout for deletion #47713

tych0 commented Apr 12, 2024

tych0 commented Apr 12, 2024

thaJeztah commented Apr 12, 2024

tych0 commented Apr 12, 2024

vvoland left a comment

cpuguy83 commented Apr 25, 2024

thaJeztah Apr 25, 2024

cpuguy83 commented Apr 26, 2024

corhere commented Apr 26, 2024

corhere commented Apr 26, 2024

container: add a timeout for deletion #47713

Are you sure you want to change the base?

container: add a timeout for deletion #47713

Conversation

tych0 commented Apr 12, 2024

tych0 commented Apr 12, 2024

thaJeztah commented Apr 12, 2024

tych0 commented Apr 12, 2024

vvoland left a comment

Choose a reason for hiding this comment

cpuguy83 commented Apr 25, 2024

thaJeztah Apr 25, 2024

Choose a reason for hiding this comment

cpuguy83 commented Apr 26, 2024

corhere commented Apr 26, 2024

corhere commented Apr 26, 2024