Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container: add a timeout for deletion #47713

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tych0
Copy link
Contributor

@tych0 tych0 commented Apr 12, 2024

On a heavily loaded host, we were experiencing long container(d) deletion times:

containerd[8907]: time="2024-03-25T13:47:45.938479195Z" level=debug msg="event forwarded" ns=moby topic=/tasks/exit type=containerd.events.TaskExit
# our control plane logic deletes the successfully exited container via
# the docker API, and...
containerd[8907]: time="2024-03-25T13:47:47.202055216Z" level=debug msg="failed to delete task" error="context deadline exceeded" id=a618057629b35e3bfea82d5ce4cbb057ba979498496428dfe6935a1322b94add

Before 4bafaa0 ("Refactor libcontainerd to minimize c8d RPCs") when this happens, the docker API reports a 255 exit code and no error:

0a7ddd027c0497d5a titus-executor-[900884]: Processing msg from a docker: main container exited with code 255

which is especially confusing. After 4bafaa0, the behavior has changed to report the container's real exit code, although there is still a hard coded timeout after which containerd will (try to) stop cleaning up. We would like to wait for this cleanup, so let's add a user configurable DeleteTimeout here.

Reported-by: Hechao Li hli@netflix.com

@tych0
Copy link
Contributor Author

tych0 commented Apr 12, 2024

Hmm, I'm not sure I understand the failure here,

=== Failed
=== FAIL: amd64.integration-cli TestDockerAPISuite/TestExecStateCleanup (10.43s)
    docker_api_exec_test.go:225: assertion failed: expected an error, got nil
    --- FAIL: TestDockerAPISuite/TestExecStateCleanup (10.43s)

it looks like it tries to submit and kill an invalid exec request, I don't think this code should be invoked on that path?

@thaJeztah
Copy link
Member

Commenting from my phone, but 👋 👋👋 good seeing you! Hope you're doing well ☺️

@tych0
Copy link
Contributor Author

tych0 commented Apr 12, 2024

Thanks, I am! This PR is not urgent, but it would be nice to understand the failures here.

daemon/monitor.go Outdated Show resolved Hide resolved
api/types/container/config.go Show resolved Hide resolved
@vvoland vvoland added area/runtime kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny status/2-code-review labels Apr 17, 2024
On a heavily loaded host, we were experiencing long container(d) deletion
times:

    containerd[8907]: time="2024-03-25T13:47:45.938479195Z" level=debug msg="event forwarded" ns=moby topic=/tasks/exit type=containerd.events.TaskExit
    # our control plane logic deletes the successfully exited container via
    # the docker API, and...
    containerd[8907]: time="2024-03-25T13:47:47.202055216Z" level=debug msg="failed to delete task" error="context deadline exceeded" id=a618057629b35e3bfea82d5ce4cbb057ba979498496428dfe6935a1322b94add

Before 4bafaa0 ("Refactor libcontainerd to minimize c8d RPCs") when
this happens, the docker API reports a 255 exit code and no error:

    0a7ddd027c0497d5a titus-executor-[900884]: Processing msg from a docker: main container exited with code 255

which is especially confusing. After 4bafaa0, the behavior has changed
to report the container's real exit code, although there is still a hard
coded timeout after which containerd will (try to) stop cleaning up. We
would like to wait for this cleanup, so let's add a user configurable
DeleteTimeout here.

Reported-by: Hechao Li <hli@netflix.com>
Signed-off-by: Tycho Andersen <tandersen@netflix.com>
Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; Would also appreciate input from other maintainers though

@cpuguy83
Copy link
Member

This doesn't seem like the right place to me.
Deletion issues are not really part of the makeup of a container.
In terms of terminology this is also somewhat ambiguous at the API level what DeleteTimeout means, because this is not about deleting the container, just the underlying containerd task.

You mention you want to wait for delete, what exactly do you mean by this, and how are you looking at it today?

Labels map[string]string // List of labels set to this container
StopSignal string `json:",omitempty"` // Signal to stop a container
StopTimeout *int `json:",omitempty"` // Timeout (in seconds) to stop a container
DeleteTimeout *int `json:",omitempty"` // Timeout (in seconds) to wait for a container to be deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to have a closer look at the full proposal (sorry; really behind on my backlog); mostly was curious if this is persisted to disk, and if it needs to be, or only to be passed when performing a delete request.

For StopTimeout and StopSignal, the time-out is related to what's running in the container (e.g. database that you want to provide a time to do a graceful shutdown)

For DeleteTimeout, I wonder if it's a property of the container or not

@cpuguy83
Copy link
Member

Discussed this with other maintainers yesterday and seeing where the existing behavior can be problematic:

Right now we have a timeout on containerd task deletion, which on a loaded node could be hit.
That's ok, but only if the same container is restarted again because before it is started the task will be deleted... it must be otherwise we can't start the container. I think this is even done without a timeout.

Where the real problem lies is if the timeout is hit and then you remove the (dockerd) container, the task will be leaked and effectively have a zombie task, shim process, and containerd container.

Some observations in the existing code:

  • We don't actually need to collect the exit code from the task delete, we get that from the containerd event.
  • We also should not be processing that delete in the event loop, as this blocks other events from being processed.
  • Container removal (in dockerd) should block until the task is deleted (which, as mentioned before, is not happening today).
  • Having a timeout on task delete seems (mostly) counter-productive since it must always be deleted. Especially when the system is under high load.

@corhere
Copy link
Contributor

corhere commented Apr 26, 2024

Do you want the daemon to wait indefinitely until the task has been deleted so as not to leak c8d tasks, or do you want the delete API call to block until the task has been deleted so that your orchestrator knows when cleanup is complete? If it's the latter, this PR is insufficient as there are timeouts on the waits for the container to leave the Running state in the API handler code paths. I'm going to assume you want the former. (Note that the daemon only transitions a container from the Running state to Stopped until after the task has been deleted – or the delete times out.)

Leaking c8d tasks on heavily loaded systems (or runtimes which are slow to delete tasks) is bad for everyone. I would ideally want to find a solution which does not require any configuration, by actually waiting indefinitely. The daemon has a separate event queue for each (c8d) container, so blocking indefinitely handling one event will not block events for other containers from being handled. The trouble is, the Docker container's state is only transitioned from Running to Stopped after the delete operation returns, and until then the container continues to be reported as Running in the Engine API. And a container can't be restarted until after the old c8d task has been deleted, so we can't just transition the container to Stopped immediately upon receiving the c8d TaskExit event, before deleting the task. I think that for waiting indefinitely to be workable in practice, the Docker container state machine will need a new "Exited" (Stopping?) state for that time when the task has exited but not yet been deleted. Exited containers could then be reported as such when inspected. And more importantly, we could allow users to leak the c8d tasks on demand (releasing the Docker container resources) by making that the behaviour when force-deleting a container when it is in the Exited state. Maybe it would also be useful for non-force delete API request to block until the container has transitioned out of the Exited state?

@corhere
Copy link
Contributor

corhere commented Apr 26, 2024

  • We also should not be processing that delete in the event loop, as this blocks other events from being processed.

@cpuguy83 I checked into that -- only other events for the same Container ID are blocked. Each Container ID gets its own separate event queue inside libcontainerd and each event callback is invoked from a new goroutine. The existing implementation where the event handler callback blocks until handleContainerExit returns is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny status/2-code-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants