Fix `runc kill` and `runc delete` for containers with no init and no private PID namespace #4102

kolyshkin · 2023-10-31T16:20:50Z

This fixes a couple of regressions introduced in #3825, most related to
killing and destroying containers without own private PID namespace,
and with initial container process gone. In this scenario, container is
considered stopped, but there might be some leftover processes in
the container cgroup.

Commit f8ad20f made it impossible to kill those leftover processes.
Fix this by moving the check if container init exists to after the
special case of handling the container without own PID namespace.
runc delete -f is not killing leftover processes either, ignoring those.
runc delete in runc 1.1 and earier used to kill leftover processes.
I think this is a bug, as delete (without -f) is not supposed to kill
anything. With this PR, such error is reported.
Fix the minor issue introduced by commit 9583b3d:
if signalAllProcesses is used, there is no need to thaw the
container (as freeze/thaw is either done in signalAllProcesses already,
or not needed at all).

Also, a few non-regressions:

make signalAllProcesses return an error early if the container
cgroup does not exist (as it relies on it to do its job). This way, the
error message returned is more generic and easier to understand
("container not running" instead of "can't open file").
Make runc delete return with non-zero exit code if the container
is not removed.
runc delete: do not remove container state if container's cgroup can't be removed
(as otherwise we'll lose the container cgroup information forever). Suggested by @lifubang.

Appropriate test cases were added to avoid future regressions.

Fixes #4047.
Fixes #4040.

kolyshkin · 2023-10-31T17:22:18Z

@lifubang this is an alternative to #4048, PTAL 🙏🏻

kolyshkin · 2023-11-03T23:39:17Z

@lifubang PTAL

kolyshkin · 2023-11-04T00:19:24Z

Too early; need to implement support for runc delete -f first (and the test case).

kolyshkin · 2023-11-04T01:52:04Z

tests/integration/kill.bats


 	# Start a few more processes.
 	for _ in 1 2 3 4 5; do
 		__runc exec -d test_busybox sleep 1h
-		[ "$status" -eq 0 ]


There was a bug in the original test case code -- since we're not using bats' run, there is no $status to check (essentially, this checks the status from the last run call which is runc run ... above).

If needed, I can separate this (and other fixes to the test case) to a separate commit, for clarity.

I think it's okay to keep it in the same commit. It's a no-op change anyway (the deleted check would never fail anyway for the reason you described).

kolyshkin · 2023-11-04T01:52:31Z

tests/integration/kill.bats

+		# Can't mount real /proc when rootless + no pidns,
+		# so change it to a bind-mounted one from the host.


Added this to explain why this is needed, because I forgot.

delete.go

cyphar

I think we should drop the last patch. Otherwise, seems okay.

libcontainer/init_linux.go

cyphar · 2023-11-07T05:59:39Z

libcontainer/state_linux.go

+	// Normally, when a container init is gone, all other processes in its
+	// cgroup are killed by the kernel. This is not the case for a shared
+	// PID namespace container, which may have some leftover processes.
+	if !b.c.config.Namespaces.IsPrivate(configs.NEWPID) {
+		// Get container pids on a best-effort basis.
+		if pids, _ := b.c.cgroupManager.GetAllPids(); len(pids) > 0 {
+			return fmt.Errorf("container is in stopped state (init is gone), but its cgroup still has %d processes; use runc delete -f or runc kill", len(pids))
+		}
+	}


So, I understand the reasoning behind this -- by some definition the container is still "alive" because the pid1 is not special for non-pidns containers. And you cannot runc delete a container that is is still alive, you need to runc kill it or use runc delete -f. So there is an argument that this is more consistent.

However, runc considers the container to be stopped because we bind the lifecycle of the container to the pid1 (as evidenced by runc state and the fact that this code was added for stoppedState). From that perspective, it's more consistent that any container which is "stopped" (according to runc state) should be able to be deleted with runc delete (no -f) except in weird circumstances.

IMHO, I think the latter is more consistent, primarily because (in my mind) killing the init process of a container means you've conceptually killed the container by definition (regardless of whether you have a separate pidns) and thus any remaining processes in the container are leftover garbage that runc delete should be happy to remove. This also has the benefit of not affecting compatibility with pre-1.2 runc. I agree the compatibility concern isn't enormous, because of what an edge-case host pidns containers are, but the difference between this and the shared cgroup stuff is the shared cgroup stuff was categorically broken and thus unusable -- this is a straight-forward change of behaviour, not removal of a broken mis-feature.

I think we should drop this part.

What you mean is that we should drop this part, change to kill the left processes in the stopped container when using runc delete without -f?

Yes, I think runc delete on a container with a dead init process should work without -f because the container is considered "stopped".

I can accept to change back the same as before.

But there is another issue discussed in #4040 (comment) (I can accept fix it in another PR, this PR is only to fix the regression introduced by #3825)
Maybe we have to retry and still have to check whether there are still some processes left in the container or not, even we had try our best to kill them. At this time, we should error out and don't delete anything about this container.
Please see: 120dad4

Ah, I see. I wonder if @kolyshkin's suggestion in #4040 (comment) is workable. @kolyshkin is there a reason you didn't go the route of making !c.hasInit() into !c.hasProcesses(), so that the container is still in the running state despite not having a pid1?

My issue with returning an error here is that it makes stopped containers act strangely in this one scenario. If we treated containers that contain processes at all as running then we can require people to use runc kill, which would solve the consistency issue.

I would prefer that over killing the processes in runc delete (assuming no other issues come up with making a container without the init process act as a "running" container -- I suspect we have a fair amount of code that makes assumptions about that behaviour that aren't ideal), but I would prefer either option over returning an error here.

I will change the last commit to kill the remaining processes rather than error out.

kolyshkin · 2023-11-10T00:46:59Z

@lifubang i've changed the cgroup removal logic, hopefully fixing related issues. If you have your own repro, can you try it on this branch?

delete.go

lifubang · 2023-11-11T00:00:23Z

libcontainer/state_linux.go

+	// to kill those processes here.
+	if !c.config.Namespaces.IsPrivate(configs.NEWPID) {
+		_ = signalAllProcesses(c.cgroupManager, unix.SIGKILL)
+	}
 	err := c.cgroupManager.Destroy()


We say we should not ignore the error from destroy, it means that we have a opportunity to destroy it again once we got an error, but if we ignore this cgroup destroy error, the state directory would be removed in next several lines, so the container disappears in runc, but leave some cgroup paths or some processes of this container in the host, we can’t delete them anymore from Runc.

You are right, but I do not modify this code here. I surely can.

Are you suggesting something like this?

if err := c.cgroupManager.Destroy(); err != nil { return fmt.Errorf("can't remove container cgroup: %w", err) }

@lifubang added, PTAL

Are you suggesting something like this?

Yes, but I’m puzzled that why we ignored this error before? I will spend sometime to look git log to find out the reason if exists.

Are you suggesting something like this?

Yes, but I’m puzzled that why we ignored this error before? I will spend sometime to look git log to find out the reason if exists.

Apparently you don't read commit messages 😄 I already looked and wrote about it.

lifubang · 2023-11-15T11:04:08Z

@Burning1020 , PTAL, is there a chance to check this PR has fixed your issue #4040 or not?

Burning1020 · 2023-11-15T11:13:47Z

@Burning1020 , PTAL, is there a chance to check this PR has fixed your issue #4040 or not?

OK. Will check, thanks!

lifubang · 2023-11-17T14:11:32Z

OK. Will check, thanks!

I think #4047 has been fixed by this PR, but I can't confirm #4040 is fixed by this PR too. 🙏🏻 @Burning1020

Burning1020

LGTM. #4040 fixed, thanks.

lifubang

LGTM
@cyphar @AkihiroSuda PTAL

The semantics of runType is slightly complicated, and the only place where we need to distinguish between Created and Running is refreshState. Replace runType with simpler hasInit, simplifying its users (except the refreshState, which now figures out on its own whether the container is Created or Running). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Let's use c.hasInit and c.isPaused where needed instead of c.curentStatus for simplicity. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Commit f8ad20f made it impossible to kill leftover processes in a stopped container that does not have its own PID namespace. In other words, if a container init is gone, it is no longer possible to use `runc kill` to kill the leftover processes. Fix this by moving the check if container init exists to after the special case of handling the container without own PID namespace. While at it, fix the minor issue introduced by commit 9583b3d: if signalAllProcesses is used, there is no need to thaw the container (as freeze/thaw is either done in signalAllProcesses already, or not needed at all). Also, make signalAllProcesses return an error early if the container cgroup does not exist (as it relies on it to do its job). This way, the error message returned is more generic and easier to understand ("container not running" instead of "can't open file"). Finally, add a test case. Fixes: f8ad20f Fixes: 9583b3d Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Commit f8ad20f moved the kill logic from container destroy to container kill (which is the right thing to do). Alas, it broke the use case of doing "runc delete -f" for a container which does not have its own private PID namespace, when its init process is gone. In this case, some processes may still be running, and runc delete -f should kill them (the same way as "runc kill" does). It does not do that because the container status is "stopped" (as runc considers the container with no init process as stopped), and so we only call "destroy" (which was doing the killing before). The fix is easy: if --force is set, call killContainer no matter what. Add a test case, similar to the one in the previous commit. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

The current code is only doing retries in RemovePaths, which is only used for cgroup v1 (cgroup v2 uses RemovePath, which makes no retries). Let's remove all retry logic and logging from RemovePaths, together with: - os.Stat check from RemovePaths (its usage probably made sense before commit 19be8e5 but not after); - error/warning logging from RemovePaths (this was added by commit 19be8e5 in 2020 and so far we've seen no errors other than EBUSY, so reporting the actual error proved to be useless). Add the retry logic to rmdir, and the second retry bool argument. Decrease the initial delay and increase the number of retries from the old implementation so it can take up to ~1 sec before returning EBUSY (was about 0.3 sec). Hopefully, as a result, we'll have less "failed to remove cgroup paths" errors. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

If container.Destroy() has failed, runc destroy still return 0, which is wrong and can result in other issues down the line. Let's always return error from destroy in runc delete. For runc checkpoint and runc run, we still treat it as a warning. Co-authored-by: Zhang Tianyang <burning9699@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

(For a container with no private PID namespace, that is). When runc delete (or container.Destroy) is called on a stopped container without private PID namespace and there are processes in its cgroup, kill those. Add a test case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

For some reason, container destroy operation removes container's state directory even if cgroup removal fails (and then still returns an error). It has been that way since commit 5c246d0, which added cgroup removal. This is problematic because once the container state dir is removed, we no longer know container's cgroup and thus can't remove it. Let's return the error early and fail if cgroup can't be removed. Same for other operations: do not proceed if we fail. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

cyphar

LGTM, thanks for wrangling this one. On reflection, I agree continuing to treat containers without their init process as "stopped" is the simpler solution.

kolyshkin force-pushed the fix-kill branch from 0b083dd to fc1a882 Compare October 31, 2023 17:21

kolyshkin requested a review from lifubang October 31, 2023 17:21

kolyshkin force-pushed the fix-kill branch 3 times, most recently from 6bc2a31 to ec62f74 Compare November 3, 2023 23:39

kolyshkin marked this pull request as ready for review November 3, 2023 23:39

kolyshkin marked this pull request as draft November 4, 2023 00:19

kolyshkin force-pushed the fix-kill branch 3 times, most recently from 70a96a0 to 5fa0188 Compare November 4, 2023 01:46

kolyshkin commented Nov 4, 2023

View reviewed changes

kolyshkin changed the title ~~Fix runc kill for containers with no init and no private PID namespace~~ Fix runc kill and runc delete for containers with no init and no private PID namespace Nov 4, 2023

kolyshkin marked this pull request as ready for review November 4, 2023 01:57

kolyshkin requested review from cyphar and AkihiroSuda November 4, 2023 01:57

lifubang reviewed Nov 5, 2023

View reviewed changes

delete.go Show resolved Hide resolved

kolyshkin force-pushed the fix-kill branch from 5fa0188 to 99d14ef Compare November 7, 2023 00:49

kolyshkin requested a review from lifubang November 7, 2023 01:02

kolyshkin force-pushed the fix-kill branch from 99d14ef to 1977d85 Compare November 7, 2023 01:09

kolyshkin added this to the 1.2.0 milestone Nov 7, 2023

kolyshkin force-pushed the fix-kill branch from 1977d85 to 8f5158f Compare November 7, 2023 01:53

kolyshkin mentioned this pull request Nov 7, 2023

return error if libcontainer desroy failed #4090

Closed

cyphar reviewed Nov 7, 2023

View reviewed changes

kolyshkin force-pushed the fix-kill branch 2 times, most recently from 0c14686 to 7e571fd Compare November 10, 2023 00:38

kolyshkin requested a review from cyphar November 10, 2023 00:47

lifubang added the status/2-code-review label Nov 10, 2023

lifubang reviewed Nov 10, 2023

View reviewed changes

delete.go Show resolved Hide resolved

lifubang reviewed Nov 11, 2023

View reviewed changes

kolyshkin force-pushed the fix-kill branch from 7e571fd to a578635 Compare November 13, 2023 23:48

Burning1020 approved these changes Nov 20, 2023

View reviewed changes

lifubang approved these changes Nov 25, 2023

View reviewed changes

lifubang added status/4-merge and removed status/2-code-review labels Nov 25, 2023

kolyshkin and others added 8 commits November 27, 2023 09:15

libct: Signal: slight refactor

542cce0

Let's use c.hasInit and c.isPaused where needed instead of c.curentStatus for simplicity. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin force-pushed the fix-kill branch from a578635 to a6f4081 Compare November 27, 2023 17:15

cyphar approved these changes Nov 28, 2023

View reviewed changes

cyphar merged commit c9019ea into opencontainers:main Nov 28, 2023
45 checks passed

lifubang mentioned this pull request Nov 28, 2023

Fix a regression when killing and deleting a container with shared(host) pid namespace #4048

Closed

kolyshkin mentioned this pull request Nov 28, 2023

Proposal: just signal all processes inside the container #2037

Closed

kolyshkin mentioned this pull request Dec 18, 2023

runc can not wait process in the container exits when share pid namespace #4145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `runc kill` and `runc delete` for containers with no init and no private PID namespace #4102

Fix `runc kill` and `runc delete` for containers with no init and no private PID namespace #4102

kolyshkin commented Oct 31, 2023 •

edited

kolyshkin commented Oct 31, 2023

kolyshkin commented Nov 3, 2023

kolyshkin commented Nov 4, 2023

kolyshkin Nov 4, 2023

kolyshkin Nov 4, 2023

cyphar Nov 7, 2023

kolyshkin Nov 4, 2023

cyphar left a comment

cyphar Nov 7, 2023

lifubang Nov 7, 2023

cyphar Nov 7, 2023

lifubang Nov 7, 2023

cyphar Nov 8, 2023 •

edited

kolyshkin Nov 8, 2023

kolyshkin Nov 20, 2023

kolyshkin commented Nov 10, 2023

lifubang Nov 11, 2023

kolyshkin Nov 13, 2023

kolyshkin Nov 13, 2023

lifubang Nov 14, 2023

kolyshkin Nov 14, 2023

lifubang commented Nov 15, 2023

Burning1020 commented Nov 15, 2023

lifubang commented Nov 17, 2023

Burning1020 left a comment

lifubang left a comment

cyphar left a comment •

edited

		# Can't mount real /proc when rootless + no pidns,
		# so change it to a bind-mounted one from the host.

Fix runc kill and runc delete for containers with no init and no private PID namespace #4102

Fix runc kill and runc delete for containers with no init and no private PID namespace #4102

Conversation

kolyshkin commented Oct 31, 2023 • edited

kolyshkin commented Oct 31, 2023

kolyshkin commented Nov 3, 2023

kolyshkin commented Nov 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar Nov 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolyshkin commented Nov 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lifubang commented Nov 15, 2023

Burning1020 commented Nov 15, 2023

lifubang commented Nov 17, 2023

Burning1020 left a comment

Choose a reason for hiding this comment

lifubang left a comment

Choose a reason for hiding this comment

cyphar left a comment • edited

Choose a reason for hiding this comment

Fix `runc kill` and `runc delete` for containers with no init and no private PID namespace #4102

Fix `runc kill` and `runc delete` for containers with no init and no private PID namespace #4102

kolyshkin commented Oct 31, 2023 •

edited

cyphar Nov 8, 2023 •

edited

cyphar left a comment •

edited