Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix runc kill and runc delete for containers with no init and no private PID namespace #4102

Merged
merged 8 commits into from
Nov 28, 2023

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Oct 31, 2023

This fixes a couple of regressions introduced in #3825, most related to
killing and destroying containers without own private PID namespace,
and with initial container process gone. In this scenario, container is
considered stopped, but there might be some leftover processes in
the container cgroup.

  1. Commit f8ad20f made it impossible to kill those leftover processes.
    Fix this by moving the check if container init exists to after the
    special case of handling the container without own PID namespace.

  2. runc delete -f is not killing leftover processes either, ignoring those.

  3. runc delete in runc 1.1 and earier used to kill leftover processes.
    I think this is a bug, as delete (without -f) is not supposed to kill
    anything. With this PR, such error is reported.

  4. Fix the minor issue introduced by commit 9583b3d:
    if signalAllProcesses is used, there is no need to thaw the
    container (as freeze/thaw is either done in signalAllProcesses already,
    or not needed at all).

Also, a few non-regressions:

  1. make signalAllProcesses return an error early if the container
    cgroup does not exist (as it relies on it to do its job). This way, the
    error message returned is more generic and easier to understand
    ("container not running" instead of "can't open file").

  2. Make runc delete return with non-zero exit code if the container
    is not removed.

  3. runc delete: do not remove container state if container's cgroup can't be removed
    (as otherwise we'll lose the container cgroup information forever). Suggested by @lifubang.

Appropriate test cases were added to avoid future regressions.

Fixes #4047.
Fixes #4040.

@kolyshkin
Copy link
Contributor Author

@lifubang this is an alternative to #4048, PTAL 🙏🏻

@kolyshkin kolyshkin force-pushed the fix-kill branch 3 times, most recently from 6bc2a31 to ec62f74 Compare November 3, 2023 23:39
@kolyshkin kolyshkin marked this pull request as ready for review November 3, 2023 23:39
@kolyshkin
Copy link
Contributor Author

@lifubang PTAL

@kolyshkin
Copy link
Contributor Author

Too early; need to implement support for runc delete -f first (and the test case).

@kolyshkin kolyshkin marked this pull request as draft November 4, 2023 00:19
@kolyshkin kolyshkin force-pushed the fix-kill branch 3 times, most recently from 70a96a0 to 5fa0188 Compare November 4, 2023 01:46

# Start a few more processes.
for _ in 1 2 3 4 5; do
__runc exec -d test_busybox sleep 1h
[ "$status" -eq 0 ]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bug in the original test case code -- since we're not using bats' run, there is no $status to check (essentially, this checks the status from the last run call which is runc run ... above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If needed, I can separate this (and other fixes to the test case) to a separate commit, for clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to keep it in the same commit. It's a no-op change anyway (the deleted check would never fail anyway for the reason you described).

Comment on lines +46 to +47
# Can't mount real /proc when rootless + no pidns,
# so change it to a bind-mounted one from the host.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to explain why this is needed, because I forgot.

@kolyshkin kolyshkin changed the title Fix runc kill for containers with no init and no private PID namespace Fix runc kill and runc delete for containers with no init and no private PID namespace Nov 4, 2023
@kolyshkin kolyshkin marked this pull request as ready for review November 4, 2023 01:57
delete.go Show resolved Hide resolved
Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should drop the last patch. Otherwise, seems okay.

libcontainer/init_linux.go Show resolved Hide resolved
Comment on lines 91 to 99
// Normally, when a container init is gone, all other processes in its
// cgroup are killed by the kernel. This is not the case for a shared
// PID namespace container, which may have some leftover processes.
if !b.c.config.Namespaces.IsPrivate(configs.NEWPID) {
// Get container pids on a best-effort basis.
if pids, _ := b.c.cgroupManager.GetAllPids(); len(pids) > 0 {
return fmt.Errorf("container is in stopped state (init is gone), but its cgroup still has %d processes; use runc delete -f or runc kill", len(pids))
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I understand the reasoning behind this -- by some definition the container is still "alive" because the pid1 is not special for non-pidns containers. And you cannot runc delete a container that is is still alive, you need to runc kill it or use runc delete -f. So there is an argument that this is more consistent.

However, runc considers the container to be stopped because we bind the lifecycle of the container to the pid1 (as evidenced by runc state and the fact that this code was added for stoppedState). From that perspective, it's more consistent that any container which is "stopped" (according to runc state) should be able to be deleted with runc delete (no -f) except in weird circumstances.

IMHO, I think the latter is more consistent, primarily because (in my mind) killing the init process of a container means you've conceptually killed the container by definition (regardless of whether you have a separate pidns) and thus any remaining processes in the container are leftover garbage that runc delete should be happy to remove. This also has the benefit of not affecting compatibility with pre-1.2 runc. I agree the compatibility concern isn't enormous, because of what an edge-case host pidns containers are, but the difference between this and the shared cgroup stuff is the shared cgroup stuff was categorically broken and thus unusable -- this is a straight-forward change of behaviour, not removal of a broken mis-feature.

I think we should drop this part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you mean is that we should drop this part, change to kill the left processes in the stopped container when using runc delete without -f?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think runc delete on a container with a dead init process should work without -f because the container is considered "stopped".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can accept to change back the same as before.

But there is another issue discussed in #4040 (comment) (I can accept fix it in another PR, this PR is only to fix the regression introduced by #3825)
Maybe we have to retry and still have to check whether there are still some processes left in the container or not, even we had try our best to kill them. At this time, we should error out and don't delete anything about this container.
Please see: 120dad4

Copy link
Member

@cyphar cyphar Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I wonder if @kolyshkin's suggestion in #4040 (comment) is workable. @kolyshkin is there a reason you didn't go the route of making !c.hasInit() into !c.hasProcesses(), so that the container is still in the running state despite not having a pid1?

My issue with returning an error here is that it makes stopped containers act strangely in this one scenario. If we treated containers that contain processes at all as running then we can require people to use runc kill, which would solve the consistency issue.

I would prefer that over killing the processes in runc delete (assuming no other issues come up with making a container without the init process act as a "running" container -- I suspect we have a fair amount of code that makes assumptions about that behaviour that aren't ideal), but I would prefer either option over returning an error here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the last commit to kill the remaining processes rather than error out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kolyshkin kolyshkin force-pushed the fix-kill branch 2 times, most recently from 0c14686 to 7e571fd Compare November 10, 2023 00:38
@kolyshkin
Copy link
Contributor Author

@lifubang i've changed the cgroup removal logic, hopefully fixing related issues. If you have your own repro, can you try it on this branch?

delete.go Show resolved Hide resolved
// to kill those processes here.
if !c.config.Namespaces.IsPrivate(configs.NEWPID) {
_ = signalAllProcesses(c.cgroupManager, unix.SIGKILL)
}
err := c.cgroupManager.Destroy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We say we should not ignore the error from destroy, it means that we have a opportunity to destroy it again once we got an error, but if we ignore this cgroup destroy error, the state directory would be removed in next several lines, so the container disappears in runc, but leave some cgroup paths or some processes of this container in the host, we can’t delete them anymore from Runc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but I do not modify this code here. I surely can.

Are you suggesting something like this?

if err := c.cgroupManager.Destroy(); err != nil {
   return fmt.Errorf("can't remove container cgroup: %w", err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang added, PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting something like this?

Yes, but I’m puzzled that why we ignored this error before? I will spend sometime to look git log to find out the reason if exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting something like this?

Yes, but I’m puzzled that why we ignored this error before? I will spend sometime to look git log to find out the reason if exists.

Apparently you don't read commit messages 😄 I already looked and wrote about it.

@lifubang
Copy link
Member

@Burning1020 , PTAL, is there a chance to check this PR has fixed your issue #4040 or not?

@Burning1020
Copy link
Contributor

@Burning1020 , PTAL, is there a chance to check this PR has fixed your issue #4040 or not?

OK. Will check, thanks!

@lifubang
Copy link
Member

OK. Will check, thanks!

I think #4047 has been fixed by this PR, but I can't confirm #4040 is fixed by this PR too. 🙏🏻 @Burning1020

Copy link
Contributor

@Burning1020 Burning1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. #4040 fixed, thanks.

Copy link
Member

@lifubang lifubang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@cyphar @AkihiroSuda PTAL

kolyshkin and others added 8 commits November 27, 2023 09:15
The semantics of runType is slightly complicated, and the only place
where we need to distinguish between Created and Running is
refreshState.

Replace runType with simpler hasInit, simplifying its users (except the
refreshState, which now figures out on its own whether the container is
Created or Running).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Let's use c.hasInit and c.isPaused where needed instead of
c.curentStatus for simplicity.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit f8ad20f made it impossible to kill leftover processes in a
stopped container that does not have its own PID namespace. In other
words, if a container init is gone, it is no longer possible to use
`runc kill` to kill the leftover processes.

Fix this by moving the check if container init exists to after the
special case of handling the container without own PID namespace.

While at it, fix the minor issue introduced by commit 9583b3d:
if signalAllProcesses is used, there is no need to thaw the
container (as freeze/thaw is either done in signalAllProcesses already,
or not needed at all).

Also, make signalAllProcesses return an error early if the container
cgroup does not exist (as it relies on it to do its job). This way, the
error message returned is more generic and easier to understand
("container not running" instead of "can't open file").

Finally, add a test case.

Fixes: f8ad20f
Fixes: 9583b3d
Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit f8ad20f moved the kill logic from container destroy to container
kill (which is the right thing to do).

Alas, it broke the use case of doing "runc delete -f" for a container
which does not have its own private PID namespace, when its init process
is gone. In this case, some processes may still be running, and runc
delete -f should kill them (the same way as "runc kill" does).

It does not do that because the container status is "stopped" (as runc
considers the container with no init process as stopped), and so we only
call "destroy" (which was doing the killing before).

The fix is easy: if --force is set, call killContainer no matter what.

Add a test case, similar to the one in the previous commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The current code is only doing retries in RemovePaths, which is only
used for cgroup v1 (cgroup v2 uses RemovePath, which makes no retries).

Let's remove all retry logic and logging from RemovePaths, together
with:

 - os.Stat check from RemovePaths (its usage probably made sense before
   commit 19be8e5 but not after);

 - error/warning logging from RemovePaths (this was added by commit
   19be8e5 in 2020 and so far we've seen no errors other
   than EBUSY, so reporting the actual error proved to be useless).

Add the retry logic to rmdir, and the second retry bool argument.
Decrease the initial delay and increase the number of retries from the
old implementation so it can take up to ~1 sec before returning EBUSY
(was about 0.3 sec).

Hopefully, as a result, we'll have less "failed to remove cgroup paths"
errors.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If container.Destroy() has failed, runc destroy still return 0, which is
wrong and can result in other issues down the line.

Let's always return error from destroy in runc delete.

For runc checkpoint and runc run, we still treat it as a warning.

Co-authored-by: Zhang Tianyang <burning9699@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(For a container with no private PID namespace, that is).

When runc delete (or container.Destroy) is called on a stopped
container without private PID namespace and there are processes
in its cgroup, kill those.

Add a test case.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For some reason, container destroy operation removes container's state
directory even if cgroup removal fails (and then still returns an
error). It has been that way since commit 5c246d0, which added
cgroup removal.

This is problematic because once the container state dir is removed, we
no longer know container's cgroup and thus can't remove it.

Let's return the error early and fail if cgroup can't be removed.

Same for other operations: do not proceed if we fail.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for wrangling this one. On reflection, I agree continuing to treat containers without their init process as "stopped" is the simpler solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants