Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky test: kill KILL [host pidns] #4163

Closed
lifubang opened this issue Jan 7, 2024 · 8 comments · Fixed by #4179
Closed

flaky test: kill KILL [host pidns] #4163

lifubang opened this issue Jan 7, 2024 · 8 comments · Fixed by #4179

Comments

@lifubang
Copy link
Member

lifubang commented Jan 7, 2024

When I run integration tests in my local machine, the tests about kill container with host pidns will alway fail.

bats tests/integration/kill.bats 
kill.bats
 ✓ kill detached busybox
 ✗ kill KILL [host pidns]
   (from function `test_host_pidns_kill' in file tests/integration/kill.bats, line 66,
    in test file tests/integration/kill.bats, line 103)
     `test_host_pidns_kill' failed
   runc spec (status=0):
   
   runc run -d --console-socket /tmp/bats-run-AwVtsc/runc.7MpMpS/tty/sock test_busybox (status=0):
   
   pids: 700996
   701019
   701035
   701050
   701066
   701081
   runc kill test_busybox KILL (status=0):
   
   pids: 701050
   rmdir: failed to remove '/sys/fs/cgroup/cpuacct//runc-cgroups-integration-test': No such file or directory
   rmdir: failed to remove '/sys/fs/cgroup/net_prio//runc-cgroups-integration-test': No such file or directory
 ✗ kill KILL [host pidns + init gone]
   (from function `test_host_pidns_kill' in file tests/integration/kill.bats, line 66,
    in test file tests/integration/kill.bats, line 124)
     `test_host_pidns_kill' failed
   runc spec (status=0):
   
   runc run -d --console-socket /tmp/bats-run-AwVtsc/runc.SkmAZh/tty/sock test_busybox (status=0):
   
   pids: 701272
   701289
   701304
   701320
   701337
   runc kill test_busybox KILL (status=0):
   
   pids: 701337
   rmdir: failed to remove '/sys/fs/cgroup/cpuacct//runc-cgroups-integration-test': No such file or directory
   rmdir: failed to remove '/sys/fs/cgroup/net_prio//runc-cgroups-integration-test': No such file or directory

3 tests, 2 failures
@kolyshkin
Copy link
Contributor

Interestingly, I've never seen it, either in CI or locally. I see that you're using cgroupv1 -- any other details that might be relevant for a repro?

@kolyshkin
Copy link
Contributor

e.g. kernel version and distro?

@lifubang
Copy link
Member Author

lifubang commented Jan 9, 2024

root@iZj6cdnzj9sp96xf38htrnZ:~/go/src/github.com/opencontainers/runc# git branch
  fix-ci-local-cgroup-awk
  fix-test-kill-hostpidns
* main
root@iZj6cdnzj9sp96xf38htrnZ:~/go/src/github.com/opencontainers/runc# uname -a
Linux iZj6cdnzj9sp96xf38htrnZ 5.4.0-169-generic #187-Ubuntu SMP Thu Nov 23 14:52:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@iZj6cdnzj9sp96xf38htrnZ:~/go/src/github.com/opencontainers/runc# bats tests/integration/kill.bats
kill.bats
 ✓ kill detached busybox
 ✓ kill KILL [host pidns]
 ✗ kill KILL [host pidns + init gone]
   (from function `test_host_pidns_kill' in file tests/integration/kill.bats, line 66,
    in test file tests/integration/kill.bats, line 124)
     `test_host_pidns_kill' failed
   runc spec (status=0):

   runc run -d --console-socket /tmp/bats-run-hZmZJC/runc.rT6rgh/tty/sock test_busybox (status=0):

   pids: 19066
   19082
   19098
   19114
   19130
   runc kill test_busybox KILL (status=0):

   pids: 19114
   rmdir: failed to remove '/sys/fs/cgroup/cpuacct//runc-cgroups-integration-test': No such file or directory
   rmdir: failed to remove '/sys/fs/cgroup/net_prio//runc-cgroups-integration-test': No such file or directory

3 tests, 1 failure

root@iZj6cdnzj9sp96xf38htrnZ:~/go/src/github.com/opencontainers/runc#

@lifubang
Copy link
Member Author

Another flaky I saw in github action:

not ok 120 kill KILL [host pidns + init gone]
# (from function `test_host_pidns_kill' in file tests/integration/kill.bats, line 56,
#  in test file tests/integration/kill.bats, line 124)
#   `test_host_pidns_kill' failed
# runc spec (status=0):
#
# runc run -d --console-socket /tmp/bats-run-WwMIyU/runc.coexOx/tty/sock test_busybox (status=0):
#
# pids: 76212
# 76236
# 76253
# 76270
# 76288
# 76305
# /home/runner/work/runc/runc/tests/integration/kill.bats: line 56: kill: (76212) - No such process

Maybe we should wait some time to see cgroup.procs after we kill a pid.

@kolyshkin
Copy link
Contributor

This last one is different from the previous few (it was line 66 before). Now it is this:

(from function `test_host_pidns_kill' in file tests/integration/kill.bats, line 56,

Here is it:

kill -0 "$p"

What's happening here is:

  • we start a container
  • exec a few sleep processes in it
  • kill container's init
  • check that these processes are still there

Ah! I think what happens is we kill init, but its pid is still listed in cgroup.procs.


Now, the older failures (line 66) are happening because this code:

wait_for_container 10 1 test_busybox stopped

does not do anything if container's init is already killed. Again, apparently we read cgroup.procs too fast for the kernel to update the list of processes.


For both cases, I can't think of anything but to add a sleep after kill.

@kolyshkin
Copy link
Contributor

Should be fixed by #4179. As much as I don't like adduing kludges like this, I can't think of any other way to solve this.

@kolyshkin
Copy link
Contributor

For both cases, I can't think of anything but to add a sleep after kill.

I was wrong. If the pid is known, one can wait in loop doing kill -0 (if the pid is known), or, for newer linux kernels, use pidfd event reporting. Let me see what I can do.

@kolyshkin
Copy link
Contributor

For both cases, I can't think of anything but to add a sleep after kill.

I was wrong. If the pid is known, one can wait in loop doing kill -0 (if the pid is known), or, for newer linux kernels, use pidfd event reporting. Let me see what I can do.

Implemented via kill -0 in a loop to wait for the process to be gone.

I think that maybe it's better to implement it in runc (rather than the test case). IOW make runc kill (and runc delete -f) synchronous. This should fix some corner cases, e.g. killing the container and creating a new one using the same cgroup, but the cgroup is not (yet) empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants