CI: "run-monitor (qemu, crio)" job failing on CreateContainer #9028

wainersm · 2024-02-05T20:36:31Z

The kata-monitor tests for QEMU hypervisor and CRI-O has failed for a while now.

For example, on https://github.com/kata-containers/kata-containers/actions/runs/7754650968/job/21149196848?pr=9003, it failed on the CreateContainer operation:

[gha-run.sh:62] INFO: Running cri-containerd tests using qemu hypervisor

* STEP: pre-checks
OK: connect to the container engine

* STEP: pull the image to be used
Image is up to date for docker.io/library/busybox@sha256:538721340ded10875f4710cad688c70e5d0ecb4dcd5e7d0c161f301f36f79414

* STEP: create workloads
OK: start workload (runc) - POD ID:c062fff4593cebb93a4a7a45f77d922d46e827058bef6c0bb0744a2720aa7c81, CID:312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050
E0202 10:58:52.305103    5339 remote_runtime.go:302] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = CreateContainer failed: the file top was not found: unknown" podSandboxID="9cba4c73f4e9386a37bd6756b1eca1fcba7433b82048cd104182e9981642a19b"
time="2024-02-02T10:58:52Z" level=fatal msg="creating container: rpc error: code = Unknown desc = CreateContainer failed: the file top was not found: unknown"
E0202 10:58:53.463181    5451 remote_runtime.go:349] "StopContainer from runtime service failed" err="rpc error: code = NotFound desc = could not find container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": container with ID starting with 312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050 not found: ID does not exist" containerID="312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050"
time="2024-02-02T10:58:53Z" level=fatal msg="stopping the container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": rpc error: code = NotFound desc = could not find container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": container with ID starting with 312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050 not found: ID does not exist"
E0202 10:58:53.484174    5457 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = could not find container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": container with ID starting with 312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050 not found: ID does not exist" containerID="312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050"
time="2024-02-02T10:58:53Z" level=error msg="rpc error: code = NotFound desc = could not find container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": container with ID starting with 312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050 not found: ID does not exist"
time="2024-02-02T10:58:53Z" level=fatal msg="unable to remove container(s)"

ERROR: cannot start workload (kata)

kata-monitor testing: FAILED!
Error: Process completed with exit code 1

The sandbox/pod is created apparently fine. When it tries to create the container with the following yaml:

{
	"metadata": {
		"name": "busybox",
		"namespace": "default",
		"uid": "busybox-container-uid"
	},
	"image":{
		"image": "busybox"
	},
	"command": [
		"top"
	],
	"log_path":"busybox.log",
	"linux": {
	}
}

I was able to reproduce the error locally. I tried to solve by passing the full path to top (/bin/top) on the above yaml, but it failed with same file not found error.

An snippet of journalctl -t crio:

Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05 20:20:35.907164447Z" level=info msg="Creating container: //" id=51ce9f6b-fb8c-4e80-b853-04f8a7041983 name=/runtime.v1.RuntimeService/CreateContainer
Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05 20:20:35.907394064Z" level=warning msg="Allowed annotations are specified for workload []"
Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05 20:20:35.907602840Z" level=info msg="Resolved \"busybox\" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)"
Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05T20:20:35.951010461Z" level=error msg="createContainer failed" error="rpc error: code = Internal desc = the file /bin/top was not found" name=containerd-shim-v2 pid=13617 sandbox=b43af2a6f47a3825a298eaabc6dc598b25c657aa962a8ad6e2f846fca24fa47f source=virtcontainers subsystem=kata_agent
Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05T20:20:35.951243569Z" level=error msg="container create failed" container=b05b522691ee847911a80352023648922aa68210a30c7702e40e4ad8a310fcc8 error="rpc error: code = Internal desc = the file /bin/top was not found" name=containerd-shim-v2 pid=13617 sandbox=b43af2a6f47a3825a298eaabc6dc598b25c657aa962a8ad6e2f846fca24fa47f source=virtcontainers subsystem=container

The text was updated successfully, but these errors were encountered:

wainersm · 2024-02-05T20:37:50Z

Hi @littlejawa ! Is this a known issue? I search on the issues list but I found anything related.

Any idea of what might be causing the failure?

littlejawa · 2024-02-07T10:35:14Z

Not a known issue, and I really don't know what's wrong :-/
I've just tried on my side, and the container runs (with or without kata).
I'm not sure I'm getting the exact same image though - the digest I have is different, even though it says it takes it from docker.io. But it's probably a red herring (not sure I'm even looking at the right thing).

BbolroC · 2024-02-07T23:36:32Z

Hi, it seems that something has gone wrong since the 1ts of Feb.

For run-monitor, https://github.com/kata-containers/kata-containers/actions/runs/7731996768/job/21119861538 was successful on the 1st of Feb.

The nightly test for s390x was able to run a container (at least) on that day: https://github.com/kata-containers/kata-containers/actions/runs/7736480131/job/21093903151, but no success on running a container since then.

(Sorry, the job is wrongly named as k8s-cri-containerd-rhel9-e2e-tests even if it is carried out with qemu-crio combination.)

A pod is stuck in a ContainerCreating state after the following event:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  75s   default-scheduler  Successfully assigned default/handlers to kata-gha-runner-builder-04
  Normal  Pulling    65s   kubelet            Pulling image "quay.io/sjenning/nginx:1.15-alpine"
  Normal  Pulled     62s   kubelet            Successfully pulled image "quay.io/sjenning/nginx:1.15-alpine" in 2.723232665s

I will dig into this more and share something looking informative soon. Thanks.

BbolroC · 2024-02-08T07:04:16Z

Update: I think I've spotted a commit which causes the issue. The test has been passed when I reverted the following:

9317e23

@fadecoder Could you have a look at this?

CC: @lifupan @gkurz @WenyuanLau

gkurz · 2024-02-08T09:18:25Z

Update: I think I've spotted a commit which causes the issue. The test has been passed when I reverted the following:

9317e23

Hi @BbolroC ! Good catch ! Did you git bisect ? 😉

This comment in the corresponding PR #8760 clearly asks to add SYS_ADMIN to the default capability set for the cri-o case.

This got addressed by @fadecoder with 091cf1b but then the PR had to be rebased and the change disappeared in the final 9317e23 commit.

It is an example of how the forced rebases because of renamed CI checks can be harmful (Cc @fidencio @stevenhorsman ).

I don't think this is the cause of the error though as explained below.

Another interesting fact is that the kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (qemu, crio) check of #8760 failed with exact same the file top was not found error. This was missed, likely because the monitor check isn't required.

All runs of this check failed according to https://github.com/kata-containers/kata-containers/actions/workflows/ci-on-push.yaml?query=actor%3Afadecoder+. Even the run for 091cf1b . This seems to indicate that the lacking SYS_ADMIN isn't the culprit.

@fadecoder Could you have a look at this?

CC: @lifupan @gkurz @WenyuanLau

gkurz · 2024-02-09T13:48:35Z

I could run the monitor test successfully with an experimental hack in #9063 .

wainersm · 2024-02-09T14:42:41Z

Update: I think I've spotted a commit which causes the issue. The test has been passed when I reverted the following:
9317e23

Hi @BbolroC ! Good catch ! Did you git bisect ? 😉

@BbolroC ++

This comment in the corresponding PR #8760 clearly asks to add SYS_ADMIN to the default capability set for the cri-o case.

This got addressed by @fadecoder with 091cf1b but then the PR had to be rebased and the change disappeared in the final 9317e23 commit.

It is an example of how the forced rebases because of renamed CI checks can be harmful (Cc @fidencio @stevenhorsman ).

I don't think this is the cause of the error though as explained below.

Another interesting fact is that the kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (qemu, crio) check of #8760 failed with exact same the file top was not found error. This was missed, likely because the monitor check isn't required.

There are a bunch of checkers non-required. I've been monitoring them since last week to see those that could be required, that's the reason why I opened this issue :)

It seems a topic for the AC meeting: how/when/what switch to current stable checkers to required.

All runs of this check failed according to https://github.com/kata-containers/kata-containers/actions/workflows/ci-on-push.yaml?query=actor%3Afadecoder+. Even the run for 091cf1b . This seems to indicate that the lacking SYS_ADMIN isn't the culprit.

@fadecoder Could you have a look at this?
CC: @lifupan @gkurz @WenyuanLau

PR kata-containers#8760 tentatively tried to have the shim to run in its own mount namespace for the sake of improving isolation between the sandbox and the host. This was inspired by the runtime-rs implementation. Unfortunately, the kata-monitor checks in CI have been failing since then with CRI-O. The precise path of code that got broken by kata-containers#8760 isn't known at this time. Looking closer at runtime-rs, it appears that only the shim daemon unshares its mount namespace. Do the same with the go runtime. Similarly to what runtime-rs does, look at the command line arguments to detect that the current process is the daemon. Since the go shim command line parsing belongs to containerd vendored code, it isn't really practical to do exactly the same though. Go for a variant : introduce a function that heuristicaly validates that the current process was spawned by the NewCommand() function. A C-style assert is added in NewCommand() to ensure any code change that would break the assumptions of the validating function are detected by CI. The panic *cannot* happen in production. This is a bit hacky and could certainly be handled cleanlier with a change in the containerd shim package. The go shim is expected to sunset soon though. It thus doesn't seem worth the pain. Fixes kata-containers#9028 Signed-off-by: Greg Kurz <groug@kaod.org>

PR kata-containers#8760 tentatively tried to have the shim to run in its own mount namespace for the sake of improving isolation between the sandbox and the host. Thus crio storage drivers shouldn't create a PRIVATE bind mount on their home directory. Otherwise, the container's rootfs mount wouldn't be propagated to kata runtime's mount namespace, and kata runtime couldn't access the container's rootfs files. So, when kata cooperated with crio, crio should set skip_mount_home=true for its storage overlay. Fixes: kata-containers#9028 Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>

wainersm added bug Incorrect behaviour needs-review Needs to be assessed by the team. area/ci Issues affecting the continuous integration area/CRI-O Interaction with CRI-O labels Feb 5, 2024

katacontainersbot added this to To do in Issue backlog Feb 5, 2024

gkurz mentioned this issue Feb 22, 2024

runtime: Only unshare mount namespace in the shim daemon #9063

Closed

lifupan mentioned this issue Mar 3, 2024

CI: fix the issue of ci failure on crio #9206

Merged

katacontainersbot moved this from To do to In progress in Issue backlog Mar 3, 2024

gkurz closed this as completed in #9206 Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: "run-monitor (qemu, crio)" job failing on CreateContainer #9028

CI: "run-monitor (qemu, crio)" job failing on CreateContainer #9028

wainersm commented Feb 5, 2024

wainersm commented Feb 5, 2024

littlejawa commented Feb 7, 2024

BbolroC commented Feb 7, 2024 •

edited

BbolroC commented Feb 8, 2024 •

edited

gkurz commented Feb 8, 2024

gkurz commented Feb 9, 2024

wainersm commented Feb 9, 2024

CI: "run-monitor (qemu, crio)" job failing on CreateContainer #9028

CI: "run-monitor (qemu, crio)" job failing on CreateContainer #9028

Comments

wainersm commented Feb 5, 2024

wainersm commented Feb 5, 2024

littlejawa commented Feb 7, 2024

BbolroC commented Feb 7, 2024 • edited

BbolroC commented Feb 8, 2024 • edited

gkurz commented Feb 8, 2024

gkurz commented Feb 9, 2024

wainersm commented Feb 9, 2024

BbolroC commented Feb 7, 2024 •

edited

BbolroC commented Feb 8, 2024 •

edited