Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: "run-monitor (qemu, crio)" job failing on CreateContainer #9028

Closed
wainersm opened this issue Feb 5, 2024 · 7 comments · Fixed by #9206
Closed

CI: "run-monitor (qemu, crio)" job failing on CreateContainer #9028

wainersm opened this issue Feb 5, 2024 · 7 comments · Fixed by #9206
Labels
area/ci Issues affecting the continuous integration area/CRI-O Interaction with CRI-O bug Incorrect behaviour needs-review Needs to be assessed by the team.

Comments

@wainersm
Copy link
Contributor

wainersm commented Feb 5, 2024

The kata-monitor tests for QEMU hypervisor and CRI-O has failed for a while now.

For example, on https://github.com/kata-containers/kata-containers/actions/runs/7754650968/job/21149196848?pr=9003, it failed on the CreateContainer operation:

[gha-run.sh:62] INFO: Running cri-containerd tests using qemu hypervisor

* STEP: pre-checks
OK: connect to the container engine

* STEP: pull the image to be used
Image is up to date for docker.io/library/busybox@sha256:538721340ded10875f4710cad688c70e5d0ecb4dcd5e7d0c161f301f36f79414

* STEP: create workloads
OK: start workload (runc) - POD ID:c062fff4593cebb93a4a7a45f77d922d46e827058bef6c0bb0744a2720aa7c81, CID:312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050
E0202 10:58:52.305103    5339 remote_runtime.go:302] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = CreateContainer failed: the file top was not found: unknown" podSandboxID="9cba4c73f4e9386a37bd6756b1eca1fcba7433b82048cd104182e9981642a19b"
time="2024-02-02T10:58:52Z" level=fatal msg="creating container: rpc error: code = Unknown desc = CreateContainer failed: the file top was not found: unknown"
E0202 10:58:53.463181    5451 remote_runtime.go:349] "StopContainer from runtime service failed" err="rpc error: code = NotFound desc = could not find container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": container with ID starting with 312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050 not found: ID does not exist" containerID="312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050"
time="2024-02-02T10:58:53Z" level=fatal msg="stopping the container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": rpc error: code = NotFound desc = could not find container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": container with ID starting with 312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050 not found: ID does not exist"
E0202 10:58:53.484174    5457 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = could not find container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": container with ID starting with 312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050 not found: ID does not exist" containerID="312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050"
time="2024-02-02T10:58:53Z" level=error msg="rpc error: code = NotFound desc = could not find container \"312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050\": container with ID starting with 312ad5b30e2ea970e1f5426846bb6e6b36ed161fdbad781ac4fa050a7cb3c050 not found: ID does not exist"
time="2024-02-02T10:58:53Z" level=fatal msg="unable to remove container(s)"

ERROR: cannot start workload (kata)

kata-monitor testing: FAILED!
Error: Process completed with exit code 1

The sandbox/pod is created apparently fine. When it tries to create the container with the following yaml:

{
	"metadata": {
		"name": "busybox",
		"namespace": "default",
		"uid": "busybox-container-uid"
	},
	"image":{
		"image": "busybox"
	},
	"command": [
		"top"
	],
	"log_path":"busybox.log",
	"linux": {
	}
}

I was able to reproduce the error locally. I tried to solve by passing the full path to top (/bin/top) on the above yaml, but it failed with same file not found error.

An snippet of journalctl -t crio:

Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05 20:20:35.907164447Z" level=info msg="Creating container: //" id=51ce9f6b-fb8c-4e80-b853-04f8a7041983 name=/runtime.v1.RuntimeService/CreateContainer
Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05 20:20:35.907394064Z" level=warning msg="Allowed annotations are specified for workload []"
Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05 20:20:35.907602840Z" level=info msg="Resolved \"busybox\" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)"
Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05T20:20:35.951010461Z" level=error msg="createContainer failed" error="rpc error: code = Internal desc = the file /bin/top was not found" name=containerd-shim-v2 pid=13617 sandbox=b43af2a6f47a3825a298eaabc6dc598b25c657aa962a8ad6e2f846fca24fa47f source=virtcontainers subsystem=kata_agent
Feb 05 20:20:35 kata-dev crio[9825]: time="2024-02-05T20:20:35.951243569Z" level=error msg="container create failed" container=b05b522691ee847911a80352023648922aa68210a30c7702e40e4ad8a310fcc8 error="rpc error: code = Internal desc = the file /bin/top was not found" name=containerd-shim-v2 pid=13617 sandbox=b43af2a6f47a3825a298eaabc6dc598b25c657aa962a8ad6e2f846fca24fa47f source=virtcontainers subsystem=container
@wainersm wainersm added bug Incorrect behaviour needs-review Needs to be assessed by the team. area/ci Issues affecting the continuous integration area/CRI-O Interaction with CRI-O labels Feb 5, 2024
@wainersm
Copy link
Contributor Author

wainersm commented Feb 5, 2024

Hi @littlejawa ! Is this a known issue? I search on the issues list but I found anything related.

Any idea of what might be causing the failure?

@littlejawa
Copy link
Contributor

Not a known issue, and I really don't know what's wrong :-/
I've just tried on my side, and the container runs (with or without kata).
I'm not sure I'm getting the exact same image though - the digest I have is different, even though it says it takes it from docker.io. But it's probably a red herring (not sure I'm even looking at the right thing).

@BbolroC
Copy link
Member

BbolroC commented Feb 7, 2024

Hi, it seems that something has gone wrong since the 1ts of Feb.

For run-monitor, https://github.com/kata-containers/kata-containers/actions/runs/7731996768/job/21119861538 was successful on the 1st of Feb.

The nightly test for s390x was able to run a container (at least) on that day: https://github.com/kata-containers/kata-containers/actions/runs/7736480131/job/21093903151, but no success on running a container since then.

(Sorry, the job is wrongly named as k8s-cri-containerd-rhel9-e2e-tests even if it is carried out with qemu-crio combination.)

A pod is stuck in a ContainerCreating state after the following event:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  75s   default-scheduler  Successfully assigned default/handlers to kata-gha-runner-builder-04
  Normal  Pulling    65s   kubelet            Pulling image "quay.io/sjenning/nginx:1.15-alpine"
  Normal  Pulled     62s   kubelet            Successfully pulled image "quay.io/sjenning/nginx:1.15-alpine" in 2.723232665s

I will dig into this more and share something looking informative soon. Thanks.

@BbolroC
Copy link
Member

BbolroC commented Feb 8, 2024

Update: I think I've spotted a commit which causes the issue. The test has been passed when I reverted the following:

9317e23

@fadecoder Could you have a look at this?

CC: @lifupan @gkurz @WenyuanLau

@gkurz
Copy link
Member

gkurz commented Feb 8, 2024

Update: I think I've spotted a commit which causes the issue. The test has been passed when I reverted the following:

9317e23

Hi @BbolroC ! Good catch ! Did you git bisect ? 😉

This comment in the corresponding PR #8760 clearly asks to add SYS_ADMIN to the default capability set for the cri-o case.

This got addressed by @fadecoder with 091cf1b but then the PR had to be rebased and the change disappeared in the final 9317e23 commit.

It is an example of how the forced rebases because of renamed CI checks can be harmful (Cc @fidencio @stevenhorsman ).

I don't think this is the cause of the error though as explained below.

Another interesting fact is that the kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (qemu, crio) check of #8760 failed with exact same the file top was not found error. This was missed, likely because the monitor check isn't required.

All runs of this check failed according to https://github.com/kata-containers/kata-containers/actions/workflows/ci-on-push.yaml?query=actor%3Afadecoder+. Even the run for 091cf1b . This seems to indicate that the lacking SYS_ADMIN isn't the culprit.

@fadecoder Could you have a look at this?

CC: @lifupan @gkurz @WenyuanLau

@gkurz
Copy link
Member

gkurz commented Feb 9, 2024

I could run the monitor test successfully with an experimental hack in #9063 .

@wainersm
Copy link
Contributor Author

wainersm commented Feb 9, 2024

Update: I think I've spotted a commit which causes the issue. The test has been passed when I reverted the following:
9317e23

Hi @BbolroC ! Good catch ! Did you git bisect ? 😉

@BbolroC ++

This comment in the corresponding PR #8760 clearly asks to add SYS_ADMIN to the default capability set for the cri-o case.

This got addressed by @fadecoder with 091cf1b but then the PR had to be rebased and the change disappeared in the final 9317e23 commit.

It is an example of how the forced rebases because of renamed CI checks can be harmful (Cc @fidencio @stevenhorsman ).

I don't think this is the cause of the error though as explained below.

Another interesting fact is that the kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (qemu, crio) check of #8760 failed with exact same the file top was not found error. This was missed, likely because the monitor check isn't required.

There are a bunch of checkers non-required. I've been monitoring them since last week to see those that could be required, that's the reason why I opened this issue :)

It seems a topic for the AC meeting: how/when/what switch to current stable checkers to required.

All runs of this check failed according to https://github.com/kata-containers/kata-containers/actions/workflows/ci-on-push.yaml?query=actor%3Afadecoder+. Even the run for 091cf1b . This seems to indicate that the lacking SYS_ADMIN isn't the culprit.

@fadecoder Could you have a look at this?
CC: @lifupan @gkurz @WenyuanLau

gkurz added a commit to gkurz/kata-containers that referenced this issue Feb 22, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
gkurz added a commit to gkurz/kata-containers that referenced this issue Feb 22, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
gkurz added a commit to gkurz/kata-containers that referenced this issue Feb 22, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this issue Feb 28, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this issue Feb 28, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this issue Feb 28, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this issue Feb 29, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
fidencio pushed a commit to fidencio/kata-containers that referenced this issue Feb 29, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its
own mount namespace for the sake of improving isolation
between the sandbox and the host. This was inspired by
the runtime-rs implementation.

Unfortunately, the kata-monitor checks in CI have been failing
since then with CRI-O. The precise path of code that got broken
by kata-containers#8760 isn't known at this time.

Looking closer at runtime-rs, it appears that only the shim
daemon unshares its mount namespace. Do the same with the go
runtime.

Similarly to what runtime-rs does, look at the command
line arguments to detect that the current process is the
daemon.

Since the go shim command line parsing belongs to containerd
vendored code, it isn't really practical to do exactly the
same though.

Go for a variant : introduce a function that heuristicaly validates
that the current process was spawned by the NewCommand() function.
A C-style assert is added in NewCommand() to ensure any code change
that would break the assumptions of the validating function are
detected by CI. The panic *cannot* happen in production.

This is a bit hacky and could certainly be handled cleanlier
with a change in the containerd shim package. The go shim
is expected to sunset soon though. It thus doesn't seem worth
the pain.

Fixes kata-containers#9028

Signed-off-by: Greg Kurz <groug@kaod.org>
lifupan pushed a commit to lifupan/kata-containers that referenced this issue Mar 3, 2024
PR kata-containers#8760 tentatively tried to have the shim to run in its own mount
namespace for the sake of improving isolation between the sandbox and
the host. Thus crio storage drivers shouldn't create a PRIVATE
bind mount on their home directory. Otherwise, the container's rootfs
mount wouldn't be propagated to kata runtime's mount namespace, and
kata runtime couldn't access the container's rootfs files.

So, when kata cooperated with crio, crio should set
skip_mount_home=true for its storage overlay.

Fixes: kata-containers#9028

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
@katacontainersbot katacontainersbot moved this from To do to In progress in Issue backlog Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Issues affecting the continuous integration area/CRI-O Interaction with CRI-O bug Incorrect behaviour needs-review Needs to be assessed by the team.
Projects
Issue backlog
  
In progress
4 participants