Handle containerd "CRIU not found" error message #123886

adrianreber · 2024-03-12T08:28:45Z

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

During the PR to get "Forensic Container Checkpointing" enabled in containerd the decision was made to not correctly report if containerd cannot find the CRIU binary. The reason was that the e2e_node checkpoint test did not understand the error message.

The e2e_node checkpoint test is skipped if the container runtime (CRI-O or containerd) does not enable checkpoint support of if checkpoint support is not implemented.

This commit adds another reason to skip a check. If the underlying OS which is used to test "Forensic Container Checkpointing" in combination with containerd or CRI-O is missing the CRIU binary.

This was encountered on Google's Container-Optimized OS (COS) based tests where CRIU was not installed.

With this change merged it is possible for containerd to return the correct error message without breaking Kubernetes e2e tests.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

During the PR to get "Forensic Container Checkpointing" enabled in containerd the decision was made to not correctly report if containerd cannot find the CRIU binary. The reason was that the e2e_node checkpoint test did not understand the error message. The e2e_node checkpoint test is skipped if the container runtime (CRI-O or containerd) does not enable checkpoint support of if checkpoint support is not implemented. This commit adds another reason to skip a check. If the underlying OS which is used to test "Forensic Container Checkpointing" in combination with containerd or CRI-O is missing the CRIU binary. This was encountered on Google's Container-Optimized OS (COS) based tests where CRIU was not installed. With this change merged it is possible for containerd to return the correct error message without breaking Kubernetes e2e tests. Signed-off-by: Adrian Reber <areber@redhat.com>

k8s-ci-robot · 2024-03-12T08:28:53Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-03-12T08:28:54Z

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adrianreber · 2024-03-12T08:33:50Z

Corresponding containerd changes: containerd/containerd#9960

dims · 2024-03-12T13:01:23Z

/ok-to-test

dims · 2024-03-12T13:01:59Z

/assign @mrunalp

adrianreber · 2024-03-12T13:59:01Z

/test pull-kubernetes-e2e-gce

kannon92 · 2024-03-12T15:07:25Z

test/e2e_node/checkpoint_container.go

@@ -239,6 +242,13 @@ var _ = SIGDescribe("Checkpoint Container", nodefeature.CheckpointContainer, fun
 					ginkgo.Skip("Container engine does not implement 'CheckpointContainer'")
 					return
 				}
+				if strings.Contains(


Is there a reason why this should be in the test?

We can install criu into the VM for containerd and make sure that we have that binary.

The containerd tests I have seen are running on Google's Container Optimized OS (COS) which does not allow installing additional tools except in /home. So currently containerd returns the wrong error message to not break Kubernetes.

Installing CRIU on COS would require a lot of additional libraries as CRIU cannot easily be linked statically. It also uses dlopen for some parts. So at some point it seemed really difficult to install CRIU on COS. Unless Google decides to add CRIU to COS.

For CRI-O based tests which are running on Fedora or RHEL bases OSes it is not a problem as runc pulls in CRIU.

Additionally I think it would make sense to correctly handle a system without CRIU installed.

My issue with this is this seems temporary. How are we setting up checkpointing for containerd? Could we just filter out containerd for checkpointing until criu is in COS?

I agree that this is a case that should be handled but that should belong in application code not in test code.

Your PR also links to containerd and not Kubernetes. Is containerd using e2e test code from Kubernetes?

My issue with this is this seems temporary. How are we setting up checkpointing for containerd? Could we just filter out containerd for checkpointing until criu is in COS?

I don't know. It probably is possible to filter out containerd based tests. I have no idea if CRIU will be in COS and who could help with it. Or if there is another way to install CRIU on COS. I also don't know if containerd is running tests on other platforms than COS.

Your PR also links to containerd and not Kubernetes. Is containerd using e2e test code from Kubernetes?

From what I saw, yes. The PR that changes the error message from a generic not implemented messages to a CRIU is too old or missing message currently fails with:

{ failed [FAILED] Unexpected status code (500) during 'CheckpointContainer': "an error on the server (\"checkpointing of checkpoint-container-test-8173/checkpoint-container-pod/test-container-1 failed (rpc error: code = Unknown desc = CRIU binary not found or too old (<31600). Failed to checkpoint container \\\"7aa3746cc64e67f2c4f4fba6d871b081bb56bbf300f590ff19a34aaa5fde973f\\\": failed to check for criu version: exec: \\\"criu\\\": executable file not found in $PATH)\") has prevented the request from succeeding (post nodes tmp-node-e2e-5c2a30c5-cos-beta-113-18244-1-14:10250)" In [It] at: k8s.io/kubernetes/test/e2e_node/checkpoint_container.go:243 @ 03/12/24 12:04:34.894 }

I don't know. It probably is possible to filter out containerd based tests. I have no idea if CRIU will be in COS and who could help with it. Or if there is another way to install CRIU on COS. I also don't know if containerd is running tests on other platforms than COS.

For swap we used ubuntu and containerd. https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial.

So IMO I would maybe consider a separate job for checkpointing (for both crio/containerd) where you can control the image. And by default we don't include this test. You can always use ubuntu/fcos and install what you need for these tests. And then turn the config on for your test.

It may be worth adding this as it seems that we need more control over the image than the default (especially for cos).

I think this PR is fine now that we discussed it. I still think we should consider having more control of the image especially since this feature is not on by default for containerd/crio at the moment.

I guess we can skip on normal tests if checkpointing is not set up or criu does not exist.

dims · 2024-03-12T16:11:46Z

here's what i think ...

We cannot assume criu will be in every environment (end user)
We cannot assume criu will be in every testing environment
e2e_node.test can check if criu is available before running a test that needs it (as the e2e_node.test is running on the same node with the containerd/crio/kubelet etc) - for a positive test - things that need CRIU work correctly
We should make sure we have tests that fail "the right way" when criu is absent if we don't have already in our test suite - for a negative test (when the test env does not have criu)

adrianreber · 2024-03-12T16:17:11Z

here's what i think ...

* We cannot assume `criu` will be in every environment (end user)

* We cannot assume `criu` will be in every testing environment

* e2e_node.test can check if `criu` is available before running a test that needs it (as the e2e_node.test is running on the same node with the containerd/crio/kubelet etc) - for a positive test - things that need CRIU work correctly

I can do the same check I do in containerd and CRI-O also in e2e_node before running the test and skip it then.

* We should make sure we have tests that fail "the right way" when `criu` is absent if we don't have already in our test suite - for a negative test (when the test env does not have `criu`)

Fail "the right way". Is that skip or is there something better than skip? @kannon92 and I were already talking about how to fail better in a previous PR.

kannon92 · 2024-03-12T16:20:22Z

Fail "the right way". Is that skip or is there something better than skip? @kannon92 and I were already talking about how to fail better in a previous PR.

I don't like skip because we aren't really testing anything. You can always assert on the case you expect in the test rather than skipping. It may be a bit painful over time as if the runtime changes any error message or whatever, we would want to adjust the tests. But I think that is worth it over a skip and we never look at the test.

kannon92 · 2024-03-13T12:53:50Z

/lgtm
/assign @SergeyKanzhelev @mrunalp

k8s-ci-robot · 2024-03-13T12:53:56Z

LGTM label has been added.

Git tree hash: 538f5bfa9e1c499dd01b79ccf3141945b3ebabe9

dims · 2024-03-13T13:44:16Z

Fail "the right way". Is that skip or is there something better than skip?

was mostly thinking about the folks who are trying to use tools to trigger a checkpoint (via k8s) and they get a proper message that it is not supported or has failed...

dims · 2024-05-04T12:39:00Z

/approve
/lgtm

happy landing this test only change.

k8s-ci-robot · 2024-05-04T12:39:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianreber, dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e_node/OWNERS~~ [dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from matthyx and SergeyKanzhelev March 12, 2024 08:28

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 12, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 12, 2024

adrianreber mentioned this pull request Mar 12, 2024

Return correct error if CRIU binary is missing containerd/containerd#9960

Merged

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 12, 2024

k8s-ci-robot assigned mrunalp Mar 12, 2024

kannon92 reviewed Mar 12, 2024

View reviewed changes

k8s-ci-robot assigned SergeyKanzhelev Mar 13, 2024

k8s-ci-robot assigned kannon92 Mar 13, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2024

k8s-ci-robot assigned dims May 4, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2024

k8s-ci-robot merged commit de16748 into kubernetes:master May 4, 2024

k8s-ci-robot added this to the v1.31 milestone May 4, 2024

Handle containerd "CRIU not found" error message #123886

Handle containerd "CRIU not found" error message #123886

Uh oh!

Conversation

adrianreber commented Mar 12, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 12, 2024

Uh oh!

k8s-ci-robot commented Mar 12, 2024

Uh oh!

adrianreber commented Mar 12, 2024

Uh oh!

dims commented Mar 12, 2024

Uh oh!

dims commented Mar 12, 2024

Uh oh!

adrianreber commented Mar 12, 2024

Uh oh!

kannon92 Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

adrianreber Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

kannon92 Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianreber Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

kannon92 Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

kannon92 Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

dims commented Mar 12, 2024

Uh oh!

adrianreber commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kannon92 commented Mar 12, 2024

Uh oh!

kannon92 commented Mar 13, 2024

Uh oh!

k8s-ci-robot commented Mar 13, 2024

Uh oh!

dims commented Mar 13, 2024

Uh oh!

dims commented May 4, 2024

Uh oh!

k8s-ci-robot commented May 4, 2024

Uh oh!

Uh oh!

kannon92 Mar 12, 2024 •

edited

Loading

adrianreber commented Mar 12, 2024 •

edited

Loading