-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle containerd "CRIU not found" error message #123886
Handle containerd "CRIU not found" error message #123886
Conversation
During the PR to get "Forensic Container Checkpointing" enabled in containerd the decision was made to not correctly report if containerd cannot find the CRIU binary. The reason was that the e2e_node checkpoint test did not understand the error message. The e2e_node checkpoint test is skipped if the container runtime (CRI-O or containerd) does not enable checkpoint support of if checkpoint support is not implemented. This commit adds another reason to skip a check. If the underlying OS which is used to test "Forensic Container Checkpointing" in combination with containerd or CRI-O is missing the CRIU binary. This was encountered on Google's Container-Optimized OS (COS) based tests where CRIU was not installed. With this change merged it is possible for containerd to return the correct error message without breaking Kubernetes e2e tests. Signed-off-by: Adrian Reber <areber@redhat.com>
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @adrianreber. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Corresponding containerd changes: containerd/containerd#9960 |
/ok-to-test |
/assign @mrunalp |
/test pull-kubernetes-e2e-gce |
@@ -239,6 +242,13 @@ var _ = SIGDescribe("Checkpoint Container", nodefeature.CheckpointContainer, fun | |||
ginkgo.Skip("Container engine does not implement 'CheckpointContainer'") | |||
return | |||
} | |||
if strings.Contains( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this should be in the test?
We can install criu into the VM for containerd and make sure that we have that binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The containerd tests I have seen are running on Google's Container Optimized OS (COS) which does not allow installing additional tools except in /home. So currently containerd returns the wrong error message to not break Kubernetes.
Installing CRIU on COS would require a lot of additional libraries as CRIU cannot easily be linked statically. It also uses dlopen
for some parts. So at some point it seemed really difficult to install CRIU on COS. Unless Google decides to add CRIU to COS.
For CRI-O based tests which are running on Fedora or RHEL bases OSes it is not a problem as runc pulls in CRIU.
Additionally I think it would make sense to correctly handle a system without CRIU installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue with this is this seems temporary. How are we setting up checkpointing for containerd? Could we just filter out containerd for checkpointing until criu is in COS?
I agree that this is a case that should be handled but that should belong in application code not in test code.
Your PR also links to containerd and not Kubernetes. Is containerd using e2e test code from Kubernetes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue with this is this seems temporary. How are we setting up checkpointing for containerd? Could we just filter out containerd for checkpointing until criu is in COS?
I don't know. It probably is possible to filter out containerd based tests. I have no idea if CRIU will be in COS and who could help with it. Or if there is another way to install CRIU on COS. I also don't know if containerd is running tests on other platforms than COS.
Your PR also links to containerd and not Kubernetes. Is containerd using e2e test code from Kubernetes?
From what I saw, yes. The PR that changes the error message from a generic not implemented messages to a CRIU is too old or missing message currently fails with:
{ failed [FAILED] Unexpected status code (500) during 'CheckpointContainer': "an error on the server (\"checkpointing of checkpoint-container-test-8173/checkpoint-container-pod/test-container-1 failed (rpc error: code = Unknown desc = CRIU binary not found or too old (<31600). Failed to checkpoint container \\\"7aa3746cc64e67f2c4f4fba6d871b081bb56bbf300f590ff19a34aaa5fde973f\\\": failed to check for criu version: exec: \\\"criu\\\": executable file not found in $PATH)\") has prevented the request from succeeding (post nodes tmp-node-e2e-5c2a30c5-cos-beta-113-18244-1-14:10250)"
In [It] at: k8s.io/kubernetes/test/e2e_node/checkpoint_container.go:243 @ 03/12/24 12:04:34.894
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. It probably is possible to filter out containerd based tests. I have no idea if CRIU will be in COS and who could help with it. Or if there is another way to install CRIU on COS. I also don't know if containerd is running tests on other platforms than COS.
For swap we used ubuntu and containerd. https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial.
So IMO I would maybe consider a separate job for checkpointing (for both crio/containerd) where you can control the image. And by default we don't include this test. You can always use ubuntu/fcos and install what you need for these tests. And then turn the config on for your test.
It may be worth adding this as it seems that we need more control over the image than the default (especially for cos).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is fine now that we discussed it. I still think we should consider having more control of the image especially since this feature is not on by default for containerd/crio at the moment.
I guess we can skip on normal tests if checkpointing is not set up or criu does not exist.
here's what i think ...
|
I can do the same check I do in containerd and CRI-O also in e2e_node before running the test and skip it then.
Fail "the right way". Is that skip or is there something better than skip? @kannon92 and I were already talking about how to fail better in a previous PR. |
I don't like skip because we aren't really testing anything. You can always assert on the case you expect in the test rather than skipping. It may be a bit painful over time as if the runtime changes any error message or whatever, we would want to adjust the tests. But I think that is worth it over a skip and we never look at the test. |
/lgtm |
LGTM label has been added. Git tree hash: 538f5bfa9e1c499dd01b79ccf3141945b3ebabe9
|
was mostly thinking about the folks who are trying to use tools to trigger a checkpoint (via k8s) and they get a proper message that it is not supported or has failed... |
/approve happy landing this test only change. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianreber, dims The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
/kind failing-test
What this PR does / why we need it:
During the PR to get "Forensic Container Checkpointing" enabled in containerd the decision was made to not correctly report if containerd cannot find the CRIU binary. The reason was that the e2e_node checkpoint test did not understand the error message.
The e2e_node checkpoint test is skipped if the container runtime (CRI-O or containerd) does not enable checkpoint support of if checkpoint support is not implemented.
This commit adds another reason to skip a check. If the underlying OS which is used to test "Forensic Container Checkpointing" in combination with containerd or CRI-O is missing the CRIU binary.
This was encountered on Google's Container-Optimized OS (COS) based tests where CRIU was not installed.
With this change merged it is possible for containerd to return the correct error message without breaking Kubernetes e2e tests.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: