Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle containerd "CRIU not found" error message #123886

Merged

Conversation

adrianreber
Copy link
Contributor

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

During the PR to get "Forensic Container Checkpointing" enabled in containerd the decision was made to not correctly report if containerd cannot find the CRIU binary. The reason was that the e2e_node checkpoint test did not understand the error message.

The e2e_node checkpoint test is skipped if the container runtime (CRI-O or containerd) does not enable checkpoint support of if checkpoint support is not implemented.

This commit adds another reason to skip a check. If the underlying OS which is used to test "Forensic Container Checkpointing" in combination with containerd or CRI-O is missing the CRIU binary.

This was encountered on Google's Container-Optimized OS (COS) based tests where CRIU was not installed.

With this change merged it is possible for containerd to return the correct error message without breaking Kubernetes e2e tests.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

During the PR to get "Forensic Container Checkpointing" enabled in
containerd the decision was made to not correctly report if containerd
cannot find the CRIU binary. The reason was that the e2e_node checkpoint
test did not understand the error message.

The e2e_node checkpoint test is skipped if the container runtime (CRI-O
or containerd) does not enable checkpoint support of if checkpoint
support is not implemented.

This commit adds another reason to skip a check. If the underlying OS
which is used to test "Forensic Container Checkpointing" in combination
with containerd or CRI-O is missing the CRIU binary.

This was encountered on Google's Container-Optimized OS (COS) based
tests where CRIU was not installed.

With this change merged it is possible for containerd to return the
correct error message without breaking Kubernetes e2e tests.

Signed-off-by: Adrian Reber <areber@redhat.com>
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Mar 12, 2024
@k8s-ci-robot k8s-ci-robot added area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 12, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 12, 2024
@adrianreber
Copy link
Contributor Author

Corresponding containerd changes: containerd/containerd#9960

@bart0sh bart0sh added this to Triage in SIG Node PR Triage Mar 12, 2024
@dims
Copy link
Member

dims commented Mar 12, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 12, 2024
@dims
Copy link
Member

dims commented Mar 12, 2024

/assign @mrunalp

@adrianreber
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce

@@ -239,6 +242,13 @@ var _ = SIGDescribe("Checkpoint Container", nodefeature.CheckpointContainer, fun
ginkgo.Skip("Container engine does not implement 'CheckpointContainer'")
return
}
if strings.Contains(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this should be in the test?

We can install criu into the VM for containerd and make sure that we have that binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containerd tests I have seen are running on Google's Container Optimized OS (COS) which does not allow installing additional tools except in /home. So currently containerd returns the wrong error message to not break Kubernetes.

Installing CRIU on COS would require a lot of additional libraries as CRIU cannot easily be linked statically. It also uses dlopen for some parts. So at some point it seemed really difficult to install CRIU on COS. Unless Google decides to add CRIU to COS.

For CRI-O based tests which are running on Fedora or RHEL bases OSes it is not a problem as runc pulls in CRIU.

Additionally I think it would make sense to correctly handle a system without CRIU installed.

Copy link
Contributor

@kannon92 kannon92 Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with this is this seems temporary. How are we setting up checkpointing for containerd? Could we just filter out containerd for checkpointing until criu is in COS?

I agree that this is a case that should be handled but that should belong in application code not in test code.

Your PR also links to containerd and not Kubernetes. Is containerd using e2e test code from Kubernetes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with this is this seems temporary. How are we setting up checkpointing for containerd? Could we just filter out containerd for checkpointing until criu is in COS?

I don't know. It probably is possible to filter out containerd based tests. I have no idea if CRIU will be in COS and who could help with it. Or if there is another way to install CRIU on COS. I also don't know if containerd is running tests on other platforms than COS.

Your PR also links to containerd and not Kubernetes. Is containerd using e2e test code from Kubernetes?

From what I saw, yes. The PR that changes the error message from a generic not implemented messages to a CRIU is too old or missing message currently fails with:

{ failed [FAILED] Unexpected status code (500) during 'CheckpointContainer': "an error on the server (\"checkpointing of checkpoint-container-test-8173/checkpoint-container-pod/test-container-1 failed (rpc error: code = Unknown desc = CRIU binary not found or too old (<31600). Failed to checkpoint container \\\"7aa3746cc64e67f2c4f4fba6d871b081bb56bbf300f590ff19a34aaa5fde973f\\\": failed to check for criu version: exec: \\\"criu\\\": executable file not found in $PATH)\") has prevented the request from succeeding (post nodes tmp-node-e2e-5c2a30c5-cos-beta-113-18244-1-14:10250)"
In [It] at: k8s.io/kubernetes/test/e2e_node/checkpoint_container.go:243 @ 03/12/24 12:04:34.894
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. It probably is possible to filter out containerd based tests. I have no idea if CRIU will be in COS and who could help with it. Or if there is another way to install CRIU on COS. I also don't know if containerd is running tests on other platforms than COS.

For swap we used ubuntu and containerd. https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial.

So IMO I would maybe consider a separate job for checkpointing (for both crio/containerd) where you can control the image. And by default we don't include this test. You can always use ubuntu/fcos and install what you need for these tests. And then turn the config on for your test.

It may be worth adding this as it seems that we need more control over the image than the default (especially for cos).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is fine now that we discussed it. I still think we should consider having more control of the image especially since this feature is not on by default for containerd/crio at the moment.

I guess we can skip on normal tests if checkpointing is not set up or criu does not exist.

@dims
Copy link
Member

dims commented Mar 12, 2024

here's what i think ...

  • We cannot assume criu will be in every environment (end user)
  • We cannot assume criu will be in every testing environment
  • e2e_node.test can check if criu is available before running a test that needs it (as the e2e_node.test is running on the same node with the containerd/crio/kubelet etc) - for a positive test - things that need CRIU work correctly
  • We should make sure we have tests that fail "the right way" when criu is absent if we don't have already in our test suite - for a negative test (when the test env does not have criu)

@adrianreber
Copy link
Contributor Author

adrianreber commented Mar 12, 2024

here's what i think ...

* We cannot assume `criu` will be in every environment (end user)

* We cannot assume `criu` will be in every testing environment

* e2e_node.test can check if `criu` is available before running a test that needs it (as the e2e_node.test is running on the same node with the containerd/crio/kubelet etc) - for a positive test - things that need CRIU work correctly

I can do the same check I do in containerd and CRI-O also in e2e_node before running the test and skip it then.

* We should make sure we have tests that fail "the right way" when `criu` is absent if we don't have already in our test suite - for a negative test (when the test env does not have `criu`)

Fail "the right way". Is that skip or is there something better than skip? @kannon92 and I were already talking about how to fail better in a previous PR.

@kannon92
Copy link
Contributor

Fail "the right way". Is that skip or is there something better than skip? @kannon92 and I were already talking about how to fail better in a previous PR.

I don't like skip because we aren't really testing anything. You can always assert on the case you expect in the test rather than skipping. It may be a bit painful over time as if the runtime changes any error message or whatever, we would want to adjust the tests. But I think that is worth it over a skip and we never look at the test.

@kannon92
Copy link
Contributor

/lgtm
/assign @SergeyKanzhelev @mrunalp

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 538f5bfa9e1c499dd01b79ccf3141945b3ebabe9

@kannon92 kannon92 moved this from Triage to Needs Approver in SIG Node PR Triage Mar 13, 2024
@dims
Copy link
Member

dims commented Mar 13, 2024

Fail "the right way". Is that skip or is there something better than skip?

was mostly thinking about the folks who are trying to use tools to trigger a checkpoint (via k8s) and they get a proper message that it is not supported or has failed...

@SergeyKanzhelev SergeyKanzhelev moved this from Triage to PRs - Needs Reviewer in SIG Node CI/Test Board Apr 3, 2024
@dims
Copy link
Member

dims commented May 4, 2024

/approve
/lgtm

happy landing this test only change.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianreber, dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2024
@k8s-ci-robot k8s-ci-robot merged commit de16748 into kubernetes:master May 4, 2024
15 checks passed
SIG Node CI/Test Board automation moved this from PRs - Needs Reviewer to Done May 4, 2024
SIG Node PR Triage automation moved this from Needs Approver to Done May 4, 2024
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants