Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gha: kata-deploy: Revert containerd config break #8679

Merged

Conversation

stevenhorsman
Copy link
Member

After kata-deploy has installed, check containerd is still in active state, to ensure that we didn't corrupt the containerd config during kata-deploy's edits

Fixes: #8678

@stevenhorsman
Copy link
Member Author

@Amulyam24 - FYI the ppc64le checks are failing with:

Step 1/23 : FROM ubuntu:20.04
toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
make[1]: *** [tools/packaging/kata-deploy/local-build/Makefile:75: kernel-tarball-build] Error 1
make: *** [tools/packaging/kata-deploy/local-build/Makefile:111: kernel-tarball] Error 2

I'm guessing that we need to login to docker hub on the self-hosted node as the anonymous rate limit is pretty low?

Copy link
Member

@beraldoleal beraldoleal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @stevenhorsman. My only question is if this is indeed fixing #8678 as described at the commit message. Unless I'm missing something, this is just adding a test, not fixing the actual problem, right?

@stevenhorsman
Copy link
Member Author

LGTM, thanks @stevenhorsman. My only question is if this is indeed fixing #8678 as described at the commit message. Unless I'm missing something, this is just adding a test, not fixing the actual problem, right?

Correct - this is just the first stage of the PR, assuming it works (and the test fails), we'll have a way to test the issue and I plan to revert part of the tomlq PR and check that fixes things. I'll add do-not-merge to this to show it's not completed.

@stevenhorsman stevenhorsman added do-not-merge PR has problems or depends on another no-backport-needed labels Dec 15, 2023
@stevenhorsman
Copy link
Member Author

/test

@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch from 71a6d02 to dddc747 Compare December 15, 2023 16:42
@katacontainersbot katacontainersbot added size/medium Average sized task and removed size/small Small and simple task labels Dec 15, 2023
@stevenhorsman
Copy link
Member Author

Ok, I think this containerd test approach works for baremetal systems, but not a other clusters as we can't necessarily get in an access their containerd. I can try using kata-debug for this process though...

@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch 2 times, most recently from d575ff7 to aa8a902 Compare December 18, 2023 12:02
@katacontainersbot katacontainersbot added size/small Small and simple task and removed size/medium Average sized task labels Dec 18, 2023
@stevenhorsman
Copy link
Member Author

I didn't get kata-debug working reliably, so just using kubectl with custom columns to extract status and containerruntimeversion. I'll also temporarily dropped the fix, so I can check that this test is correctly failing.

@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch 3 times, most recently from 361289e to 6b941d0 Compare December 18, 2023 15:44
@stevenhorsman
Copy link
Member Author

In the previous run, I got some of the deploy tests to fail as expected, but more of them were still passing, which doesn't really make sense, so I've upped the sleep to see if that gives enough time for the containerd config changes to take effect.

@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch from 80f08c1 to c6b9f86 Compare December 18, 2023 17:50
@katacontainersbot katacontainersbot added size/medium Average sized task and removed size/small Small and simple task labels Dec 18, 2023
@stevenhorsman
Copy link
Member Author

/test

@stevenhorsman
Copy link
Member Author

As I shared in confidential-containers/operator#305 (comment) - I'm not super happy with this change and think the testing could do with some wider input, but the revert of using tomlq resolves the problem of kata-deploy corrupting containerd's config, so I think it's good enough to merge and unblock people until January when more people are back to discuss this.

@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch from 7ed9e5c to 861ee1e Compare December 19, 2023 15:16
@stevenhorsman stevenhorsman changed the title gha: kata-deploy: Add containerd tests gha: kata-deploy: Revert containerd config break Dec 19, 2023
@stevenhorsman
Copy link
Member Author

/test

@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch from 861ee1e to 8c772d5 Compare December 19, 2023 15:17
@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch 2 times, most recently from 4d1f863 to 2991a96 Compare December 19, 2023 17:21
@stevenhorsman
Copy link
Member Author

/test

@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch from 2991a96 to 46570e6 Compare December 19, 2023 18:56
@stevenhorsman
Copy link
Member Author

/test

After kata-deploy has installed, check that the worker nodes
are still in Ready state and don't have a containerd://Unknown
container runtime versions, identicating that container isn't working
to ensure that we didn't corrupt the containerd config during kata-deploy's edits

Fixes: kata-containers#8678
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
This reverts commit dd9f5b0.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
@stevenhorsman stevenhorsman force-pushed the kata-deploy-containerd-config-fix branch from 4418eb8 to ee5fa08 Compare December 20, 2023 09:10
@stevenhorsman
Copy link
Member Author

/test

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @stevenhorsman !

@gkurz gkurz merged commit ce094ec into kata-containers:main Dec 20, 2023
170 of 179 checks passed
@fidencio
Copy link
Member

fidencio commented Dec 20, 2023

Folks, I don't like the commit being reverted, but I must say it's what it's at this point.

As I'm working Today, I've decided to poke Steve and get to the root cause of the issue, and here's the full explanation.
tomlq relies on jq in order to do the parsing / editing of the file, and it's done in this specific code path

            elif input_format == "toml":
                import tomlkit

                for input_stream in input_streams:
                    json.dump(tomlkit.load(input_stream), jq.stdin, cls=JSONDateTimeEncoder)  # type: ignore
                    jq.stdin.write("\n")  # type: ignore

The interesting part to understand here is that jq is an invocation of the jq command, as shown here

        jq = subprocess.Popen(
            ["jq"] + list(jq_args),
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE if converting_output else None,
            close_fds=False,
            universal_newlines=True,
        )

Okay, with this in mind, we can simplify a whole lot the test case.
Let's test with echo '{"foo": 1.0}' | jq .foo.

/ # jq --version
jq-1.6
/ # echo '{"foo": 1.0}' | jq .foo
1

Bingo, we can reproduce the issue we've faced and we know that the version of jq being used is 1.6 When taking a look at jq's GitHub page, I've noticed a newer release, 1.7, and decided to give it a try before hunting this issue down.

> Downloads ./jq-linux-amd64 --version
jq-1.7
⋊> Downloads echo '{"foo": 1.0}' | jq .foo
1.0

And, hey, it works as expected.

So, the actual solution for the issue should be:

  • Re-revert the reverted commit
  • Bump the jq version used in our daemonset
  • Profit

I will open a PR with this soon enough, and this will allow us to move on and get the snapshotter work merged, unblocking others depending on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/medium Average sized task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kata-deploy: tomlq breaks containerd config
7 participants