Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial provider implementation for nerdctl (+finch) #3429

Merged
merged 2 commits into from Feb 27, 2024

Conversation

estesp
Copy link
Contributor

@estesp estesp commented Nov 17, 2023

Adds implementation for a provider based on nerdctl. Several todos in the code but the core functionality of creating/deleting clusters is working and a simple application deployed works properly.

Fixes: #2317

I don't like that it's a bit messy to pass around binaryName to support finch alongside nerdctl. Clearly users could alias finch -> nerdctl and this code could be simplified but I would love to find a clean way to support any nerdctl wrapper/implementation if possible without requiring user action.

(Updated the following on 14 Dec 2023)
Some TODOs in the code around IPv6 support (coming in nerdctl 1.7.0 but isn't released yet, and would be good to support nerdctl < 1.7.0 as well) and restart policy (supported, but containers are exiting with 137 with kind delete cluster so the one-time restart is actually happening, causing a recreate of the same cluster to fail without user action.

This PR now relies on nerdctl 1.7.0 or above, and/or finch 1.0.1 or above. That solves the lack of IPv6 support, although I have not tested IPv6 yet. The delete cluster flow was updated to use stop/wait/rm and works properly now, and leaves the restart policy on create matching the Docker provider.

Update on 7 Feb 2024
Now added the initial CI testing matrix for nerdctl from #3408 ; carried that PR's commit and modified it with recent versions and not aliasing to docker and instead using the implementation from this PR with experimental provider set.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 17, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @estesp!

It looks like this is your first PR to kubernetes-sigs/kind 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kind has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 17, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @estesp. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 17, 2023

// ensure the pre-requisite network exists
networkName := fixedNetworkName
if n := os.Getenv("KIND_EXPERIMENTAL_DOCKER_NETWORK"); n != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should be still called "DOCKER"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that we not support this in the nerdctl provider, I consider it to have been a mistake in the docker provider that has caused no end of headaches (e.g. users trying to run in host network (!))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good; happy to remove this piece

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest commit

// note: requires API v1.41+ from Dec 2020 in Docker 20.10.0
// this is the default with cgroups v2 but not with cgroups v1, unless
// overridden in the daemon --default-cgroupns-mode
// https://github.com/docker/cli/pull/3699#issuecomment-1191675788
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't make a sense for nerdctl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in latest commit

@@ -18,6 +18,8 @@ func GetDefault(logger log.Logger) cluster.ProviderOption {
case "docker":
logger.Warn("using docker due to KIND_EXPERIMENTAL_PROVIDER")
return cluster.ProviderWithDocker()
case "nerdctl", "finch":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I saw that PR.. should definitely collaborate on getting this into CI. Wanted to make sure there weren't any glaring misses here before getting CI set up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is finch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is finch?

Amazon's distribution of nerdctl wrapped in Lima
https://aws.amazon.com/jp/blogs/opensource/ready-for-flight-announcing-finch-1-0-ga/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there compatibility issues such that we need to expand the test matrix that far as both upstream nerdctl and the finch distro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be no compat issues; Finch is a distribution that includes nerdctl. The only minor issue could be point-in-time discrepancies where nerdctl has released a version, but Finch has no release (yet) with the new version of nerdctl.

)

// NewProvider returns a new provider based on executing `nerdctl ...`
func NewProvider(logger log.Logger, binaryName string) providers.Provider {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if the nerdctl package can be merged into the docker package. and the docker package should have something like NewProviderWithVariant() that takes a custom Docker-like binary path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are close; I also see some of the changes I had to make to a few capabilities around inspect in this PR are fixed (at least one around Names JSON object in container inspect) in recent PRs to nerdctl. There are also inconsistencies in a few other Go template responses that I worked around with index; probably could be fixed in nerdctl as it is essentially Docker compatibility mismatches at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice; so the options are 1) try and use the docker provider with "binary name customization" and require nerdctl 1.7.0 or greater, or 2) create this provider and assume that over time it could be deprecated when any/all differences are resolved. Interested to hear from the kind maintainers what they would prefer. I think both are potentially reasonable, but obviously there is code duplication in creating yet-another-provider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I appreciate that nerdctl considers incompatibilities to be bugs and that there should be minimal skew, eventually we're bound to run into another incompatibility when developing the docker provider and need to work around these differences until a nerdctl update is released (and we don't know how long users will take to upgrade).

It's not a big deal to request that incoming changes be ported between essentially two copies of the docker provider, and in this PR we already have an example of something that should only be tested for docker and not nerdctl, #3429 (comment)

We already need to ensure that any new features target podman as well, often the code is nearly the same but not quite.

The node providers are intended to be very small shims that isolate most of the rest of the logic from which runtime we're using for nodes. We can afford another copy and it will give us more room to handle quirks even if nerdctl patches them in later releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree; that makes sense to me.

@AkihiroSuda
Copy link
Member

coming in nerdctl 1.7.0 but isn't released yet

released: https://github.com/containerd/nerdctl/releases/tag/v1.7.0

@estesp
Copy link
Contributor Author

estesp commented Nov 17, 2023

coming in nerdctl 1.7.0 but isn't released yet

released: https://github.com/containerd/nerdctl/releases/tag/v1.7.0

Oops! Well, you can tell how well I'm paying attention! I'm testing using Finch 1.0 which hasn't updated to 1.7.0 yet.

// filter for nodes with the cluster label
"--filter", fmt.Sprintf("label=%s=%s", clusterLabelKey, cluster),
// format to include the cluster name
"--format", `{{range .Names}}{{println .}}{{end}}`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @estesp

It's awsome PR :-)

The nerdctl ps -a --filter label=io.x-k8s.kind.cluster=kind --format '{{range .Names}}{{println .}}{{end}}' can not work in nerdctl , but nerdctl ps -a --filter label=io.x-k8s.kind.cluster=kind --format '{{.Names}}' same with docker can work correctly in v1.7.0 :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to match behavior from nerdctl 1.7.0 and above

@yankay
Copy link
Member

yankay commented Nov 27, 2023

HI @kubernetes-sigs/kind-maintainers , @estesp

would you please review the PR , and give some advice?
Thank you very much ╰(°▽°)╯

@BenTheElder
Copy link
Member

Sorry, I'm back from conference / post-conference-cold / holidays but I'm still going to have to come back to this later:

https://kubernetes.slack.com/archives/C2C40FMNF/p1701112863247509
kubernetes/website#44109

At a high level: +1 to doing this and we've had some other conversations and PRs related to CI.
I will aim to get a review in by end of week if not @aojea before me, sorry.

@aojea
Copy link
Contributor

aojea commented Nov 27, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 27, 2023
@aojea
Copy link
Contributor

aojea commented Nov 27, 2023

naive question, how complex is to add one github action job to test this in CI and get some signal?

I see Akihiro, already got there https://github.com/kubernetes-sigs/kind/pull/3429/files#r1397788447

"sigs.k8s.io/kind/pkg/internal/integration"
)

func TestIntegrationEnsureNetworkConcurrent(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails on the CI because the job pod does not contain the nerdctl binary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this test for nerdctl and eventually even docker won't need it as the docker bug/quirk we were working around was fixed recently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the test in latest commit

@BenTheElder
Copy link
Member

Some TODOs in the code around IPv6 support (coming in nerdctl 1.7.0 but isn't released yet, and would be good to support nerdctl < 1.7.0 as well) and restart policy (supported, but containers are exiting with 137 with kind delete cluster so the one-time restart is actually happening, causing a recreate of the same cluster to fail without user action.

Is the user action to rm again?
If so, can we sleep-wait for the container to restart and then rm again within Delete, and later add a check to skip this if nerdctl is newer than some fixed release?

@arivera-xealth
Copy link

+1 on this

@estesp
Copy link
Contributor Author

estesp commented Dec 11, 2023

I'm thinking for simplicity to make nerdctl 1.7.0 the lowest supported version; given there has been no support until now, that seems reasonable to me, but curious if anyone has strong opinions. I also would like input on whether to simply allow the provider to fail if the user has an older version, or do version checks during provider initialization?

@estesp
Copy link
Contributor Author

estesp commented Dec 14, 2023

Some TODOs in the code around IPv6 support (coming in nerdctl 1.7.0 but isn't released yet, and would be good to support nerdctl < 1.7.0 as well) and restart policy (supported, but containers are exiting with 137 with kind delete cluster so the one-time restart is actually happening, causing a recreate of the same cluster to fail without user action.

Is the user action to rm again? If so, can we sleep-wait for the container to restart and then rm again within Delete, and later add a check to skip this if nerdctl is newer than some fixed release?

I really don't understand why the container is exiting with error; although ending with rm -f goes straight to SIGKILL to the container. I've corrected this for now with a slightly slower but stable approach I assume would work elsewhere: stop + wait + rm -f -v; I also prepended that with updating the restart-policy to no because the container still seems to exit with error and that (at least for nerdctl) means a restart try, which I didn't dig into yet.

At least for now, this is mitigated in the latest pushed code in the PR and several create, delete, create, delete attempts properly kill and remove nodes and allow for clusters to be created and deleted properly.

@estesp
Copy link
Contributor Author

estesp commented Feb 9, 2024

Can anyone approve the workflows for this to see if the added nerdctl workflow tests pass in GH actions?

@estesp
Copy link
Contributor Author

estesp commented Feb 12, 2024

/retest

@BenTheElder
Copy link
Member

Can anyone approve the workflows for this to see if the added nerdctl workflow tests pass in GH actions?

They're already approved?

Sorry, last week was enhancements freeze so pretty much all my time went to PRR reviews https://www.kubernetes.dev/resources/release/#timeline

aojea is OOO currently.

I'm going to be working on a bug fix release related to #3510 and some other work but aim to come back to this.

I'm thinking for simplicity to make nerdctl 1.7.0 the lowest supported version; [...]

SGTM

[...] given there has been no support until now, that seems reasonable to me, but curious if anyone has strong opinions. I also would like input on whether to simply allow the provider to fail if the user has an older version, or do version checks during provider initialization?

The current behavior is to fallback to docker if we don't detect another working provider unless the user explicitly sets KIND_EXPERIMENTAL_PROVIDER, if another provider is considered stable (podman is close-ish) then we might change that ...

Detecting providers is supposed to be fast and cheap because it happens on every invocation unless explicitly set, we're just looking for binaries in path.

In the podman provider we version check during create cluster implementation, for a few reasons:

  • we only do the expensive calls during the once-per-cluster invocation
  • this command inherently happens before any of the others
  • other commands like kind delete will work fine with older versions anyhow

I think we should keep that pattern for now.

At least for now, this is mitigated in the latest pushed code in the PR and several create, delete, create, delete attempts properly kill and remove nodes and allow for clusters to be created and deleted properly.

sounds OK for now, long term would love to get reboot working on all the providers but it's been thorny for podman as well and not perfect under docker either.


re: image-loading, we've had an open issue to outline a better command that is multi-provider, would love some more input there. the intention is to leave kind load docker-image and kind load image-archive unchanged (confusing compat at this point given kind load docker-image will actually do docker => podman if you have both on a system currently 🤯)

@estesp
Copy link
Contributor Author

estesp commented Feb 13, 2024

Can anyone approve the workflows for this to see if the added nerdctl workflow tests pass in GH actions?

They're already approved?

Oh cool! when I first commented I had made another small fix and re-pushed and they were stuck waiting on approval--looks like someone did that and they ran successfully, so yay!

Sorry, last week was enhancements freeze so pretty much all my time went to PRR reviews > https://www.kubernetes.dev/resources/release/#timeline

aojea is OOO currently.

No worries at all; no rush on this. Hopefully it is in good enough shape for review and has basic CI now for nerdctl.

@@ -18,6 +18,8 @@ func GetDefault(logger log.Logger) cluster.ProviderOption {
case "docker":
logger.Warn("using docker due to KIND_EXPERIMENTAL_PROVIDER")
return cluster.ProviderWithDocker()
case "nerdctl", "finch", "nerdctl.lima":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.Warn("using nerdctl due to KIND_EXPERIMENTAL_PROVIDER")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; repushed. Thanks!

@aojea
Copy link
Contributor

aojea commented Feb 14, 2024

one nit, missing a log on the provider detection, the rest looks ok, the code is isolated and is almost the same with the binaryName parameter.

@estesp
Copy link
Contributor Author

estesp commented Feb 16, 2024

@aojea fixed the missing logger; should be ready to go then!

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 16, 2024

@estesp: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kind-e2e-kubernetes-1-24 5dc7c30 link true /test pull-kind-e2e-kubernetes-1-24

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

estesp and others added 2 commits February 16, 2024 12:56
Adds implementation for a provider based on nerdctl. Several todos
in the code but the core functionality of creating/deleting clusters
is working and a simple application deployed works properly

Signed-off-by: Phil Estes <estesp@gmail.com>
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Signed-off-by: Phil Estes <estesp@gmail.com>
@aojea
Copy link
Contributor

aojea commented Feb 16, 2024

/lgtm
/approve

/hold

just if @BenTheElder wants to check something, we can unhold tomorrow if no answer

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 16, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, estesp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2024
@estesp
Copy link
Contributor Author

estesp commented Feb 26, 2024

@BenTheElder any thoughts/concerns?

@aojea
Copy link
Contributor

aojea commented Feb 27, 2024

/hold cancel

lazy consensus :)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2024
@k8s-ci-robot k8s-ci-robot merged commit 7c3e01f into kubernetes-sigs:main Feb 27, 2024
17 checks passed
@nimakaviani
Copy link

great to see this @estesp!

@BenTheElder
Copy link
Member

Thank you all!

@BenTheElder BenTheElder added this to the v0.23.0 milestone Feb 27, 2024
@yankay
Copy link
Member

yankay commented Feb 28, 2024

Thanks @estesp 🎉🎉🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

containerd / nerdctl provider support
8 participants