Allow executors to define `containerd` and `cridockerd` behavior #9184

HarrisonWAffel · 2024-01-08T23:32:11Z

Proposed Changes

In order to properly address rancher/rke2#2204, the rke2 windows agent needs to be able to define how it reacts to containerd exiting. Currently, rke2 uses k3s' implementation for containerd, which will call os.Exit if the process stops running. On Windows nodes, this sudden exit prevents the top level context from fully canceling, resulting in supporting processes created with CommandContext (kubelet, kube-proxy, etc.) continuing to run. In the event that the rke2 service is restarted, but those processes are not cleaned up, the service will enter a crash loop which must be manually resolved.

k3s should provide a way for the rke2 windows agent (or any custom executor) to define how containerd is executed, in a similar manner to other components such as the kubelet or kube-proxy.

Enhance the Executor interface so that custom executors (i.e. rke2's pebinaryexecutor) can define how containerd and cridockerd are executed.
Export the utility functions containerd.PreloadImages and containerd.SetupContainerdConfig so that code duplication between k3s and rke2 is reduced as much as possible
Start containerd/cridockerd when agent.Agent is called, as opposed to doing so in the run function
Update the embedded executor to satisfy the updated interface, continue to use the existing logic for running containerd and cridockerd

Types of Changes

Enhancement, custom executors now have control over the behavior of containerd and cridockerd.

Verification

The k3s embedded executor should be tested to ensure that it properly starts the desired container runtime on startup, otherwise there are no functional changes which would need more significant verification.

Testing

Manual testing:

After building a custom k3s binary, I was able to confirm that a 2 node cluster (1 cp+etcd server, 1 agent) will properly start the desired container runtime (both containerd and cridockerd were tested), and that workloads can be deployed onto that cluster without issue using kubectl.

Automated testing:

The existing integration and e2e tests cover basic startup and functionality of the k3s agent and server, so no new tests have been added. Since this PR focuses on extending the executor interface, I could not think of a way to add meaningful unit tests to cover these changes

Linked Issues

rancher/rke2#2204

User-Facing Change

NONE

Further Comments

In order to start the runtime using an executor I had to move the call from agent.run to agent.Agent. This results in the agent being started before the following code is called

if cfg.AgentReady != nil {
	close(cfg.AgentReady)
}

Previously this call was made once the runtime was started, but before the agent was created. Looking at the code I don't think this has a negative impact on the server, however I wanted to call this out in case there is a subtlety that I'm missing in how that channel is expected to be closed

brandond · 2024-01-08T23:40:22Z

Have you tested this in RKE2 yet? The sequencing of the agent startup is much fussier on that side; I'm curious if the changes will have an impact on that side.

LGTM otherwise though, I am a fan of moving more stuff into the executor interface.

manuelbuil · 2024-01-09T14:49:24Z

I like this PR and what it tries to solve. Removing/restarting rke2 is a pain in Windows

HarrisonWAffel · 2024-01-09T20:14:22Z

I was able to test out the fix for 2204 today using a custom rke2 binary compiled against this k3s branch. Once I updated the pebinaryexecutor and the staticpod executors I didn't see any issues during startup for rke2. I tested both linux and windows nodes

I'm looking into the CI failure today, will update once that is resolved

brandond · 2024-01-09T20:16:26Z

It looks like the e2e tests have broken, glad to approve once those are green.

Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>

HarrisonWAffel · 2024-01-11T20:42:30Z

Rebased this pr off of master, it looks like CI is now passing

HarrisonWAffel · 2024-01-17T15:52:43Z

Any idea when we might be able to merge this? Also, should I go ahead and raise back port PR's for this change, or is that handled during the release process?

caroline-suse-rancher · 2024-01-17T18:09:00Z

Hey @HarrisonWAffel we entered code freeze for January releases on Friday (01/12). Is this crucial to get in for the January releases, or is a February timeline alright? cc @brandond @manuelbuil

HarrisonWAffel · 2024-01-17T18:32:12Z

@caroline-suse-rancher

Nope this is not critical for the January release, I'll double back on this when code freeze is over. If there's any other action items on my side needed let me know!

Thanks!

brandond · 2024-01-17T19:34:35Z

@HarrisonWAffel devs are responsible for their own backport issues and PRs, if you're up for it you might as well get those created now.

HarrisonWAffel requested a review from a team as a code owner January 8, 2024 23:32

HarrisonWAffel force-pushed the containerd-executor branch from 845aed9 to 5cb7c5d Compare January 8, 2024 23:34

allow executors to define containerd and docker behavior

509dcd2

Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>

HarrisonWAffel force-pushed the containerd-executor branch from 5cb7c5d to 509dcd2 Compare January 11, 2024 19:13

brandond approved these changes Jan 12, 2024

View reviewed changes

manuelbuil approved these changes Jan 12, 2024

View reviewed changes

This was referenced Jan 17, 2024

[Release-1.26] Allow executors to define containerd and docker behavior #9252

Merged

[Release-1.27] Allow executors to define containerd and docker behavior #9253

Merged

[Release-1.28] Allow executors to define containerd and docker behavior #9254

Merged

vitorsavian merged commit a36cc73 into k3s-io:master Feb 9, 2024
14 checks passed

This was referenced Feb 21, 2024

Restore original order of agent startup functions #9539

Merged

Unable to create split role cluster #9540

Closed

Unable to create split role cluster rancher/rke2#5504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow executors to define `containerd` and `cridockerd` behavior #9184

Allow executors to define `containerd` and `cridockerd` behavior #9184

HarrisonWAffel commented Jan 8, 2024 •

edited

Loading

brandond commented Jan 8, 2024 •

edited

Loading

manuelbuil commented Jan 9, 2024

HarrisonWAffel commented Jan 9, 2024

brandond commented Jan 9, 2024

HarrisonWAffel commented Jan 11, 2024

HarrisonWAffel commented Jan 17, 2024

caroline-suse-rancher commented Jan 17, 2024

HarrisonWAffel commented Jan 17, 2024

brandond commented Jan 17, 2024

Allow executors to define containerd and cridockerd behavior #9184

Allow executors to define containerd and cridockerd behavior #9184

Conversation

HarrisonWAffel commented Jan 8, 2024 • edited Loading

Proposed Changes

Types of Changes

Verification

Testing

Linked Issues

User-Facing Change

Further Comments

brandond commented Jan 8, 2024 • edited Loading

manuelbuil commented Jan 9, 2024

HarrisonWAffel commented Jan 9, 2024

brandond commented Jan 9, 2024

HarrisonWAffel commented Jan 11, 2024

HarrisonWAffel commented Jan 17, 2024

caroline-suse-rancher commented Jan 17, 2024

HarrisonWAffel commented Jan 17, 2024

brandond commented Jan 17, 2024

Allow executors to define `containerd` and `cridockerd` behavior #9184

Allow executors to define `containerd` and `cridockerd` behavior #9184

HarrisonWAffel commented Jan 8, 2024 •

edited

Loading

brandond commented Jan 8, 2024 •

edited

Loading