Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow executors to define containerd and cridockerd behavior #9184

Merged
merged 1 commit into from
Feb 9, 2024

Conversation

HarrisonWAffel
Copy link
Contributor

@HarrisonWAffel HarrisonWAffel commented Jan 8, 2024

Proposed Changes

In order to properly address rancher/rke2#2204, the rke2 windows agent needs to be able to define how it reacts to containerd exiting. Currently, rke2 uses k3s' implementation for containerd, which will call os.Exit if the process stops running. On Windows nodes, this sudden exit prevents the top level context from fully canceling, resulting in supporting processes created with CommandContext (kubelet, kube-proxy, etc.) continuing to run. In the event that the rke2 service is restarted, but those processes are not cleaned up, the service will enter a crash loop which must be manually resolved.

k3s should provide a way for the rke2 windows agent (or any custom executor) to define how containerd is executed, in a similar manner to other components such as the kubelet or kube-proxy.

  • Enhance the Executor interface so that custom executors (i.e. rke2's pebinaryexecutor) can define how containerd and cridockerd are executed.
  • Export the utility functions containerd.PreloadImages and containerd.SetupContainerdConfig so that code duplication between k3s and rke2 is reduced as much as possible
  • Start containerd/cridockerd when agent.Agent is called, as opposed to doing so in the run function
  • Update the embedded executor to satisfy the updated interface, continue to use the existing logic for running containerd and cridockerd

Types of Changes

Enhancement, custom executors now have control over the behavior of containerd and cridockerd.

Verification

The k3s embedded executor should be tested to ensure that it properly starts the desired container runtime on startup, otherwise there are no functional changes which would need more significant verification.

Testing

Manual testing:

After building a custom k3s binary, I was able to confirm that a 2 node cluster (1 cp+etcd server, 1 agent) will properly start the desired container runtime (both containerd and cridockerd were tested), and that workloads can be deployed onto that cluster without issue using kubectl.

Automated testing:

The existing integration and e2e tests cover basic startup and functionality of the k3s agent and server, so no new tests have been added. Since this PR focuses on extending the executor interface, I could not think of a way to add meaningful unit tests to cover these changes

Linked Issues

rancher/rke2#2204

User-Facing Change

NONE

Further Comments

In order to start the runtime using an executor I had to move the call from agent.run to agent.Agent. This results in the agent being started before the following code is called

if cfg.AgentReady != nil {
	close(cfg.AgentReady)
} 

Previously this call was made once the runtime was started, but before the agent was created. Looking at the code I don't think this has a negative impact on the server, however I wanted to call this out in case there is a subtlety that I'm missing in how that channel is expected to be closed

@brandond
Copy link
Contributor

brandond commented Jan 8, 2024

Have you tested this in RKE2 yet? The sequencing of the agent startup is much fussier on that side; I'm curious if the changes will have an impact on that side.

LGTM otherwise though, I am a fan of moving more stuff into the executor interface.

@manuelbuil
Copy link
Contributor

I like this PR and what it tries to solve. Removing/restarting rke2 is a pain in Windows

@HarrisonWAffel
Copy link
Contributor Author

I was able to test out the fix for 2204 today using a custom rke2 binary compiled against this k3s branch. Once I updated the pebinaryexecutor and the staticpod executors I didn't see any issues during startup for rke2. I tested both linux and windows nodes

I'm looking into the CI failure today, will update once that is resolved

@brandond
Copy link
Contributor

brandond commented Jan 9, 2024

It looks like the e2e tests have broken, glad to approve once those are green.

Signed-off-by: Harrison Affel <harrisonaffel@gmail.com>
@HarrisonWAffel
Copy link
Contributor Author

Rebased this pr off of master, it looks like CI is now passing

@HarrisonWAffel
Copy link
Contributor Author

Any idea when we might be able to merge this? Also, should I go ahead and raise back port PR's for this change, or is that handled during the release process?

@caroline-suse-rancher
Copy link
Contributor

Hey @HarrisonWAffel we entered code freeze for January releases on Friday (01/12). Is this crucial to get in for the January releases, or is a February timeline alright? cc @brandond @manuelbuil

@HarrisonWAffel
Copy link
Contributor Author

@caroline-suse-rancher

Nope this is not critical for the January release, I'll double back on this when code freeze is over. If there's any other action items on my side needed let me know!

Thanks!

@brandond
Copy link
Contributor

@HarrisonWAffel devs are responsible for their own backport issues and PRs, if you're up for it you might as well get those created now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants