fix(alpine): fix Alpine 3.23 compatibility issues causing post-reboot crashes#132
Merged
stevensbkang merged 10 commits intoportainer:developfrom Mar 31, 2026
Merged
Conversation
… crashes Three root causes identified on Alpine 3.23 that work fine on 3.22: 1. cgroupDriver mismatch: kubelet hardcoded "systemd" as cgroup driver but Alpine uses OpenRC (no systemd), causing kubelet to fail immediately on boot. Similarly, containerd set SystemdCgroup=true whenever cgroupv2 was detected, regardless of whether systemd was running. Both now detect systemd presence via /run/systemd/private before selecting the driver. 2. CoreDNS OOM killed: Alpine 3.23's cgroupv2 kernel accounts more memory types (socket buffers, slab objects, page tables), causing CoreDNS to exceed its 20Mi limit. Increased limit to 64Mi and request to 32Mi. 3. kube-proxy nftables mode: /proc/net/ip_tables_names is absent on Alpine 3.23, causing kube-proxy to select nftables mode which requires the nft binary. Added nftables as an Alpine-specific prerequisite check in install.sh, with --install-prereqs flag for automatic installation. Also fixed install.sh stop_running_processes self-kill: replaced pgrep -f "kubesolo" (which matches the install script's own cmdline when --offline-install path contains "kubesolo") with a /proc/$pid/exe-based check that only matches processes whose actual executable is the kubesolo binary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…elf-kill stop_port_processes was detecting kubesolo processes by grepping the cmdline for "kubesolo", which could match the install script itself or its path argument. Switch to checking /proc/$pid/exe against the known binary path, consistent with the fix already applied to stop_running_processes.
Mirrors the matrix from release.yaml so musl artifacts (required for Alpine and other musl-based distros) are available from CI runs, not only from manual release triggers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lsof -t on Alpine returns PID 0/1 for files that were previously mapped by the init system. The previous check used cmdline grep which failed to identify them as non-kubesolo, and a logic bug caused all processes to be killed when not running under kubesolo. Fix: - Add explicit guard: never kill PID <= 1 - Replace cmdline grep with /proc/$pid/exe check (consistent with the other stop_* functions) so only the actual kubesolo binary is targeted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…es on reboot containerd's state directory was under /var/lib/kubesolo (persistent), so after a reboot containerd recovered existing pod sandboxes from its database without re-running CNI ADD. This left the nftables masquerade rules (in-memory, wiped on reboot) empty, breaking pod-to-external routing until pods were manually deleted and rescheduled. Moving state to /run/kubesolo/containerd/state (tmpfs) forces containerd to treat all pods as new on each boot, re-running CNI ADD for every pod and re-establishing the masquerade rules. Persistent image/snapshot data stays under basePath/containerd/root as before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On a default Alpine install the OpenRC cgroups service is not enabled, leaving /sys/fs/cgroup/cgroup.controllers empty. Without it kubesolo fails the cgroups pre-flight check. Added ensure_alpine_cgroups_service() which detects this condition on Alpine/OpenRC and either enables+starts the service automatically (--install-prereqs) or exits with clear instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
flushNftablesNat() was unconditionally flushing table ip nat before kube-proxy started. In nftables mode kube-proxy uses its own table ip kube-proxy and never writes to table ip nat, so the flush was wiping CNI masquerade rules set up by the bridge plugin during pod scheduling. This caused pod-to-external traffic to break after every reboot (rules were added during kubelet startup, then cleared when kube-proxy started moments later). The flush was originally added to avoid conflicts with Podman/netavark native nftables entries, which only applies to iptables proxy mode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This change is being handled in a separate PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Only the limit needs to increase to prevent OOM kills on Alpine 3.23 cgroupv2. The request can stay at the original 20Mi. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
arm32 and riscv64 builds are not needed for CI validation. Full arch matrix is still built in the release pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stevensbkang
approved these changes
Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cgroupDriver: systemdhardcoded and containerd setSystemdCgroup=truewhenever cgroupv2 was detected. Alpine 3.23 uses cgroupv2 but runs OpenRC (not systemd), so both kubelet and containerd now detect systemd presence via/run/systemd/privatebefore selecting the systemd cgroup driver. Without this fix, kubesolo works interactively but crashes on every reboot because OpenRC starts the service before systemd (which doesn't exist) could be checked./proc/net/ip_tables_namesis absent on Alpine 3.23, so kube-proxy selects nftables mode and requires thenftbinary which isn't installed by default. Added Alpine-specific nftables prerequisite check toinstall.shwith a--install-prereqsflag for automatic installation.stop_running_processesusedpgrep -f "kubesolo"which matched the install script's own process when invoked with--offline-install=/tmp/kubesolo. Replaced with a/proc/$pid/exe-based lookup that only matches processes whose actual executable is the kubesolo binary.Test plan
--install-prereqsflag and verify nftables is installed automaticallydmesg)install.sh --offline-install=/tmp/kubesolo-binaryand confirm the script does not kill itself