fix(alpine): fix Alpine 3.23 compatibility issues causing post-reboot crashes by samdulam · Pull Request #132 · portainer/kubesolo

samdulam · 2026-03-30T11:33:37Z

Summary

cgroupDriver mismatch: kubelet had cgroupDriver: systemd hardcoded and containerd set SystemdCgroup=true whenever cgroupv2 was detected. Alpine 3.23 uses cgroupv2 but runs OpenRC (not systemd), so both kubelet and containerd now detect systemd presence via /run/systemd/private before selecting the systemd cgroup driver. Without this fix, kubesolo works interactively but crashes on every reboot because OpenRC starts the service before systemd (which doesn't exist) could be checked.
CoreDNS OOM killed: Alpine 3.23's cgroupv2 kernel accounts for more memory types (socket buffers, slab objects, page tables), causing CoreDNS to exceed its 20Mi limit and be OOM-killed silently (exit 255 / "Unknown" rather than 137 / "OOMKilled"). Increased limit to 64Mi and request to 32Mi.
kube-proxy nftables binary missing: /proc/net/ip_tables_names is absent on Alpine 3.23, so kube-proxy selects nftables mode and requires the nft binary which isn't installed by default. Added Alpine-specific nftables prerequisite check to install.sh with a --install-prereqs flag for automatic installation.
install.sh self-kill: stop_running_processes used pgrep -f "kubesolo" which matched the install script's own process when invoked with --offline-install=/tmp/kubesolo. Replaced with a /proc/$pid/exe-based lookup that only matches processes whose actual executable is the kubesolo binary.

Test plan

Install kubesolo on Alpine 3.23 with --install-prereqs flag and verify nftables is installed automatically
Verify kubesolo starts successfully after a clean reboot on Alpine 3.23
Verify CoreDNS pod stays running (no OOM kills in dmesg)
Verify kube-proxy starts without "nft not found" error
Verify kubesolo still works correctly on systemd-based distros (cgroupDriver should still be "systemd")
Run install.sh --offline-install=/tmp/kubesolo-binary and confirm the script does not kill itself

… crashes Three root causes identified on Alpine 3.23 that work fine on 3.22: 1. cgroupDriver mismatch: kubelet hardcoded "systemd" as cgroup driver but Alpine uses OpenRC (no systemd), causing kubelet to fail immediately on boot. Similarly, containerd set SystemdCgroup=true whenever cgroupv2 was detected, regardless of whether systemd was running. Both now detect systemd presence via /run/systemd/private before selecting the driver. 2. CoreDNS OOM killed: Alpine 3.23's cgroupv2 kernel accounts more memory types (socket buffers, slab objects, page tables), causing CoreDNS to exceed its 20Mi limit. Increased limit to 64Mi and request to 32Mi. 3. kube-proxy nftables mode: /proc/net/ip_tables_names is absent on Alpine 3.23, causing kube-proxy to select nftables mode which requires the nft binary. Added nftables as an Alpine-specific prerequisite check in install.sh, with --install-prereqs flag for automatic installation. Also fixed install.sh stop_running_processes self-kill: replaced pgrep -f "kubesolo" (which matches the install script's own cmdline when --offline-install path contains "kubesolo") with a /proc/$pid/exe-based check that only matches processes whose actual executable is the kubesolo binary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…elf-kill stop_port_processes was detecting kubesolo processes by grepping the cmdline for "kubesolo", which could match the install script itself or its path argument. Switch to checking /proc/$pid/exe against the known binary path, consistent with the fix already applied to stop_running_processes.

Mirrors the matrix from release.yaml so musl artifacts (required for Alpine and other musl-based distros) are available from CI runs, not only from manual release triggers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lsof -t on Alpine returns PID 0/1 for files that were previously mapped by the init system. The previous check used cmdline grep which failed to identify them as non-kubesolo, and a logic bug caused all processes to be killed when not running under kubesolo. Fix: - Add explicit guard: never kill PID <= 1 - Replace cmdline grep with /proc/$pid/exe check (consistent with the other stop_* functions) so only the actual kubesolo binary is targeted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…es on reboot containerd's state directory was under /var/lib/kubesolo (persistent), so after a reboot containerd recovered existing pod sandboxes from its database without re-running CNI ADD. This left the nftables masquerade rules (in-memory, wiped on reboot) empty, breaking pod-to-external routing until pods were manually deleted and rescheduled. Moving state to /run/kubesolo/containerd/state (tmpfs) forces containerd to treat all pods as new on each boot, re-running CNI ADD for every pod and re-establishing the masquerade rules. Persistent image/snapshot data stays under basePath/containerd/root as before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

On a default Alpine install the OpenRC cgroups service is not enabled, leaving /sys/fs/cgroup/cgroup.controllers empty. Without it kubesolo fails the cgroups pre-flight check. Added ensure_alpine_cgroups_service() which detects this condition on Alpine/OpenRC and either enables+starts the service automatically (--install-prereqs) or exits with clear instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

flushNftablesNat() was unconditionally flushing table ip nat before kube-proxy started. In nftables mode kube-proxy uses its own table ip kube-proxy and never writes to table ip nat, so the flush was wiping CNI masquerade rules set up by the bridge plugin during pod scheduling. This caused pod-to-external traffic to break after every reboot (rules were added during kubelet startup, then cleared when kube-proxy started moments later). The flush was originally added to avoid conflicts with Podman/netavark native nftables entries, which only applies to iptables proxy mode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This change is being handled in a separate PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Only the limit needs to increase to prevent OOM kills on Alpine 3.23 cgroupv2. The request can stay at the original 20Mi. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

arm32 and riscv64 builds are not needed for CI validation. Full arch matrix is still built in the release pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

samdulam requested a review from stevensbkang as a code owner March 30, 2026 11:33

Sam and others added 9 commits March 30, 2026 12:02

ci: add musl builds to CI pipeline

b756aa8

Mirrors the matrix from release.yaml so musl artifacts (required for Alpine and other musl-based distros) are available from CI runs, not only from manual release triggers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

revert(containerd): revert containerd state dir change to /run

c1c9c5c

This change is being handled in a separate PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(coredns): revert memory request to 20Mi, keep limit at 64Mi

75c59f6

Only the limit needs to increase to prevent OOM kills on Alpine 3.23 cgroupv2. The request can stay at the original 20Mi. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: limit CI builds to amd64 and arm64 only

8b3158d

arm32 and riscv64 builds are not needed for CI validation. Full arch matrix is still built in the release pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stevensbkang approved these changes Mar 31, 2026

View reviewed changes

stevensbkang merged commit a5cf461 into portainer:develop Mar 31, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(alpine): fix Alpine 3.23 compatibility issues causing post-reboot crashes#132

fix(alpine): fix Alpine 3.23 compatibility issues causing post-reboot crashes#132
stevensbkang merged 10 commits intoportainer:developfrom
samdulam:fix/alpine-3.23-compatibility

samdulam commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samdulam commented Mar 30, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants