Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues running repeated updates with autopilot #4296

Open
3 of 4 tasks
laverya opened this issue Apr 12, 2024 · 2 comments
Open
3 of 4 tasks

Issues running repeated updates with autopilot #4296

laverya opened this issue Apr 12, 2024 · 2 comments
Labels
bug Something isn't working component/autopilot

Comments

@laverya
Copy link

laverya commented Apr 12, 2024

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Version

v1.28.7+k0s.0

Sysinfo

`k0s sysinfo`
Machine ID: "afebc983fd329da739962030512903dcb8d95d75363811f488798f5677c802ff" (from machine) (pass)
Total memory: 15.6 GiB (pass)
Disk space available for /var/lib/k0s: 173.3 GiB (pass)
Name resolution: localhost: [::1 127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 6.1.0-18-cloud-amd64 (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: active (pass)
  Executable in PATH: modprobe: /usr/sbin/modprobe (pass)
  Executable in PATH: mount: /usr/bin/mount (pass)
  Executable in PATH: umount: /usr/bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (pass)
    cgroup controller "memory": available (pass)
    cgroup controller "devices": available (assumed) (pass)
    cgroup controller "freezer": available (assumed) (pass)
    cgroup controller "pids": available (pass)
    cgroup controller "hugetlb": available (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: module (pass)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
        CONFIG_IP_VS_SH: Source hashing scheduling: module (pass)
        CONFIG_IP_VS_RR: Round-robin scheduling: module (pass)
        CONFIG_IP_VS_WRR: Weighted round-robin scheduling: module (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: module (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: built-in (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

When using AirgapUpdate to pull a new images file for a host, the name of the file was the same as the file currently on the host (images-amd64.tar). This causes the AirgapUpdate to fail with a content length error?

Steps to reproduce

  1. install k0s in airgap mode with an images file named X.tar
  2. run an AirgapUpdate plan that references another file named X.tar
  3. Observe the plan fail (though it would succeed if the file was named differently)

Expected behavior

The new image file is downloaded and replaces the current image file.

Actual behavior

No new image file is downloaded, instead the autopilot plan fails, with the proximate log line being

Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=error msg="Unable to download 'http://127.0.0.1:50000/images/images-amd64.tar': bad content length" component=autopilot controller=Node

Screenshots and logs

From journalctl -u k0scontroller.service

Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Adding new status for plan 'AirgapUpdate' (index=0)" component=inithandler controller=plans leadermode=true
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg=Processing command=airgapupdate component=autopilot controller=plans leadermode=true state=newplan
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Adding new status for plan 'K0sUpdate' (index=1)" component=inithandler controller=plans leadermode=true
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg=Processing command=k0supdate component=autopilot controller=plans leadermode=true state=newplan
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg=Processing command=airgapupdate component=autopilot controller=plans leadermode=true state=schedulablewait
 state=schedulablewait
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Reconciling controller/worker signal node statuses" command=airgapupdate component=autopilot controller=plans leadermode=true
schedulablewait
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Workers can be scheduled (controllers done)" command=airgapupdate component=autopilot controller=plans leadermode=true state=
ermode=true
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Requesting plan command transition from 'SchedulableWait' --> 'Schedulable'" component=planstatehandler controller=plans lead
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg=Processing command=airgapupdate component=autopilot controller=plans leadermode=true state=schedulable
roller=plans leadermode=true state=schedulable
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Sending signalling to node='laverya-ec-airgap-update.c.replicated-qa.internal'" command=airgapupdate component=autopilot cont
ermode=true
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Requesting plan command transition from 'Schedulable' --> 'SchedulableWait'" component=planstatehandler controller=plans lead
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg=Processing command=airgapupdate component=autopilot controller=plans leadermode=true state=schedulablewait
 state=schedulablewait
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Reconciling controller/worker signal node statuses" command=airgapupdate component=autopilot controller=plans leadermode=true
te.c.replicated-qa.internal updatetype=airgap
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Found available signaling update request" component=autopilot controller=signal object=Node signalnode=laverya-ec-airgap-upda
update.c.replicated-qa.internal updatetype=airgap
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Updating signaling response to 'Downloading'" component=autopilot controller=signal object=Node signalnode=laverya-ec-airgap-
rue state=schedulablewait
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="No applicable transitions available, requesting retry" command=airgapupdate component=autopilot controller=plans leadermode=t
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Requeuing request due to explicit retry" component=autopilot controller=plans leadermode=true
iler=downloading signalnode=laverya-ec-airgap-update.c.replicated-qa.internal
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Starting download of 'http://127.0.0.1:50000/images/images-amd64.tar'" component=autopilot controller=Node object=Node reconc
 object=Node reconciler=downloading signalnode=laverya-ec-airgap-update.c.replicated-qa.internal
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=error msg="Unable to download 'http://127.0.0.1:50000/images/images-amd64.tar': bad content length" component=autopilot controller=Node
lnode=laverya-ec-airgap-update.c.replicated-qa.internal
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Updating signaling response to 'FailedDownload'" component=autopilot controller=Node object=Node reconciler=downloading signa
ply your changes to the latest version and try again" name=laverya-ec-airgap-update.c.replicated-qa.internal namespace= reconcileID="\"f506532b-89cd-452e-bcb1-97e7640131e0\""
oller=node controllerGroup= controllerKind=Node error="failed to update signal node to status 'FailedDownload': Operation cannot be fulfilled on nodes \"laverya-ec-airgap-update.c.replicated-qa.internal\": the object has been modified; please ap
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=error msg="Reconciler error" Node="{\"name\":\"laverya-ec-airgap-update.c.replicated-qa.internal\"}" component=controller-runtime contr
iler=downloading signalnode=laverya-ec-airgap-update.c.replicated-qa.internal
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Starting download of 'http://127.0.0.1:50000/images/images-amd64.tar'" component=autopilot controller=Node object=Node reconc
 object=Node reconciler=downloading signalnode=laverya-ec-airgap-update.c.replicated-qa.internal
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=error msg="Unable to download 'http://127.0.0.1:50000/images/images-amd64.tar': bad content length" component=autopilot controller=Node
lnode=laverya-ec-airgap-update.c.replicated-qa.internal
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="Updating signaling response to 'FailedDownload'" component=autopilot controller=Node object=Node reconciler=downloading signa
Apr 12 14:39:26 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:26" level=info msg="current cfg matches existing, not gonna do anything" component=coredns
Apr 12 14:39:31 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:31" level=info msg=Processing command=airgapupdate component=autopilot controller=plans leadermode=true state=schedulablewait
 state=schedulablewait
Apr 12 14:39:31 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:31" level=info msg="Reconciling controller/worker signal node statuses" command=airgapupdate component=autopilot controller=plans leadermode=true
on: FailedDownload)" command=airgapupdate component=autopilot controller=plans leadermode=true
Apr 12 14:39:31 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:31" level=info msg="Signal node 'laverya-ec-airgap-update.c.replicated-qa.internal' status changed from 'SignalSent' to 'SignalApplyFailed' (reas
=schedulablewait
Apr 12 14:39:31 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:31" level=info msg="Plan is non-recoverable due to apply failure" command=airgapupdate component=autopilot controller=plans leadermode=true state
ermode=true
Apr 12 14:39:31 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:31" level=info msg="Requesting plan command transition from 'SchedulableWait' --> 'ApplyFailed'" component=planstatehandler controller=plans lead
Apr 12 14:39:36 laverya-ec-airgap-update.c.replicated-qa.internal k0s[3928]: time="2024-04-12 14:39:36" level=info msg="current cfg matches existing, not gonna do anything" component=coredns

The plan yaml:

apiVersion: autopilot.k0sproject.io/v1beta2
kind: Plan
metadata:
  annotations:
    embedded-cluster.replicated.com/installation-name: "20240412114318"
  creationTimestamp: "2024-04-12T14:39:26Z"
  generation: 1
  name: autopilot
  resourceVersion: "47287"
  uid: 53fa20d7-d56f-49a0-a0f1-1229412c062f
spec:
  commands:
  - airgapupdate:
      platforms:
        linux-amd64:
          url: http://127.0.0.1:50000/images/images-amd64.tar
      version: v1.28.8+k0s.0
      workers:
        discovery:
          static:
            nodes:
            - laverya-ec-airgap-update.c.replicated-qa.internal
        limits:
          concurrent: 1
  - k0supdate:
      platforms:
        linux-amd64:
          sha256: 51c9482a558096d99028304fd56afd383e2d87a71963e8457e02210298f5be62
          url: http://127.0.0.1:50000/bin/k0s-upgrade
      targets:
        controllers:
          discovery:
            static:
              nodes:
              - laverya-ec-airgap-update.c.replicated-qa.internal
          limits:
            concurrent: 1
        workers:
          discovery:
            static: {}
          limits:
            concurrent: 1
      version: v1.28.8+k0s.0
  id: 34b10fbd-2973-4e38-87a4-2765cf454b92
  timestamp: now
status:
  commands:
  - airgapupdate:
      workers:
      - lastUpdatedTimestamp: "2024-04-12T14:39:26Z"
        name: laverya-ec-airgap-update.c.replicated-qa.internal
        state: SignalApplyFailed
    id: 0
    state: ApplyFailed
  - id: 1
    k0supdate:
      controllers:
      - lastUpdatedTimestamp: "2024-04-12T14:39:26Z"
        name: laverya-ec-airgap-update.c.replicated-qa.internal
        state: SignalPending
    state: SchedulableWait
  state: ApplyFailed

We run a server (outside of k0s) on localhost in order to serve these files.

Additional context

Our workaround here will be to just change the name of the images file each update, but then we need to handle cleanup too - is there something we should be doing instead? For the k0s binary we can just always name it k0s-upgrade and it will be renamed as part of the upgrade process, but that doesn't appear to be the case for this file.

(this is also not the latest version of k0s, as that is a necessity for testing updates... I can trigger the AirgapUpdate plan component on its own if that would be desirable though)

@laverya laverya added the bug Something isn't working label Apr 12, 2024
@laverya
Copy link
Author

laverya commented Apr 12, 2024

Also, incidentally, the autopilot docs are kind of lacking and don't actually say that AirgapUpdate is for updating the image file - I'm assuming that I'm using this for its intended purpose here? 😅

@jnummelin
Copy link
Collaborator

That is correct assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/autopilot
Projects
None yet
Development

No branches or pull requests

3 participants