Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot of the first control plane with dynamicConfig breaks Calico & Helm Charts #3304

Closed
4 tasks done
CmdrSharp opened this issue Jul 20, 2023 · 27 comments · Fixed by k0sproject/k0sctl#523
Closed
4 tasks done
Assignees
Labels
bug Something isn't working Stale

Comments

@CmdrSharp
Copy link
Contributor

CmdrSharp commented Jul 20, 2023

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux 5.15.117-flatcar #1 SMP Tue Jul 4 14:43:38 -00 2023 x86_64 GNU/Linux
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3510.2.4
VERSION_ID=3510.2.4
BUILD_ID=2023-07-04-1508
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3510.2.4:*:*:*:*:*:*:*"

Version

1.27.3+k0s.0

Sysinfo

`k0s sysinfo`
Machine ID: "fbb20c14cb7ccdef4c6b0bc754438a16a7410a2e5c418fc935574c5e2dbf6c8c" (from machine) (pass)
Total memory: 11.7 GiB (pass)
Disk space available for /var/lib/k0s: 40.5 GiB (pass)
Operating system: Linux (pass)
  Linux kernel release: 5.15.117-flatcar (pass)
  Max. file descriptors per process: current: 524288 / max: 524288 (pass)
  Executable in path: modprobe: /usr/sbin/modprobe (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (pass)
    cgroup controller "memory": available (pass)
    cgroup controller "devices": available (assumed) (pass)
    cgroup controller "freezer": available (assumed) (pass)
    cgroup controller "pids": available (pass)
    cgroup controller "hugetlb": available (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: built-in (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: module (pass)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: built-in (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: module (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

After installing a fresh k0s cluster, I decided to test resiliency by both gracefully and forcefully shutting down nodes. The goal was to verify that the cluster recovers when the nodes get brought back into the cluster.

Removing workers was no issue, so moving into the control plane nodes, I decided to take out the third control plane node first (gracefully). Doing so was no issue; it became NotReady, and the konnectivity-agent on it immediately went to Terminating. I then powered the node back on and the workload recovered fine.

I now move on to do the same to the second control plane node (cp02), and it also works fine.

However, doing so to the first control plane node breaks things.
Immediately as it goes offline, Calico seemingly (mostly) uninstalls and gets replaced by kube-router seemingly out of nowhere.
To test this further, I re-created the cluster and this time immediately took away cp01, leaving the other two. The issue still immediately occurs.

Something appears to cause the CNI to get replaced (at least if it was Calico) if the primary control plane node ever goes offline. Bringing cp01 back does not recover the situation.

To be sure, I checked that the ClusterConfig object contains the correct configuration still, and it does.

Steps to reproduce

  1. Install a k0s cluster with Calico as the CNI
  2. Either gracefully or forcefully shut down the primary control plane node

Expected behavior

It should behave no differently to if the second or third control plane node is the one to disappear; Calico should remain functioning and kube-router should not be installed.

Actual behavior

Most (but not all) of Calico gets removed and replaced by kube-router - though this installation remains broken as well. It also seems metallb is removed automatically.

Remaining workloads after cp01 has powered off:

NAMESPACE          NAME                                                READY   STATUS        RESTARTS       AGE     IP               NODE                NOMINATED NODE   READINESS GATES
external-secrets   external-secrets-769df6c8cd-fbsn9                   0/1     Terminating   0              13m     10.244.252.130   calico-dev01-w04    <none>           <none>
external-secrets   external-secrets-769df6c8cd-lqf9m                   0/1     Terminating   0              13m     10.244.252.132   calico-dev01-w04    <none>           <none>
external-secrets   external-secrets-cert-controller-57b8c96ffb-glbf9   0/1     Terminating   0              13m     10.244.252.134   calico-dev01-w04    <none>           <none>
external-secrets   external-secrets-webhook-75c54b49d7-9wzcb           0/1     Terminating   0              13m     10.244.252.133   calico-dev01-w04    <none>           <none>
kube-system        calico-kube-controllers-6d48c8cf5c-264g7            1/1     Terminating   0              14m     10.244.210.1     calico-dev01-cp01   <none>           <none>
kube-system        coredns-878bb57ff-5clgh                             1/1     Running       0              14m     10.244.210.2     calico-dev01-cp01   <none>           <none>
kube-system        coredns-878bb57ff-ld5hx                             1/1     Running       0              13m     10.244.252.129   calico-dev01-w04    <none>           <none>
kube-system        konnectivity-agent-6bcv7                            1/1     Terminating   0              3m49s   REDACTED.52     calico-dev01-cp01   <none>           <none>
kube-system        konnectivity-agent-9fhbt                            1/1     Running       0              3m47s   REDACTED.35     calico-dev01-w02    <none>           <none>
kube-system        konnectivity-agent-d2vs5                            1/1     Running       0              3m46s   REDACTED.54     calico-dev01-cp03   <none>           <none>
kube-system        konnectivity-agent-gfnsk                            1/1     Running       0              8m8s    REDACTED.53     calico-dev01-cp02   <none>           <none>
kube-system        konnectivity-agent-kclj8                            1/1     Running       0              8m10s   REDACTED.36     calico-dev01-w03    <none>           <none>
kube-system        konnectivity-agent-mnqq7                            1/1     Running       0              3m50s   REDACTED.37     calico-dev01-w04    <none>           <none>
kube-system        konnectivity-agent-swwkj                            0/1     Terminating   0              38s     <none>           calico-dev01-w01    <none>           <none>
kube-system        kube-proxy-5jphq                                    1/1     Running       0              13m     REDACTED37     calico-dev01-w04    <none>           <none>
kube-system        kube-proxy-7q6bs                                    1/1     Running       0              13m     REDACTED.36     calico-dev01-w03    <none>           <none>
kube-system        kube-proxy-8cvfq                                    1/1     Running       0              13m     REDACTED.34     calico-dev01-w01    <none>           <none>
kube-system        kube-proxy-cf97j                                    1/1     Running       2 (4m1s ago)   13m     REDACTED.54     calico-dev01-cp03   <none>           <none>
kube-system        kube-proxy-db5xs                                    1/1     Running       0              13m     REDACTED.52     calico-dev01-cp01   <none>           <none>
kube-system        kube-proxy-hwbr2                                    1/1     Running       0              13m     REDACTED.53     calico-dev01-cp02   <none>           <none>
kube-system        kube-proxy-lfcfh                                    1/1     Running       0              13m     REDACTED.35     calico-dev01-w02    <none>           <none>
kube-system        kube-router-4dcrs                                   1/1     Running       0              31s     REDACTED.54     calico-dev01-cp03   <none>           <none>
kube-system        kube-router-59gkx                                   0/1     Pending       0              31s     <none>           calico-dev01-cp01   <none>           <none>
kube-system        kube-router-67kfl                                   1/1     Running       0              31s     REDACTED.37     calico-dev01-w04    <none>           <none>
kube-system        kube-router-bn29z                                   1/1     Running       0              31s     REDACTED.35     calico-dev01-w02    <none>           <none>
kube-system        kube-router-stp5s                                   1/1     Running       0              31s     REDACTED.36     calico-dev01-w03    <none>           <none>
kube-system        kube-router-swzc4                                   1/1     Running       0              31s     REDACTED.53     calico-dev01-cp02   <none>           <none>
kube-system        kube-router-xtkjq                                   1/1     Running       0              31s     REDACTED.34     calico-dev01-w01    <none>           <none>
kube-system        metrics-server-7f86dff975-vxffm                     1/1     Running       0              14m     10.244.210.3     calico-dev01-cp01   <none>           <none>
kube-system        nllb-calico-dev01-cp01                              1/1     Running       0              12m     REDACTED.52     calico-dev01-cp01   <none>           <none>
kube-system        nllb-calico-dev01-cp02                              1/1     Running       0              12m     REDACTED.53     calico-dev01-cp02   <none>           <none>
kube-system        nllb-calico-dev01-cp03                              1/1     Running       0              3m54s   REDACTED.54     calico-dev01-cp03   <none>           <none>
kube-system        nllb-calico-dev01-w01                               1/1     Running       0              12m     REDACTED.34     calico-dev01-w01    <none>           <none>
kube-system        nllb-calico-dev01-w02                               1/1     Running       0              12m     REDACTED.35     calico-dev01-w02    <none>           <none>
kube-system        nllb-calico-dev01-w03                               1/1     Running       0              13m     REDACTED.36     calico-dev01-w03    <none>           <none>
kube-system        nllb-calico-dev01-w04                               1/1     Running       0              11m     REDACTED.37     calico-dev01-w04    <none>           <none>

We can also see that most of the calico deployment has been removed, including its RBAC.

Screenshots and logs

After cp01 gets powered back on:

k describe pod -n kube-system calico-kube-controllers-6d48c8cf5c-264g7

  Warning  FailedKillPod           12s                kubelet            error killing pod: failed to "KillPodSandbox" for "c9e98fbc-401c-46d8-b6a7-d39561a1533f" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"5b75ad197b786c49cf24076777c14b8a3b8fb827c87a05f948c7dc2c7b7067ac\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: Get \"https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.96.0.1:443: i/o timeout"
  Warning  FailedMount             11s (x7 over 43s)  kubelet            MountVolume.SetUp failed for volume "kube-api-access-h9qdl" : failed to fetch token: serviceaccounts "calico-kube-controllers" not found
  Warning  FailedKillPod           2s                 kubelet            error killing pod: failed to "KillPodSandbox" for "c9e98fbc-401c-46d8-b6a7-d39561a1533f" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"5b75ad197b786c49cf24076777c14b8a3b8fb827c87a05f948c7dc2c7b7067ac\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized"

Additional context

k0sctl.yaml

apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: calico-dev01
spec:
  hosts:
  - ssh:
      address: REDACTED.52
      user: devops
      port: 22
      keyPath: devops
    role: controller+worker
    privateInterface: ens192
  - ssh:
      address: REDACTED.53
      user: devops
      port: 22
      keyPath: devops
    role: controller+worker
    privateInterface: ens192
  - ssh:
      address: REDACTED.54
      user: devops
      port: 22
      keyPath: devops
    role: controller+worker
    privateInterface: ens192
  - ssh:
      address: REDACTED.34
      user: devops
      port: 22
      keyPath: devops
    role: worker
    privateInterface: ens192
  - ssh:
      address: REDACTED.35
      user: devops
      port: 22
      keyPath: devops
    role: worker
    privateInterface: ens192
  - ssh:
      address: REDACTED.36
      user: devops
      port: 22
      keyPath: devops
    role: worker
    privateInterface: ens192
  - ssh:
      address: REDACTED.37
      user: devops
      port: 22
      keyPath: devops
    role: worker
    privateInterface: ens192
  k0s:
    version: 1.27.3+k0s.0
    dynamicConfig: true
    config:
      spec:
        extensions:
          helm:
            repositories:
            - name: metallb
              url: https://metallb.github.io/metallb
            - name: external-secrets
              url: https://charts.external-secrets.io
            charts:
            - name: metallb
              chartname: metallb/metallb
              namespace: metallb
              order: 0
              values: |
                speaker:
                  logLevel: warn
            - name: external-secrets
              chartname: external-secrets/external-secrets
              namespace: external-secrets
              version: 0.9.0
              order: 1
              values: |
                replicaCount: 2
                leaderElect: true
                podDisruptionBudget:
                  enabled: true
                  minAvailable: 1
                recreatePods: true
        network:
          nodeLocalLoadBalancing:
            enabled: true
          provider: calico
          calico:
            envVars:
              FELIX_FEATUREDETECTOVERRIDE: ChecksumOffloadBroken=true

List of pods in fresh cluster (prior to the issue)

NAMESPACE          NAME                                                READY   STATUS    RESTARTS   AGE
external-secrets   external-secrets-769df6c8cd-fbsn9                   1/1     Running   0          2m24s
external-secrets   external-secrets-769df6c8cd-lqf9m                   1/1     Running   0          2m24s
external-secrets   external-secrets-cert-controller-57b8c96ffb-glbf9   1/1     Running   0          2m24s
external-secrets   external-secrets-webhook-75c54b49d7-9wzcb           1/1     Running   0          2m24s
kube-system        calico-kube-controllers-6d48c8cf5c-264g7            1/1     Running   0          2m45s
kube-system        calico-node-6m8q2                                   1/1     Running   0          2m
kube-system        calico-node-6wd48                                   1/1     Running   0          2m12s
kube-system        calico-node-dcr8s                                   1/1     Running   0          2m
kube-system        calico-node-lx9k8                                   1/1     Running   0          2m
kube-system        calico-node-m2dvn                                   1/1     Running   0          118s
kube-system        calico-node-ps8tp                                   1/1     Running   0          119s
kube-system        calico-node-qwfg2                                   1/1     Running   0          2m28s
kube-system        coredns-878bb57ff-5clgh                             1/1     Running   0          2m45s
kube-system        coredns-878bb57ff-ld5hx                             1/1     Running   0          2m10s
kube-system        konnectivity-agent-6f8jm                            1/1     Running   0          2m
kube-system        konnectivity-agent-7dxx2                            1/1     Running   0          2m
kube-system        konnectivity-agent-8tt68                            1/1     Running   0          2m9s
kube-system        konnectivity-agent-hfnwg                            1/1     Running   0          119s
kube-system        konnectivity-agent-lt946                            1/1     Running   0          119s
kube-system        konnectivity-agent-rwc47                            1/1     Running   0          2m
kube-system        konnectivity-agent-sbmhf                            1/1     Running   0          118s
kube-system        kube-proxy-5jphq                                    1/1     Running   0          2m
kube-system        kube-proxy-7q6bs                                    1/1     Running   0          118s
kube-system        kube-proxy-8cvfq                                    1/1     Running   0          2m
kube-system        kube-proxy-cf97j                                    1/1     Running   0          119s
kube-system        kube-proxy-db5xs                                    1/1     Running   0          2m28s
kube-system        kube-proxy-hwbr2                                    1/1     Running   0          2m12s
kube-system        kube-proxy-lfcfh                                    1/1     Running   0          2m
kube-system        metrics-server-7f86dff975-vxffm                     1/1     Running   0          2m45s
kube-system        nllb-calico-dev01-cp01                              1/1     Running   0          70s
kube-system        nllb-calico-dev01-cp02                              1/1     Running   0          63s
kube-system        nllb-calico-dev01-cp03                              1/1     Running   0          34s
kube-system        nllb-calico-dev01-w01                               1/1     Running   0          45s
kube-system        nllb-calico-dev01-w02                               1/1     Running   0          55s
kube-system        nllb-calico-dev01-w03                               1/1     Running   0          117s
kube-system        nllb-calico-dev01-w04                               1/1     Running   0          37s
metallb            metallb-controller-5cd9b4944b-2j47z                 1/1     Running   0          2m25s
metallb            metallb-speaker-64tzg                               4/4     Running   0          77s
metallb            metallb-speaker-b6pt8                               4/4     Running   0          103s
metallb            metallb-speaker-f262d                               4/4     Running   0          71s
metallb            metallb-speaker-gwtjp                               4/4     Running   0          101s
metallb            metallb-speaker-h9c79                               4/4     Running   0          77s
metallb            metallb-speaker-ndpcs                               4/4     Running   0          88s
metallb            metallb-speaker-wqzsf                               4/4     Running   0          78s
@CmdrSharp CmdrSharp added the bug Something isn't working label Jul 20, 2023
@juanluisvaladas
Copy link
Contributor

Hi @CmdrSharp I tried to reproduce this but I couldn't get it. I used 1.27.4 but at the end of the day it should be the same.

I used a similar configuration with both calico and dynamic config:

apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0s-cluster
spec:
  hosts:
  - role: controller+worker
    ssh:
      address: controller-0.k0s.lab
      user: root
      keyPath: ~/.ssh/id_rsa
  - role: controller+worker
    ssh:
      address: controller-1.k0s.lab
      user: root
      keyPath: ~/.ssh/id_rsa
  k0s:
    dynamicConfig: true
    config:
      spec:
        network:
          nodeLocalLoadBalancing:
            enabled: true
            type: EnvoyProxy
          provider: calico
          calico:
            envVars:
              FELIX_FEATUREDETECTOVERRIDE: ChecksumOffloadBroken=true

I tried rebooting both nodes one by one and completely shut off both in parallel, however this didn't happen to me.
Could you please provide from all three control plane nodes:
1- ls /var/lib/k0s/manifests/
2- /etc/k0s/k0s.yaml
Also I'm going to need the output of: kubectl get clusterconfig -n kube-system k0s -o yaml

@CmdrSharp
Copy link
Contributor Author

That's odd, I've had a 100% successrate at reproducing it so far. I'll get you the output you asked for first thing tomorrow!

@CmdrSharp
Copy link
Contributor Author

CmdrSharp commented Jul 27, 2023

@juanluisvaladas Here is the requested output

ls /var/lib/k0s/manifests/

api-config  autopilot  bootstraprbac  calico  calico_init  coredns  helm  konnectivity	kubelet  kubeproxy  kuberouter	metricserver

/etc/k0s/k0s.yaml

# generated-by-k0sctl 2023-07-20T15:37:03+02:00
apiVersion: k0s.k0sproject.io/v1beta1
kind: ClusterConfig
metadata: {}
spec:
  api:
    address: REDACTED.52
    sans:
    - REDACTED.52
    - REDACTED.53
    - REDACTED.54
    - 127.0.0.1
  extensions:
    helm:
      charts:
      - chartname: metallb/metallb
        name: metallb
        namespace: metallb
        order: 0
        values: |
          speaker:
            logLevel: warn
      - chartname: external-secrets/external-secrets
        name: external-secrets
        namespace: external-secrets
        order: 1
        values: |
          replicaCount: 2
          leaderElect: true
          podDisruptionBudget:
            enabled: true
            minAvailable: 1
          recreatePods: true
        version: 0.9.0
      repositories:
      - name: metallb
        url: https://metallb.github.io/metallb
      - name: external-secrets
        url: https://charts.external-secrets.io
  network:
    calico:
      envVars:
        FELIX_FEATUREDETECTOVERRIDE: ChecksumOffloadBroken=true
    nodeLocalLoadBalancing:
      enabled: true
    provider: calico
  storage: {}

k get clusterconfig -n kube-system k0s -o yaml

apiVersion: k0s.k0sproject.io/v1beta1
kind: ClusterConfig
metadata:
  creationTimestamp: "2023-07-20T13:37:21Z"
  generation: 1
  name: k0s
  namespace: kube-system
  resourceVersion: "201"
  uid: f69fcea5-4cf3-4cc8-966c-910aba70dd6b
spec:
  extensions:
    helm:
      charts:
      - chartname: metallb/metallb
        name: metallb
        namespace: metallb
        order: 0
        timeout: 0
        values: |
          speaker:
            logLevel: warn
        version: ""
      - chartname: external-secrets/external-secrets
        name: external-secrets
        namespace: external-secrets
        order: 1
        timeout: 0
        values: |
          replicaCount: 2
          leaderElect: true
          podDisruptionBudget:
            enabled: true
            minAvailable: 1
          recreatePods: true
        version: 0.9.0
      concurrencyLevel: 5
      repositories:
      - caFile: ""
        certFile: ""
        insecure: false
        keyfile: ""
        name: metallb
        password: ""
        url: https://metallb.github.io/metallb
        username: ""
      - caFile: ""
        certFile: ""
        insecure: false
        keyfile: ""
        name: external-secrets
        password: ""
        url: https://charts.external-secrets.io
        username: ""
    storage:
      create_default_storage_class: false
      type: external_storage
  network:
    calico:
      envVars:
        FELIX_FEATUREDETECTOVERRIDE: ChecksumOffloadBroken=true
      flexVolumeDriverPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
      mode: vxlan
      mtu: 1450
      overlay: Always
      vxlanPort: 4789
      vxlanVNI: 4096
      wireguard: false
      withWindowsNodes: false
    dualStack: {}
    kubeProxy:
      iptables:
        minSyncPeriod: 0s
        syncPeriod: 0s
      ipvs:
        minSyncPeriod: 0s
        syncPeriod: 0s
        tcpFinTimeout: 0s
        tcpTimeout: 0s
        udpTimeout: 0s
      metricsBindAddress: 0.0.0.0:10249
      mode: iptables
    kuberouter:
      autoMTU: true
      hairpin: Enabled
      ipMasq: false
      metricsPort: 8080
      mtu: 0
      peerRouterASNs: ""
      peerRouterIPs: ""
    nodeLocalLoadBalancing:
      enabled: true
      envoyProxy:
        apiServerBindPort: 7443
        image:
          image: quay.io/k0sproject/envoy-distroless
          version: v1.24.1
        konnectivityServerBindPort: 7132
      type: EnvoyProxy
    podCIDR: 10.244.0.0/16
    provider: calico

@juanluisvaladas
Copy link
Contributor

Hmm I tried this with just 2 controller nodes and when I shut down one the cluster is read only. Since this is related to dynamic config my test is probably invalid. I'll retry this with three control plane nodes.

@juanluisvaladas
Copy link
Contributor

Hi, I tried this again with 3 nodes and I can't reproduce it. I will try now forcing 1.27.3

@CmdrSharp
Copy link
Contributor Author

CmdrSharp commented Jul 27, 2023

I'll retry this myself yet again just to verify. Tearing down the current one that it's happening on and re-building.
Here's the steps I will take:

  1. Bootstrap VM's (Flatcar)
  2. Run k0sctl install with the provided yaml config
  3. Shut down the first control plane node and wait until it becomes NotReady
  4. Boot the first control plane node

Will get back to you during the afternoon.
For reference, I'm using k0sctl 0.15.0 (not 0.15.2) since Flatcar instrumentation is bugged in the current release.

@juanluisvaladas
Copy link
Contributor

I will test this with flatcar, seems to be something specific to it.
I suspect what might be happening is that because spec.network.kuberouter isn't empty in the ClusterConfig and flatcar may consider some of our directories non persistent. I will try this but I guess the correct behavior is that with dynamicConfig set to true, the config sets kuberouter to nil if claico is configured.

@CmdrSharp
Copy link
Contributor Author

Had no issues reproducing it. What I did notice is that the issue actually occurs before the control node comes back. It happens the second that the first control plane node goes NotReady - behaving as if the state of the others never matched the first one, so when it left, the remaining control planes reconcile a different state.
In the below test I've just left cp01 powered down - never booting it back up.

k0sctl output during installation
⠀⣿⣿⡇⠀⠀⢀⣴⣾⣿⠟⠁⢸⣿⣿⣿⣿⣿⣿⣿⡿⠛⠁⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀█████████ █████████ ███
⠀⣿⣿⡇⣠⣶⣿⡿⠋⠀⠀⠀⢸⣿⡇⠀⠀⠀⣠⠀⠀⢀⣠⡆⢸⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀███          ███    ███
⠀⣿⣿⣿⣿⣟⠋⠀⠀⠀⠀⠀⢸⣿⡇⠀⢰⣾⣿⠀⠀⣿⣿⡇⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀███          ███    ███
⠀⣿⣿⡏⠻⣿⣷⣤⡀⠀⠀⠀⠸⠛⠁⠀⠸⠋⠁⠀⠀⣿⣿⡇⠈⠉⠉⠉⠉⠉⠉⠉⠉⢹⣿⣿⠀███          ███    ███
⠀⣿⣿⡇⠀⠀⠙⢿⣿⣦⣀⠀⠀⠀⣠⣶⣶⣶⣶⣶⣶⣿⣿⡇⢰⣶⣶⣶⣶⣶⣶⣶⣶⣾⣿⣿⠀█████████    ███    ██████████
k0sctl v0.15.0 Copyright 2022, k0sctl authors.
Anonymized telemetry of usage will be sent to the authors.
By continuing to use k0sctl you agree to these terms:
https://k0sproject.io/licenses/eula
INFO ==> Running phase: Connect to hosts
INFO [ssh] REDACTED.54:22: connected
INFO [ssh] REDACTED.35:22: connected
INFO [ssh] REDACTED.34:22: connected
INFO [ssh] REDACTED.37:22: connected
INFO [ssh] REDACTED.52:22: connected
INFO [ssh] REDACTED.36:22: connected
INFO [ssh] REDACTED.53:22: connected
INFO ==> Running phase: Detect host operating systems
INFO [ssh] REDACTED.52:22: is running Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)
INFO [ssh] REDACTED.54:22: is running Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)
INFO [ssh] REDACTED.34:22: is running Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)
INFO [ssh] REDACTED.36:22: is running Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)
INFO [ssh] REDACTED.53:22: is running Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)
INFO [ssh] REDACTED.35:22: is running Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)
INFO [ssh] REDACTED.37:22: is running Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)
INFO ==> Running phase: Acquire exclusive host lock
INFO ==> Running phase: Prepare hosts
INFO ==> Running phase: Gather host facts
INFO [ssh] REDACTED.34:22: using calico-dev01-w01 as hostname
INFO [ssh] REDACTED.37:22: using calico-dev01-w04 as hostname
INFO [ssh] REDACTED.54:22: using calico-dev01-cp03 as hostname
INFO [ssh] REDACTED.52:22: using calico-dev01-cp01 as hostname
INFO [ssh] REDACTED.35:22: using calico-dev01-w02 as hostname
INFO [ssh] REDACTED.36:22: using calico-dev01-w03 as hostname
INFO [ssh] REDACTED.53:22: using calico-dev01-cp02 as hostname
INFO ==> Running phase: Validate hosts
INFO ==> Running phase: Gather k0s facts
INFO ==> Running phase: Validate facts
INFO ==> Running phase: Download k0s on hosts
INFO [ssh] REDACTED.35:22: downloading k0s v1.27.3+k0s.0
INFO [ssh] REDACTED.34:22: downloading k0s v1.27.3+k0s.0
INFO [ssh] REDACTED.36:22: downloading k0s v1.27.3+k0s.0
INFO [ssh] REDACTED.52:22: downloading k0s v1.27.3+k0s.0
INFO [ssh] REDACTED.54:22: downloading k0s v1.27.3+k0s.0
INFO [ssh] REDACTED.53:22: downloading k0s v1.27.3+k0s.0
INFO [ssh] REDACTED.37:22: downloading k0s v1.27.3+k0s.0
INFO ==> Running phase: Configure k0s
INFO [ssh] REDACTED.52:22: validating configuration
INFO [ssh] REDACTED.53:22: validating configuration
INFO [ssh] REDACTED.54:22: validating configuration
INFO [ssh] REDACTED.53:22: configuration was changed
INFO [ssh] REDACTED.52:22: configuration was changed
INFO [ssh] REDACTED.54:22: configuration was changed
INFO ==> Running phase: Initialize the k0s cluster
INFO [ssh] REDACTED.52:22: installing k0s controller
INFO [ssh] REDACTED.52:22: waiting for the k0s service to start
INFO [ssh] REDACTED.52:22: waiting for kubernetes api to respond
INFO ==> Running phase: Install controllers
INFO [ssh] REDACTED.52:22: generating token
INFO [ssh] REDACTED.53:22: writing join token
INFO [ssh] REDACTED.53:22: installing k0s controller
INFO [ssh] REDACTED.53:22: starting service
INFO [ssh] REDACTED.53:22: waiting for the k0s service to start
INFO [ssh] REDACTED.53:22: waiting for kubernetes api to respond
INFO [ssh] REDACTED.52:22: generating token
INFO [ssh] REDACTED.54:22: writing join token
INFO [ssh] REDACTED.54:22: installing k0s controller
INFO [ssh] REDACTED.54:22: starting service
INFO [ssh] REDACTED.54:22: waiting for the k0s service to start
INFO [ssh] REDACTED.54:22: waiting for kubernetes api to respond
INFO ==> Running phase: Install workers
INFO [ssh] REDACTED.34:22: validating api connection to https://REDACTED.52:6443
INFO [ssh] REDACTED.35:22: validating api connection to https://REDACTED.52:6443
INFO [ssh] REDACTED.36:22: validating api connection to https://REDACTED.52:6443
INFO [ssh] REDACTED.37:22: validating api connection to https://REDACTED.52:6443
INFO [ssh] REDACTED.52:22: generating token
INFO [ssh] REDACTED.34:22: writing join token
INFO [ssh] REDACTED.35:22: writing join token
INFO [ssh] REDACTED.37:22: writing join token
INFO [ssh] REDACTED.36:22: writing join token
INFO [ssh] REDACTED.36:22: installing k0s worker
INFO [ssh] REDACTED.34:22: installing k0s worker
INFO [ssh] REDACTED.37:22: installing k0s worker
INFO [ssh] REDACTED.35:22: installing k0s worker
INFO [ssh] REDACTED.36:22: starting service
INFO [ssh] REDACTED.34:22: starting service
INFO [ssh] REDACTED.37:22: starting service
INFO [ssh] REDACTED.35:22: starting service
INFO [ssh] REDACTED.36:22: waiting for node to become ready
INFO [ssh] REDACTED.35:22: waiting for node to become ready
INFO [ssh] REDACTED.37:22: waiting for node to become ready
INFO [ssh] REDACTED.34:22: waiting for node to become ready
INFO ==> Running phase: Release exclusive host lock
INFO ==> Running phase: Disconnect from hosts
INFO ==> Finished in 2m19s
INFO k0s cluster version 1.27.3+k0s.0 is now installed
INFO Tip: To access the cluster you can now fetch the admin kubeconfig using:
INFO      k0sctl kubeconfig
Pods prior to restart of control plane
NAMESPACE          NAME                                                READY   STATUS    RESTARTS   AGE
external-secrets   external-secrets-769df6c8cd-bktv5                   1/1     Running   0          3m40s
external-secrets   external-secrets-769df6c8cd-r72q6                   1/1     Running   0          3m40s
external-secrets   external-secrets-cert-controller-57b8c96ffb-r5r48   1/1     Running   0          3m40s
external-secrets   external-secrets-webhook-75c54b49d7-qpjtd           1/1     Running   0          3m40s
kube-system        calico-kube-controllers-6d48c8cf5c-rb7bh            1/1     Running   0          4m1s
kube-system        calico-node-9lkck                                   1/1     Running   0          3m32s
kube-system        calico-node-dfs2j                                   1/1     Running   0          3m43s
kube-system        calico-node-jfshz                                   1/1     Running   0          3m16s
kube-system        calico-node-l6d5n                                   1/1     Running   0          3m16s
kube-system        calico-node-mrknm                                   1/1     Running   0          3m16s
kube-system        calico-node-pqsmx                                   1/1     Running   0          3m19s
kube-system        calico-node-r9hj9                                   1/1     Running   0          3m16s
kube-system        coredns-878bb57ff-qkdk5                             1/1     Running   0          3m25s
kube-system        coredns-878bb57ff-wd58c                             1/1     Running   0          4m1s
kube-system        konnectivity-agent-8p2dm                            1/1     Running   0          3m16s
kube-system        konnectivity-agent-g76h6                            1/1     Running   0          3m19s
kube-system        konnectivity-agent-jrdnx                            1/1     Running   0          3m15s
kube-system        konnectivity-agent-kg8xx                            1/1     Running   0          3m16s
kube-system        konnectivity-agent-n8qr7                            1/1     Running   0          3m16s
kube-system        konnectivity-agent-szrc7                            1/1     Running   0          3m16s
kube-system        konnectivity-agent-x5n45                            1/1     Running   0          3m24s
kube-system        kube-proxy-5cn9q                                    1/1     Running   0          3m16s
kube-system        kube-proxy-8qlxt                                    1/1     Running   0          3m32s
kube-system        kube-proxy-c5b72                                    1/1     Running   0          3m16s
kube-system        kube-proxy-csftt                                    1/1     Running   0          3m16s
kube-system        kube-proxy-mjpk6                                    1/1     Running   0          3m43s
kube-system        kube-proxy-r86fx                                    1/1     Running   0          3m19s
kube-system        kube-proxy-v9vnd                                    1/1     Running   0          3m16s
kube-system        metrics-server-7f86dff975-zfw86                     1/1     Running   0          4m1s
kube-system        nllb-calico-dev01-cp01                              1/1     Running   0          2m19s
kube-system        nllb-calico-dev01-cp02                              1/1     Running   0          2m12s
kube-system        nllb-calico-dev01-cp03                              1/1     Running   0          2m6s
kube-system        nllb-calico-dev01-w01                               1/1     Running   0          2m3s
kube-system        nllb-calico-dev01-w02                               1/1     Running   0          113s
kube-system        nllb-calico-dev01-w03                               1/1     Running   0          106s
kube-system        nllb-calico-dev01-w04                               1/1     Running   0          2m10s
metallb            metallb-controller-5cd9b4944b-24tn8                 1/1     Running   0          3m41s
metallb            metallb-speaker-82qj7                               4/4     Running   0          2m32s
metallb            metallb-speaker-9kpwd                               4/4     Running   0          3m
metallb            metallb-speaker-bwdxd                               4/4     Running   0          2m32s
metallb            metallb-speaker-dxvdp                               4/4     Running   0          3m1s
metallb            metallb-speaker-tzwhd                               4/4     Running   0          2m58s
metallb            metallb-speaker-vvjtf                               4/4     Running   0          2m51s
metallb            metallb-speaker-xx2fd                               4/4     Running   0          2m59s
Nodes when cp01 is down
NAME                STATUS     ROLES           AGE     VERSION
calico-dev01-cp01   NotReady   control-plane   5m54s   v1.27.3+k0s
calico-dev01-cp02   Ready      control-plane   5m37s   v1.27.3+k0s
calico-dev01-cp03   Ready      control-plane   5m24s   v1.27.3+k0s
calico-dev01-w01    Ready                5m21s   v1.27.3+k0s
calico-dev01-w02    Ready                5m21s   v1.27.3+k0s
calico-dev01-w03    Ready                5m21s   v1.27.3+k0s
calico-dev01-w04    Ready                5m21s   v1.27.3+k0s
Note that this is of course queried by changing the API URL in the kubeconfig to the second control plane node, since the first one is down.
Pods when cp01 is down
NAMESPACE          NAME                                                READY   STATUS              RESTARTS   AGE
external-secrets   external-secrets-769df6c8cd-4g2td                   0/1     Terminating         0          4m11s
external-secrets   external-secrets-769df6c8cd-dv2fx                   0/1     Terminating         1          4m11s
external-secrets   external-secrets-cert-controller-57b8c96ffb-8f6jz   0/1     Terminating         0          4m11s
external-secrets   external-secrets-webhook-75c54b49d7-nsnfj           0/1     Terminating         0          4m11s
kube-system        calico-kube-controllers-6d48c8cf5c-bl89b            1/1     Terminating         0          4m32s
kube-system        coredns-878bb57ff-74gq8                             1/1     Running             0          4m32s
kube-system        coredns-878bb57ff-xkh4k                             1/1     Running             0          3m56s
kube-system        konnectivity-agent-49v5n                            1/1     Running             0          3m47s
kube-system        konnectivity-agent-5rmcb                            0/1     ContainerCreating   0          51s
kube-system        konnectivity-agent-77ppm                            1/1     Running             0          3m47s
kube-system        konnectivity-agent-7xwqh                            1/1     Running             0          3m47s
kube-system        konnectivity-agent-h44mf                            1/1     Terminating         0          3m55s
kube-system        konnectivity-agent-ksmnd                            1/1     Running             0          3m47s
kube-system        konnectivity-agent-rd4t4                            1/1     Running             0          3m46s
kube-system        kube-proxy-28hck                                    1/1     Running             0          3m51s
kube-system        kube-proxy-2k7x5                                    1/1     Running             0          3m47s
kube-system        kube-proxy-9qbf7                                    1/1     Running             0          3m47s
kube-system        kube-proxy-d7476                                    1/1     Running             0          4m26s
kube-system        kube-proxy-j8np6                                    1/1     Running             0          4m4s
kube-system        kube-proxy-jhvvp                                    1/1     Running             0          3m47s
kube-system        kube-proxy-qlc4z                                    1/1     Running             0          3m47s
kube-system        kube-router-54pz9                                   0/1     Pending             0          51s
kube-system        kube-router-6rssr                                   1/1     Running             0          52s
kube-system        kube-router-d6jqt                                   1/1     Running             0          52s
kube-system        kube-router-gj5n4                                   1/1     Running             0          51s
kube-system        kube-router-kbmdw                                   1/1     Running             0          52s
kube-system        kube-router-qkhz2                                   1/1     Running             0          52s
kube-system        kube-router-qsvcw                                   1/1     Running             0          51s
kube-system        metrics-server-7f86dff975-9pjs8                     1/1     Running             0          4m32s
kube-system        nllb-calico-dev01-cp01                              1/1     Running             0          2m59s
kube-system        nllb-calico-dev01-cp02                              1/1     Running             0          2m40s
kube-system        nllb-calico-dev01-cp03                              1/1     Running             0          2m27s
kube-system        nllb-calico-dev01-w01                               1/1     Running             0          2m23s
kube-system        nllb-calico-dev01-w02                               1/1     Running             0          2m34s
kube-system        nllb-calico-dev01-w03                               1/1     Running             0          2m25s
kube-system        nllb-calico-dev01-w04                               1/1     Running             0          2m42s

@juanluisvaladas
Copy link
Contributor

Had no issues reproducing it. What I did notice is that the issue actually occurs before the control node comes back. It happens the second that the first control plane node goes NotReady

This breaks my theory. I'll reproduce this later today.

@juanluisvaladas
Copy link
Contributor

I tried reproducing this with flatcar using both controller and controller+worker and I couldn't reproduce this.

How are you deploying flactar exactly? I tried on AWS.

ip-172-31-4-243 ~ # cat /etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3510.2.5
VERSION_ID=3510.2.5
BUILD_ID=2023-07-14-1822
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3510.2.5 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3510.2.5:*:*:*:*:*:*:*"

@juanluisvaladas
Copy link
Contributor

juanluisvaladas commented Aug 2, 2023

I realized in the flatcar tests I accidentally forgot to enable dynamicConfig. This was not forgotten in the other distribution tests. I'm retrying now again.
Edit: Managed to reproduce this.

@CmdrSharp
Copy link
Contributor Author

CmdrSharp commented Aug 2, 2023

I deploy Flatcar on VMware using Pulumi.

Ignition config looks something like this:

    const configuration = {
      "ignition": {
        "version": "3.3.0"
      },
      "passwd": {
        "users": [
          {
            "name": "redacted",
            "groups": [
              "sudo",
              "docker",
            ],
            "sshAuthorizedKeys": [
              "redacted"
            ]
          }
        ]
      },
      "storage": {
        "files": [
          {
            "filesystem": "root",
            "path": "/etc/hostname",
            "mode": 420,
            "contents": {
              "source": `data:,${this.hostname}`
            }
          },
          {
            "path": "/etc/flatcar/update.conf",
            "mode": 644,
            "contents": {
              "source": "data:,SERVER=disabled"
            }
          },
          {
            "path": "/etc/systemd/network/00-vmware.network",
            "contents": {
              "compression": "gzip",
              "source": `data:;base64,${networkConfiguration}`
            }
          }
        ]
      }
    };

NetworkConfig looks like this:

      [Match]
      Name=ens192
      [Network]
      DHCP=no
      DNS=1.1.1.1
      DNS=1.0.0.1
      [Address]
      Address=${this.ipv4}/${this.cidr}
      [Route]
      Destination=0.0.0.0/0
      Gateway=${this.gateway}

I'll go ahead and try this with dynamicConfig disabled just to see if that changes anything!

@CmdrSharp
Copy link
Contributor Author

CmdrSharp commented Aug 2, 2023

dynamicConfig now disabled. Fresh installation:

NAME                STATUS   ROLES           AGE     VERSION       INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                             KERNEL-VERSION     CONTAINER-RUNTIME
calico-dev01-cp01   Ready    control-plane   3m20s   v1.27.3+k0s   REDACTED.52   <none>        Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)   5.15.117-flatcar   containerd://1.7.1
calico-dev01-cp02   Ready    control-plane   2m54s   v1.27.3+k0s   REDACTED.53   <none>        Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)   5.15.117-flatcar   containerd://1.7.1
calico-dev01-cp03   Ready    control-plane   2m45s   v1.27.3+k0s   REDACTED.54   <none>        Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)   5.15.117-flatcar   containerd://1.7.1
calico-dev01-w01    Ready    <none>          2m41s   v1.27.3+k0s   REDACTED.34   <none>        Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)   5.15.117-flatcar   containerd://1.7.1
calico-dev01-w02    Ready    <none>          2m41s   v1.27.3+k0s   REDACTED.35   <none>        Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)   5.15.117-flatcar   containerd://1.7.1
calico-dev01-w03    Ready    <none>          2m41s   v1.27.3+k0s   REDACTED.36   <none>        Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)   5.15.117-flatcar   containerd://1.7.1
calico-dev01-w04    Ready    <none>          2m41s   v1.27.3+k0s   REDACTED.37   <none>        Flatcar Container Linux by Kinvolk 3510.2.4 (Oklo)   5.15.117-flatcar   containerd://1.7.1
NAMESPACE          NAME                                                READY   STATUS    RESTARTS   AGE     IP              NODE                NOMINATED NODE   READINESS GATES
external-secrets   external-secrets-769df6c8cd-drfdg                   1/1     Running   0          5m2s    10.244.92.197   calico-dev01-w01    <none>           <none>
external-secrets   external-secrets-769df6c8cd-pv7td                   1/1     Running   0          5m2s    10.244.92.196   calico-dev01-w01    <none>           <none>
external-secrets   external-secrets-cert-controller-57b8c96ffb-wprrs   1/1     Running   0          5m2s    10.244.92.193   calico-dev01-w01    <none>           <none>
external-secrets   external-secrets-webhook-75c54b49d7-b5rxr           1/1     Running   0          5m2s    10.244.92.195   calico-dev01-w01    <none>           <none>
kube-system        calico-kube-controllers-6d48c8cf5c-2t5zs            1/1     Running   0          5m24s   10.244.210.3    calico-dev01-cp01   <none>           <none>
kube-system        calico-node-dcr5q                                   1/1     Running   0          4m38s   REDACTED.34    calico-dev01-w01    <none>           <none>
kube-system        calico-node-dqm9m                                   1/1     Running   0          4m38s   REDACTED.35    calico-dev01-w02    <none>           <none>
kube-system        calico-node-gmnnl                                   1/1     Running   0          5m17s   REDACTED.52    calico-dev01-cp01   <none>           <none>
kube-system        calico-node-mbbc2                                   1/1     Running   0          4m38s   REDACTED.37    calico-dev01-w04    <none>           <none>
kube-system        calico-node-nbgf7                                   1/1     Running   0          4m42s   REDACTED.54    calico-dev01-cp03   <none>           <none>
kube-system        calico-node-prwms                                   1/1     Running   0          4m51s   REDACTED.53    calico-dev01-cp02   <none>           <none>
kube-system        calico-node-x4bk5                                   1/1     Running   0          4m38s   REDACTED.36    calico-dev01-w03    <none>           <none>
kube-system        coredns-878bb57ff-4zjhz                             1/1     Running   0          5m24s   10.244.210.1    calico-dev01-cp01   <none>           <none>
kube-system        coredns-878bb57ff-5c49s                             1/1     Running   0          4m48s   10.244.121.65   calico-dev01-cp02   <none>           <none>
kube-system        konnectivity-agent-59mcl                            1/1     Running   0          4m38s   REDACTED.37    calico-dev01-w04    <none>           <none>
kube-system        konnectivity-agent-6pj7j                            1/1     Running   0          4m42s   REDACTED.54    calico-dev01-cp03   <none>           <none>
kube-system        konnectivity-agent-8756j                            1/1     Running   0          4m38s   REDACTED.36    calico-dev01-w03    <none>           <none>
kube-system        konnectivity-agent-dszls                            1/1     Running   0          4m36s   REDACTED.53    calico-dev01-cp02   <none>           <none>
kube-system        konnectivity-agent-gbczp                            1/1     Running   0          4m38s   REDACTED.35    calico-dev01-w02    <none>           <none>
kube-system        konnectivity-agent-ks5z5                            1/1     Running   0          4m38s   REDACTED.34    calico-dev01-w01    <none>           <none>
kube-system        konnectivity-agent-xcpt5                            1/1     Running   0          4m46s   REDACTED.52    calico-dev01-cp01   <none>           <none>
kube-system        kube-proxy-52l5g                                    1/1     Running   0          4m38s   REDACTED.34    calico-dev01-w01    <none>           <none>
kube-system        kube-proxy-c5scm                                    1/1     Running   0          4m38s   REDACTED.37    calico-dev01-w04    <none>           <none>
kube-system        kube-proxy-hkmqn                                    1/1     Running   0          4m42s   REDACTED.54    calico-dev01-cp03   <none>           <none>
kube-system        kube-proxy-nnq9k                                    1/1     Running   0          5m17s   REDACTED.52    calico-dev01-cp01   <none>           <none>
kube-system        kube-proxy-p62h7                                    1/1     Running   0          4m38s   REDACTED.35    calico-dev01-w02    <none>           <none>
kube-system        kube-proxy-rz6q8                                    1/1     Running   0          4m51s   REDACTED.53    calico-dev01-cp02   <none>           <none>
kube-system        kube-proxy-v2qsb                                    1/1     Running   0          4m38s   REDACTED.36    calico-dev01-w03    <none>           <none>
kube-system        metrics-server-7f86dff975-xbwfg                     1/1     Running   0          5m24s   10.244.210.2    calico-dev01-cp01   <none>           <none>
kube-system        nllb-calico-dev01-cp01                              1/1     Running   0          3m59s   REDACTED.52    calico-dev01-cp01   <none>           <none>
kube-system        nllb-calico-dev01-cp02                              1/1     Running   0          4m50s   REDACTED.53    calico-dev01-cp02   <none>           <none>
kube-system        nllb-calico-dev01-cp03                              1/1     Running   0          3m20s   REDACTED.54    calico-dev01-cp03   <none>           <none>
kube-system        nllb-calico-dev01-w01                               1/1     Running   0          3m24s   REDACTED.34    calico-dev01-w01    <none>           <none>
kube-system        nllb-calico-dev01-w02                               1/1     Running   0          3m30s   REDACTED.35    calico-dev01-w02    <none>           <none>
kube-system        nllb-calico-dev01-w03                               1/1     Running   0          3m13s   REDACTED.36    calico-dev01-w03    <none>           <none>
kube-system        nllb-calico-dev01-w04                               1/1     Running   0          3m10s   REDACTED.37    calico-dev01-w04    <none>           <none>
metallb            metallb-controller-5cd9b4944b-79mwr                 1/1     Running   0          5m3s    10.244.92.194   calico-dev01-w01    <none>           <none>
metallb            metallb-speaker-87bgd                               4/4     Running   0          3m56s   REDACTED.35    calico-dev01-w02    <none>           <none>
metallb            metallb-speaker-9r8mt                               4/4     Running   0          4m11s   REDACTED.53    calico-dev01-cp02   <none>           <none>
metallb            metallb-speaker-h5nfj                               4/4     Running   0          3m57s   REDACTED.34    calico-dev01-w01    <none>           <none>
metallb            metallb-speaker-k2s7v                               4/4     Running   0          3m56s   REDACTED.37    calico-dev01-w04    <none>           <none>
metallb            metallb-speaker-l6ncz                               4/4     Running   0          4m      REDACTED.54    calico-dev01-cp03   <none>           <none>
metallb            metallb-speaker-lpws2                               4/4     Running   0          3m54s   REDACTED.36    calico-dev01-w03    <none>           <none>
metallb            metallb-speaker-n5qwj                               4/4     Running   0          4m50s   REDACTED.52    calico-dev01-cp01   <none>           <none>

Killing cp01 (guest OS shutdown) and waiting for it to become NotReady

calico-dev01-cp01   NotReady   control-plane

This time, other workloads continue as expected with no issue. All is fine after booting the node back up and it becoming Ready as well.

NAMESPACE          NAME                                                READY   STATUS        RESTARTS   AGE     IP              NODE                NOMINATED NODE   READINESS GATES
external-secrets   external-secrets-769df6c8cd-drfdg                   1/1     Running       0          9m      10.244.92.197   calico-dev01-w01    <none>           <none>
external-secrets   external-secrets-769df6c8cd-pv7td                   1/1     Running       0          9m      10.244.92.196   calico-dev01-w01    <none>           <none>
external-secrets   external-secrets-cert-controller-57b8c96ffb-wprrs   1/1     Running       0          9m      10.244.92.193   calico-dev01-w01    <none>           <none>
external-secrets   external-secrets-webhook-75c54b49d7-b5rxr           1/1     Running       0          9m      10.244.92.195   calico-dev01-w01    <none>           <none>
kube-system        calico-kube-controllers-6d48c8cf5c-2t5zs            1/1     Running       0          9m22s   10.244.210.3    calico-dev01-cp01   <none>           <none>
kube-system        calico-node-dcr5q                                   1/1     Running       0          8m36s   REDACTED.34    calico-dev01-w01    <none>           <none>
kube-system        calico-node-dqm9m                                   1/1     Running       0          8m36s   REDACTED.35    calico-dev01-w02    <none>           <none>
kube-system        calico-node-gmnnl                                   1/1     Running       0          9m15s   REDACTED.52    calico-dev01-cp01   <none>           <none>
kube-system        calico-node-mbbc2                                   1/1     Running       0          8m36s   REDACTED.37    calico-dev01-w04    <none>           <none>
kube-system        calico-node-nbgf7                                   1/1     Running       0          8m40s   REDACTED.54    calico-dev01-cp03   <none>           <none>
kube-system        calico-node-prwms                                   1/1     Running       0          8m49s   REDACTED.53    calico-dev01-cp02   <none>           <none>
kube-system        calico-node-x4bk5                                   1/1     Running       0          8m36s   REDACTED.36    calico-dev01-w03    <none>           <none>
kube-system        coredns-878bb57ff-4zjhz                             1/1     Running       0          9m22s   10.244.210.1    calico-dev01-cp01   <none>           <none>
kube-system        coredns-878bb57ff-5c49s                             1/1     Running       0          8m46s   10.244.121.65   calico-dev01-cp02   <none>           <none>
kube-system        konnectivity-agent-59mcl                            1/1     Running       0          8m36s   REDACTED.37    calico-dev01-w04    <none>           <none>
kube-system        konnectivity-agent-6pj7j                            1/1     Running       0          8m40s   REDACTED.54    calico-dev01-cp03   <none>           <none>
kube-system        konnectivity-agent-8756j                            1/1     Running       0          8m36s   REDACTED.36    calico-dev01-w03    <none>           <none>
kube-system        konnectivity-agent-dszls                            1/1     Running       0          8m34s   REDACTED.53    calico-dev01-cp02   <none>           <none>
kube-system        konnectivity-agent-gbczp                            1/1     Running       0          8m36s   REDACTED.35    calico-dev01-w02    <none>           <none>
kube-system        konnectivity-agent-ks5z5                            1/1     Running       0          8m36s   REDACTED.34    calico-dev01-w01    <none>           <none>
kube-system        konnectivity-agent-xcpt5                            1/1     Terminating   0          8m44s   REDACTED.52    calico-dev01-cp01   <none>           <none>
kube-system        kube-proxy-52l5g                                    1/1     Running       0          8m36s   REDACTED.34    calico-dev01-w01    <none>           <none>
kube-system        kube-proxy-c5scm                                    1/1     Running       0          8m36s   REDACTED.37    calico-dev01-w04    <none>           <none>
kube-system        kube-proxy-hkmqn                                    1/1     Running       0          8m40s   REDACTED.54    calico-dev01-cp03   <none>           <none>
kube-system        kube-proxy-nnq9k                                    1/1     Running       0          9m15s   REDACTED.52    calico-dev01-cp01   <none>           <none>
kube-system        kube-proxy-p62h7                                    1/1     Running       0          8m36s   REDACTED.35    calico-dev01-w02    <none>           <none>
kube-system        kube-proxy-rz6q8                                    1/1     Running       0          8m49s   REDACTED.53    calico-dev01-cp02   <none>           <none>
kube-system        kube-proxy-v2qsb                                    1/1     Running       0          8m36s   REDACTED.36    calico-dev01-w03    <none>           <none>
kube-system        metrics-server-7f86dff975-xbwfg                     1/1     Running       0          9m22s   10.244.210.2    calico-dev01-cp01   <none>           <none>
kube-system        nllb-calico-dev01-cp01                              1/1     Running       0          7m57s   REDACTED.52    calico-dev01-cp01   <none>           <none>
kube-system        nllb-calico-dev01-cp02                              1/1     Running       0          8m48s   REDACTED.53    calico-dev01-cp02   <none>           <none>
kube-system        nllb-calico-dev01-cp03                              1/1     Running       0          7m18s   REDACTED.54    calico-dev01-cp03   <none>           <none>
kube-system        nllb-calico-dev01-w01                               1/1     Running       0          7m22s   REDACTED.34    calico-dev01-w01    <none>           <none>
kube-system        nllb-calico-dev01-w02                               1/1     Running       0          7m28s   REDACTED.35    calico-dev01-w02    <none>           <none>
kube-system        nllb-calico-dev01-w03                               1/1     Running       0          7m11s   REDACTED.36    calico-dev01-w03    <none>           <none>
kube-system        nllb-calico-dev01-w04                               1/1     Running       0          7m8s    REDACTED.37    calico-dev01-w04    <none>           <none>
metallb            metallb-controller-5cd9b4944b-79mwr                 1/1     Running       0          9m1s    10.244.92.194   calico-dev01-w01    <none>           <none>
metallb            metallb-speaker-87bgd                               4/4     Running       0          7m54s   REDACTED.35    calico-dev01-w02    <none>           <none>
metallb            metallb-speaker-9r8mt                               4/4     Running       0          8m9s    REDACTED.53    calico-dev01-cp02   <none>           <none>
metallb            metallb-speaker-h5nfj                               4/4     Running       0          7m55s   REDACTED.34    calico-dev01-w01    <none>           <none>
metallb            metallb-speaker-k2s7v                               4/4     Running       0          7m54s   REDACTED.37    calico-dev01-w04    <none>           <none>
metallb            metallb-speaker-l6ncz                               4/4     Running       0          7m58s   REDACTED.54    calico-dev01-cp03   <none>           <none>
metallb            metallb-speaker-lpws2                               4/4     Running       0          7m52s   REDACTED.36    calico-dev01-w03    <none>           <none>
metallb            metallb-speaker-n5qwj                               4/4     Running       0          8m48s   REDACTED.52    calico-dev01-cp01   <none>           <none>

The issue appears related to dynamicConfig specifically.

@juanluisvaladas
Copy link
Contributor

Hi, I completely identified the issue and worked in a fix. Also I now that I understand it I managed to reproduce this on ubuntu (just takes longer than a simple reboot, but a full poweroff and a few more seconds eventually make this happen as well.

@CmdrSharp
Copy link
Contributor Author

Oh, good timing! Do you mind sharing a bit more about the issue for the curious? :)
Good job on the troubleshooting!

@CmdrSharp CmdrSharp changed the title Reboot of the first control plane node breaks Calico Reboot of the first control plane with dynamicConfig breaks Calico & Helm Charts Aug 2, 2023
@juanluisvaladas
Copy link
Contributor

When using dynamicConfig k0s always creates /var/lib/k0s/manifests/kuberouter

The reason why a reboot is not affecting it on ubuntu is I guess related to some lock stored in the filesystem which on flatcar disappears. I literally rebooted it on a loop and went for lunch for about an hour and didn't manage to reproduce. A full poweroff and a couple minutes gets it done.

There are at least three independent issues here:
1- DynamicConfig is created with kuberouter even when it shouldn't
2- The kuberouter component manager creates the manifests when it's not
3- The manifests are not synchronized, between masters. I guess k0sctl should do this on its first deployment.

1 and 2 I will solve as these are very obviously wrong and undesired behavior so I expect to fix them today and get merged this week or early next week. As for 3 I think we'll need to do some discussion.

Regarding the helm charts issue. I don't know if it will be immediately fixed as it's a part of the code I'm not very familiar with and I haven't looked into it deeply..

@CmdrSharp
Copy link
Contributor Author

Interesting! When you did your tests with reboots - did you ever let the node become NotReady? This usually takes a little bit of time. It's not until that happens that the issue pops up.

The manifests are not synchronized, between masters. I guess k0sctl should do this on its first deployment.

This seems reasonable; either that, or a controller should generate them independently based on the ClusterConfig for each control plane.

Regarding the helm charts issue. I don't know if it will be immediately fixed as it's a part of the code I'm not very familiar with and I haven't looked into it deeply..

Understood. Is this something @mikhail-sakhnov might know more about?

@CmdrSharp
Copy link
Contributor Author

@juanluisvaladas Should this really have been closed?

@juanluisvaladas
Copy link
Contributor

No, I closed it accidentally

@juanluisvaladas
Copy link
Contributor

juanluisvaladas commented Aug 2, 2023

OK so status of things, The change in k0stl ENTIRELY fixes the problem.

However this is needs changes in k0s itself:
1- A node with dynamic configuration starts running component managers BEFORE having its configuration fully synchronized
2- A node in a cluster with dynamic configuration OVERRIDES existing configuration if started incorrectly in some cases.

As for the helm chart I think the solution is not to start component managers until the configuration is fully initalized.

In my test cluster with the k0sctl of my PR I see the dynamicConfig doesn't have kuberouter and the manifests are generated the way anyone would expect.

@CmdrSharp
Copy link
Contributor Author

Good stuff! I'll look forward to the changes being merged :)

kke pushed a commit to k0sproject/k0sctl that referenced this issue Aug 7, 2023
Every controller with dynamiConfig needs a properly configured
.spec.network, this is required because otherwise the component managers
for network components may start synchronizing before getting the
configuration dynamically.

Partially fixes k0sproject/k0s#3304

This doesn't impact negatively worker nodes.

Signed-off-by: Juan Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
@CmdrSharp
Copy link
Contributor Author

Not sure this should've been closed @juanluisvaladas @kke

@kke
Copy link
Contributor

kke commented Aug 9, 2023

Github triggered on "Partially fixes"

@kke kke reopened this Aug 9, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Sep 8, 2023
@kke kke removed the Stale label Sep 11, 2023
@github-actions
Copy link
Contributor

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Oct 11, 2023
@kke kke removed the Stale label Oct 12, 2023
Copy link
Contributor

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Nov 11, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 18, 2023
@juanluisvaladas
Copy link
Contributor

Hi, I think this can be entirely closed now.
The issue is still theoretically possible to reproduce if all 3 k0s control planes are bootstraped at the same time with dynamic config and different network/helm configuraitons, but both k0sctl and k0smotron take care of this.

I guess it could happen theoretically if someone wanted to automate deployments of k0s but I think it's an edge case it should be taken care of in the automation itslef.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants