Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium crash with Talos 1.8.1 after operator rollout #9

Closed
rducom opened this issue Oct 30, 2024 · 2 comments
Closed

Cilium crash with Talos 1.8.1 after operator rollout #9

rducom opened this issue Oct 30, 2024 · 2 comments

Comments

@rducom
Copy link

rducom commented Oct 30, 2024

Bug Report

Cilium is crashing after rollout [CRIT] Dead loop on virtual device cilium_vxlan, fix it urgently!.
I didn't find a way to restore system beyond this point (except flashing again..)

Minimal repro :

  • Flash the 1.8.1 build on RK1 (f4bcf5c)
  • Usual Talos boostrap sequence
  • cilium install --version 1.16.3
  • once installed, just kubectl rollout restart ds cilium -n kube-system

Then all nodes (CP & Workers) are goin into the dead loop.

Description

Logs

a loop of :

kern: crit: [2024-10-30T23:49:59.75932735Z]: Dead loop on virtual device cilium_vxlan, fix it urgently!
kern: crit: [2024-10-30T23:49:59.79143835Z]: Dead loop on virtual device cilium_vxlan, fix it urgently!
kern: crit: [2024-10-30T23:50:00.01505335Z]: Dead loop on virtual device cilium_vxlan, fix it urgently!
kern: crit: [2024-10-30T23:50:01.03912835Z]: Dead loop on virtual device cilium_vxlan, fix it urgently!
kern: warning: [2024-10-30T23:50:03.08683535Z]: net_ratelimit: 7 callbacks suppressed
kern: crit: [2024-10-30T23:50:03.08685035Z]: Dead loop on virtual device cilium_vxlan, fix it urgently!
kern: crit: [2024-10-30T23:50:03.75894435Z]: Dead loop on virtual device cilium_vxlan, fix it urgently!

Environment

  • machine config :
machine:
  install:
    disk: /dev/mmcblk0
  kernel:
    modules:
        - name: rockchip-cpufreq
cluster:
  allowSchedulingOnControlPlanes: true
  network:
    cni:
      name: none
  proxy:
    disabled: true
  • Talos version:
Server:
	NODE:        192.168.1.198
	Tag:         v1.8.1-f4bcf5cb27
	SHA:         477752fe
	Built:
	Go version:  go1.22.8
	OS/Arch:     linux/arm64
	Enabled:     RBAC
  • Platform: Turing PI 2.5 / RK1
@rducom
Copy link
Author

rducom commented Oct 31, 2024

Other users are having the same issue on upstream siderolabs#9102
Nico, since it’s not specific to your fork, I'm closing the issue here.

@rducom rducom closed this as completed Oct 31, 2024
@rducom
Copy link
Author

rducom commented Nov 1, 2024

For others, here's a working mc patch :

machine:
  install:
    disk: /dev/mmcblk0
  kernel:
    modules:
        - name: rockchip-cpufreq
cluster:
  etcd:
      advertisedSubnets:
          - 192.168.1.0/24
  allowSchedulingOnControlPlanes: true
  network:
    cni:
      name: none
    podSubnets:
      - 10.42.0.0/20
    serviceSubnets:
      - 10.42.16.0/20
  proxy:
    disabled: true
  apiServer:
    admissionControl:
      - name: PodSecurity
        configuration:
          exemptions:
            namespaces:
              - cilium-test-1
              - rook-ceph

And install cilium with :

helm repo add cilium https://helm.cilium.io/
helm repo update cilium
CILIUM_LATEST=$(helm search repo cilium --versions --output yaml | yq '.[0].version')
helm install cilium cilium/cilium \
    --version ${CILIUM_LATEST} \
    --namespace kube-system \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup \
    --set l2announcements.enabled=true \
    --set kubeProxyReplacement=true \
    --set loadBalancer.acceleration=native \
    --set k8sServiceHost=127.0.0.1 \
    --set k8sServicePort=7445 \
    --set bpf.masquerade=true \
    --set ingressController.enabled=true \
    --set ingressController.default=true \
    --set ingressController.loadbalancerMode=dedicated \
    --set ipam.mode=cluster-pool \
    --set ipam.operator.clusterPoolIPv4PodCIDRList="10.42.32.0/20" \
    --set hubble.relay.enabled=true \
    --set hubble.ui.enabled=true \
    --set gatewayAPI.enabled=true \
    --set bgpControlPlane.enabled=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant