Skip to content

CrashLoopBackOff on calico pods caused by loopback interface being down #4257

@MridulS

Description

@MridulS

We run a Kubernetes cluster with calico and sporadically the cluster goes down (~once a month). The calico pod on one of
the nodes goes into perpetual CrashLoopBackOff state and the node's loopback interface is down (can't even ping itself).

One thing to note is that we run around ~200 pods in that node.

Expected Behavior

The cluster should be stable.

Current Behavior

ip address show lo output on the node

lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state DOWN group default qlen 1000

calicoctl node status on the node

$ sudo calicoctl node status
Calico process is not running.

Possible Solution

What we do to fix this is to run ip link set dev lo up manually enabling the loopback interface and restarting the pod.

Steps to Reproduce (for bugs)

The issue pops up sporadically.

Context

We have a guess that this has probably something to do with us running more than the limit of 110 pods on a node.

Your Environment

  • Calico version 3.17.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.19.4
  • Operating System and version: Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-126-generic x86_64)
  • Link to your project (optional): https://github.com/gesiscss/orc

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions