Broken "On-Premise" Installation on a Fresh Kubernetes 1.22 Cluster #4875

pmyjavec · 2021-08-27T12:23:18Z

When following the official documentation for on-premise on a freshly provisioned kubernetes cluster with kubeadm, Calico has issues and coredns containers can no longer start.

When using the instructions in the quick start guide, calico seems to work fine, the cluster seems ok.

Expected Behavior

I would expect that the installation instructions on a new cluster would work fine.

Current Behavior

After installing calico, none of the calico containers can start, and the logs contain the following:

# less /var/log/calico/cni/cni.log  | head -n 1
2021-08-26 10:04:13.832 [ERROR][10155] plugin.go 120: Final result of CNI ADD was an error. error=stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/

Logs from the pods in /var/log/pods.

/var/log/pods# find kube-system_calico-node-zvhg6_3dc62e8e-939a-4aab-b2ea-6a4b42f57437 -name "*log" -exec cat {} \;
{"log":"2021-08-26 10:04:09.286 [INFO][1] ipam_plugin.go 75: migrating from host-local to calico-ipam...\n","stream":"stderr","time":"2021-08-26T10:04:09.287157318Z"}
{"log":"2021-08-26 10:04:09.287 [INFO][1] migrate.go 66: checking host-local IPAM data dir dir existence...\n","stream":"stderr","time":"2021-08-26T10:04:09.288229304Z"}
{"log":"2021-08-26 10:04:09.288 [INFO][1] migrate.go 68: host-local IPAM data dir dir not found; no migration necessary, successfully exiting...\n","stream":"stderr","time":"2021-08-26T10:04:09.288242913Z"}
{"log":"2021-08-26 10:04:09.288 [INFO][1] ipam_plugin.go 105: migration from host-local to calico-ipam complete node=\"k8s-master-0\"\n","stream":"stderr","time":"2021-08-26T10:04:09.288269143Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Running as a Kubernetes pod\" source=\"install.go:140\"\n","stream":"stderr","time":"2021-08-26T10:04:10.014410311Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/bandwidth\"\n","stream":"stderr","time":"2021-08-26T10:04:10.035669025Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/calico\"\n","stream":"stderr","time":"2021-08-26T10:04:10.167915647Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/calico-ipam\"\n","stream":"stderr","time":"2021-08-26T10:04:10.313007299Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/flannel\"\n","stream":"stderr","time":"2021-08-26T10:04:10.325279607Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/host-local\"\n","stream":"stderr","time":"2021-08-26T10:04:10.334338052Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/install\"\n","stream":"stderr","time":"2021-08-26T10:04:10.467571637Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/loopback\"\n","stream":"stderr","time":"2021-08-26T10:04:10.478684753Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/portmap\"\n","stream":"stderr","time":"2021-08-26T10:04:10.489711181Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/tuning\"\n","stream":"stderr","time":"2021-08-26T10:04:10.507658006Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Wrote Calico CNI binaries to /host/opt/cni/bin\\n\"\n","stream":"stderr","time":"2021-08-26T10:04:10.507683598Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"CNI plugin version: v3.20.0\\n\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533135325Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"/host/secondary-bin-dir is not writeable, skipping\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533159785Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Using CNI config template from CNI_NETWORK_CONFIG environment variable.\" source=\"install.go:306\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533395549Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Created /host/etc/cni/net.d/10-calico.conflist\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533770069Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Done configuring CNI.  Sleep= false\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533780014Z"}
{"log":"{\n","stream":"stdout","time":"2021-08-26T10:04:10.533781139Z"}
{"log":"  \"name\": \"k8s-pod-network\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533791926Z"}
{"log":"  \"cniVersion\": \"0.3.1\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533794768Z"}
{"log":"  \"plugins\": [\n","stream":"stdout","time":"2021-08-26T10:04:10.533797355Z"}
{"log":"    {\n","stream":"stdout","time":"2021-08-26T10:04:10.533799773Z"}
{"log":"      \"type\": \"calico\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533803278Z"}
{"log":"      \"log_level\": \"info\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533805816Z"}
{"log":"      \"log_file_path\": \"/var/log/calico/cni/cni.log\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533808381Z"}
{"log":"      \"datastore_type\": \"kubernetes\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533810978Z"}
{"log":"      \"nodename\": \"k8s-master-0\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533813453Z"}
{"log":"      \"mtu\": 0,\n","stream":"stdout","time":"2021-08-26T10:04:10.533815935Z"}
{"log":"      \"ipam\": {\n","stream":"stdout","time":"2021-08-26T10:04:10.533818377Z"}
{"log":"          \"type\": \"calico-ipam\"\n","stream":"stdout","time":"2021-08-26T10:04:10.533820763Z"}
{"log":"      },\n","stream":"stdout","time":"2021-08-26T10:04:10.533823259Z"}
{"log":"      \"policy\": {\n","stream":"stdout","time":"2021-08-26T10:04:10.533825662Z"}
{"log":"          \"type\": \"k8s\"\n","stream":"stdout","time":"2021-08-26T10:04:10.533828111Z"}
{"log":"      },\n","stream":"stdout","time":"2021-08-26T10:04:10.533837283Z"}
{"log":"      \"kubernetes\": {\n","stream":"stdout","time":"2021-08-26T10:04:10.533839839Z"}
{"log":"          \"kubeconfig\": \"/etc/cni/net.d/calico-kubeconfig\"\n","stream":"stdout","time":"2021-08-26T10:04:10.533842282Z"}
{"log":"      }\n","stream":"stdout","time":"2021-08-26T10:04:10.533845077Z"}
{"log":"    },\n","stream":"stdout","time":"2021-08-26T10:04:10.533847502Z"}
{"log":"    {\n","stream":"stdout","time":"2021-08-26T10:04:10.533849857Z"}
{"log":"      \"type\": \"portmap\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533852243Z"}
{"log":"      \"snat\": true,\n","stream":"stdout","time":"2021-08-26T10:04:10.533854819Z"}
{"log":"      \"capabilities\": {\"portMappings\": true}\n","stream":"stdout","time":"2021-08-26T10:04:10.533857293Z"}
{"log":"    },\n","stream":"stdout","time":"2021-08-26T10:04:10.533859816Z"}
{"log":"    {\n","stream":"stdout","time":"2021-08-26T10:04:10.533862198Z"}
{"log":"      \"type\": \"bandwidth\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533864587Z"}
{"log":"      \"capabilities\": {\"bandwidth\": true}\n","stream":"stdout","time":"2021-08-26T10:04:10.533867171Z"}
{"log":"    }\n","stream":"stdout","time":"2021-08-26T10:04:10.533869734Z"}
{"log":"  ]\n","stream":"stdout","time":"2021-08-26T10:04:10.533872098Z"}
{"log":"}\n","stream":"stdout","time":"2021-08-26T10:04:10.533874563Z"}

The CoreDNS pods / containers also have trouble starting.

Possible Solution

Not sure yet.

Steps to Reproduce (for bugs)

Provision an Ubuntu 18 host.
Install Kubenerets and Kubeadm
Create a kubernetes 1.22.1 cluster with # kubeadm init --pod-network-cidr=192.168.0.0/16
Install Calico with these instructions.

Context

Your Environment

Calico version: 3.20
Orchestrator version: Kubernetes
Operating System and version: Ubuntu 18, Calico 3.20, Kubernetes 1.22.1

The text was updated successfully, but these errors were encountered:

jeliseocd · 2021-08-29T21:06:16Z

Same problem upgrading from Kubernetes 1.21.2 to 1.22.1

CoreDNS no longer works, but it is because there is no longer communication between PODs in different nodes. So one POD in nodeX can't reach to a coredns deployed in nodeY.

It seems that the problem is in Calico 3.20

Calico 3.19.1 works fine with the same configuration in Kubernetes 1.21.2

I will try upgrading Kubernetes without updating Calico to see if it works.

jeliseocd · 2021-08-30T18:01:37Z

Hi again!

I've upgraded my Kubernetes installation with kubeadm from version 1.21.2 to 1.22.1 without upgrade Calico (version 3.19.1) and the POD networking works OK.

So I'm afraid the problem is in Calico version 3.20.

pmyjavec · 2021-08-31T03:17:04Z

@jeliseocd thanks for adding some extra info to my original bug report, been a bit low on time.

caseydavenport · 2021-08-31T18:21:40Z

@jeliseocd @pmyjavec could you share the logs from one of your calico/node pods? The logs in the OP appear to be from the init container and not from the calico-node container, which is where I would expect to see relevant diags.

e.g. kubectl logs -n kube-system calico-node-xxxx

I am guessing this is related to v1.22 - I tried those steps on a v1.21 cluster and it appears to be working fine. Will try again on v1.22 to see if I can repro.

caseydavenport · 2021-08-31T18:34:54Z

Which manifest from that page are you guys using?

jeliseocd · 2021-09-01T13:01:26Z

tigera-operator.yaml.old.zip
This is the Calico version 3.19.1 manifest used in both installation. Kubernetes 1.21.2 and the upgrade 1.22.1. With the next custom-resources.yaml

---
# This section includes base Calico installation configuration.
# For more information, see: https://docs.projectcalico.org/v3.18/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 10.96.0.0/12
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
    nodeAddressAutodetectionV4:
      canReach: 172.26.0.10
---

This work fine.

Calico version 3.20 was downloaded from: https://docs.projectcalico.org/archive/v3.20/manifests/tigera-operator.yaml
With this manifest, "kubectl get nodes -o wide" show "STATUS Ready" in all the nodes and "kubectl -n calico-system get pods -o wide" show "READY 1/1" in all the Calico PODs, but, after a few seconds, there isn't connection between PODs on different nodes.

In a while I will update Calico to version 3.20 again to upload the logs.

AlanHohn · 2021-09-01T23:36:21Z

I'm seeing what I think is a related problem and did some testing with various versions. Testing was primarily on Ubuntu 21.04 (Hirsute) on Vagrant but I also tested the first case with 20.04 (Focal) on AWS for another data point with the same results. All clusters are three node HA installed via kubeadm. Calico is installed using kubectl straight from the URL.

Calico 3.20 / k8s 1.22.1

Not working. This is where I started.

I first observed an issue with services not being able to find endpoints. When I ran ip r after initial install, I briefly saw the inter-node routes. During that interval, I could ping pod IP addresses on other nodes. However, after a minute or so, those routes were removed and ping of a pod on a different node stopped working. The routes then came and went every few minutes.

I captured this log message which was repeated in all of my calico-node pods:

bird: Mesh_172_31_1_11: Invalid NEXT_HOP attribute in route 192.168.239.192/26
bird: Mesh_172_31_1_13: Invalid NEXT_HOP attribute in route 192.168.25.192/26
2021-09-01 23:09:10.936 [INFO][88] felix/route_table.go 876: Remove old route dest=192.168.25.192/26 ifaceName="enp0s8" ifaceRegex="" ipVersion=0x4 routeProblems=[]string{"unexpected route", "incorrect gateway"}
2021-09-01 23:09:10.937 [INFO][88] felix/route_table.go 876: Remove old route dest=192.168.239.192/26 ifaceName="enp0s8" ifaceRegex="" ipVersion=0x4 routeProblems=[]string{"unexpected route", "incorrect gateway"}

While the routes existed, they appeared to be set up as I would expect (the next hop was the correct IP address for the other node).

Calico 3.20 / k8s 1.21.4

Not working. Same route removal behavior as Calico 3.20 / k8s 1.22.1. Same log messages in calico-node as well.

Calico 3.19 / k8s 1.22.1

Not working, for a new reason. Calico never gets installed as the tigera-operator pod is stuck in CrashLoopBackOff:

{"level":"error","ts":1630539944.5364842,"logger":"setup","msg":"problem running manager","error":"no matches for kind \"APIService\" in version \"apiregistration.k8s.io/v1beta1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nmain.main\n\t/go/src/github.com/tigera/operator/main.go:228\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}

This is not surprising as this resource type is no longer available in 1.22.

Calico 3.19 / k8s 1.21.4

Success! Everything works as expected. I don't see the above messages in the calico-node log.

Hope this helps. Let me know if there's any additional data I can pull from any of these combinations.

pmyjavec · 2021-09-02T00:14:30Z

@caseydavenport,

# kubectl logs -n kube-system calico-node-wfsd9
failed to try resolving symlinks in path "/var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/7.log": lstat /var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/7.log: no such file or directory

The parent directory is created fine, I don't know why it's referring to issues resolving symlinks:

# file /var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/

/var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/: directory

ivanovpavel1983 · 2021-09-03T06:18:10Z

I'm seeing what I think is a related problem and did some testing with various versions. Testing was primarily on Ubuntu 21.04 (Hirsute) on Vagrant but I also tested the first case with 20.04 (Focal) on AWS for another data point with the same results. All clusters are three node HA installed via kubeadm. Calico is installed using kubectl straight from the URL.

Calico 3.20 / k8s 1.22.1

Not working. This is where I started.

I first observed an issue with services not being able to find endpoints. When I ran ip r after initial install, I briefly saw the inter-node routes. During that interval, I could ping pod IP addresses on other nodes. However, after a minute or so, those routes were removed and ping of a pod on a different node stopped working. The routes then came and went every few minutes.

I captured this log message which was repeated in all of my calico-node pods:
bird: Mesh_172_31_1_11: Invalid NEXT_HOP attribute in route 192.168.239.192/26
bird: Mesh_172_31_1_13: Invalid NEXT_HOP attribute in route 192.168.25.192/26
2021-09-01 23:09:10.936 [INFO][88] felix/route_table.go 876: Remove old route dest=192.168.25.192/26 ifaceName="enp0s8" ifaceRegex="" ipVersion=0x4 routeProblems=[]string{"unexpected route", "incorrect gateway"}
2021-09-01 23:09:10.937 [INFO][88] felix/route_table.go 876: Remove old route dest=192.168.239.192/26 ifaceName="enp0s8" ifaceRegex="" ipVersion=0x4 routeProblems=[]string{"unexpected route", "incorrect gateway"}
While the routes existed, they appeared to be set up as I would expect (the next hop was the correct IP address for the other node).

Calico 3.20 / k8s 1.21.4

Not working. Same route removal behavior as Calico 3.20 / k8s 1.22.1. Same log messages in calico-node as well.

Calico 3.19 / k8s 1.22.1

Not working, for a new reason. Calico never gets installed as the tigera-operator pod is stuck in CrashLoopBackOff:
{"level":"error","ts":1630539944.5364842,"logger":"setup","msg":"problem running manager","error":"no matches for kind \"APIService\" in version \"apiregistration.k8s.io/v1beta1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nmain.main\n\t/go/src/github.com/tigera/operator/main.go:228\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
This is not surprising as this resource type is no longer available in 1.22.

Calico 3.19 / k8s 1.21.4

Success! Everything works as expected. I don't see the above messages in the calico-node log.

Hope this helps. Let me know if there's any additional data I can pull from any of these combinations.

The same issue in k8s cluster v 1.20.4, rhel 7.9 nodes.
Calico 3.19.2 works fine, 3.20:
Invalid NEXT_HOP attribute in route....

Flou21 · 2021-09-04T09:54:28Z

I have the same problem with kubernetes 1.22.1, centos 8 and calico 3.20
kubernetes 1.21.3 calico 3.19 works fine

In the calico pod logs I don't see errors, but it definitely seems like a networking issue related to calico. The deployed pods can't communicate with each other.

I don't know if my problem is related to this, but it is also fresh install 1.22 kubernetes cluster.
I already opened a kubernetes issue because my initial thought it was a kubernetes bug: kubernetes/kubernetes#104738.

knaou · 2021-09-09T15:17:52Z

I had a near/same problem with kubernetes 1.22.1, debian 11 and calico 3.20.
I installed calico into my on-premise cluster from https://docs.projectcalico.org/manifests/calico.yaml

There was an interesting log on calico-kube-controllers-*

 Warning  FailedCreatePodSandBox  0s (x5 over 54s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": failed to find plugin "loopback" in path [/usr/lib/cni]

It seems to access wrong path /usr/lib/cni.
loopback plugin is in default path(/opt/cni/bin).

coutinhop · 2021-09-09T16:22:32Z

@caseydavenport,

# kubectl logs -n kube-system calico-node-wfsd9
failed to try resolving symlinks in path "/var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/7.log": lstat /var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/7.log: no such file or directory

The parent directory is created fine, I don't know why it's referring to issues resolving symlinks:

# file /var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/

/var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/: directory

@pmyjavec it seems like the "resolving symlinks" part of that message is a bit misleading, the real issue is that there is no "7.log" file in that dir. Did you by any chance change logging drivers? I ask because googling for this gave me this result: https://stackoverflow.com/questions/63028034/kubernetes-pod-logging-broken-with-journald-logging-driver

Maybe passing a --log-dir arg to kubectl logs would let you get the logs? Could you try it?

pmyjavec · 2021-09-15T13:18:50Z

We solved this issue at least for our use-case. The problem was that we needed to comment out the following: mountPropagation: Bidirectional in the component.

              # Bidirectional means that, if we mount the BPF filesystem at /sys/fs/bpf it will propagate to the host.
              # If the host is known to mount that filesystem already then Bidirectional can be omitted.
              mountPropagation: Bidirectional

One thing that I didn't make clear was that while we were running a fresh cluster, it was hosted inside an LXD container. It seemed the /sys/fs/bpf information was already mounted for us.

Thanks for the help, feel free to close this issue.

caseydavenport added impact/high kind/bug likelihood/high labels Aug 31, 2021

coutinhop closed this as completed Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken "On-Premise" Installation on a Fresh Kubernetes 1.22 Cluster #4875

Broken "On-Premise" Installation on a Fresh Kubernetes 1.22 Cluster #4875

pmyjavec commented Aug 27, 2021

jeliseocd commented Aug 29, 2021

jeliseocd commented Aug 30, 2021

pmyjavec commented Aug 31, 2021

caseydavenport commented Aug 31, 2021

caseydavenport commented Aug 31, 2021

jeliseocd commented Sep 1, 2021 •

edited by caseydavenport

Loading

AlanHohn commented Sep 1, 2021 •

edited

Loading

pmyjavec commented Sep 2, 2021

ivanovpavel1983 commented Sep 3, 2021

Calico 3.20 / k8s 1.22.1

Calico 3.20 / k8s 1.21.4

Calico 3.19 / k8s 1.22.1

Calico 3.19 / k8s 1.21.4

Flou21 commented Sep 4, 2021

knaou commented Sep 9, 2021 •

edited

Loading

coutinhop commented Sep 9, 2021

pmyjavec commented Sep 15, 2021

Broken "On-Premise" Installation on a Fresh Kubernetes 1.22 Cluster #4875

Broken "On-Premise" Installation on a Fresh Kubernetes 1.22 Cluster #4875

Comments

pmyjavec commented Aug 27, 2021

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

jeliseocd commented Aug 29, 2021

jeliseocd commented Aug 30, 2021

pmyjavec commented Aug 31, 2021

caseydavenport commented Aug 31, 2021

caseydavenport commented Aug 31, 2021

jeliseocd commented Sep 1, 2021 • edited by caseydavenport Loading

AlanHohn commented Sep 1, 2021 • edited Loading

Calico 3.20 / k8s 1.22.1

Calico 3.20 / k8s 1.21.4

Calico 3.19 / k8s 1.22.1

Calico 3.19 / k8s 1.21.4

pmyjavec commented Sep 2, 2021

ivanovpavel1983 commented Sep 3, 2021

Calico 3.20 / k8s 1.22.1

Calico 3.20 / k8s 1.21.4

Calico 3.19 / k8s 1.22.1

Calico 3.19 / k8s 1.21.4

Flou21 commented Sep 4, 2021

knaou commented Sep 9, 2021 • edited Loading

coutinhop commented Sep 9, 2021

pmyjavec commented Sep 15, 2021

jeliseocd commented Sep 1, 2021 •

edited by caseydavenport

Loading

AlanHohn commented Sep 1, 2021 •

edited

Loading

knaou commented Sep 9, 2021 •

edited

Loading