Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken "On-Premise" Installation on a Fresh Kubernetes 1.22 Cluster #4875

Closed
pmyjavec opened this issue Aug 27, 2021 · 13 comments
Closed

Broken "On-Premise" Installation on a Fresh Kubernetes 1.22 Cluster #4875

pmyjavec opened this issue Aug 27, 2021 · 13 comments

Comments

@pmyjavec
Copy link

When following the official documentation for on-premise on a freshly provisioned kubernetes cluster with kubeadm, Calico has issues and coredns containers can no longer start.

When using the instructions in the quick start guide, calico seems to work fine, the cluster seems ok.

Expected Behavior

I would expect that the installation instructions on a new cluster would work fine.

Current Behavior

After installing calico, none of the calico containers can start, and the logs contain the following:

# less /var/log/calico/cni/cni.log  | head -n 1
2021-08-26 10:04:13.832 [ERROR][10155] plugin.go 120: Final result of CNI ADD was an error. error=stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/

Logs from the pods in /var/log/pods.

/var/log/pods# find kube-system_calico-node-zvhg6_3dc62e8e-939a-4aab-b2ea-6a4b42f57437 -name "*log" -exec cat {} \;
{"log":"2021-08-26 10:04:09.286 [INFO][1] ipam_plugin.go 75: migrating from host-local to calico-ipam...\n","stream":"stderr","time":"2021-08-26T10:04:09.287157318Z"}
{"log":"2021-08-26 10:04:09.287 [INFO][1] migrate.go 66: checking host-local IPAM data dir dir existence...\n","stream":"stderr","time":"2021-08-26T10:04:09.288229304Z"}
{"log":"2021-08-26 10:04:09.288 [INFO][1] migrate.go 68: host-local IPAM data dir dir not found; no migration necessary, successfully exiting...\n","stream":"stderr","time":"2021-08-26T10:04:09.288242913Z"}
{"log":"2021-08-26 10:04:09.288 [INFO][1] ipam_plugin.go 105: migration from host-local to calico-ipam complete node=\"k8s-master-0\"\n","stream":"stderr","time":"2021-08-26T10:04:09.288269143Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Running as a Kubernetes pod\" source=\"install.go:140\"\n","stream":"stderr","time":"2021-08-26T10:04:10.014410311Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/bandwidth\"\n","stream":"stderr","time":"2021-08-26T10:04:10.035669025Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/calico\"\n","stream":"stderr","time":"2021-08-26T10:04:10.167915647Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/calico-ipam\"\n","stream":"stderr","time":"2021-08-26T10:04:10.313007299Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/flannel\"\n","stream":"stderr","time":"2021-08-26T10:04:10.325279607Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/host-local\"\n","stream":"stderr","time":"2021-08-26T10:04:10.334338052Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/install\"\n","stream":"stderr","time":"2021-08-26T10:04:10.467571637Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/loopback\"\n","stream":"stderr","time":"2021-08-26T10:04:10.478684753Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/portmap\"\n","stream":"stderr","time":"2021-08-26T10:04:10.489711181Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Installed /host/opt/cni/bin/tuning\"\n","stream":"stderr","time":"2021-08-26T10:04:10.507658006Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Wrote Calico CNI binaries to /host/opt/cni/bin\\n\"\n","stream":"stderr","time":"2021-08-26T10:04:10.507683598Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"CNI plugin version: v3.20.0\\n\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533135325Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"/host/secondary-bin-dir is not writeable, skipping\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533159785Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Using CNI config template from CNI_NETWORK_CONFIG environment variable.\" source=\"install.go:306\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533395549Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Created /host/etc/cni/net.d/10-calico.conflist\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533770069Z"}
{"log":"time=\"2021-08-26T10:04:10Z\" level=info msg=\"Done configuring CNI.  Sleep= false\"\n","stream":"stderr","time":"2021-08-26T10:04:10.533780014Z"}
{"log":"{\n","stream":"stdout","time":"2021-08-26T10:04:10.533781139Z"}
{"log":"  \"name\": \"k8s-pod-network\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533791926Z"}
{"log":"  \"cniVersion\": \"0.3.1\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533794768Z"}
{"log":"  \"plugins\": [\n","stream":"stdout","time":"2021-08-26T10:04:10.533797355Z"}
{"log":"    {\n","stream":"stdout","time":"2021-08-26T10:04:10.533799773Z"}
{"log":"      \"type\": \"calico\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533803278Z"}
{"log":"      \"log_level\": \"info\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533805816Z"}
{"log":"      \"log_file_path\": \"/var/log/calico/cni/cni.log\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533808381Z"}
{"log":"      \"datastore_type\": \"kubernetes\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533810978Z"}
{"log":"      \"nodename\": \"k8s-master-0\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533813453Z"}
{"log":"      \"mtu\": 0,\n","stream":"stdout","time":"2021-08-26T10:04:10.533815935Z"}
{"log":"      \"ipam\": {\n","stream":"stdout","time":"2021-08-26T10:04:10.533818377Z"}
{"log":"          \"type\": \"calico-ipam\"\n","stream":"stdout","time":"2021-08-26T10:04:10.533820763Z"}
{"log":"      },\n","stream":"stdout","time":"2021-08-26T10:04:10.533823259Z"}
{"log":"      \"policy\": {\n","stream":"stdout","time":"2021-08-26T10:04:10.533825662Z"}
{"log":"          \"type\": \"k8s\"\n","stream":"stdout","time":"2021-08-26T10:04:10.533828111Z"}
{"log":"      },\n","stream":"stdout","time":"2021-08-26T10:04:10.533837283Z"}
{"log":"      \"kubernetes\": {\n","stream":"stdout","time":"2021-08-26T10:04:10.533839839Z"}
{"log":"          \"kubeconfig\": \"/etc/cni/net.d/calico-kubeconfig\"\n","stream":"stdout","time":"2021-08-26T10:04:10.533842282Z"}
{"log":"      }\n","stream":"stdout","time":"2021-08-26T10:04:10.533845077Z"}
{"log":"    },\n","stream":"stdout","time":"2021-08-26T10:04:10.533847502Z"}
{"log":"    {\n","stream":"stdout","time":"2021-08-26T10:04:10.533849857Z"}
{"log":"      \"type\": \"portmap\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533852243Z"}
{"log":"      \"snat\": true,\n","stream":"stdout","time":"2021-08-26T10:04:10.533854819Z"}
{"log":"      \"capabilities\": {\"portMappings\": true}\n","stream":"stdout","time":"2021-08-26T10:04:10.533857293Z"}
{"log":"    },\n","stream":"stdout","time":"2021-08-26T10:04:10.533859816Z"}
{"log":"    {\n","stream":"stdout","time":"2021-08-26T10:04:10.533862198Z"}
{"log":"      \"type\": \"bandwidth\",\n","stream":"stdout","time":"2021-08-26T10:04:10.533864587Z"}
{"log":"      \"capabilities\": {\"bandwidth\": true}\n","stream":"stdout","time":"2021-08-26T10:04:10.533867171Z"}
{"log":"    }\n","stream":"stdout","time":"2021-08-26T10:04:10.533869734Z"}
{"log":"  ]\n","stream":"stdout","time":"2021-08-26T10:04:10.533872098Z"}
{"log":"}\n","stream":"stdout","time":"2021-08-26T10:04:10.533874563Z"}

The CoreDNS pods / containers also have trouble starting.

Possible Solution

Not sure yet.

Steps to Reproduce (for bugs)

  1. Provision an Ubuntu 18 host.
  2. Install Kubenerets and Kubeadm
  3. Create a kubernetes 1.22.1 cluster with # kubeadm init --pod-network-cidr=192.168.0.0/16
  4. Install Calico with these instructions.

Context

Your Environment

  • Calico version: 3.20
  • Orchestrator version: Kubernetes
  • Operating System and version: Ubuntu 18, Calico 3.20, Kubernetes 1.22.1
@jeliseocd
Copy link

Same problem upgrading from Kubernetes 1.21.2 to 1.22.1

CoreDNS no longer works, but it is because there is no longer communication between PODs in different nodes. So one POD in nodeX can't reach to a coredns deployed in nodeY.

It seems that the problem is in Calico 3.20

Calico 3.19.1 works fine with the same configuration in Kubernetes 1.21.2

I will try upgrading Kubernetes without updating Calico to see if it works.

@jeliseocd
Copy link

Hi again!

I've upgraded my Kubernetes installation with kubeadm from version 1.21.2 to 1.22.1 without upgrade Calico (version 3.19.1) and the POD networking works OK.

So I'm afraid the problem is in Calico version 3.20.

@pmyjavec
Copy link
Author

@jeliseocd thanks for adding some extra info to my original bug report, been a bit low on time.

@caseydavenport
Copy link
Member

@jeliseocd @pmyjavec could you share the logs from one of your calico/node pods? The logs in the OP appear to be from the init container and not from the calico-node container, which is where I would expect to see relevant diags.

e.g. kubectl logs -n kube-system calico-node-xxxx

I am guessing this is related to v1.22 - I tried those steps on a v1.21 cluster and it appears to be working fine. Will try again on v1.22 to see if I can repro.

@caseydavenport
Copy link
Member

Which manifest from that page are you guys using?

@jeliseocd
Copy link

jeliseocd commented Sep 1, 2021

tigera-operator.yaml.old.zip
This is the Calico version 3.19.1 manifest used in both installation. Kubernetes 1.21.2 and the upgrade 1.22.1. With the next custom-resources.yaml

---
# This section includes base Calico installation configuration.
# For more information, see: https://docs.projectcalico.org/v3.18/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 10.96.0.0/12
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
    nodeAddressAutodetectionV4:
      canReach: 172.26.0.10
--- 

This work fine.

Calico version 3.20 was downloaded from: https://docs.projectcalico.org/archive/v3.20/manifests/tigera-operator.yaml
With this manifest, "kubectl get nodes -o wide" show "STATUS Ready" in all the nodes and "kubectl -n calico-system get pods -o wide" show "READY 1/1" in all the Calico PODs, but, after a few seconds, there isn't connection between PODs on different nodes.

In a while I will update Calico to version 3.20 again to upload the logs.

@AlanHohn
Copy link

AlanHohn commented Sep 1, 2021

I'm seeing what I think is a related problem and did some testing with various versions. Testing was primarily on Ubuntu 21.04 (Hirsute) on Vagrant but I also tested the first case with 20.04 (Focal) on AWS for another data point with the same results. All clusters are three node HA installed via kubeadm. Calico is installed using kubectl straight from the URL.

Calico 3.20 / k8s 1.22.1

Not working. This is where I started.

I first observed an issue with services not being able to find endpoints. When I ran ip r after initial install, I briefly saw the inter-node routes. During that interval, I could ping pod IP addresses on other nodes. However, after a minute or so, those routes were removed and ping of a pod on a different node stopped working. The routes then came and went every few minutes.

I captured this log message which was repeated in all of my calico-node pods:

bird: Mesh_172_31_1_11: Invalid NEXT_HOP attribute in route 192.168.239.192/26
bird: Mesh_172_31_1_13: Invalid NEXT_HOP attribute in route 192.168.25.192/26
2021-09-01 23:09:10.936 [INFO][88] felix/route_table.go 876: Remove old route dest=192.168.25.192/26 ifaceName="enp0s8" ifaceRegex="" ipVersion=0x4 routeProblems=[]string{"unexpected route", "incorrect gateway"}
2021-09-01 23:09:10.937 [INFO][88] felix/route_table.go 876: Remove old route dest=192.168.239.192/26 ifaceName="enp0s8" ifaceRegex="" ipVersion=0x4 routeProblems=[]string{"unexpected route", "incorrect gateway"}

While the routes existed, they appeared to be set up as I would expect (the next hop was the correct IP address for the other node).

Calico 3.20 / k8s 1.21.4

Not working. Same route removal behavior as Calico 3.20 / k8s 1.22.1. Same log messages in calico-node as well.

Calico 3.19 / k8s 1.22.1

Not working, for a new reason. Calico never gets installed as the tigera-operator pod is stuck in CrashLoopBackOff:

{"level":"error","ts":1630539944.5364842,"logger":"setup","msg":"problem running manager","error":"no matches for kind \"APIService\" in version \"apiregistration.k8s.io/v1beta1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nmain.main\n\t/go/src/github.com/tigera/operator/main.go:228\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}

This is not surprising as this resource type is no longer available in 1.22.

Calico 3.19 / k8s 1.21.4

Success! Everything works as expected. I don't see the above messages in the calico-node log.

Hope this helps. Let me know if there's any additional data I can pull from any of these combinations.

@pmyjavec
Copy link
Author

pmyjavec commented Sep 2, 2021

@caseydavenport,

# kubectl logs -n kube-system calico-node-wfsd9
failed to try resolving symlinks in path "/var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/7.log": lstat /var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/7.log: no such file or directory

The parent directory is created fine, I don't know why it's referring to issues resolving symlinks:

# file /var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/

/var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/: directory

@ivanovpavel1983
Copy link

I'm seeing what I think is a related problem and did some testing with various versions. Testing was primarily on Ubuntu 21.04 (Hirsute) on Vagrant but I also tested the first case with 20.04 (Focal) on AWS for another data point with the same results. All clusters are three node HA installed via kubeadm. Calico is installed using kubectl straight from the URL.

Calico 3.20 / k8s 1.22.1

Not working. This is where I started.

I first observed an issue with services not being able to find endpoints. When I ran ip r after initial install, I briefly saw the inter-node routes. During that interval, I could ping pod IP addresses on other nodes. However, after a minute or so, those routes were removed and ping of a pod on a different node stopped working. The routes then came and went every few minutes.

I captured this log message which was repeated in all of my calico-node pods:

bird: Mesh_172_31_1_11: Invalid NEXT_HOP attribute in route 192.168.239.192/26
bird: Mesh_172_31_1_13: Invalid NEXT_HOP attribute in route 192.168.25.192/26
2021-09-01 23:09:10.936 [INFO][88] felix/route_table.go 876: Remove old route dest=192.168.25.192/26 ifaceName="enp0s8" ifaceRegex="" ipVersion=0x4 routeProblems=[]string{"unexpected route", "incorrect gateway"}
2021-09-01 23:09:10.937 [INFO][88] felix/route_table.go 876: Remove old route dest=192.168.239.192/26 ifaceName="enp0s8" ifaceRegex="" ipVersion=0x4 routeProblems=[]string{"unexpected route", "incorrect gateway"}

While the routes existed, they appeared to be set up as I would expect (the next hop was the correct IP address for the other node).

Calico 3.20 / k8s 1.21.4

Not working. Same route removal behavior as Calico 3.20 / k8s 1.22.1. Same log messages in calico-node as well.

Calico 3.19 / k8s 1.22.1

Not working, for a new reason. Calico never gets installed as the tigera-operator pod is stuck in CrashLoopBackOff:

{"level":"error","ts":1630539944.5364842,"logger":"setup","msg":"problem running manager","error":"no matches for kind \"APIService\" in version \"apiregistration.k8s.io/v1beta1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nmain.main\n\t/go/src/github.com/tigera/operator/main.go:228\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}

This is not surprising as this resource type is no longer available in 1.22.

Calico 3.19 / k8s 1.21.4

Success! Everything works as expected. I don't see the above messages in the calico-node log.

Hope this helps. Let me know if there's any additional data I can pull from any of these combinations.

The same issue in k8s cluster v 1.20.4, rhel 7.9 nodes.
Calico 3.19.2 works fine, 3.20:
Invalid NEXT_HOP attribute in route....

@Flou21
Copy link

Flou21 commented Sep 4, 2021

I have the same problem with kubernetes 1.22.1, centos 8 and calico 3.20
kubernetes 1.21.3 calico 3.19 works fine

In the calico pod logs I don't see errors, but it definitely seems like a networking issue related to calico. The deployed pods can't communicate with each other.

I don't know if my problem is related to this, but it is also fresh install 1.22 kubernetes cluster.
I already opened a kubernetes issue because my initial thought it was a kubernetes bug: kubernetes/kubernetes#104738.

@knaou
Copy link

knaou commented Sep 9, 2021

I had a near/same problem with kubernetes 1.22.1, debian 11 and calico 3.20.
I installed calico into my on-premise cluster from https://docs.projectcalico.org/manifests/calico.yaml

There was an interesting log on calico-kube-controllers-*

 Warning  FailedCreatePodSandBox  0s (x5 over 54s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "...": failed to find plugin "loopback" in path [/usr/lib/cni]  

It seems to access wrong path /usr/lib/cni.
loopback plugin is in default path(/opt/cni/bin).

@coutinhop
Copy link
Contributor

@caseydavenport,

# kubectl logs -n kube-system calico-node-wfsd9
failed to try resolving symlinks in path "/var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/7.log": lstat /var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/7.log: no such file or directory

The parent directory is created fine, I don't know why it's referring to issues resolving symlinks:

# file /var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/

/var/log/pods/kube-system_calico-node-wfsd9_c5adac27-2153-42ff-9590-76b2421ff94b/calico-node/: directory

@pmyjavec it seems like the "resolving symlinks" part of that message is a bit misleading, the real issue is that there is no "7.log" file in that dir. Did you by any chance change logging drivers? I ask because googling for this gave me this result: https://stackoverflow.com/questions/63028034/kubernetes-pod-logging-broken-with-journald-logging-driver

Maybe passing a --log-dir arg to kubectl logs would let you get the logs? Could you try it?

@pmyjavec
Copy link
Author

We solved this issue at least for our use-case. The problem was that we needed to comment out the following: mountPropagation: Bidirectional in the component.

              # Bidirectional means that, if we mount the BPF filesystem at /sys/fs/bpf it will propagate to the host.
              # If the host is known to mount that filesystem already then Bidirectional can be omitted.
              mountPropagation: Bidirectional

One thing that I didn't make clear was that while we were running a fresh cluster, it was hosted inside an LXD container. It seemed the /sys/fs/bpf information was already mounted for us.

Thanks for the help, feel free to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants