k3s fails to start on network where Ceph cluster is present #464

bjwschaap · 2019-05-09T13:22:10Z

Bug Description
I have an Alpine linux VM (netbooted) with Docker and k3s 0.5.0. When I start k3s server --docker on any network without Ceph present in it, k3s starts without any problems. If I connect to a network where Ceph cluster is present, k3s fails with:

d08-00-27-65-3a-ec:~# k3s --debug server --docker
DEBU[0000] Asset dir /var/lib/rancher/k3s/data/4e1224c66a9dbb9b03daefff200f4f8eaf45590fb722b6fe2924a201d6de2e8d
DEBU[0000] Running /var/lib/rancher/k3s/data/4e1224c66a9dbb9b03daefff200f4f8eaf45590fb722b6fe2924a201d6de2e8d/bin/k3s-server [k3s --debug server --docker]
INFO[2019-05-09T13:13:40.138537779Z] Starting k3s v0.5.0 (8c0116dd)
INFO[2019-05-09T13:13:40.139599616Z] Running kube-apiserver --authorization-mode=Node,RBAC --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --advertise-port=6445 --insecure-port=0 --bind-address=127.0.0.1 --basic-auth-file=/var/lib/rancher/k3s/server/cred/passwd --kubelet-client-key=/var/lib/rancher/k3s/server/tls/token-node.key --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --service-cluster-ip-range=10.43.0.0/16 --advertise-address=127.0.0.1 --service-account-issuer=k3s --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/token-node-1.crt --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-username-headers=X-Remote-User --watch-cache=false --tls-private-key-file=/var/lib/rancher/k3s/server/tls/localhost.key --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --api-audiences=unknown --requestheader-allowed-names=kubernetes-proxy --requestheader-group-headers=X-Remote-Group --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --allow-privileged=true --secure-port=6444 --tls-cert-file=/var/lib/rancher/k3s/server/tls/localhost.crt --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --requestheader-extra-headers-prefix=X-Remote-Extra-
I0509 13:13:40.139793    2852 server.go:517] external host was not specified, using 127.0.0.1
I0509 13:13:40.139965    2852 server.go:148] Version: v1.14.1-k3s.4
I0509 13:13:40.144493    2852 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I0509 13:13:40.144552    2852 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
E0509 13:13:40.145272    2852 prometheus.go:138] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145329    2852 prometheus.go:150] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145363    2852 prometheus.go:162] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145426    2852 prometheus.go:174] failed to register work_duration metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145438    2852 prometheus.go:189] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145465    2852 prometheus.go:202] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
I0509 13:13:40.145474    2852 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I0509 13:13:40.145478    2852 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
I0509 13:13:40.148694    2852 master.go:218] Using reconciler: lease
W0509 13:13:40.168908    2852 genericapiserver.go:315] Skipping API batch/v2alpha1 because it has no resources.
W0509 13:13:40.183878    2852 genericapiserver.go:315] Skipping API node.k8s.io/v1alpha1 because it has no resources.
E0509 13:13:40.195678    2852 prometheus.go:138] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195715    2852 prometheus.go:150] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195737    2852 prometheus.go:162] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195757    2852 prometheus.go:174] failed to register work_duration metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195772    2852 prometheus.go:189] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195782    2852 prometheus.go:202] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
I0509 13:13:40.195795    2852 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I0509 13:13:40.195800    2852 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
I0509 13:13:40.198951    2852 secure_serving.go:116] Serving securely on 127.0.0.1:6444
I0509 13:13:40.200338    2852 apiservice_controller.go:94] Starting APIServiceRegistrationController
I0509 13:13:40.200360    2852 cache.go:32] Waiting for caches to sync for APIServiceRegistrationController controller
INFO[2019-05-09T13:13:40.205940227Z] Running kube-scheduler --bind-address=127.0.0.1 --secure-port=0 --kubeconfig=/var/lib/rancher/k3s/server/cred/kubeconfig-system.yaml --leader-elect=false --port=10251
E0509 13:13:40.206647    2852 controller.go:148] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/127.0.0.1, ResourceVersion: 0, AdditionalErrorMsg:
I0509 13:13:40.207687    2852 server.go:142] Version: v1.14.1-k3s.4
I0509 13:13:40.207720    2852 defaults.go:87] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
I0509 13:13:40.208710    2852 crd_finalizer.go:242] Starting CRDFinalizer
I0509 13:13:40.209209    2852 available_controller.go:320] Starting AvailableConditionController
I0509 13:13:40.209304    2852 cache.go:32] Waiting for caches to sync for AvailableConditionController controller
W0509 13:13:40.209829    2852 authorization.go:47] Authorization is disabled
W0509 13:13:40.209860    2852 authentication.go:55] Authentication is disabled
I0509 13:13:40.209871    2852 deprecated_insecure_serving.go:49] Serving healthz insecurely on [::]:10251
I0509 13:13:40.210143    2852 autoregister_controller.go:139] Starting autoregister controller
I0509 13:13:40.210260    2852 cache.go:32] Waiting for caches to sync for autoregister controller
I0509 13:13:40.211216    2852 customresource_discovery_controller.go:203] Starting DiscoveryController
I0509 13:13:40.211332    2852 naming_controller.go:284] Starting NamingConditionController
I0509 13:13:40.211739    2852 establishing_controller.go:73] Starting EstablishingController
INFO[2019-05-09T13:13:40.209246098Z] Running kube-controller-manager --cluster-cidr=10.42.0.0/16 --port=10252 --kubeconfig=/var/lib/rancher/k3s/server/cred/kubeconfig-system.yaml --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --root-ca-file=/var/lib/rancher/k3s/server/tls/token-ca.crt --leader-elect=false --allocate-node-cidrs=true --bind-address=127.0.0.1 --secure-port=0
panic: creating CRD store Get https://localhost:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: dial tcp 10.32.2.101:6444: connect: connection refused

goroutine 250 [running]:
github.com/rancher/k3s/vendor/github.com/rancher/norman/store/crd.(*Factory).BatchCreateCRDs.func1(0xc001afd820, 0xc0018b9920, 0x3, 0x3, 0xc001aa5f40, 0x5d36d20, 0x3bbcd20, 0xc0018b9ad0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/norman/store/crd/init.go:65 +0x2c2
created by github.com/rancher/k3s/vendor/github.com/rancher/norman/store/crd.(*Factory).BatchCreateCRDs
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/norman/store/crd/init.go:50 +0xce

In this log the ip 10.32.2.101 is from my MON instance.

Steps To Reproduce
Steps to reproduce the behavior:

bootstrap Ceph cluster (e.g. https://ceph.com/geen-categorie/bootstrap-your-ceph-cluster-in-docker/)
netboot Alpine Linux 3.9.3 on a node in the same network
apk add docker
wget k3s
k3s server --docker

Expected behavior
k3s should start without errors.

Additional context
We are PXE booting VMs with Alpine Linux netboot, and want to run k3s+docker on them to form a Kubernetes cluster. Then we want to use Ceph for PVCs.

Version Info

d08-00-27-65-3a-ec:~# uname -a
Linux d08-00-27-65-3a-ec 4.14.89-0-vanilla #1-Alpine SMP Tue Dec 18 16:10:10 UTC 2018 x86_64 Linux
d08-00-27-65-3a-ec:~# cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.9.3
PRETTY_NAME="Alpine Linux v3.9"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"
d08-00-27-65-3a-ec:~# k3s --version
k3s version v0.5.0 (8c0116dd)
d08-00-27-65-3a-ec:~# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
d08-00-27-65-3a-ec:~# docker version
Client:
 Version:           18.09.1-ce
 API version:       1.39
 Go version:        go1.11.4
 Git commit:        4c52b901c6cb019f7552cd93055f9688c6538be4
 Built:             Fri Jan 11 15:41:33 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.1-ce
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.4
  Git commit:       4c52b901c6cb019f7552cd93055f9688c6538be4
  Built:            Fri Jan 11 15:40:52 2019
  OS/Arch:          linux/amd64
  Experimental:     false

The text was updated successfully, but these errors were encountered:

bjwschaap · 2019-05-09T13:57:00Z

Never mind.. After TCP dumping DNS requests, we observed that an actual NS lookup is done for localhost<<.aaa.bb.org>> (redacted). We had an obsolete wildcard DNS entry pointing to the MON host, which caused it to resolve to the MON host i/o localhost. A FQDN lookup is done because our DHCP broadcasts the domain, which causes a search domain in the resolv.conf.

cnf · 2019-05-25T12:25:48Z

How did you resolve this @bjwschaap?

OyutianO · 2019-07-03T12:14:23Z

I got this bug too.

bjwschaap closed this as completed May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k3s fails to start on network where Ceph cluster is present #464

k3s fails to start on network where Ceph cluster is present #464

bjwschaap commented May 9, 2019 •

edited

bjwschaap commented May 9, 2019

cnf commented May 25, 2019

OyutianO commented Jul 3, 2019

k3s fails to start on network where Ceph cluster is present #464

k3s fails to start on network where Ceph cluster is present #464

Comments

bjwschaap commented May 9, 2019 • edited

bjwschaap commented May 9, 2019

cnf commented May 25, 2019

OyutianO commented Jul 3, 2019

bjwschaap commented May 9, 2019 •

edited