Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s fails to start on network where Ceph cluster is present #464

Closed
bjwschaap opened this issue May 9, 2019 · 3 comments
Closed

k3s fails to start on network where Ceph cluster is present #464

bjwschaap opened this issue May 9, 2019 · 3 comments

Comments

@bjwschaap
Copy link

bjwschaap commented May 9, 2019

Bug Description
I have an Alpine linux VM (netbooted) with Docker and k3s 0.5.0. When I start k3s server --docker on any network without Ceph present in it, k3s starts without any problems. If I connect to a network where Ceph cluster is present, k3s fails with:

d08-00-27-65-3a-ec:~# k3s --debug server --docker
DEBU[0000] Asset dir /var/lib/rancher/k3s/data/4e1224c66a9dbb9b03daefff200f4f8eaf45590fb722b6fe2924a201d6de2e8d
DEBU[0000] Running /var/lib/rancher/k3s/data/4e1224c66a9dbb9b03daefff200f4f8eaf45590fb722b6fe2924a201d6de2e8d/bin/k3s-server [k3s --debug server --docker]
INFO[2019-05-09T13:13:40.138537779Z] Starting k3s v0.5.0 (8c0116dd)
INFO[2019-05-09T13:13:40.139599616Z] Running kube-apiserver --authorization-mode=Node,RBAC --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --advertise-port=6445 --insecure-port=0 --bind-address=127.0.0.1 --basic-auth-file=/var/lib/rancher/k3s/server/cred/passwd --kubelet-client-key=/var/lib/rancher/k3s/server/tls/token-node.key --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --service-cluster-ip-range=10.43.0.0/16 --advertise-address=127.0.0.1 --service-account-issuer=k3s --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/token-node-1.crt --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-username-headers=X-Remote-User --watch-cache=false --tls-private-key-file=/var/lib/rancher/k3s/server/tls/localhost.key --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --api-audiences=unknown --requestheader-allowed-names=kubernetes-proxy --requestheader-group-headers=X-Remote-Group --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --allow-privileged=true --secure-port=6444 --tls-cert-file=/var/lib/rancher/k3s/server/tls/localhost.crt --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --requestheader-extra-headers-prefix=X-Remote-Extra-
I0509 13:13:40.139793    2852 server.go:517] external host was not specified, using 127.0.0.1
I0509 13:13:40.139965    2852 server.go:148] Version: v1.14.1-k3s.4
I0509 13:13:40.144493    2852 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I0509 13:13:40.144552    2852 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
E0509 13:13:40.145272    2852 prometheus.go:138] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145329    2852 prometheus.go:150] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145363    2852 prometheus.go:162] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145426    2852 prometheus.go:174] failed to register work_duration metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145438    2852 prometheus.go:189] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.145465    2852 prometheus.go:202] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
I0509 13:13:40.145474    2852 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I0509 13:13:40.145478    2852 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
I0509 13:13:40.148694    2852 master.go:218] Using reconciler: lease
W0509 13:13:40.168908    2852 genericapiserver.go:315] Skipping API batch/v2alpha1 because it has no resources.
W0509 13:13:40.183878    2852 genericapiserver.go:315] Skipping API node.k8s.io/v1alpha1 because it has no resources.
E0509 13:13:40.195678    2852 prometheus.go:138] failed to register depth metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195715    2852 prometheus.go:150] failed to register adds metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195737    2852 prometheus.go:162] failed to register latency metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195757    2852 prometheus.go:174] failed to register work_duration metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195772    2852 prometheus.go:189] failed to register unfinished_work_seconds metric admission_quota_controller: duplicate metrics collector registration attempted
E0509 13:13:40.195782    2852 prometheus.go:202] failed to register longest_running_processor_microseconds metric admission_quota_controller: duplicate metrics collector registration attempted
I0509 13:13:40.195795    2852 plugins.go:158] Loaded 8 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,MutatingAdmissionWebhook.
I0509 13:13:40.195800    2852 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
I0509 13:13:40.198951    2852 secure_serving.go:116] Serving securely on 127.0.0.1:6444
I0509 13:13:40.200338    2852 apiservice_controller.go:94] Starting APIServiceRegistrationController
I0509 13:13:40.200360    2852 cache.go:32] Waiting for caches to sync for APIServiceRegistrationController controller
INFO[2019-05-09T13:13:40.205940227Z] Running kube-scheduler --bind-address=127.0.0.1 --secure-port=0 --kubeconfig=/var/lib/rancher/k3s/server/cred/kubeconfig-system.yaml --leader-elect=false --port=10251
E0509 13:13:40.206647    2852 controller.go:148] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/127.0.0.1, ResourceVersion: 0, AdditionalErrorMsg:
I0509 13:13:40.207687    2852 server.go:142] Version: v1.14.1-k3s.4
I0509 13:13:40.207720    2852 defaults.go:87] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
I0509 13:13:40.208710    2852 crd_finalizer.go:242] Starting CRDFinalizer
I0509 13:13:40.209209    2852 available_controller.go:320] Starting AvailableConditionController
I0509 13:13:40.209304    2852 cache.go:32] Waiting for caches to sync for AvailableConditionController controller
W0509 13:13:40.209829    2852 authorization.go:47] Authorization is disabled
W0509 13:13:40.209860    2852 authentication.go:55] Authentication is disabled
I0509 13:13:40.209871    2852 deprecated_insecure_serving.go:49] Serving healthz insecurely on [::]:10251
I0509 13:13:40.210143    2852 autoregister_controller.go:139] Starting autoregister controller
I0509 13:13:40.210260    2852 cache.go:32] Waiting for caches to sync for autoregister controller
I0509 13:13:40.211216    2852 customresource_discovery_controller.go:203] Starting DiscoveryController
I0509 13:13:40.211332    2852 naming_controller.go:284] Starting NamingConditionController
I0509 13:13:40.211739    2852 establishing_controller.go:73] Starting EstablishingController
INFO[2019-05-09T13:13:40.209246098Z] Running kube-controller-manager --cluster-cidr=10.42.0.0/16 --port=10252 --kubeconfig=/var/lib/rancher/k3s/server/cred/kubeconfig-system.yaml --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --root-ca-file=/var/lib/rancher/k3s/server/tls/token-ca.crt --leader-elect=false --allocate-node-cidrs=true --bind-address=127.0.0.1 --secure-port=0
panic: creating CRD store Get https://localhost:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: dial tcp 10.32.2.101:6444: connect: connection refused

goroutine 250 [running]:
github.com/rancher/k3s/vendor/github.com/rancher/norman/store/crd.(*Factory).BatchCreateCRDs.func1(0xc001afd820, 0xc0018b9920, 0x3, 0x3, 0xc001aa5f40, 0x5d36d20, 0x3bbcd20, 0xc0018b9ad0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/norman/store/crd/init.go:65 +0x2c2
created by github.com/rancher/k3s/vendor/github.com/rancher/norman/store/crd.(*Factory).BatchCreateCRDs
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/norman/store/crd/init.go:50 +0xce

In this log the ip 10.32.2.101 is from my MON instance.

Steps To Reproduce
Steps to reproduce the behavior:

Expected behavior
k3s should start without errors.

Additional context
We are PXE booting VMs with Alpine Linux netboot, and want to run k3s+docker on them to form a Kubernetes cluster. Then we want to use Ceph for PVCs.

Version Info

d08-00-27-65-3a-ec:~# uname -a
Linux d08-00-27-65-3a-ec 4.14.89-0-vanilla #1-Alpine SMP Tue Dec 18 16:10:10 UTC 2018 x86_64 Linux
d08-00-27-65-3a-ec:~# cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.9.3
PRETTY_NAME="Alpine Linux v3.9"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"
d08-00-27-65-3a-ec:~# k3s --version
k3s version v0.5.0 (8c0116dd)
d08-00-27-65-3a-ec:~# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
d08-00-27-65-3a-ec:~# docker version
Client:
 Version:           18.09.1-ce
 API version:       1.39
 Go version:        go1.11.4
 Git commit:        4c52b901c6cb019f7552cd93055f9688c6538be4
 Built:             Fri Jan 11 15:41:33 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.1-ce
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.4
  Git commit:       4c52b901c6cb019f7552cd93055f9688c6538be4
  Built:            Fri Jan 11 15:40:52 2019
  OS/Arch:          linux/amd64
  Experimental:     false
@bjwschaap
Copy link
Author

Never mind.. After TCP dumping DNS requests, we observed that an actual NS lookup is done for localhost<<.aaa.bb.org>> (redacted). We had an obsolete wildcard DNS entry pointing to the MON host, which caused it to resolve to the MON host i/o localhost. A FQDN lookup is done because our DHCP broadcasts the domain, which causes a search domain in the resolv.conf.

@cnf
Copy link

cnf commented May 25, 2019

How did you resolve this @bjwschaap?

@OyutianO
Copy link

OyutianO commented Jul 3, 2019

I got this bug too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants