Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create HA cluster is flaky #588

Closed
neolit123 opened this issue Jun 4, 2019 · 62 comments
Closed

create HA cluster is flaky #588

neolit123 opened this issue Jun 4, 2019 · 62 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@neolit123
Copy link
Member

neolit123 commented Jun 4, 2019

What happened:
i started seeing odd failures in the kind-master and -1.14 kubeadm jobs:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-1.14

after switching to this HA config:

# a cluster with 3 control-planes and 3 workers
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker
I0604 19:15:09.075770     760 join.go:480] [preflight] Retrieving KubeConfig objects
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
I0604 19:15:10.310249     760 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 1233 milliseconds
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized 
 ✗ Joining more control-plane nodes 🎮
DEBU[22:15:10] Running: /usr/bin/docker [docker ps -q -a --no-trunc --filter label=io.k8s.sigs.kind.cluster --format {{.Names}}\t{{.Label "io.k8s.sigs.kind.cluster"}} --filter label=io.k8s.sigs.kind.cluster=kind] 
$KUBECONFIG is still set to use /home/lubo-it/.kube/kind-config-kind even though that file has been deleted, remember to unset it
DEBU[22:15:10] Running: /usr/bin/docker [docker rm -f -v kind-control-plane2 kind-control-plane kind-control-plane3 kind-worker kind-worker3 kind-worker2 kind-external-load-balancer] 
⠈⠁ Joining more control-plane nodes 🎮 Error: failed to create cluster: failed to join a control plane node with kubeadm: exit status 1

What you expected to happen:
no errors.

How to reproduce it (as minimally and precisely as possible):

cd kind-src-path
GO111MODULE=on go build
# install the kind binary to PATH
cd kubernetes-src-path
kind build node-image --kube-root=$(pwd)
kind create cluster --config=<path-to-above-ha-config> --image kindest/node:latest

Anything else we need to know?:

  • i cannot reproduce the bug without --loglevel=debug.
  • sometimes it fails during joining the extra CP nodes, something during joining the workers.

Environment:

  • kind version: (use kind version): master at 43bf0e2
  • Kubernetes version: master at 1409ff38e5828f55
  • Docker version: (use docker info):
Containers: 10
 Running: 7
 Paused: 0
 Stopped: 3
Images: 128
Server Version: 18.06.3-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: a592beb5bc4c4092b1b1bac971afed27687340c5
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-41-generic
Operating System: Ubuntu 17.10
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.66GiB
Name: luboitvbox
ID: K2H6:2I6N:FSBZ:S77V:R5CQ:X22B:VYTF:WZ4R:UIKC:HGOT:UCHD:GCR2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
  • OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="17.10 (Artful Aardvark)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 17.10"
VERSION_ID="17.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=artful
UBUNTU_CODENAME=artful

/kind bug
/priority important-soon
(?)

@neolit123 neolit123 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2019
@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 4, 2019
@neolit123
Copy link
Member Author

neolit123 commented Jun 4, 2019

seems to work with 3e5f299
possible regression in #585

EDIT: never mind, was able to reproduce it with v0.3.0 (flakes?)
EDIT2: v0.2.0 seems to work consistently.

/assign @BenTheElder

@BenTheElder i'm going to switch the kubeadm jobs to use a stable kind version (kubetest/kind needs update to pin a newer stable than 0.1.0):
kubernetes/test-infra#12883

possibly the sig-testing kind jobs can remain to use kind from master.

@aojea
Copy link
Contributor

aojea commented Jun 4, 2019

I can´t reproduce it neither find a similar error in the jobs, can you give me a link to some logs? I´m not familiar with testgrid and prow 😅

I0604 19:15:10.310249     760 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 1233 milliseconds
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized 

The errors is 401 Unauthorized 🤔

@neolit123
Copy link
Member Author

neolit123 commented Jun 4, 2019

this:

error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized

happended here:

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1135916134365663234/build-log.txt

it's flaky and i'm able to reproduce it locally with with --loglevel=debug. happens sometimes.
make sure that you use the HA config too.

some of the other failures here, however are different:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master

e.g.
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136007475389009922/build-log.txt

the e2e suite passes:

�[1m�[32mSUCCESS!�[0m -- �[32m�[1m182 Passed�[0m | �[91m�[1m0 Failed�[0m | �[33m�[1m0 Pending�[0m | �[36m�[1m4230 Skipped�[0m

but then kind export logs times out:

time="20:51:10" level=debug msg="Running: /usr/bin/docker [docker exec --privileged kind-kubetest-external-load-balancer tar --hard-dereference -C /var/log -chf - .]"
2019/06/04 21:02:51 process.go:199: Abort after 30m0s timeout during kind export logs /logs/artifacts --loglevel=debug --name=kind-kubetest. Will terminate in another 15m
SIGABRT: abort

@neolit123
Copy link
Member Author

@aojea
i think the kind export logs timeout is reproducible consistently with the HA config.
it just stays at:

DEBU[01:17:24] Running: /usr/bin/docker [docker exec --privileged kind-external-load-balancer tar --hard-dereference -C /var/log -chf - .]

@neolit123
Copy link
Member Author

neolit123 commented Jun 4, 2019

also:
kind build node-image --kube-root=$(pwd)
for k/k commit 1409ff38e5828f55

gives me some CRI errors:

INFO[2019-06-04T22:36:46.255776190Z] Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs" 
ERRO[2019-06-04T22:36:46.259945113Z] Failed to load cni during init, please check CRI plugin status before setting up network for pods  error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
INFO[2019-06-04T22:36:46.263918747Z] loading plugin "io.containerd.grpc.v1.introspection"...  type=io.containerd.grpc.v1
INFO[2019-06-04T22:36:46.264185013Z] Start subscribing containerd event           
INFO[2019-06-04T22:36:46.265648130Z] Start recovering state                       
INFO[2019-06-04T22:36:46.266610875Z] Start event monitor                          
INFO[2019-06-04T22:36:46.266651618Z] Start snapshots syncer                       
INFO[2019-06-04T22:36:46.266663684Z] Start streaming server                       
INFO[2019-06-04T22:36:46.266727343Z] serving...                                    address="/run/containerd/containerd.sock"
INFO[2019-06-04T22:36:46.266748482Z] containerd successfully booted in 0.099826s  
ERRO[2019-06-04T22:36:46.885194357Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:47.112182694Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:47.666346302Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:48.154099576Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:49.001964021Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:49.503738885Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:50.248597676Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:50.630062574Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/pause:3.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:50.656049117Z] ImageCreate event &ImageCreate{Name:sha256:da86e6ba6ca197bf6bc5e9d900febd906b133eaa4750e6bed647b0fbe50ed43e,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:50.657063170Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/pause:3.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:51.166245090Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:51.618692113Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:52.051331802Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-proxy:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:52.100652939Z] ImageCreate event &ImageCreate{Name:sha256:ea22abc123a1c4162fdd6e37a6594b242d5b83284faf8b5dff1fcaaf6c11ae48,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:52.101579489Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-proxy:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:52.512814469Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:52.979902246Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:53.243875342Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/ip-masq-agent:v2.4.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:53.267360225Z] ImageCreate event &ImageCreate{Name:sha256:19bb968f77bba3a5b5f56b5c033d71f699c22bdc8bbe9412f0bfaf7f674a64cc,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:53.268120372Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/ip-masq-agent:v2.4.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.324315128Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-apiserver:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:55.328863044Z] ImageCreate event &ImageCreate{Name:sha256:25876c9e6eb0c50f882e6e586f1a41f98d19f44fb94677877d8c096c5b2249c2,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.329891024Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-apiserver:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.412397894Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/coredns:1.3.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:55.421736948Z] ImageCreate event &ImageCreate{Name:sha256:eb516548c180f8a6e0235034ccee2428027896af16a509786da13022fe95fe8c,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.422692034Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/coredns:1.3.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:57.098307600Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-repositories locked: unavailable" ref=tar-repositories total=142
INFO[2019-06-04T22:36:57.229823963Z] ImageCreate event &ImageCreate{Name:docker.io/kindest/kindnetd:0.1.0,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.280581565Z] ImageCreate event &ImageCreate{Name:sha256:f227066bdc5f9aa2f8a9bb54854e5b7a23c6db8fce0f927e5c4feef8a9e74d46,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.281617808Z] ImageUpdate event &ImageUpdate{Name:docker.io/kindest/kindnetd:0.1.0,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.301074980Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-controller-manager:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.318367087Z] ImageCreate event &ImageCreate{Name:sha256:c155832fa95abf9b0c33d3f9c39ad43bf55002b853c44394a6dbb89dcc68f55a,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.319170150Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-controller-manager:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.320687117Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.326232183Z] ImageCreate event &ImageCreate{Name:sha256:7dd88ed8b36300ced4153dcc2d0e4995378da059be497d31271771b19976fedc,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.327088405Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:37:00.575908437Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/etcd:3.3.10,Labels:map[string]string{},} 
INFO[2019-06-04T22:37:00.580934619Z] ImageCreate event &ImageCreate{Name:sha256:2c4adeb21b4ff8ed3309d0e42b6b4ae39872399f7b37e0856e673b13c4aba13d,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:37:00.581375948Z] Stop CRI service                           

EDIT: i think the cause for:

error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized

is a bad node-image.

@BenTheElder
Copy link
Member

BenTheElder commented Jun 4, 2019

"CRI errors" during kind build node-image are normal.

kind export logs and kind load ... appear to be broken on HA because they're trying to do that against the nginx node. has nothing to do with CRI etc.

@neolit123
Copy link
Member Author

a couple of problems:

looks like --loglevel=debug is not a factor after all.

lubo-it@luboitvbox:~/go/src/k8s.io/kubernetes$ kind create cluster --image kindest/node:latest --config ./../../k8s.io/kubeadm/tests/e2e/kind/ha-cp/ha-cp.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:latest) 🖼 
 ✓ Preparing nodes 📦📦📦📦📦📦📦 
 ✓ Configuring the external load balancer ⚖️ 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✗ Joining more control-plane nodes 🎮 
$KUBECONFIG is still set to use /home/lubo-it/.kube/kind-config-kind even though that file has been deleted, remember to unset it
Error: failed to create cluster: failed to join a control plane node with kubeadm: exit status 1
lubo-it@luboitvbox:~/go/src/k8s.io/kubernetes$ kind create cluster --image kindest/node:latest --config ./../../k8s.io/kubeadm/tests/e2e/kind/ha-cp/ha-cp.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:latest) 🖼 
 ✓ Preparing nodes 📦📦📦📦📦📦📦 
 ✓ Configuring the external load balancer ⚖️ 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining more control-plane nodes 🎮 
 ✓ Joining worker nodes 🚜 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info

@neolit123 neolit123 changed the title create HA cluster fails with loglevel=debug create HA cluster is flaky Jun 4, 2019
@BenTheElder
Copy link
Member

593 fixed the log export and image load issues, not sure about whatever flakiness is about #593

@neolit123
Copy link
Member Author

https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master

the latest green run passes with kind from master.
but the previous run has the Unauthorized in 3567 milliseconds\nerror execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized"

the one before that had the timout problem that #593 fixed.

a run that has kubernetes/test-infra#12883 should be in 1 hour.

@aojea
Copy link
Contributor

aojea commented Jun 5, 2019

I narrowed it down to this snippet, what can be the problem? what means invalid bearer token?

Starting autoregister controller
2019-06-05T18:52:11.653957701Z stderr F I0605 18:52:11.653932       1 cache.go:32] Waiting for caches to sync for autoregister controller
2019-06-05T18:52:11.657095624Z stderr F E0605 18:52:11.656920       1 controller.go:148] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/172.17.0.7, ResourceVersion: 0, AdditionalErrorMsg: 
2019-06-05T18:52:11.682115562Z stderr F E0605 18:52:11.671678       1 authentication.go:65] Unable to authenticate the request due to an error: invalid bearer token
2019-06-05T18:52:11.682181186Z stderr F E0605 18:52:11.673921       1 authentication.go:65] Unable to authenticate the request due to an error: invalid bearer token
2019-06-05T18:52:11.754070454Z stderr F I0605 18:52:11.753862       1 cache.go:39] Caches are synced for AvailableConditionController controller

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136342315124658176/artifacts/kind-kubetest-control-plane3/pods/kube-system_kube-apiserver-kind-kubetest-control-plane3_58215256c50a36017e211d17dfbfd67c/kube-apiserver/0.log

@neolit123
Copy link
Member Author

neolit123 commented Jun 5, 2019

bootstrap tokens are technically bearer tokens:
https://kubernetes.io/docs/reference/access-authn-authz/bootstrap-tokens/
AFAIK, this error can happen if kubeadm join passed a bad JWT (the --token part of join).

not sure how a token can flake though...
were you able to reproduce it locally?

on a side note, kubeadm does have concurrent CP join in k/k master now:
xref https://github.com/kubernetes-sigs/kind/blob/master/pkg/cluster/internal/create/actions/kubeadmjoin/join.go#L90-L95
(we can probably try it after this ticket is closed)

@neolit123
Copy link
Member Author

@aojea
Copy link
Contributor

aojea commented Jun 6, 2019

@neolit123
Copy link
Member Author

neolit123 commented Jun 6, 2019

yes, but we switched it to kind v0.2.0, because it's in the release-informing dashboard and the release team wants it green.
kubernetes/test-infra#12883

@aojea
Copy link
Contributor

aojea commented Jun 6, 2019

Seems that the etcd cluster wasn´t ready, in this case, the kind-control-plane3 is the one that wasn´t able to join the cluster at the moment of the authentication.
A hypothesis is that the LB forwarded the worker3 join connection to it and that´s why it failed, since the etcd cluster wasn´t ensambled. Should kubeadm handle this scenario?

I think that kind is too fast at this moment :)

2019-06-05T18:52:01.871076762Z stderr F 2019-06-05 18:52:01.870776 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2019-06-05T18:52:01.871232212Z stderr F 2019-06-05 18:52:01.871073 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2019-06-05T18:52:01.877611555Z stderr F 2019-06-05 18:52:01.877310 I | embed: rejected connection from "172.17.0.7:50372" (error "remote error: tls: bad certificate", ServerName "")
2019-06-05T18:52:06.918508278Z stderr F 2

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136342315124658176/artifacts/kind-kubetest-control-plane3/pods/kube-system_etcd-kind-kubetest-control-plane3_5c9595671afd08252eb2e03d05489ee5/etcd/0.log

@neolit123
Copy link
Member Author

this is a possibility yes. but the kinder HA jobs are consistently green in this regard.
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-master-on-stable
i wonder what ends up being different.

i think we should bisect kind v0.2.0 -> latest and find what hapended.

@aojea
Copy link
Contributor

aojea commented Jun 6, 2019

That´s a fantastic idea, I think that I know what´s happening, let me do some tests

@aojea
Copy link
Contributor

aojea commented Jun 6, 2019

/assign @aojea

@thomastaylor312
Copy link

I am experiencing this flakiness as well. Let me know if I can help test any fixes!

@fabriziopandini
Copy link
Member

fabriziopandini commented Jun 7, 2019

@aojea @neolit123

One additional question:

is it possible that health check on the load balancer are not working properly (the lead balancer lets traffic pass before the API server is fully operative - with etcd)?

And some answers (hopefully helpful)

kinder HA jobs are consistently green in this regard.

Kinder has wait loops after each action and it is less aggressive than kind in terms of parallelism; e.g.
https://github.com/kubernetes/kubeadm/blob/30be64eaa19c663dd1f8e8d9826608e52d06607c/kinder/pkg/actions/util.go#L85

It is possible to pass --wait=0 to get a behaviour more similar to what is implemented in kind, but parallelism for join is not there at all.

Should kubeadm handle this scenario?

Afaik kubeadm join implements a wait loop, but as soon as one API server answers it proceed with joining.
I will try to investigate if there are some regression there even if it seems that the problem is not on the waiting side but on the answering side (kube-apiserver answers before everything/etcd/RBAC is not properly set)

@aojea
Copy link
Contributor

aojea commented Jun 7, 2019

@fabriziopandini sorry for not updating the ticket, you are absolutely right, the problem is that nginx is only doing health checks at TCP level, however, haproxy was doing health checks at the TLS level with the option ssl-hello-chk.
The problem is that as soon as the TCP port is open nginx forwards traffic to it, but it can happen that the kube-apiserver is now working, thus the join fails.
I was doing research yesterday on load balancers, basically, we need SSL passthrough (L4 load balancing) with SSL health checks :

  • haproxy, is the best solution because it has the ssl-hello-chk option, however the docker container doesn't have a default configuration and it dies unless we bind mount the config file, we can change the entrypoint with an sleep forever per example , is this an option?
  • nginx, active health checks seem to be only available on the commercial version or with custom modules
  • traefik, TCP services is only available in alpha in the last version 2.0.0, the docs doesn't mention health checks. It's very powerful and flexible and can be used to solve the port forwarding too, something to check in the future.
  • envoy, seems a good alternative, want to test it today

Platforms/Architectures images: haproxy and nginx seems to have docker images for almost all platforms, but traefik and envoy are starting to support more platforms

@BenTheElder
Copy link
Member

Alternatively we can not LB to all nodes while they are not joined yet.

@fabriziopandini
Copy link
Member

@aojea thanks for your great work trying to address this issue.

Before answering to your question, le me recap the options that in my opinion should be considered for solving this issue:

  1. act on the API server, implementing a change for not answering while not healthy
  2. act on kubeadm, implementing a wait cycle that detects when then new etcd member is fully syncronized, and then create the new API server only after this condition is satisfied
  3. act on kubeadm, tuning probes
    4 act on kind load balancer

IMO 1 is the best option, but it will take some time

I'm in contact with etcd maintainers for implementing 2 in a smart way (I don't like the idea to add arbitrary sleep). ATM there is a promising feature still only on master; let's see if there is also something immediately actionable.

You are looking at 2/3

With regards to timers, they are the results of a painful trade-off between different types of kubeadm users. We should be really careful in touching those values unless we find something badly wrong or we can assume that the change doesn't hurt users with slower devices (raspberry is the corner case in this area).

AFAIK, licenses probe isn't the best candidate for this use case (detecting things are not started); I'm also trying to understand if readiness/liveness can actually block traffic for a static pod using host networking...

@neolit123
Copy link
Member Author

xref kubernetes/kubernetes#72102

@neolit123
Copy link
Member Author

My understanding of Readiness is that only works for Services, should it work here?
Can we use more agressive timers for the liveness probe?

immediate answer - no, we cannot as these are well balanced. we have found cases where the values do not match a use case, but normally there is another issue at play there.

@aojea
Copy link
Contributor

aojea commented Jun 11, 2019

I've looked at envoy and I'm afraid it can not solve the problem, seems it doesn't have (at least I couldn't find it) the option to do https health checks against the API /healthz endpoint (it does http only) . If anybody wants to give a shot this is the configuration I've tried

envoy config

docker run -d envoyproxy/envoy-alpine:v1.10.0
docker run -it ENVOYCONTAINER sh
cat <<EOF > /etc/envoy/envoy.yaml
admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      protocol: TCP
      address: 127.0.0.1
      port_value: 9901
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.tcp_proxy
        typed_config:
          "@type": type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy
          stat_prefix: kind_tcp
          cluster: kind_cluster
  clusters:
  - name: kind_cluster
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: kind_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 172.16.0.4
                port_value: 6443
    health_checks:
      - timeout: 1s
        interval: 2s
        interval_jitter: 1s
        unhealthy_threshold: 1
        healthy_threshold: 3
        http_health_check:
          path: "/healthz"

exit

docker restart ENVOYCONTAINER 

I think we should go with the 1- 3 options mentioned by fabrizio and, if we want to work around the problem until we found the definitive solution, we can go for the active loop #598.

@aojea
Copy link
Contributor

aojea commented Jun 11, 2019

/unassign @aojea

@neolit123
Copy link
Member Author

can't seem to reproduce the problem after the workaround in kubernetes/kubernetes#78915
cc @ereslibre

the issue of CM retries is now tracked in kubernetes/kubeadm#1606

@aojea
Copy link
Contributor

aojea commented Jun 14, 2019

Thanks @neolit123 and @ereslibre , that's awesome
Should we close if jobs are stable during some days?

@ereslibre
Copy link

Should we close if jobs are stable during some days?

I'll defer this decision to others. However, in my opinion, the workaround included in kubeadm was to save the situation for flakiness temporarily, and is meant to be reverted as far as we have discussed.

Ideally, during the 1.16 cycle we should be able to understand the source(s) of problems for concurrent joins and as a result of these investigations it could happen that they circle back to kind (e.g. the LB would need to perform health checks, so we only route traffic to initialized apiservers).

@BenTheElder
Copy link
Member

BenTheElder commented Jun 14, 2019

NOTE: kubespray also has a mode with an internal nginx LB...
https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md#kube-apiserver

@aojea
Copy link
Contributor

aojea commented Jun 14, 2019

/assign @aojea

I was talking today about this with @ereslibre , despite all the issues, one fact is that the API server is advertising the correct status in the healthz endpoint, and that's info that the LB should honor.

After trying different LBs seems that the only open source LB that allows to do health checks on an HTTPS endpoint and HTTPS passthrough load balancing are:

  • haproxy, we have to overcome the limitation that default image comes without config and dies ... maybe we can boot it with a sleep command ??
  • nginx, only the paid version or installing modules allows doing active health checks

@BenTheElder
Copy link
Member

haproxy, we have to overcome the limitation that default image comes without config and dies ... maybe we can boot it with a sleep command ??

The standard docs show adding your config to the image. You can put a dummy config. If we boot a sleep then restart doesn't work properly.

  • add a dummy config to our custom image kindest/haproxy
  • rewrite the config at runtime by writing it into the container
  • use SIGHUP to reload the config

we actually had this working at one point already :-)

@BenTheElder
Copy link
Member

#645 implemented haproxy with healthchecks per #588 (comment)

@BenTheElder BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jun 24, 2019
@BenTheElder
Copy link
Member

so far I haven't gotten it to flake, still "soak testing" this locally 🙃

@neolit123
Copy link
Member Author

#588 (comment)

if using k/k from master our workaround in kubernetes/kubernetes#78915 might be kicking in.

@neolit123
Copy link
Member Author

given #645 merged, we can locally revert kubernetes/kubernetes#78915 and test again for flakes.

but this issue can also be closed as kind is now doing what is needed (e.g. use ha-proxy)?

@ereslibre
Copy link

@neolit123 absolutely, I will start investigating on my spare time. Will ping you back with my findings.

@BenTheElder
Copy link
Member

if using k/k from master our workaround in kubernetes/kubernetes#78915 might be kicking in.

I'm testing with v1.14.2 currently. (the default image until we bump it... about to do that)

@BenTheElder
Copy link
Member

$ docker logs kind-external-load-balancer 
[WARNING] 174/214223 (1) : config : missing timeouts for frontend 'controlPlane'.
   | While not properly invalid, you will certainly encounter various problems
   | with such a configuration. To fix this, please ensure that all following
   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.
[NOTICE] 174/214223 (1) : New worker #1 (6) forked
[WARNING] 174/214225 (1) : Reexecuting Master process
[ALERT] 174/214225 (1) : sendmsg()/writev() failed in logger #1: No such file or directory (errno=2)
[WARNING] 174/214225 (6) : Stopping frontend controlPlane in 0 ms.
[WARNING] 174/214225 (6) : Stopping backend kube-apiservers in 0 ms.
[WARNING] 174/214225 (6) : Stopping frontend GLOBAL in 0 ms.
[WARNING] 174/214225 (6) : Proxy controlPlane stopped (FE: 0 conns, BE: 0 conns).
[NOTICE] 174/214225 (1) : New worker #1 (22) forked
[WARNING] 174/214225 (6) : Proxy kube-apiservers stopped (FE: 0 conns, BE: 0 conns).
[WARNING] 174/214225 (6) : Proxy GLOBAL stopped (FE: 0 conns, BE: 0 conns).
[WARNING] 174/214225 (22) : Server kube-apiservers/kind-control-plane is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 174/214225 (1) : Former worker #1 (6) exited with code 0 (Exit)
[WARNING] 174/214226 (22) : Server kube-apiservers/kind-control-plane2 is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 174/214226 (22) : Server kube-apiservers/kind-control-plane3 is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 174/214226 (22) : backend 'kube-apiservers' has no server available!
[WARNING] 174/214250 (22) : Server kube-apiservers/kind-control-plane is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214336 (22) : Server kube-apiservers/kind-control-plane is DOWN, reason: Layer7 timeout, check duration: 2001ms. 0 active and 0 backup servers left. 19 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 174/214336 (22) : backend 'kube-apiservers' has no server available!
[WARNING] 174/214341 (22) : Server kube-apiservers/kind-control-plane is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 3ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214344 (22) : Server kube-apiservers/kind-control-plane2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214413 (22) : Server kube-apiservers/kind-control-plane3 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 4ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

@BenTheElder
Copy link
Member

I think we can close now, please re-open or file a new issue if you see further problems!
Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
8 participants