fails to start with a timeout with Kubernetes 1.11 #282

alban · 2018-06-30T14:35:56Z

To Reproduce:

Install Fedora 28 from https://cloud.fedoraproject.org/ (GP2 image) on AWS:
- m4.large
- Disk: at least 50GiB
- ssh: ssh -i ~/.ssh/$KEY fedora@$IP
Start a kube-spawn Kubernetes cluster on the AWS EC2 instance:

export KUBERNETES_VERSION=v1.9.9 # or other version
export KUBERNETES_VERSION=v1.10.5 # or other version
export KUBERNETES_VERSION=v1.11.0 # or other version
export KUBE_SPAWN_VERSION=master # FIXME

## Workarounds
sudo setenforce 0

## Install dependencies
sudo dnf install -y btrfs-progs git go iptables libselinux-utils polkit qemu-img systemd-container make docker
mkdir go
export GOPATH=$HOME/go
curl -fsSL -O https://github.com/containernetworking/plugins/releases/download/v0.6.0/cni-plugins-amd64-v0.6.0.tgz
sudo mkdir -p /opt/cni/bin
sudo tar -C /opt/cni/bin -xvf cni-plugins-amd64-v0.6.0.tgz
sudo curl -Lo /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${KUBERNETES_VERSION}/bin/linux/amd64/kubectl
sudo chmod +x /usr/local/bin/kubectl

## Compile and install
mkdir -p $GOPATH/src/github.com/kinvolk
cd $GOPATH/src/github.com/kinvolk
git clone https://github.com/kinvolk/kube-spawn.git
cd kube-spawn/
git checkout $KUBE_SPAWN_VERSION
make DOCKERIZED=n
sudo make install

## First attempt to use kube-spawn
cd
sudo -E kube-spawn create --kubernetes-version $KUBERNETES_VERSION
sudo -E kube-spawn start --nodes=3
sudo -E kube-spawn destroy

## Workaround for "no space left on device": https://github.com/kinvolk/kube-spawn/issues/281
sudo umount /var/lib/machines
sudo qemu-img resize -f raw /var/lib/machines.raw $((10*1024*1024*1024))
sudo mount -t btrfs -o loop /var/lib/machines.raw /var/lib/machines
sudo btrfs filesystem resize max /var/lib/machines
sudo btrfs quota disable /var/lib/machines

## Start kube-spawn
cd
sudo -E kube-spawn create --kubernetes-version $KUBERNETES_VERSION
sudo -E kube-spawn start --nodes=3

Then the error message:

Download of https://alpha.release.flatcar-linux.net/amd64-usr/current/flatcar_developer_container.bin.bz2 complete.
Created new local image 'flatcar'.
Operation completed successfully.
Exiting.
nf_conntrack module is not loaded: stat /sys/module/nf_conntrack/parameters/hashsize: no such file or directory
Warning: nf_conntrack module is not loaded.
loading nf_conntrack module... 
making iptables FORWARD chain defaults to ACCEPT...
setting iptables rule to allow CNI traffic...
Starting 3 nodes in cluster default ...
Waiting for machine kube-spawn-default-worker-fjxan9 to start up ...
Waiting for machine kube-spawn-default-master-5y7clq to start up ...
Waiting for machine kube-spawn-default-worker-2ujr2f to start up ...
Started kube-spawn-default-worker-2ujr2f
Bootstrapping kube-spawn-default-worker-2ujr2f ...
Started kube-spawn-default-master-5y7clq
Bootstrapping kube-spawn-default-master-5y7clq ...
Cluster "default" started
Failed to start machine kube-spawn-default-worker-fjxan9: timeout waiting for "kube-spawn-default-worker-fjxan9" to start
Note: `kubeadm init` can take several minutes
master-5y7clq I0630 14:22:29.999557     380 feature_gate.go:230] feature gates: &{map[]}
              [init] using Kubernetes version: v1.11.0
              [preflight] running pre-flight checks
              [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
              [WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
              [WARNING FileExisting-crictl]: crictl not found in system path
              I0630 14:22:30.050775     380 kernel_validator.go:81] Validating kernel version
              I0630 14:22:30.051083     380 kernel_validator.go:96] Validating kernel config
              [WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.05.0-ce. Max validated version: 17.03
              [WARNING Hostname]: hostname "kube-spawn-default-master-5y7clq" could not be reached
              [WARNING Hostname]: hostname "kube-spawn-default-master-5y7clq" lookup kube-spawn-default-master-5y7clq on 8.8.8.8:53: no such host
              reflight/images] Pulling images required for setting up a Kubernetes cluster
              [preflight/images] This might take a minute or two, depending on the speed of your internet connection
              [preflight/images] You can also perform this action in beforehand using 'kubeadm config images pull'
              [kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
              [kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
              [preflight] Activating the kubelet service
              [certificates] Generated ca certificate and key.
              [certificates] Generated apiserver certificate and key.
              [certificates] apiserver serving cert is signed for DNS names [kube-spawn-default-master-5y7clq kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.22.0.3]
              [certificates] Generated apiserver-kubelet-client certificate and key.
              [certificates] Generated sa key and public key.
              [certificates] Generated front-proxy-ca certificate and key.
              [certificates] Generated front-proxy-client certificate and key.
              [certificates] Generated etcd/ca certificate and key.
              [certificates] Generated etcd/server certificate and key.
              [certificates] etcd/server serving cert is signed for DNS names [kube-spawn-default-master-5y7clq localhost] and IPs [127.0.0.1 ::1]
              [certificates] Generated etcd/peer certificate and key.
              [certificates] etcd/peer serving cert is signed for DNS names [kube-spawn-default-master-5y7clq localhost] and IPs [10.22.0.3 127.0.0.1 ::1]
              [certificates] Generated etcd/healthcheck-client certificate and key.
              [certificates] Generated apiserver-etcd-client certificate and key.
              [certificates] valid certificates and keys now exist in "/etc/kubernetes/pki"
              [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
              [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
              [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
              [kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"
              [controlplane] wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
              [controlplane] wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
              [controlplane] wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
              [etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
              [init] waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests"
              [init] this might take a minute or longer if the control plane images have to be pulled
              [apiclient] All control plane components are healthy after 42.001677 seconds
              [uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
              [kubelet] Creating a ConfigMap "kubelet-config-1.11" in namespace kube-system with the configuration for the kubelets in the cluster
              [markmaster] Marking the node kube-spawn-default-master-5y7clq as master by adding the label "node-role.kubernetes.io/master=''"
              [markmaster] Marking the node kube-spawn-default-master-5y7clq as master by adding the taints [node-role.kubernetes.io/master:NoSchedule]
              [patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "kube-spawn-default-master-5y7clq" as an annotation
              [bootstraptoken] using token: 1o71nu.v7s48wncryhbdmm7
              [bootstraptoken] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
              [bootstraptoken] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
              [bootstraptoken] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
              [bootstraptoken] creating the "cluster-info" ConfigMap in the "kube-public" namespace
              [addons] Applied essential addon: CoreDNS
              [addons] Applied essential addon: kube-proxy
              Your Kubernetes master has initialized successfully!
              To start using your cluster, you need to run the following as a regular user:
              mkdir -p $HOME/.kube
              sudo cp -i /etc/kubernetes/admin.conf
              $HOME/.kube/config
              sudo chown $(id -u):$(id -g) $HOME/.kube/config
              You should now deploy a pod network to the cluster.
              Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
              https://kubernetes.io/docs/concepts/cluster-administration/addons/
              You can now join any number of machines by running the following on each node
              as root:
              kubeadm join 10.22.0.3:6443 --token 1o71nu.v7s48wncryhbdmm7 --discovery-token-ca-cert-hash sha256:c8ac2337adc7ed01725bed7d78605661dc759257fce213838f1cb89486fe263c
              I0630 14:23:47.569329    1140 feature_gate.go:230] feature gates: &{map[]}
              aaaaaa.bbbbbbbbbbbbbbbb
              serviceaccount/weave-net created
              clusterrole.rbac.authorization.k8s.io/weave-net created
              clusterrolebinding.rbac.authorization.k8s.io/weave-net created
              daemonset.extensions/weave-net created
worker-2ujr2f [preflight] running pre-flight checks
              [WARNING RequiredIPVSKernelModulesAvailable]: the IPVS proxier will not be used, because the following required kernel modules are not loaded: [ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh] or no builtin kernel ipvs support: map[ip_vs:{} ip_vs_rr:{} ip_vs_wrr:{} ip_vs_sh:{} nf_conntrack_ipv4:{}]
              you can solve this problem with following methods:
              1. Run 'modprobe -- ' to load missing kernel modules;
              2. Provide the missing builtin kernel ipvs support
              [WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
              [WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
              [WARNING FileExisting-crictl]: crictl not found in system path
              I0630 14:23:49.919029     449 kernel_validator.go:81] Validating kernel version
              I0630 14:23:49.919338     449 kernel_validator.go:96] Validating kernel config
              [WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.05.0-ce. Max validated version: 17.03
              [WARNING Hostname]: hostname "kube-spawn-default-worker-2ujr2f" could not be reached
              [WARNING Hostname]: hostname "kube-spawn-default-worker-2ujr2f" lookup kube-spawn-default-worker-2ujr2f on 8.8.8.8:53: no such host
              [discovery] Trying to connect to API Server "10.22.0.3:6443"
              [discovery] Created cluster-info discovery client, requesting info from "https://10.22.0.3:6443"
              [discovery] Failed to connect to API Server "10.22.0.3:6443": token id "aaaaaa" is invalid for this cluster or it has expired. Use "kubeadm token create" on the master node to creating a new valid token
              [discovery] Trying to connect to API Server "10.22.0.3:6443"
              [discovery] Created cluster-info discovery client, requesting info from "https://10.22.0.3:6443"
              [discovery] Cluster info signature and contents are valid and no TLS pinning was specified, will use API Server "10.22.0.3:6443"
              [discovery] Successfully established connection with API Server "10.22.0.3:6443"
              [kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.11" ConfigMap in the kube-system namespace
              [kubelet] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
              [kubelet] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
              [preflight] Activating the kubelet service
              [tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
              [patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "kube-spawn-default-worker-2ujr2f" as an annotation
              This node has joined the cluster:
              * Certificate signing request was sent to master and a response
              was received.
              * The Kubelet was informed of the new secure connection details.
              Run 'kubectl get nodes' on the master to see this node join the cluster.
Failed to start cluster: provisioning the worker nodes with kubeadm didn't succeed

More debug info:

$ kubectl get nodes
NAME                               STATUS    ROLES     AGE       VERSION
kube-spawn-default-master-5y7clq   Ready     master    1m        v1.11.0
kube-spawn-default-worker-2ujr2f   Ready     <none>    46s       v1.11.0
$ machinectl 
MACHINE                          CLASS     SERVICE        OS      VERSION  ADDRESSES
kube-spawn-default-master-5y7clq container systemd-nspawn flatcar 1814.0.0 10.22.0.3...
kube-spawn-default-worker-2ujr2f container systemd-nspawn flatcar 1814.0.0 10.22.0.2...

2 machines listed.
$ df -h /var/lib/machines
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       10G  1.7G  7.8G  18% /var/lib/machines

The third machine does not exist anymore?

The text was updated successfully, but these errors were encountered:

alban · 2018-06-30T14:38:47Z

After a second attempt, it works.

arcolife · 2018-11-12T07:50:35Z

I get this timeout just as @alban described, except it's reproducible every time.

$ kube-spawn start
Warning: kube-proxy could crash due to insufficient nf_conntrack hashsize.
setting nf_conntrack hashsize to 131072... 
making iptables FORWARD chain defaults to ACCEPT...
new poolSize to be : 5490739200
Starting 3 nodes in cluster default ...
Waiting for machine kube-spawn-default-worker-naz6fc to start up ...
Waiting for machine kube-spawn-default-master-yz3twq to start up ...
Waiting for machine kube-spawn-default-worker-u5fu6n to start up ...
Failed to start machine kube-spawn-default-master-yz3twq: timeout waiting for "kube-spawn-default-master-yz3twq" to start
Failed to start machine kube-spawn-default-worker-naz6fc: timeout waiting for "kube-spawn-default-worker-naz6fc" to start
Failed to start cluster: starting the cluster didn't succeed

Note:

I face the same timeout issue, regardless of when I destroy the cluster and start again. Or if I mount a formatted btrfs and redo this.
The first time I launched kube-spawn, it was with a manually formatted and mounted btrfs volume. That's when it complained "machine.raw" not found. So I unmounted and re-ran. So the systemd-nspawn did its job and created a machine.raw. I tried to re-spawn the cluster afterwards, except this time it didn't complain about .raw file obviously. But it timed out regardless.
Even though I've been through the troubleshooting.md guide, SELinux has been a pita and as a result I've had to create about a dozen policies and semanage it all. Not the cake I was digging. pfft

for debugging, is there any place this things logs itself into?

kube-spawn v0.3.0
FS:

/dev/loop2     btrfs      40G  1.7G   39G   5% /var/lib/machines

OR 

/dev/sda4      btrfs      56G  1.7G   54G   4% /var/lib/machines

systemd-container-238-10.git438ac26.fc28.x86_64
qemu-img-2.11.2-4.fc28.x86_64
machinectl limit to 40G with loopback mount (as evident in the df output above too):

# machinectl show
PoolPath=/var/lib/machines
PoolUsage=1866190848
PoolLimit=42949672960

OS: Linux 4.18.17-200.fc28.x86_64 GNU/Linux

arcolife · 2018-11-12T08:05:44Z

ok nevermind.

all I had to do was:

export KUBERNETES_VERSION=v1.12.0 (didn't do it earlier before create step)
kube-spawn destroy
kube-spawn create (this time, it populated /var/lib/kube-spawn/clusters. It was an empty trail of subdirs earlier.)
kube-spawn start

and it works. jeez

krnowak · 2018-11-12T10:18:35Z

Seems to be related to #325.

arcolife · 2018-11-12T18:57:26Z

Seems to be related to #325.

sure, except I didn't destroy it first. Got the timeout from start as per #282 (comment) (so to speak, after creating the cluster)
..then resolved issue with #282 (comment)

apologies if that order in step 2 of resolution comment, created a confusion.

also I can't reproduce it now. :/

alban mentioned this issue Jul 10, 2018

No space left on device /var/lib/machines (again) #281

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fails to start with a timeout with Kubernetes 1.11 #282

fails to start with a timeout with Kubernetes 1.11 #282

alban commented Jun 30, 2018

alban commented Jun 30, 2018

arcolife commented Nov 12, 2018 •

edited

Loading

arcolife commented Nov 12, 2018 •

edited

Loading

krnowak commented Nov 12, 2018

arcolife commented Nov 12, 2018 •

edited

Loading

fails to start with a timeout with Kubernetes 1.11 #282

fails to start with a timeout with Kubernetes 1.11 #282

Comments

alban commented Jun 30, 2018

To Reproduce:

alban commented Jun 30, 2018

arcolife commented Nov 12, 2018 • edited Loading

arcolife commented Nov 12, 2018 • edited Loading

krnowak commented Nov 12, 2018

arcolife commented Nov 12, 2018 • edited Loading

arcolife commented Nov 12, 2018 •

edited

Loading

arcolife commented Nov 12, 2018 •

edited

Loading

arcolife commented Nov 12, 2018 •

edited

Loading