Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd doesn't start on master after reboot on AWS host #417

Closed
cristic83 opened this issue Sep 1, 2017 · 4 comments
Closed

Etcd doesn't start on master after reboot on AWS host #417

cristic83 opened this issue Sep 1, 2017 · 4 comments

Comments

@cristic83
Copy link

BUG REPORT

Versions

kubeadm version: &version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.4", GitCommit:"793658f2d7ca7f064d2bdf606519f9fe1229c381", GitTreeState:"clean", BuildDate:"2017-08-17T08:30:51Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.4", GitCommit:"793658f2d7ca7f064d2bdf606519f9fe1229c381", GitTreeState:"clean", BuildDate:"2017-08-17T08:48:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
The connection to the server 10.0.19.21:6443 was refused - did you specify the right host or port?

  • Cloud provider or hardware configuration: cluster installed with kubeadm on AWS hosts, but without enabling the cloud-provider aws
  • OS (e.g. from /etc/os-release):
    PRETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)"
  • Kernel (e.g. uname -a):
    Linux ip-10-0-19-21.eu-west-1.compute.internal 3.10.0-327.el7.x86_64 kubeadm join on slave node fails preflight checks #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
  • Others:

What happened?

We've installed a Kubernetes cluster both on one or several Red Hat machines hosted on AWS by following the guide : https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/ . We've used flannel for the pod network. Note that we did not make the cluster aware of the AWS environment, i.e. we did not follow the instructions for this doc: https://docs.google.com/document/d/17d4qinC_HnIwrK0GHnRlD1FKkTNdN__VO4TH9-EzbIY/edit to enable the AWS Cloud Provider, as we wanted to simulate an installation on bare-metal.

What you expected to happen?

We would have expected that stopping the AWS master machine and starting it again (both in the setup with only 1 host or with 3 hosts) would start the Kubernetes master services successfully. Instead, we noticed that the etcd pod on the master did not start and its log contained the following error:

2017-09-01 11:03:50.267821 I | etcdmain: etcd Version: 3.0.17
2017-09-01 11:03:50.268234 I | etcdmain: Git SHA: cc198e2
2017-09-01 11:03:50.268239 I | etcdmain: Go Version: go1.6.4
2017-09-01 11:03:50.268246 I | etcdmain: Go OS/Arch: linux/amd64
2017-09-01 11:03:50.268250 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-09-01 11:03:50.268327 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-09-01 11:03:50.269031 I | etcdmain: listening for peers on http://localhost:2380
2017-09-01 11:03:50.269075 I | etcdmain: listening for client requests on 127.0.0.1:2379
2017-09-01 11:03:50.269214 I | etcdmain: stopping listening for client requests on 127.0.0.1:2379
2017-09-01 11:03:50.269232 I | etcdmain: stopping listening for peers on http://localhost:2380
2017-09-01 11:03:50.269239 C | etcdmain: cannot access data directory: open /var/lib/etcd/.touch: permission denied

Moreover, the kubectl version command failed with the following error:
version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.4", GitCommit:"793658f2d7ca7f064d2bdf606519f9fe1229c381", GitTreeState:"clean", BuildDate:"2017-08-17T08:48:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
The connection to the server 10.0.19.21:6443 was refused - did you specify the right host or port?

The kubelet command fails with the following error:

I0901 13:10:01.759421 4969 feature_gate.go:144] feature gates: map[]
W0901 13:10:01.759797 4969 server.go:496] No API client: no api servers specified
error: failed to run Kubelet: error reading /var/run/kubernetes/kubelet.key, certificate and key must be supplied as a pair

The kube-api-server pod also fails to start because etcd is not up.

How to reproduce it (as minimally and precisely as possible)?

Install the kubernetes cluster on one or three machines as described above.
Stop and then start the AWS master machine.
SSH to the AWS master machine and:

sudo docker ps -a | grep "etcd"
sudo docker logs <etcd_id>

Anything else we need to know?

$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Vi 2017-09-01 11:05:28 CEST; 2h 6min ago
Docs: http://kubernetes.io/docs/
Main PID: 11149 (kubelet)
CGroup: /system.slice/kubelet.service
├─11149 /usr/bin/kubelet --kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni --cni-...
└─11213 journalctl -k -f

sep 01 13:12:19 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: E0901 13:12:19.909863 11149 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https...
sep 01 13:12:20 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: I0901 13:12:20.200246 11149 kubelet_node_status.go:247] Setting node annotation to enable volume controller attach/detach
sep 01 13:12:20 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: I0901 13:12:20.286776 11149 kubelet_node_status.go:82] Attempting to register node ip-10-0-19-21.eu-west-1.compute.internal
sep 01 13:12:20 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: E0901 13:12:20.287104 11149 kubelet_node_status.go:106] Unable to register node "ip-10-0-19-21.eu-west-1.compute.in...ion refused
sep 01 13:12:20 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: E0901 13:12:20.600222 11149 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Serv...ion refused
sep 01 13:12:20 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: E0901 13:12:20.601103 11149 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Ge...
sep 01 13:12:20 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: E0901 13:12:20.910412 11149 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https...
sep 01 13:12:21 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: E0901 13:12:21.600833 11149 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Serv...ion refused
sep 01 13:12:21 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: E0901 13:12:21.601776 11149 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Ge...
sep 01 13:12:21 ip-10-0-19-21.eu-west-1.compute.internal kubelet[11149]: E0901 13:12:21.910936 11149 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https...
Hint: Some lines were ellipsized, use -l to show in full.

$ sudo ls -la /var/lib/etcd
total 4
drwxr-xr-x. 3 root root 19 sep 1 10:52 .
drwxr-xr-x. 34 root root 4096 sep 1 12:22 ..
drwx------. 4 root root 27 sep 1 10:52 member

The docker daemon runs with root privileges.

@tomciaaa
Copy link

The problem is SELinux. As per https://kubernetes.io/docs/setup/independent/install-kubeadm/:

Note: Disabling SELinux by running setenforce 0 is required to allow containers to access the host filesystem, which is required by pod networks for example. You have to do this until SELinux support is improved in the kubelet.

After you restart the box, it sets SELinux back to 'Enforcing'. (you can check if that's the case by running getenforce).

You should be able to turn off SELinux permanently on the box by setting SELINUX=permissive in /etc/selinux/config.

@luxas
Copy link
Member

luxas commented Oct 27, 2017

Great @tomciaaa, it seems like the issue got solved

@novakg
Copy link

novakg commented Apr 18, 2018

This is a very helpful discussion. I'd suggest to add these steps to the documentation: https://kubernetes.io/docs/setup/independent/install-kubeadm/#installing-kubeadm-kubelet-and-kubectl

@xjantoth
Copy link

xjantoth commented Apr 23, 2018

My original issue turned to be exactly the same problem as people are describing above:

Click to expand

I'm experiencing exactly same problems on Centos7 .
I have installed a simple two member (master, node) k8s cluster on Centos7.4
Followed https://www.linuxtechi.com/install-kubernetes-1-7-centos7-rhel7/

 kubectl.exe get nodes
NAME               STATUS    ROLES     AGE       VERSION
x042.x.int.com     Ready     <none>    36m       v1.10.1
x051.x.int.com     Ready     master    40m       v1.10.1

I have installed:

  • tiller
  • helm
  • weave (networking)
  • helmfile installation of prometheus and grafana

Everything looks perfect and I have set up NodePort for grafana service

 kubectl.exe get svc,pods
NAME                                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
svc/graf-grafana                         NodePort    10.103.172.192   <none>        80:30598/TCP   33m
svc/kubernetes                           ClusterIP   10.96.0.1        <none>        443/TCP        42m
svc/prom-prometheus-alertmanager         ClusterIP   10.98.26.164     <none>        80/TCP         33m
svc/prom-prometheus-kube-state-metrics   ClusterIP   None             <none>        80/TCP         33m
svc/prom-prometheus-node-exporter        ClusterIP   None             <none>        9100/TCP       33m
svc/prom-prometheus-pushgateway          ClusterIP   10.103.22.73     <none>        9091/TCP       33m
svc/prom-prometheus-server               ClusterIP   10.103.210.21    <none>        80/TCP         33m

NAME                                                     READY     STATUS    RESTARTS   AGE
po/graf-grafana-8748b7bf-2pc8g                           1/1       Running   1          33m
po/prom-prometheus-alertmanager-688fb9dbbf-ptdpr         2/2       Running   2          33m
po/prom-prometheus-kube-state-metrics-5ccdb7cb7c-qf8xq   1/1       Running   1          33m
po/prom-prometheus-node-exporter-r56zh                   1/1       Running   1          33m
po/prom-prometheus-pushgateway-5b6f76698f-jhqhh          1/1       Running   1          33m
po/prom-prometheus-server-7d9564d579-lj4jj               2/2       Running   2          33m

So far so good but there is a problem.

When I am trying to access

$ curl x042.x.int.com:30598 -v
* STATE: INIT => CONNECT handle 0x600057990; line 1423 (connection #-5000)
* Rebuilt URL to: N.N.N.N:30598/
* Added connection 0. The cache now contains 1 members
*   Trying N.N.N.N...
* TCP_NODELAY set
* STATE: CONNECT => WAITCONNECT handle 0x600057990; line 1475 (connection #0)

but when I ssh to actual node

ssh x042.x.int.com or ssh x051.x.int.com

I can easily access

curl x042.x.int.com:30598 -v

and I will get response I am expecting

<a href="/login">Found</a>.

HINT

lsof -i :30598
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
kube-prox 3094 root    9u  IPv6  29431      0t0  TCP *:30598 (LISTEN)

Centos7
After I rebooted server and run kubeadm reset/init few time I can't start
cluster anymore it is hanged on
[init] This might take a minute or longer if the control plane images have to be pulled.

tail /var/log/messages
5566 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://X.X.X.X:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dqa051.qa.zoomint.com&limit=500&resourceVersion=0: dial tcp X.X.X.X:6443: getsockopt: connection refused

I have spent like 4 days on it, as you can in my post - like 4 days ago - I was at least start up cluster, suddenly - It is not possible anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants