Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In a HA setup, rebooting one master node messes up the entire cluster. #52498

Closed
jeroenjacobs79 opened this issue Sep 14, 2017 · 7 comments
Closed
Labels
sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@jeroenjacobs79
Copy link

jeroenjacobs79 commented Sep 14, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

In an HA setup, rebooting one of the master nodes messes up the entire cluster

What you expected to happen:

Rebooting a master node should have no impact in a HA setup

How to reproduce it (as minimally and precisely as possible):

I have 3 nodes with the control-plane processes (kube-apiserver, kube-scheduler, kube-controller-manager). These run as static pods. I have an nginx acting as a tcp loadbalancer for the apiserver. all kubelets connect to the loadbalancer ip address. and the lb ip address is configured as the advertising address in kube-apiserver as well.

When I reboot one of the master nodes, multiple other worker nodes start experiencing issues as are getting stuck in a "not ready" state.

In the logs of those machines I see the following:

Sep 14 20:47:18 master-03 kubelet: , diff2={"status":{"$setElementOrder/conditions":[{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2017-09-14T18:47:18Z","lastTransitionTime":"2017-09-14T18:47:18Z","message":"kubelet has sufficient disk space available","reason":"KubeletHasSufficientDisk","status":"False","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-09-14T18:47:18Z","lastTransitionTime":"2017-09-14T18:47:18Z","message":"kubelet has sufficient memory available","reason":"KubeletHasSufficientMemory","status":"False","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-09-14T18:47:18Z","lastTransitionTime":"2017-09-14T18:47:18Z","message":"kubelet has no disk pressure","reason":"KubeletHasNoDiskPressure","status":"False","type":"DiskPressure"},{"lastHeartbeatTime":"2017-09-14T18:47:18Z","lastTransitionTime":"2017-09-14T18:47:18Z","message":"kubelet is posting ready status","reason":"KubeletReady","status":"True","type":"Ready"}]}}
Sep 14 20:47:18 master-03 kubelet: E0914 20:47:18.395590   13436 kubelet_node_status.go:357] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"Ready\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2017-09-14T18:47:18Z\",\"lastTransitionTime\":\"2017-09-14T18:47:18Z\",\"message\":\"kubelet has sufficient disk space available\",\"reason\":\"KubeletHasSufficientDisk\",\"status\":\"False\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2017-09-14T18:47:18Z\",\"lastTransitionTime\":\"2017-09-14T18:47:18Z\",\"message\":\"kubelet has sufficient memory available\",\"reason\":\"KubeletHasSufficientMemory\",\"status\":\"False\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2017-09-14T18:47:18Z\",\"lastTransitionTime\":\"2017-09-14T18:47:18Z\",\"message\":\"kubelet has no disk pressure\",\"reason\":\"KubeletHasNoDiskPressure\",\"status\":\"False\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2017-09-14T18:47:18Z\",\"lastTransitionTime\":\"2017-09-14T18:47:18Z\",\"message\":\"kubelet is posting ready status\",\"reason\":\"KubeletReady\",\"status\":\"True\",\"type\":\"Ready\"}]}}" for node "master-03": Operation cannot be fulfilled on nodes "master-03": there is a meaningful conflict (firstResourceVersion: "3138", currentResourceVersion: "3292"):
Sep 14 20:47:18 master-03 kubelet: diff1={"metadata":{"resourceVersion":"3292"},"status":{"$setElementOrder/conditions":[{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2017-09-14T18:42:12Z","lastTransitionTime":"2017-09-14T18:43:23Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-09-14T18:42:12Z","lastTransitionTime":"2017-09-14T18:43:23Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-09-14T18:42:12Z","lastTransitionTime":"2017-09-14T18:43:23Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastHeartbeatTime":"2017-09-14T18:42:12Z","lastTransitionTime":"2017-09-14T18:43:23Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}}
Sep 14 20:47:18 master-03 kubelet: , diff2={"status":{"$setElementOrder/conditions":[{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2017-09-14T18:47:18Z","lastTransitionTime":"2017-09-14T18:47:18Z","message":"kubelet has sufficient disk space available","reason":"KubeletHasSufficientDisk","status":"False","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-09-14T18:47:18Z","lastTransitionTime":"2017-09-14T18:47:18Z","message":"kubelet has sufficient memory available","reason":"KubeletHasSufficientMemory","status":"False","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-09-14T18:47:18Z","lastTransitionTime":"2017-09-14T18:47:18Z","message":"kubelet has no disk pressure","reason":"KubeletHasNoDiskPressure","status":"False","type":"DiskPressure"},{"lastHeartbeatTime":"2017-09-14T18:47:18Z","lastTransitionTime":"2017-09-14T18:47:18Z","message":"kubelet is posting ready status","reason":"KubeletReady","status":"True","type":"Ready"}]}}
Sep 14 20:47:18 master-03 kubelet: E0914 20:47:18.395608   13436 kubelet_node_status.go:349] Unable to update node status: update node status exceeds retry count

That's the really weird part: it's not always the rebooted master node that is affected. Those logs start to show up in other worker and master node logs.

Rebooting those nodes doesn't solve the issue, so basically, my entire cluster is messed up, and the only workaround I have is totally rebuilding it.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.7.5
  • Cloud provider or hardware configuration**: vSphere
  • OS (e.g. from /etc/os-release): CentOS7
  • Kernel (e.g. uname -a): 4.4.88-1.el7.elrepo.x86_64
  • Install tools: ansible
  • Others:

/sig scalability

@jeroenjacobs79
Copy link
Author

jeroenjacobs79 commented Sep 14, 2017

I'm gonna share as much of my config as I can.

These are the master servers:

  • 192.168.60.10
  • 192.168.60.11
  • 192.168.60.12

The Loadbalancer ip is 192.168.60.150

This is the kubelet systemd unit file that is used by all nodes (both master- and worker nodes):

[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=docker.service
Requires=docker.service

[Service]
ExecStart=/usr/local/bin/kubelet \
  --hostname-override=master-03 \
  --cgroup-driver=systemd \
  --allow-privileged=true \
  --pod-manifest-path=/etc/kubernetes/manifests \
  --cluster-dns=10.96.0.10 \
  --cluster-domain=cluster.local \
  --enable-custom-metrics \
  --kubeconfig=/etc/kubeconfig/kubelet \
  --network-plugin=cni \
  --pod-cidr=10.32.0.0/12} \
  --register-node=true \
  --require-kubeconfig \
  --runtime-request-timeout=10m \
  --tls-cert-file=/etc/ssl/kubernetes/master-03.pem \
  --tls-private-key-file=/etc/ssl/kubernetes/master-03-key.pem \
  --v=1
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Here is a part of the kubeconfig file, as you can see, it connects to the loadbalancer:

apiVersion: v1
clusters:
- cluster:
    certificate-authority: /etc/ssl/kubernetes/root_ca.pem
    server: https://192.168.60.150:6443
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    user: system:node:master-03
  name: default-system
current-context: default-system
kind: Config
preferences: {}
users:
- name: system:node:master-03
  user:
    client-certificate: /etc/ssl/kubernetes/master-03.pem
    client-key: /etc/ssl/kubernetes/master-03-key.pem

This the kube-apiserver manifest file (installed on all master servers):

apiVersion: v1
kind: Pod
metadata:
  labels:
    component: kube-apiserver
    tier: control-plane
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-apiserver
    - --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota
    - --advertise-address=192.168.60.150
    - --allow-privileged=true
    - --storage-backend=etcd3
    - --apiserver-count=3
    - --authorization-mode=RBAC
    - --requestheader-username-headers=X-Remote-User
    - --requestheader-extra-headers-prefix=X-Remote-Extra-
    - --requestheader-group-headers=X-Remote-Group
    - --bind-address=0.0.0.0
    - --client-ca-file=/etc/ssl/kubernetes/root_ca.pem
    - --etcd-servers=http://192.168.60.10:2379,http://192.168.60.11:2379,http://192.168.60.12:2379    - --event-ttl=1h
    - --insecure-bind-address=127.0.0.1
    - --kubelet-certificate-authority=/etc/ssl/kubernetes/root_ca.pem
    - --kubelet-client-certificate=/etc/ssl/kubernetes/apiserver.pem
    - --kubelet-client-key=/etc/ssl/kubernetes/apiserver-key.pem
    - --kubelet-https=true
    - --service-account-key-file=/etc/ssl/kubernetes/apiserver-key.pem
    - --service-cluster-ip-range=10.96.0.0/12
    - --service-node-port-range=30000-32767
    - --tls-ca-file=/etc/ssl/kubernetes/root_ca.pem
    - --tls-cert-file=/etc/ssl/kubernetes/apiserver.pem
    - --tls-private-key-file=/etc/ssl/kubernetes/apiserver-key.pem
    image: gcr.io/google_containers/kube-apiserver-amd64:v1.7.5
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 6443
        scheme: HTTPS
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-apiserver
    resources:
      requests:
        cpu: 250m
    volumeMounts:
    - mountPath: /etc/ssl/kubernetes
      name: certs
      readOnly: true
  hostNetwork: true
  volumes:
  - hostPath:
      path: /etc/ssl/kubernetes
    name: certs

kube-controller-manager manifest file (installed on all master servers):

apiVersion: v1
kind: Pod
metadata:
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --address=127.0.0.1
    - --cluster-cidr=10.32.0.0/12
    - --allocate-node-cidrs=true
    - --cluster-name=kubernetes
    - --cluster-signing-cert-file=/etc/ssl/kubernetes/root_ca.pem
    - --cluster-signing-key-file=/etc/ssl/kubernetes/root_ca-key.pem
    - --leader-elect=true
    - --master=http://127.0.0.1:8080
    - --root-ca-file=/etc/ssl/kubernetes/root_ca.pem
    - --service-account-private-key-file=/etc/ssl/kubernetes/apiserver-key.pem
    - --service-cluster-ip-range=10.96.0.0/12
    image: gcr.io/google_containers/kube-controller-manager-amd64:v1.7.5
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10252
        scheme: HTTP
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-controller-manager
    resources:
      requests:
        cpu: 200m
    volumeMounts:
    - mountPath: /etc/ssl/kubernetes
      name: certs
      readOnly: true
  hostNetwork: true
  volumes:
  - hostPath:
      path: /etc/ssl/kubernetes
    name: certs

This is the kube-scheduler manifest file (installed on all master servers):

apiVersion: v1
kind: Pod
metadata:
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --leader-elect=true
    - --master=http://127.0.0.1:8080
    - --address=127.0.0.1
    image: gcr.io/google_containers/kube-scheduler-amd64:v1.7.5
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10251
        scheme: HTTP
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
  hostNetwork: true

This is the etcd manifest file ((installed on all master servers):

apiVersion: v1
kind: Pod
metadata:
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - image: quay.io/coreos/etcd:v3.2.7
    command:
      - /usr/local/bin/etcd
      - --name=192.168.60.12
      - --initial-advertise-peer-urls=http://192.168.60.12:2380
      - --listen-peer-urls=http://0.0.0.0:2380
      - --listen-client-urls=http://0.0.0.0:2379
      - --advertise-client-urls=http://192.168.60.12:2379
      - --initial-cluster-token=etcd-cluster-0
      - --initial-cluster=192.168.60.10=http://192.168.60.10:2380,192.168.60.11=http://192.168.60.11:2380,192.168.60.12=http://192.168.60.12:2380
      - --initial-cluster-state=new
      - --data-dir=/var/etcd
    # - --cert-file=/etc/ssl/master-03-bundle.pem
    # - --key-file=/etc/ssl/master-03-key.pem
    # - --peer-cert-file=/etc/ssl/master-03-bundle.pem
    # - --peer-key-file=/etc/ssl/master-03-key.pem
    # - --trusted-ca-file=/etc/ssl/root_ca.pem
    # - --peer-trusted-ca-file=/etc/ssl/root_ca.pem
    # - --client-cert-auth=true

    name: etcd
    volumeMounts:
    - mountPath: /var/etcd
      name: etcd-data
  hostNetwork: true
  volumes:
  - hostPath:
      path: /var/etcd
    name: etcd-data

This is the nginx config for the loadbalancer:

stream {
    upstream kubeapi {
    hash $remote_addr consistent;

        server 192.168.60.10:6443;
        server 192.168.60.11:6443;
        server 192.168.60.12:6443;

    }

    server {
      listen 6443;
      proxy_connect_timeout 3s;
      proxy_pass kubeapi;
    }


}

@k8s-github-robot
Copy link

@jeroenjacobs1205
There are no sig labels on this issue. Please add a sig label by:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/sig-contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <label>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 14, 2017
@jeroenjacobs79
Copy link
Author

/sig scalability

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Sep 15, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 15, 2017
@jeroenjacobs79
Copy link
Author

I added the following to kube-apiserver, but no difference:

--etcd-quorum-read=true

As long as the master node is down, everything keeps working. The moment it comes up again, the cluster breaks, and nothing can be done to solve it. I keep getting errors like these:

Sep 15 12:16:20 master-01 kubelet: E0915 12:16:20.786176    1220 kubelet_node_status.go:349] Unable to update node status: update node status exceeds retry count
Sep 15 12:16:27 master-01 kubelet: W0915 12:16:27.089152    1220 status_manager.go:448] Failed to update status for pod "kube-scheduler-master-01_kube-system(53bc2dd9-99f6-11e7-94f8-005056a1731b)": Operation cannot be fulfilled on pods "kube-scheduler-master-01": the object has been modified; please apply your changes to the latest version and try again
Sep 15 12:16:27 master-01 kubelet: W0915 12:16:27.095572    1220 status_manager.go:448] Failed to update status for pod "kube-controller-manager-master-01_kube-system(5295d60b-99f6-11e7-94f8-005056a1731b)": Operation cannot be fulfilled on pods "kube-controller-manager-master-01": the object has been modified; please apply your changes to the latest version and try again

A cluster in HA mode should be able to cope with 1 master node being down for a few minutes.

@jeroenjacobs79
Copy link
Author

I have done some more research, and the issue seems to caused by etcd. I moved etcd to differrent nodes, so they are no longer shared with kubernetes master processes.

I started stopping hosts again, and turn them back on after a minute.

As soon as I stop one of the etcd nodes, and start it again after a minute, the issues start to pop up again with the same error message.

I'm getting the feeling that the rebooted host returns out-dated information for a short time, despite the fact that --etcd-quorum-read=true is specified. I'm running 3 etcd nodes, so quorum reads should work, yes?

I'm running etcd 3.2.7, btw.

@jeroenjacobs79
Copy link
Author

Guess what, using etcd 3.1.10 instead of 3.2.7 solves my issues :-)

Is this a known incompatibility between k8s en etcd v3.2.x?

@jeroenjacobs79
Copy link
Author

Closing this issue as nobody will ever answer that last question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

3 participants