-
Notifications
You must be signed in to change notification settings - Fork 39.3k
-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In a HA setup, rebooting one master node messes up the entire cluster. #52498
Comments
I'm gonna share as much of my config as I can. These are the master servers:
The Loadbalancer ip is 192.168.60.150 This is the kubelet systemd unit file that is used by all nodes (both master- and worker nodes):
Here is a part of the kubeconfig file, as you can see, it connects to the loadbalancer:
This the kube-apiserver manifest file (installed on all master servers):
kube-controller-manager manifest file (installed on all master servers):
This is the kube-scheduler manifest file (installed on all master servers):
This is the etcd manifest file ((installed on all master servers):
This is the nginx config for the loadbalancer:
|
@jeroenjacobs1205
Note: Method 1 will trigger an email to the group. You can find the group list here and label list here. |
/sig scalability |
I added the following to kube-apiserver, but no difference:
As long as the master node is down, everything keeps working. The moment it comes up again, the cluster breaks, and nothing can be done to solve it. I keep getting errors like these:
A cluster in HA mode should be able to cope with 1 master node being down for a few minutes. |
I have done some more research, and the issue seems to caused by etcd. I moved etcd to differrent nodes, so they are no longer shared with kubernetes master processes. I started stopping hosts again, and turn them back on after a minute. As soon as I stop one of the etcd nodes, and start it again after a minute, the issues start to pop up again with the same error message. I'm getting the feeling that the rebooted host returns out-dated information for a short time, despite the fact that I'm running etcd 3.2.7, btw. |
Guess what, using etcd 3.1.10 instead of 3.2.7 solves my issues :-) Is this a known incompatibility between k8s en etcd v3.2.x? |
Closing this issue as nobody will ever answer that last question. |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
In an HA setup, rebooting one of the master nodes messes up the entire cluster
What you expected to happen:
Rebooting a master node should have no impact in a HA setup
How to reproduce it (as minimally and precisely as possible):
I have 3 nodes with the control-plane processes (kube-apiserver, kube-scheduler, kube-controller-manager). These run as static pods. I have an nginx acting as a tcp loadbalancer for the apiserver. all kubelets connect to the loadbalancer ip address. and the lb ip address is configured as the advertising address in kube-apiserver as well.
When I reboot one of the master nodes, multiple other worker nodes start experiencing issues as are getting stuck in a "not ready" state.
In the logs of those machines I see the following:
That's the really weird part: it's not always the rebooted master node that is affected. Those logs start to show up in other worker and master node logs.
Rebooting those nodes doesn't solve the issue, so basically, my entire cluster is messed up, and the only workaround I have is totally rebuilding it.
Anything else we need to know?:
Environment:
kubectl version
): 1.7.5uname -a
): 4.4.88-1.el7.elrepo.x86_64/sig scalability
The text was updated successfully, but these errors were encountered: