-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load-balance the apiserver endpoint #168
Comments
Okay, I've talked this over with a coworker (hi @maisem!), and I think I know how this needs to work. It deserves a proper design doc with graphs and stuff, to fully explain the bootstrapping problem and why this is hard, but here's a quick braindump before I forget. Add a config option to MetalLB that tells it to treat the default/kubernetes service specially. Something like When that service is converted to a LoadBalancer (by default it's a ClusterIP), controller does the normal IP allocation logic, but speaker does not announce it. Instead, controller writes a pod manifest to a ConfigMap (more on that pod later). It also forces We add a new DaemonSet to MetalLB. It's configured to run only on master nodes, and mounts a hostPath volume for the static pod directory /etc/kubernetes/manifests . All this pod does is periodically (or on change notification) copy the ConfigMap pod manifest into /etc/kubernetes/manifests on the masters. This makes kubelet statically run that pod, even if the rest of k8s is down. The pod manifest written by the controller is for a new MetalLB binary, a lightweight, statically configured, single-IP speaker (working name "Lithium", the lightest metal). The pod exec command looks something like This pod does 2 things:
The specifics vary based on protocol:
It's quite a lot of work and extra indirection, but this achieves strong resiliency, even in the face of a global cluster poweroff and reboot. On reboot, k8s is broken because the LB is down... But the last-written static pod manifest is still there, and its commandline contains everything it needs to know in order to bring up the datapath for the apiserver LB IP. Once that is up, all kubelets are able to talk to apiserver, the rest of Kubernetes comes up. Controller and the "copy DaemonSet" come back up, and they once again start syncing the Service object and static pod manifests. Tentatively, slating this design for 0.5.0 or 0.6.0. It also really depends on what kubeadm ends up doing with HA clusters, because I'm assuming that will define the "standard" for how HA clusters are supposed to function, and we may need to adapt this design to match that. |
@danderson, I think that some parts of the needed infrastructure have already been implemented by other projects. bootkube (and kubeadm I think) use a temporary API server to inject the initial manifests to build a self-hosted Kubernetes control-plane. metallb related manifest could also be injected at that time. To solve the chicken-egg-problem after a cluster reboot, they use a pod checkpointer which writes certain pods (and related configmaps and secrets) to disk and loads them again on reboot. See https://github.com/kubernetes-incubator/bootkube/tree/master/cmd/checkpoint. |
Any update on this issue? |
Note that you can not convert |
You can of course add a second service with a type of your choice, exposing the apiserver a second time. If your apiserver runs as a pod, you have a selector. If not, you need something else :-/. |
Well, you only need |
@elemental-lf Is there an easier way to make it work with bootkube setup? have you got it working? |
Can we fix this behaviour? What is the downside to declaring it as LoadBalancer type? |
That's something to take up with upstream kubernetes, it's not something I control. Offhand I can't think of a reason why it couldn't become a LoadBalancer (other than circular dependency issues, but as long as you avoid that...). |
Wondering if there are new thoughts on this. I would love to get rid of HAProxy for the api server... |
CoreDNS has a Proxy plugin with which you can bypass the need to use a LB to create an HA master cluster by responding to DNS queries using programmatic logic with a different master given different situations. |
Cool, is there an example somewhere? Would that work with etcd as well?
… On Jan 14, 2020, at 12:50 PM, TJ Zimmerman ***@***.***> wrote:
CoreDNS has a Proxy plugin with which you can bypass the need to use a LB to create an HA master cluster by responding to DNS queries using programmatic logic with a different master given different situations.
|
I can't share an example publicly unfortunately. As for etcd. |
For external clients though, this means your clients need to be either pointed at coredns or your coredns is hung off of the greater DNS system? If thats the case, it isn't a solution for a lot of clusters. |
I've tried MetalLB in the past, but because of the lack of this feature, I've moved on to a hardware lb locally. Seeing this, made me thing about the problem some more. One thing that comes to mind, is that having a 'deployed' (aka, stored in etcd) resource might never seem the proper solution. Because in case of a full outage (or just a plain old shutdown, because of maintenance), everything would be down, and would never come up. Since Kubelets start by contacting their apiserver to see what should be running. I think the only proper solution would be to have manual manifests in the Kubelet directory. I think the easiest solution is having manual pods deployed via manifests, that form a cluster with the other apiserver lb pods, and only broadcast the single IP for the loadbalancer. Furthermore, they don't ingegrate any other logic from MetalLB from the APIServer. These 'minimal' speaker pods require little maintenance, and only need to be updated in case of significant updates in the BGP support. The downside is that you need to run twice as many MetalLB pods in your cluster. On the other hand, it's not like MetalLB is using much resources... It might be feasible to connect these pods afterwards to a running MetalLB cluster for monitoring purposes, or 'controller' integration. But I've not seen the ability to update Kubelets Pod Manifests from inside the cluster, unfortunatly. PS: If you have any theories you want to test, I have a test cluster at home containing 3 nodes and a BGP capable router, so I can test a few scenario's if required. |
I think its probably fine to load balance the apiserver endpoint using metallb managed within the cluster. You just need to separate the use cases:
2 & 3 could benefit from metallb managing a load balancer for the apiserver.
|
@danderson Any update on this topic? I'm eagerly awaiting a solution! |
I recently set up a cluster using metallb load balancing api server endpoints by creating a loadbalaancer typed service pointing to apiserver pods. The real issue is not about bootstrapping. It's that k8s client can't reset existing TCP connection when it's dead. (issue) It requires a load balancer that can do healthcheck on the api server endpoint and kill any existing sessions if a backend becomes unhealthy. metallb uses k8s data plane to do load balancing (e.g. iptables + conntrack) which doesn't satisfy the requirement. Before that issue is fixed, I don't see it's a good idea to use metallb as a load balancer for control plane HA. |
@liuyuan10 interesting issue. Wouldn't that happen for non k8s services too? Is this a bigger problem for Metallb in general, not just using it to front kube-apiserver? Separate but related question. I can see it being an issue for iptables+conntrack based kube-proxy. But is it still an issue with kube-proxy in ipvs mode? |
Yes. it's also an issue for other clients that can't recover from traffic blackhole. This is more of issue on the client side. Usually a client should detect timeout and recreate TCP connection. I never tried ipvs so I can't say. My guess is it applies to ipvs too because that needed to support graceful pod termination. |
Is there something metallb could do to help? Its already arping the ip. There something reasonable it could do on failover to send fin packets out? Or is there not enough state on failover to be able to do that? If it cant, how can a non metallb lb solve this either? Wouldn't you get the same kind of problem if your ip moved to a different host with say, haproxy backing it, and kubectl's were in the half closed state? I think this may be the typical load balancer problem. You need a load balancer for your load balancer for your load balancer cause each one can't solve the problem without another load balancer doing the hard part and people keep pushing the problem up? Make it the other load balancer's problem and pretend they are solving it for you? I guess maybe I'm coming to the conclusion that this particular problem you mention can't be solved outside of metallb really either, so maybe the functionality of having metallb be able to load balance the apiserver endpoint still is a valid use case? It just might not be optimal for the use case until the client is fixed. But other options will be equally as broken? |
there is no lb failover happening it's that the backend starts to drop traffic. For control plane, it can be solved by running a keepalived + haproxy in front of apiserver with healthchecks. healtcheck in haproxy can detect a dead apiserver and reset all existing connections to the dead one. |
Yes, but I don't see how keepalived + haproxy failing over to another node doesn't have the same problem with the kubectl client talking to the vip / haproxy when it moves to another host. IE, moving the problem up to keepalived + haproxy just makes the problem up at the keepalived + haproxy level? You just change the single point of failover failure from kube-apiserver to haproxy? |
lb failover will reset all existing connections so the same issue doesn't
apply. The new lb node can't handle existing connections.
…On Thu, Jun 11, 2020 at 4:46 PM kfox1111 ***@***.***> wrote:
Yes, but I don't see how keepalived + haproxy failing over to another node
doesn't have the same problem with the kubectl client talking to the vip /
haproxy when it moves to another host.
IE, moving the problem up to keepalived + haproxy just makes the problem
up at the keepalived + haproxy level? You just change the single point of
failover failure from kube-apiserver to haproxy?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#168 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCV6UVUNEBIH7AGIBHORO3RWFUHBANCNFSM4ERZFSXA>
.
|
I don't understand how. Does keepalived keep track of connections so it can send out tcp fin's for the failed node? If kubectl connects to vip. the node with vip dies, no tcp fin is sent. vip fails over to new node. I think in either case, you will get the failure you describe. You could slide haproxy in the middle of the vip and kube-apiserver so that you can update the kube-apiserver without hitting the issue, but then haproxy can't be updated without dropping connections? Keepalived+haproxy doesn't solve the issue. it just pushes the problem to haproxy. So, I think I'm still at the conclusion that, being able to load balance the kube-apiserver ip address just like any other ip with metallb is not functionally any different then needing to install keepalived to do the same work, other then needing to manage a whole other piece of software? So we should still add the functionality to metallb I think. It does not appear to be any less reliable then keepalived? Sliding in haproxy allows you to switchout the backend more reliably but you can't just switch out haproxy. Perhaps for those that want haproxy in the middle, they just launch a metallb managed haproxy in the cluster and have haproxy point at the apiservers? Does that need any changes at all to work today? Is that functionally any different then using keepalived to manage the vip? So the real issue is one of two things to actually make the load balancer stuff work regardless if its keepalived or metallb:
|
It's the kernel that is sending the resets after a lb failover. It sees TCP packets without any matching socket and will reset it immediately. In this case it works the same way as metallb.
I'm not asking the LB to keep existing connection going when backend is turned down and it's impossible. But haproxy is able to reset the TCP connections going to a dead backend easily and that's where the issue is. |
I think its a passive thing though, right? The client must send a packet to the node that was failed over to, before the kernel there notices that the connection isn't openened and the fin gets sent. This doesn't happen currently with a kubectl doing a watch as its just waiting for a message thats never going to come. It would have to send something (heartbeat) to the node to know its been shut down. For the backend side reliability, could kube-proxy or metallb listen for k8s svc endpoint change events and force close all existing tcp connections targeting those endpoints? That might trigger the fins, at least in the case where the node fronting the traffic is still alive. |
An really small tcp keepalive value (10s ?) would do the trick IMO, but not sure it's enabled |
The issue with watch is always there and metallb is not worse than other LB solutions.
You can always hack it around to make it work. But the same thing is needed for graceful pod termination, so you can't simply do that for all services. You'll end up with a special annotation to mark the service. The root cause is in the client and the fix should go there.
An aggressive keepalive interval definitely will help but it's also a double edge sword. |
You can put 10s intervals and 20 retries, it will cut the connection after 10s when you switch server, but after 200s in case of temporary outage |
Can't we utilise static pod manifests to have metallb running before the control plane is up? |
We get the configuration & status from the API, and we don't handle the data plane at all, so starting MetalLB is really the tip of the iceberg |
FYI, RKE doesn't support static pods (at least for now). In the meantime, I created a small, low-resource project that solves the problem, please take a look if you need an immediate solution: https://github.com/immanuelfodor/kube-karp |
FYI, just found this small tool that provides a HA control plane with a Virtual IP: https://github.com/kube-vip/kube-vip I am not sure how it exactly works. Just found that tool and this issue here via google. |
This is a feature we're interested in for OpenShift, and as we're separately considering integrating MetalLB it might be something we can work on. As mentioned in an earlier comment, I think static pods could form a key part of the solution here, specifically static speakers co-located with the api servers. The high level outline would be:
A little bit of detail from a few hours taking an initial look at the code: The speaker must be able to start before the API server is up. The only place I could see which would currently block this is in SpeakerList.Start() where we fetch a list of other speaker pods. I think we could safely move this into the updateSpeakerIPs() thread before the for loop, right? There are a bunch of ways we could potentially manage injection of local configuration, but one I'll offer as a straw man is a local unix socket which implements 2 gRPC calls:
These would both be accept a complete configuration, so you would clear the configuration by calling them with an empty configuration. An independent process on the control plane host, possibly even a sidecar in the speaker pod, would be configured to poll the local API server for health. While the API server is healthy it would add the VIP locally and advertise it by pushing the relevant config to the local speaker directly. It would withdraw the VIP if the API server was not healthy. This is somewhat hand-wavy in many parts, but my current impression is that it could work in principal. Is this something the project would be interested in? If so I will try to find some time to develop a PoC and a more thorough design. |
unix socket would work, but maybe a /etc/kubernetes/manifest style watched dir with yaml might too and be a little more friendly for end sysadmins to play with as needed? The external app could write to this dir instead of a socket. |
This config is going to be dynamic, for example we'll have to withdraw it if the local API server endpoint isn't healthy, and a watched dir is finicky with atomicity. The input to the sidecar might be a regular static config directory, though. |
wouldn't just deleting the file work? Yeah, I guess if it could do a static dir as well as a socket, that would work too. |
Tried to wrap my head a bit around this. My take is, the feature is interesting (but I'd like to hear also @gclawes and / or @oribon 's opinion) As you said, a lot of details are still blurry. Couple of comments that comes into my mind right now:
As a side note, I had a look at what kube-vip does and I find it quite elegant. It offers an utility to generate the static pod manifests from the parameters required to bootstrap the API server, so the pod would have all it needs to do the "apiserver" part, minus monitoring that the apiserver is alive. Probably it's something we can take inspiration from (see https://kube-vip.io/docs/installation/static/ ) In general, I think the static pods approach is viable, how to configure metallb is to be discussed. I also think it'd make sense to start a design proposal (even high level, that can be shaped going on) that we can talk over. |
Upstream alignement - 270324
Is this a bug report or a feature request?:
Feature request.
MetalLB cannot reliably provide a load-balancer for the Kubernetes apiserver, because of circular dependencies.
In a working HA cluster, the setup is: you have N machines with apiserver, and a load-balancer providing a single IP for all of them. Then, you configure all your kubelets to talk to the LB IP, and voila! Miracle.
But how do you configure the LB? The Kubernetes documentation basically says "use a magic load-balancer in the sky, outside your cluster, and it will be fine." That doesn't work for bare metal clusters, we don't have magic load-balancers in the sky.
What about just configuring a LoadBalancer Service in k8s? MetalLB would create and advertise the LB IP, and everything works, right? No, because now you have a circular dependency:
So, MetalLB cannot be used to control the LB IP for the apiserver.
How can we solve this? Basically we need some way of breaking the circular dependency, so that kubelets can join the cluster and MetalLB can run, at the "same time."
There are a couple of options for this. Both require new code/config in MetalLB, but first we should try to agree on a general strategy for solving the problem. The options I can think of are:
In general, cluster bringup without circular dependencies is a can of worms, so it probably won't be trivial to fix.. But MetalLB should offer a comprehensive solution for "how do I do LB in my cluster", and apiserver LB is part of that.
The text was updated successfully, but these errors were encountered: