Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load-balance the apiserver endpoint #168

Open
danderson opened this issue Feb 21, 2018 · 40 comments
Open

Load-balance the apiserver endpoint #168

danderson opened this issue Feb 21, 2018 · 40 comments
Assignees

Comments

@danderson
Copy link
Contributor

danderson commented Feb 21, 2018

Is this a bug report or a feature request?:

Feature request.

MetalLB cannot reliably provide a load-balancer for the Kubernetes apiserver, because of circular dependencies.

In a working HA cluster, the setup is: you have N machines with apiserver, and a load-balancer providing a single IP for all of them. Then, you configure all your kubelets to talk to the LB IP, and voila! Miracle.

But how do you configure the LB? The Kubernetes documentation basically says "use a magic load-balancer in the sky, outside your cluster, and it will be fine." That doesn't work for bare metal clusters, we don't have magic load-balancers in the sky.

What about just configuring a LoadBalancer Service in k8s? MetalLB would create and advertise the LB IP, and everything works, right? No, because now you have a circular dependency:

  1. Kubelet cannot talk to the control plane until MetalLB has started and configured the LB IP
  2. ... But MetalLB cannot start until kubelet can talk to the control plane and discover that it should be running the pod!

So, MetalLB cannot be used to control the LB IP for the apiserver.


How can we solve this? Basically we need some way of breaking the circular dependency, so that kubelets can join the cluster and MetalLB can run, at the "same time."

There are a couple of options for this. Both require new code/config in MetalLB, but first we should try to agree on a general strategy for solving the problem. The options I can think of are:

  • Reconfigure kubelets on apiserver nodes to talk only to 127.0.0.1. This way, MetalLB can schedule on the apiserver nodes, and it can set up the cluster LB. From there, all other kubelets can connect to the LB IP, and everything works.
    • One implication of this is that kubelets on the apiserver machines will be "less reliable", because they will drop out of the cluster if their local apiserver is unavailable, even if the apiserver LB IP is still working. This is probably OK, because if the apiserver is unhealthy, the machine is probably pretty broken anyway... But it's still forcing users to change the availability semantics of their cluster.
  • Run a special "apiserver LB only" version of the speaker as a static addon pod (via a manifest in /etc/kubernetes/addons/...) on apiservers. This pod only connects to the apiserver at localhost, and only configures the apiserver LB IP, nothing else. For sanity of management, the addon manifest would be managed by the MetalLB controller, via some intermediate "addon manager" pod that drops updated manifests on the machines.
    • This adds a lot of complexity to MetalLB (need to write an "addon manager", give it similar semantics to DaemonSets...), but requires ~no changes to the cluster configuration. You just install metallb using the "HA apiserver" manifests, and it takes care of plumbing everything together to make apiserver LB work.
    • The exact separation of which pod is responsible for what is unclear to me. The proposed setup I described is probably not quite right, but I'm trying to convey the general idea of what we want to do, not the exact implementation.
    • A fair question would be: how is this better than just using keepalived-vip? One answer might be that it allows exposing the apiserver IP via BGP, not just ARP. But, for this one special case, it might be fine to just document "here's how you make this work safely using keepalived, the apiserver LB IP is special so you need a special solution".
  • Look into whether Kubernetes has any plans for "autonomous pods", i.e. pods that automatically restart even if kubelet cannot talk to the apiserver. i.e., when kubelet starts up, it looks at its local checkpoint and goes "oh, this pod is autonomous, I'll start it now". Then, later, when it successfully connects to apiserver, apiserver might tell it "oh, actually, don't run that pod right now", and kubelet stops it.
    • This would let us simply run MetalLB as a set of "autonomous pods", with a tiny bit of config checkpointing so that they can bring up the apiserver LB IP with zero dependencies, which allows the rest of the system to startup and converge.
    • I haven't heard of any plans for a feature like this, so this is probably just a random brainstorming idea.

In general, cluster bringup without circular dependencies is a can of worms, so it probably won't be trivial to fix.. But MetalLB should offer a comprehensive solution for "how do I do LB in my cluster", and apiserver LB is part of that.

@danderson danderson self-assigned this Feb 21, 2018
@danderson danderson added this to To do in HA Control Plane via automation Feb 21, 2018
@danderson
Copy link
Contributor Author

Okay, I've talked this over with a coworker (hi @maisem!), and I think I know how this needs to work. It deserves a proper design doc with graphs and stuff, to fully explain the bootstrapping problem and why this is hard, but here's a quick braindump before I forget.


Add a config option to MetalLB that tells it to treat the default/kubernetes service specially. Something like ha-kubernetes-control-plane: true.

When that service is converted to a LoadBalancer (by default it's a ClusterIP), controller does the normal IP allocation logic, but speaker does not announce it. Instead, controller writes a pod manifest to a ConfigMap (more on that pod later). It also forces externalTrafficPolicy to Local, and probably does a few other things to "standardize" the LoadBalancer, because it's more restricted than "normal" MetalLB balancers.

We add a new DaemonSet to MetalLB. It's configured to run only on master nodes, and mounts a hostPath volume for the static pod directory /etc/kubernetes/manifests . All this pod does is periodically (or on change notification) copy the ConfigMap pod manifest into /etc/kubernetes/manifests on the masters. This makes kubelet statically run that pod, even if the rest of k8s is down.

The pod manifest written by the controller is for a new MetalLB binary, a lightweight, statically configured, single-IP speaker (working name "Lithium", the lightest metal). The pod exec command looks something like lithium --ip=<LB IP> --protocol=bgp --peer=1.2.3.4 --peer=2.3.4.5 --community=1234:2345.

This pod does 2 things:

  • Announces --ip using whatever protocol configuration it was given, unconditionally (or maybe based on a hardcoded apiserver healthcheck? Either way extremely simple).
  • Installs a single iptables rule, -t nat -A PREROUTING -d <LB IP> -j REDIRECT. This rule should not get interfered with by either kube-proxy or the network addon (to be verified), and is the tiny little change necessary to make master connections work.

The specifics vary based on protocol:

  • BGP is trivial: connect to all specified peers, make 1 static announcement, and sit there for ever.
  • ARP is less trivial, because we need to elect a leader to make announcements, and we can't do that until k8s works. The "silly" solution for that would be to just use an actual VRRP-ish network protocol for this single use case, so that we can do "loose" leader election directly over the network. We could also use probabilistic algorithms like sleep(RANDOM(1s,100s)) before announcing, combined with some logic to detect that "okay, apiserver is up" and transition to standard leader election... But that feels a bit brittle. TBD for the exact ARP mechanism, but there are workable solutions.

It's quite a lot of work and extra indirection, but this achieves strong resiliency, even in the face of a global cluster poweroff and reboot. On reboot, k8s is broken because the LB is down... But the last-written static pod manifest is still there, and its commandline contains everything it needs to know in order to bring up the datapath for the apiserver LB IP. Once that is up, all kubelets are able to talk to apiserver, the rest of Kubernetes comes up. Controller and the "copy DaemonSet" come back up, and they once again start syncing the Service object and static pod manifests.


Tentatively, slating this design for 0.5.0 or 0.6.0. It also really depends on what kubeadm ends up doing with HA clusters, because I'm assuming that will define the "standard" for how HA clusters are supposed to function, and we may need to adapt this design to match that.

@elemental-lf
Copy link

@danderson, I think that some parts of the needed infrastructure have already been implemented by other projects. bootkube (and kubeadm I think) use a temporary API server to inject the initial manifests to build a self-hosted Kubernetes control-plane. metallb related manifest could also be injected at that time. To solve the chicken-egg-problem after a cluster reboot, they use a pod checkpointer which writes certain pods (and related configmaps and secrets) to disk and loads them again on reboot. See https://github.com/kubernetes-incubator/bootkube/tree/master/cmd/checkpoint.
I've build an HA API endpoint on bare-metal based on bootkube and the pod checkpointer by using https://github.com/aledbf/kube-keepalived-vip and while there are some shortcomings it works adequately.

@kaoet
Copy link

kaoet commented Oct 11, 2018

Any update on this issue?

@ikorolev93
Copy link

Note that you can not convert kubernetes.default service to LoadBalancer, as kube-apiserver will ensure on start that it is either NodePort (when --kubernetes-service-node-port is passed) or ClusterIP.

@sfudeus
Copy link
Contributor

sfudeus commented Dec 20, 2018

You can of course add a second service with a type of your choice, exposing the apiserver a second time. If your apiserver runs as a pod, you have a selector. If not, you need something else :-/.
In our setup, we have apiserver not running as pod, but to achieve H/A, each node has a local haproxy which proxies traffic to all 3 apiservers for the local kubelet. These are deployed as static manifests so I can set the selector of the additional apiserver service to them.

@ikorolev93
Copy link

Well, you only need type: LoadBalancer to assign an IP from the controller, and assuming you want it static, you could just use some custom annotation to work around that. You can still get a list of usable nodes from the appropriate Endpoints object. It would need a kubeconfig to connect to the local apiserver to get all the info, but that can be handled with a hostPath mount from /etc/kubernetes somewhere (just like every other part of the control plane).
That way, only this small part of MetalLB depends on the local apiserver, every other part of the control plane still talks to the HA address.

@surajssd
Copy link

@elemental-lf Is there an easier way to make it work with bootkube setup? have you got it working?

@rajatchopra
Copy link

Note that you can not convert kubernetes.default service to LoadBalancer, as kube-apiserver will ensure on start that it is either NodePort (when --kubernetes-service-node-port is passed) or ClusterIP.

Can we fix this behaviour? What is the downside to declaring it as LoadBalancer type?

@danderson
Copy link
Contributor Author

That's something to take up with upstream kubernetes, it's not something I control. Offhand I can't think of a reason why it couldn't become a LoadBalancer (other than circular dependency issues, but as long as you avoid that...).

@ggilley
Copy link

ggilley commented Jan 14, 2020

Wondering if there are new thoughts on this. I would love to get rid of HAProxy for the api server...

@zimmertr
Copy link

zimmertr commented Jan 14, 2020

CoreDNS has a Proxy plugin with which you can bypass the need to use a LB to create an HA master cluster by responding to DNS queries using programmatic logic with a different master given different situations.

@ggilley
Copy link

ggilley commented Jan 14, 2020 via email

@zimmertr
Copy link

I can't share an example publicly unfortunately.

As for etcd.

@kfox1111
Copy link

For external clients though, this means your clients need to be either pointed at coredns or your coredns is hung off of the greater DNS system? If thats the case, it isn't a solution for a lot of clusters.

@FireDrunk
Copy link
Contributor

FireDrunk commented Apr 20, 2020

I've tried MetalLB in the past, but because of the lack of this feature, I've moved on to a hardware lb locally. Seeing this, made me thing about the problem some more.

One thing that comes to mind, is that having a 'deployed' (aka, stored in etcd) resource might never seem the proper solution. Because in case of a full outage (or just a plain old shutdown, because of maintenance), everything would be down, and would never come up. Since Kubelets start by contacting their apiserver to see what should be running.

I think the only proper solution would be to have manual manifests in the Kubelet directory.
If there was an easy solution for this problem, it would have been implemented by Kubernetes iteslf for etcd as well, since etcd suffers from the same problem.

I think the easiest solution is having manual pods deployed via manifests, that form a cluster with the other apiserver lb pods, and only broadcast the single IP for the loadbalancer. Furthermore, they don't ingegrate any other logic from MetalLB from the APIServer.

These 'minimal' speaker pods require little maintenance, and only need to be updated in case of significant updates in the BGP support.

The downside is that you need to run twice as many MetalLB pods in your cluster. On the other hand, it's not like MetalLB is using much resources...

It might be feasible to connect these pods afterwards to a running MetalLB cluster for monitoring purposes, or 'controller' integration. But I've not seen the ability to update Kubelets Pod Manifests from inside the cluster, unfortunatly.

PS: If you have any theories you want to test, I have a test cluster at home containing 3 nodes and a BGP capable router, so I can test a few scenario's if required.

@kfox1111
Copy link

kfox1111 commented Apr 20, 2020

I think its probably fine to load balance the apiserver endpoint using metallb managed within the cluster. You just need to separate the use cases:

  1. kubelet -> apiserver on control-plane
  2. kubelet -> apiserver not on control-plane
  3. user -> apiserver

2 & 3 could benefit from metallb managing a load balancer for the apiserver.

  1. can't rely on metallb if metallb is in the cluster but I don't think needs it though. kubelet on the control plane can just point at localhost for the apiserver?

@FireDrunk
Copy link
Contributor

@danderson Any update on this topic? I'm eagerly awaiting a solution!

@liuyuan10
Copy link

I recently set up a cluster using metallb load balancing api server endpoints by creating a loadbalaancer typed service pointing to apiserver pods. The real issue is not about bootstrapping. It's that k8s client can't reset existing TCP connection when it's dead. (issue) It requires a load balancer that can do healthcheck on the api server endpoint and kill any existing sessions if a backend becomes unhealthy. metallb uses k8s data plane to do load balancing (e.g. iptables + conntrack) which doesn't satisfy the requirement. Before that issue is fixed, I don't see it's a good idea to use metallb as a load balancer for control plane HA.

@kfox1111
Copy link

kfox1111 commented Jun 8, 2020

@liuyuan10 interesting issue. Wouldn't that happen for non k8s services too? Is this a bigger problem for Metallb in general, not just using it to front kube-apiserver?

Separate but related question. I can see it being an issue for iptables+conntrack based kube-proxy. But is it still an issue with kube-proxy in ipvs mode?

@liuyuan10
Copy link

Yes. it's also an issue for other clients that can't recover from traffic blackhole. This is more of issue on the client side. Usually a client should detect timeout and recreate TCP connection.

I never tried ipvs so I can't say. My guess is it applies to ipvs too because that needed to support graceful pod termination.

@kfox1111
Copy link

kfox1111 commented Jun 9, 2020

Is there something metallb could do to help? Its already arping the ip. There something reasonable it could do on failover to send fin packets out? Or is there not enough state on failover to be able to do that?

If it cant, how can a non metallb lb solve this either? Wouldn't you get the same kind of problem if your ip moved to a different host with say, haproxy backing it, and kubectl's were in the half closed state?

I think this may be the typical load balancer problem. You need a load balancer for your load balancer for your load balancer cause each one can't solve the problem without another load balancer doing the hard part and people keep pushing the problem up? Make it the other load balancer's problem and pretend they are solving it for you?

I guess maybe I'm coming to the conclusion that this particular problem you mention can't be solved outside of metallb really either, so maybe the functionality of having metallb be able to load balance the apiserver endpoint still is a valid use case? It just might not be optimal for the use case until the client is fixed. But other options will be equally as broken?

@liuyuan10
Copy link

there is no lb failover happening it's that the backend starts to drop traffic. For control plane, it can be solved by running a keepalived + haproxy in front of apiserver with healthchecks. healtcheck in haproxy can detect a dead apiserver and reset all existing connections to the dead one.

@kfox1111
Copy link

Yes, but I don't see how keepalived + haproxy failing over to another node doesn't have the same problem with the kubectl client talking to the vip / haproxy when it moves to another host.

IE, moving the problem up to keepalived + haproxy just makes the problem up at the keepalived + haproxy level? You just change the single point of failover failure from kube-apiserver to haproxy?

@liuyuan10
Copy link

liuyuan10 commented Jun 12, 2020 via email

@kfox1111
Copy link

I don't understand how. Does keepalived keep track of connections so it can send out tcp fin's for the failed node?

If kubectl connects to vip. the node with vip dies, no tcp fin is sent. vip fails over to new node.
If kubectl doesn't notice and recover from this state, how is the vip managed by keepalived any different then the vip being managed by metallb?

I think in either case, you will get the failure you describe.

You could slide haproxy in the middle of the vip and kube-apiserver so that you can update the kube-apiserver without hitting the issue, but then haproxy can't be updated without dropping connections? Keepalived+haproxy doesn't solve the issue. it just pushes the problem to haproxy.

So, I think I'm still at the conclusion that, being able to load balance the kube-apiserver ip address just like any other ip with metallb is not functionally any different then needing to install keepalived to do the same work, other then needing to manage a whole other piece of software? So we should still add the functionality to metallb I think. It does not appear to be any less reliable then keepalived?

Sliding in haproxy allows you to switchout the backend more reliably but you can't just switch out haproxy.

Perhaps for those that want haproxy in the middle, they just launch a metallb managed haproxy in the cluster and have haproxy point at the apiservers? Does that need any changes at all to work today? Is that functionally any different then using keepalived to manage the vip?

So the real issue is one of two things to actually make the load balancer stuff work regardless if its keepalived or metallb:

  1. upstream clients need to heartbeat if there is to be reliability.
  2. the software managing the vip needs to keep track of the connections and issue the FIN's on switchover

@liuyuan10
Copy link

liuyuan10 commented Jun 12, 2020

It's the kernel that is sending the resets after a lb failover. It sees TCP packets without any matching socket and will reset it immediately. In this case it works the same way as metallb.

You could slide haproxy in the middle of the vip and kube-apiserver so that you can update the kube-apiserver without hitting the issue, but then haproxy can't be updated without dropping connections? Keepalived+haproxy doesn't solve the issue. it just pushes the problem to haproxy.

I'm not asking the LB to keep existing connection going when backend is turned down and it's impossible. But haproxy is able to reset the TCP connections going to a dead backend easily and that's where the issue is.

@kfox1111
Copy link

I think its a passive thing though, right? The client must send a packet to the node that was failed over to, before the kernel there notices that the connection isn't openened and the fin gets sent. This doesn't happen currently with a kubectl doing a watch as its just waiting for a message thats never going to come. It would have to send something (heartbeat) to the node to know its been shut down.

For the backend side reliability, could kube-proxy or metallb listen for k8s svc endpoint change events and force close all existing tcp connections targeting those endpoints? That might trigger the fins, at least in the case where the node fronting the traffic is still alive.

@champtar
Copy link
Contributor

An really small tcp keepalive value (10s ?) would do the trick IMO, but not sure it's enabled

@liuyuan10
Copy link

liuyuan10 commented Jun 13, 2020

The issue with watch is always there and metallb is not worse than other LB solutions.

For the backend side reliability, could kube-proxy or metallb listen for k8s svc endpoint change events and force close all existing tcp connections targeting those endpoints? That might trigger the fins, at least in the case where the node fronting the traffic is still alive.

You can always hack it around to make it work. But the same thing is needed for graceful pod termination, so you can't simply do that for all services. You'll end up with a special annotation to mark the service. The root cause is in the client and the fix should go there.

An really small tcp keepalive value (10s ?) would do the trick IMO, but not sure it's enabled

An aggressive keepalive interval definitely will help but it's also a double edge sword.

@champtar
Copy link
Contributor

You can put 10s intervals and 20 retries, it will cut the connection after 10s when you switch server, but after 200s in case of temporary outage

@champtar
Copy link
Contributor

golang/go@5bd7e9c

@arianvp
Copy link

arianvp commented Oct 25, 2020

Can't we utilise static pod manifests to have metallb running before the control plane is up?

@champtar
Copy link
Contributor

We get the configuration & status from the API, and we don't handle the data plane at all, so starting MetalLB is really the tip of the iceberg

@immanuelfodor
Copy link

FYI, RKE doesn't support static pods (at least for now).

In the meantime, I created a small, low-resource project that solves the problem, please take a look if you need an immediate solution: https://github.com/immanuelfodor/kube-karp

@Skaronator
Copy link

Skaronator commented May 22, 2022

FYI, just found this small tool that provides a HA control plane with a Virtual IP: https://github.com/kube-vip/kube-vip

I am not sure how it exactly works. Just found that tool and this issue here via google.

@mdbooth
Copy link

mdbooth commented Jun 24, 2022

This is a feature we're interested in for OpenShift, and as we're separately considering integrating MetalLB it might be something we can work on.

As mentioned in an earlier comment, I think static pods could form a key part of the solution here, specifically static speakers co-located with the api servers. The high level outline would be:

  • Add the concept of 'local' configuration which does not come from the API server to the speaker
  • When either local or api configuration is modified, the two are merged before reconciliation
  • Start the speaker from a static pod and inject config for the API VIP as local configuration

A little bit of detail from a few hours taking an initial look at the code:

The speaker must be able to start before the API server is up. The only place I could see which would currently block this is in SpeakerList.Start() where we fetch a list of other speaker pods. I think we could safely move this into the updateSpeakerIPs() thread before the for loop, right?

There are a bunch of ways we could potentially manage injection of local configuration, but one I'll offer as a straw man is a local unix socket which implements 2 gRPC calls:

  • SetLocalConfig()
  • SetLocalBalancers()

These would both be accept a complete configuration, so you would clear the configuration by calling them with an empty configuration.

An independent process on the control plane host, possibly even a sidecar in the speaker pod, would be configured to poll the local API server for health. While the API server is healthy it would add the VIP locally and advertise it by pushing the relevant config to the local speaker directly. It would withdraw the VIP if the API server was not healthy.

This is somewhat hand-wavy in many parts, but my current impression is that it could work in principal. Is this something the project would be interested in? If so I will try to find some time to develop a PoC and a more thorough design.

@kfox1111
Copy link

unix socket would work, but maybe a /etc/kubernetes/manifest style watched dir with yaml might too and be a little more friendly for end sysadmins to play with as needed? The external app could write to this dir instead of a socket.

@mdbooth
Copy link

mdbooth commented Jun 28, 2022

unix socket would work, but maybe a /etc/kubernetes/manifest style watched dir with yaml might too and be a little more friendly for end sysadmins to play with as needed? The external app could write to this dir instead of a socket.

This config is going to be dynamic, for example we'll have to withdraw it if the local API server endpoint isn't healthy, and a watched dir is finicky with atomicity. The input to the sidecar might be a regular static config directory, though.

@kfox1111
Copy link

wouldn't just deleting the file work?

Yeah, I guess if it could do a static dir as well as a socket, that would work too.

@fedepaol
Copy link
Member

This is a feature we're interested in for OpenShift, and as we're separately considering integrating MetalLB it might be something we can work on.

As mentioned in an earlier comment, I think static pods could form a key part of the solution here, specifically static speakers co-located with the api servers. The high level outline would be:

* Add the concept of 'local' configuration which does not come from the API server to the speaker

* When either local or api configuration is modified, the two are merged before reconciliation

* Start the speaker from a static pod and inject config for the API VIP as local configuration

A little bit of detail from a few hours taking an initial look at the code:

The speaker must be able to start before the API server is up. The only place I could see which would currently block this is in SpeakerList.Start() where we fetch a list of other speaker pods. I think we could safely move this into the updateSpeakerIPs() thread before the for loop, right?

There are a bunch of ways we could potentially manage injection of local configuration, but one I'll offer as a straw man is a local unix socket which implements 2 gRPC calls:

* SetLocalConfig()

* SetLocalBalancers()

These would both be accept a complete configuration, so you would clear the configuration by calling them with an empty configuration.

An independent process on the control plane host, possibly even a sidecar in the speaker pod, would be configured to poll the local API server for health. While the API server is healthy it would add the VIP locally and advertise it by pushing the relevant config to the local speaker directly. It would withdraw the VIP if the API server was not healthy.

This is somewhat hand-wavy in many parts, but my current impression is that it could work in principal. Is this something the project would be interested in? If so I will try to find some time to develop a PoC and a more thorough design.

This is a feature we're interested in for OpenShift, and as we're separately considering integrating MetalLB it might be something we can work on.

As mentioned in an earlier comment, I think static pods could form a key part of the solution here, specifically static speakers co-located with the api servers. The high level outline would be:

* Add the concept of 'local' configuration which does not come from the API server to the speaker

* When either local or api configuration is modified, the two are merged before reconciliation

* Start the speaker from a static pod and inject config for the API VIP as local configuration

A little bit of detail from a few hours taking an initial look at the code:

The speaker must be able to start before the API server is up. The only place I could see which would currently block this is in SpeakerList.Start() where we fetch a list of other speaker pods. I think we could safely move this into the updateSpeakerIPs() thread before the for loop, right?

There are a bunch of ways we could potentially manage injection of local configuration, but one I'll offer as a straw man is a local unix socket which implements 2 gRPC calls:

* SetLocalConfig()

* SetLocalBalancers()

These would both be accept a complete configuration, so you would clear the configuration by calling them with an empty configuration.

An independent process on the control plane host, possibly even a sidecar in the speaker pod, would be configured to poll the local API server for health. While the API server is healthy it would add the VIP locally and advertise it by pushing the relevant config to the local speaker directly. It would withdraw the VIP if the API server was not healthy.

This is somewhat hand-wavy in many parts, but my current impression is that it could work in principal. Is this something the project would be interested in? If so I will try to find some time to develop a PoC and a more thorough design.

Tried to wrap my head a bit around this. My take is, the feature is interesting (but I'd like to hear also @gclawes and / or @oribon 's opinion)

As you said, a lot of details are still blurry. Couple of comments that comes into my mind right now:

  • MetalLB's job is to drive the traffic to the node. The path between the node and the pods is not covered by MetalLB, so this last mile must be covered somehow (or maybe it is already as the apiservers are listening to any?)
  • if we block before knowing about the other speakers, there won't be any leader election and all the nodes will reply to ARP requests at the same time, so we'd likely to find a better way to do leader election for masters
  • we need to take into account how those static pods live together with the "regular metallb" pods, if only on the masters or on all the nodes, what is the lifecycle model of those pods and how to avoid overlaps.

As a side note, I had a look at what kube-vip does and I find it quite elegant. It offers an utility to generate the static pod manifests from the parameters required to bootstrap the API server, so the pod would have all it needs to do the "apiserver" part, minus monitoring that the apiserver is alive. Probably it's something we can take inspiration from (see https://kube-vip.io/docs/installation/static/ )

In general, I think the static pods approach is viable, how to configure metallb is to be discussed. I also think it'd make sense to start a design proposal (even high level, that can be shaped going on) that we can talk over.

karampok pushed a commit to karampok/metallb that referenced this issue Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests