New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FRR mode not restarting BGP when a node becomes unready #1292
Comments
Is there any progress? |
Hi @xhejtman , we are giving this priority, however we are struggling in reproducing it. |
metallb.zip I rebooted node 4. nginx has been restarted on node 1. (loadbalancer ip 10.16.48.119). I forced nginx to be moved back to node 4. nodes 1, 3, 4 stopped sending bgp updates to peer, only node 2 continued to send updates. I double checked this with the following:
|
Thanks! Just one question: what you mean by "force service move" ? Normally the lb service is advertised from all the nodes. |
|
oh, and maybe related thing. ingress loadbalancer service uses |
I was about to ask that :-) |
yes, that is the way I noticed that it stopped working, as the service is actually not announced on all nodes ;) |
would you mind jumping on the metallb slack on the kubernetes workspace so we don't pollute the issue? |
Hello! Did you work out any solution? |
Hi @michalg91 , would you mind clarifying? Are you restarting the deployment OR the speakers? |
And, would you mind sharing your configuration and the details of the service? |
Hi, @fedepaol I also tried to change frr version but no mather which one i chose the problem still occurs.
FRR Config in speaker:
service:
|
Couple of more questions:
|
I can confirm that in native mode it is working as expected, but usage of bfd is important to us. |
I was trying to narrow down, I am not suggesting that reverting is a valid workaround :-) |
And, does this happen always? |
Yes, this happens always. |
Try to build frr image yourself from git. At least this head 403f312d5657d2d62775dc57c0a73bfb85118ea6 worker for me but I need to use custom frr image. |
or if you want, you can use our image: |
@xhejtman @fedepaol thanks for your help, i think i got it running (only done few tests). Edit: |
This issue has been automatically marked as stale because it has been open 30 days
|
I think this is workarounded by #1732 , where we verify frr is healthy via a probe. |
This issue was first discussed on slack thread. I have a kubernetes cluster with 2 worker nodes and 1 master node. Metallb is installed from FRR yaml but with speaker image tag changed from "main" to "v0.12.1" since with "main" tag containers where unable to start and where reportying "crashLoopBack error". The cluster is working with GRE tunneling and a router running with FRR Routing daemonset.
Strict ARP is enabled from init script:
The configmap for metallb I'm using is the next:
And the configuration for the FRR Router this:
On vtysh "show bfd peers" shows the next output when all nodes are up:
On other different executions after stopping and and restarting two worker nodes I geet the next output after executing "show bfd peers":
When one worker node is down it shows:
Version of MetalLB
v0.12.1
Version of Kubernetes
1.23.5
Name and version of network addon (e.g. Calico, Weave...)
Flannel from master.
Whether you've configured kube-proxy for iptables or ipvs mode
Im not sure about this, I have enabled ipvs mode with an starting script.
FRR container logs:
-Environment replication:
The issue can be replicated with this repository (branch FRR_BFD) on vagrant. To check if the cluster is running simply enter 172.42.42.100 on web browser.
The text was updated successfully, but these errors were encountered: