Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRR mode not restarting BGP when a node becomes unready #1292

Closed
pantxisto opened this issue Mar 29, 2022 · 23 comments
Closed

FRR mode not restarting BGP when a node becomes unready #1292

pantxisto opened this issue Mar 29, 2022 · 23 comments

Comments

@pantxisto
Copy link

pantxisto commented Mar 29, 2022

  • The bug itself, as detailed as you can.

This issue was first discussed on slack thread. I have a kubernetes cluster with 2 worker nodes and 1 master node. Metallb is installed from FRR yaml but with speaker image tag changed from "main" to "v0.12.1" since with "main" tag containers where unable to start and where reportying "crashLoopBack error". The cluster is working with GRE tunneling and a router running with FRR Routing daemonset.

Strict ARP is enabled from init script:

# see what changes would be made, returns nonzero returncode if different
kubectl get configmap kube-proxy -n kube-system -o yaml | \
sed -e "s/strictARP: false/strictARP: true/" | \
kubectl diff -f - -n kube-system

# actually apply the changes, returns nonzero returncode on errors only
kubectl get configmap kube-proxy -n kube-system -o yaml | \
sed -e "s/strictARP: false/strictARP: true/" | \
kubectl apply -f - -n kube-system

The configmap for metallb I'm using is the next:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    peers:
    - peer-address: 10.0.0.1
      peer-asn: 64600
      my-asn: 64600
      bfd-profile: full1
    bfd-profiles:
    - name: full1
      receive-interval: 179
      transmit-interval: 180
      echo-interval: 62
      echo-mode: false
      passive-mode: false
      minimum-ttl: 254
    address-pools:
    - name: default
      protocol: bgp
      addresses:
      - 128.28.28.1-128.28.28.20

And the configuration for the FRR Router this:

ip nht resolve-via-default

bfd
  profile full1
    receive-interval 179
    transmit-interval 150
    echo receive-interval 62
    echo transmit-interval 62
    minimum-ttl 254
  peer 10.0.0.2
    profile full1
    no shutdown
  peer 10.0.0.3
    profile full1
    no shutdown
  peer 10.0.0.4
    profile full1
    no shutdown
    
router bgp 64600
  bgp router-id 10.0.0.1
  neighbor 10.0.0.2 remote-as 64600
  neighbor 10.0.0.2 bfd
  neighbor 10.0.0.3 remote-as 64600
  neighbor 10.0.0.3 bfd
  neighbor 10.0.0.4 remote-as 64600
  neighbor 10.0.0.4 bfd

On vtysh "show bfd peers" shows the next output when all nodes are up:

BFD Peers:
	peer 10.0.0.2 vrf default
		ID: 992899208
		Remote ID: 227744130
		Active mode
		Status: up
		Uptime: 10 minute(s), 8 second(s)
		Diagnostics: ok
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

	peer 10.0.0.3 vrf default
		ID: 1356740786
		Remote ID: 4063445766
		Active mode
		Status: up
		Uptime: 10 minute(s), 8 second(s)
		Diagnostics: ok
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
		BFD Peers:
	peer 10.0.0.2 vrf default
		ID: 992899208
		Remote ID: 227744130
		Active mode
		Status: up
		Uptime: 12 minute(s), 9 second(s)
		Diagnostics: ok
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

	peer 10.0.0.3 vrf default
		ID: 1356740786
		Remote ID: 0
		Active mode
		Status: down
		Downtime: 45 second(s)
		Diagnostics: control detection time expired
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

	peer 10.0.0.4 vrf default
		ID: 1921253385
		Remote ID: 185864328
		Active mode
		Status: up
		Uptime: 11 minute(s), 2 second(s)
		Diagnostics: ok
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms	Transmission interval: 180ms
			Echo receive interval: 62ms

	peer 10.0.0.4 vrf default
		ID: 1921253385
		Remote ID: 185864328
		Active mode
		Status: up
		Uptime: 9 minute(s), 1 second(s)
		Diagnostics: ok
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

On other different executions after stopping and and restarting two worker nodes I geet the next output after executing "show bfd peers":

BFD Peers:
	peer 10.0.0.4 vrf default
		ID: 2422357222
		Remote ID: 0
		Active mode
		Status: down
		Downtime: 11 minute(s), 35 second(s)
		Diagnostics: neighbor signaled session down
		Remote diagnostics: neighbor signaled session down
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

	peer 10.0.0.3 vrf default
		ID: 144310512
		Remote ID: 0
		Active mode
		Status: down
		Downtime: 24 minute(s), 31 second(s)
		Diagnostics: control detection time expired
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

	peer 10.0.0.2 vrf default
		ID: 2275962565
		Remote ID: 545881773
		Active mode
		Status: up
		Uptime: 38 minute(s), 35 second(s)
		Diagnostics: ok
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

When one worker node is down it shows:

BFD Peers:
	peer 10.0.0.2 vrf default
		ID: 992899208
		Remote ID: 227744130
		Active mode
		Status: up
		Uptime: 12 minute(s), 9 second(s)
		Diagnostics: ok
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

	peer 10.0.0.3 vrf default
		ID: 1356740786
		Remote ID: 0
		Active mode
		Status: down
		Downtime: 45 second(s)
		Diagnostics: control detection time expired
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms

	peer 10.0.0.4 vrf default
		ID: 1921253385
		Remote ID: 185864328
		Active mode
		Status: up
		Uptime: 11 minute(s), 2 second(s)
		Diagnostics: ok
		Remote diagnostics: ok
		Peer Type: configured
		Local timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 150ms
			Echo receive interval: 62ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 179ms
			Transmission interval: 180ms
			Echo receive interval: 62ms
  • Version of MetalLB
    v0.12.1

  • Version of Kubernetes
    1.23.5

  • Name and version of network addon (e.g. Calico, Weave...)
    Flannel from master.

  • Whether you've configured kube-proxy for iptables or ipvs mode
    Im not sure about this, I have enabled ipvs mode with an starting script.

  • FRR container logs:

sh: 60: unknown operand
Started watchfrr
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
sh: 60: unknown operand
  • Speaker container logs on master node (init-worker1 stop-worker1 restart):
{"branch":"HEAD","caller":"level.go:63","commit":"v0.12.1","goversion":"gc / go1.16.14 / amd64","level":"info","msg":"MetalLB speaker starting version 0.12.1 (commit v0.12.1, branch HEAD)","ts":"2022-03-29T14:31:45.735502377Z","version":"0.12.1"}
{"caller":"level.go:63","event":"createARPResponder","interface":"eth0","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:31:45.739125696Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"eth0","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T14:31:45.739374708Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"eth1","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:31:45.739511326Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"eth1","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T14:31:45.739572763Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"docker0","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:31:45.739665697Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"flannel.1","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:31:45.739880698Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"flannel.1","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T14:31:45.739926128Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"cni0","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:31:45.740009082Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"cni0","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T14:31:45.740067481Z"}
W0329 14:31:45.750872     472 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
{"caller":"level.go:63","level":"info","msg":"using endpoint slices","op":"New","ts":"2022-03-29T14:31:45.751333984Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.101","node event":"NodeJoin","node name":"master","ts":"2022-03-29T14:31:45.757788511Z"}
W0329 14:31:45.760706     472 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
W0329 14:31:45.762096     472 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
{"caller":"level.go:63","configmap":"metallb-system/config","event":"peerAdded","level":"info","msg":"peer configured, starting BGP session","peer":"10.0.0.1","ts":"2022-03-29T14:31:45.859637573Z"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-03-29T14:31:45.860039772Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:31:45.860528856Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:31:45.860831129Z"}
{"caller":"level.go:63","event":"nodeLabelsChanged","level":"info","msg":"Node labels changed, resyncing BGP peers","ts":"2022-03-29T14:31:45.861262547Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T14:31:45.861529763Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: connect: connection refused\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:31:45.864111906Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:31:45.866951486Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:31:45.874298383Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:32:15.736851423Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.102","node event":"NodeJoin","node name":"worker1","ts":"2022-03-29T14:32:28.875622265Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:32:28.888601207Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:32:28.888717863Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:32:45.736917729Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T14:36:10.253097537Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.103","node event":"NodeJoin","node name":"worker2","ts":"2022-03-29T14:37:03.323619433Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:37:03.329544545Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:37:03.329687498Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:37:15.736120195Z"}
W0329 14:40:00.798082     472 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect worker1 has failed, no acks received","ts":"2022-03-29T14:40:26.738056369Z"}
{"caller":"state.go:1157","component":"Memberlist","level":"info","msg":"memberlist: Marking worker1 as failed, suspect timeout reached (0 peer confirmations)","ts":"2022-03-29T14:40:28.194747965Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.102","node event":"NodeLeave","node name":"worker1","ts":"2022-03-29T14:40:28.198093195Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:40:28.210388794Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:40:28.210430025Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect worker1 has failed, no acks received","ts":"2022-03-29T14:40:28.739353217Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:40:45.73891404Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: i/o timeout\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:40:55.759543497Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:41:03.453413246Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:41:03.45346262Z"}
{"caller":"level.go:63","event":"serviceWithdrawn","ip":null,"ips":["128.28.28.1"],"level":"info","msg":"withdrawing service announcement","pool":"default","protocol":"bgp","reason":"noEndpoints","service":"default/my-nginx","ts":"2022-03-29T14:41:03.475157054Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:41:15.744450669Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T14:41:17.314561541Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: i/o timeout\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:41:27.315977877Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:41:48.831928524Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:42:48.833649438Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: connect: connection refused\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:43:45.759981517Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.102","node event":"NodeJoin","node name":"worker1","ts":"2022-03-29T14:43:47.96156176Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:43:50.946105448Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:43:50.946318092Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:43:53.01925933Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:43:53.01932921Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:44:15.738539496Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T14:46:23.953060622Z"}
  • Speaker container logs on worker1 node (init-worker1 stop-worker1 restart):
{"branch":"HEAD","caller":"level.go:63","commit":"v0.12.1","goversion":"gc / go1.16.14 / amd64","level":"info","msg":"MetalLB speaker starting version 0.12.1 (commit v0.12.1, branch HEAD)","ts":"2022-03-29T13:20:34.596405447Z","version":"0.12.1"}
{"caller":"level.go:63","event":"createARPResponder","interface":"eth0","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T13:20:34.600675127Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"eth0","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T13:20:34.601438142Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"eth1","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T13:20:34.601549706Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"eth1","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T13:20:34.601598283Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"docker0","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T13:20:34.601897331Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"flannel.1","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T13:20:34.602010286Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"flannel.1","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T13:20:34.602065564Z"}
{"caller":"level.go:63","level":"info","msg":"using endpoint slices","op":"New","ts":"2022-03-29T13:20:34.609040429Z"}
W0329 13:20:34.608553     105 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.103","node event":"NodeJoin","node name":"worker2","ts":"2022-03-29T13:20:34.61722257Z"}
W0329 13:20:34.618984     105 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
W0329 13:20:34.620488     105 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
{"caller":"level.go:63","configmap":"metallb-system/config","event":"peerAdded","level":"info","msg":"peer configured, starting BGP session","peer":"10.0.0.1","ts":"2022-03-29T13:20:34.719002559Z"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-03-29T13:20:34.719233665Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:20:34.720175561Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:20:34.720529325Z"}
{"caller":"level.go:63","event":"nodeLabelsChanged","level":"info","msg":"Node labels changed, resyncing BGP peers","ts":"2022-03-29T13:20:34.721263293Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T13:20:34.72144267Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:20:34.725600945Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:20:34.73142526Z"}
{"caller":"level.go:63","error":null,"expected":2,"joined":1,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T13:20:34.735097399Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeJoin","node name":"master","ts":"2022-03-29T13:20:34.731018203Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.102","node event":"NodeJoin","node name":"worker1","ts":"2022-03-29T13:20:34.735818535Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:20:34.741857531Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:20:34.741916737Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:20:36.605708088Z"}
{"caller":"state.go:1157","component":"Memberlist","level":"info","msg":"memberlist: Marking master as failed, suspect timeout reached (0 peer confirmations)","ts":"2022-03-29T13:20:38.803347968Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeLeave","node name":"master","ts":"2022-03-29T13:20:38.803741044Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:20:38.816895808Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:20:38.817063023Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:20:39.599802998Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T13:21:00.725377901Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 10.0.2.15: dial tcp 10.0.2.15:7946: connect: connection refused\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T13:21:00.72614516Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T13:21:04.597578139Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeJoin","node name":"master","ts":"2022-03-29T13:21:27.421856345Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:21:27.428026957Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:21:27.4283128Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:21:30.597815699Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:21:32.598932933Z"}
{"caller":"state.go:1157","component":"Memberlist","level":"info","msg":"memberlist: Marking master as failed, suspect timeout reached (0 peer confirmations)","ts":"2022-03-29T13:21:33.987539957Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeLeave","node name":"master","ts":"2022-03-29T13:21:33.987753164Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:21:33.993390336Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:21:33.993592124Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T13:21:34.612609019Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 10.0.2.15: dial tcp 10.0.2.15:7946: connect: connection refused\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T13:21:34.618049281Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T13:22:04.59718083Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeJoin","node name":"master","ts":"2022-03-29T13:22:27.586091689Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:22:27.599594307Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:22:27.599653775Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:22:30.606196865Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:22:32.602574221Z"}
{"caller":"state.go:1157","component":"Memberlist","level":"info","msg":"memberlist: Marking master as failed, suspect timeout reached (0 peer confirmations)","ts":"2022-03-29T13:22:33.793660746Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeLeave","node name":"master","ts":"2022-03-29T13:22:33.794127557Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:22:33.800362605Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:22:33.800504599Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T13:22:34.596772149Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:22:34.597910104Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 10.0.2.15: dial tcp 10.0.2.15:7946: connect: connection refused\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T13:22:34.61745698Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T13:23:04.617354957Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeJoin","node name":"master","ts":"2022-03-29T13:23:27.448485002Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:23:27.454276192Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:23:27.454422536Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:23:29.601589751Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:23:31.598725432Z"}
{"caller":"state.go:1157","component":"Memberlist","level":"info","msg":"memberlist: Marking master as failed, suspect timeout reached (0 peer confirmations)","ts":"2022-03-29T13:23:33.603494745Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeLeave","node name":"master","ts":"2022-03-29T13:23:33.603643524Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:23:33.60914393Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:23:33.609205337Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:23:34.600079331Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T13:23:34.600339132Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 10.0.2.15: dial tcp 10.0.2.15:7946: connect: connection refused\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T13:23:34.621340181Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T13:24:04.706957407Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeJoin","node name":"master","ts":"2022-03-29T13:24:27.587475148Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:24:27.593425106Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:24:27.593778602Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:24:29.597864274Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:24:31.599141645Z"}
{"caller":"state.go:1157","component":"Memberlist","level":"info","msg":"memberlist: Marking master as failed, suspect timeout reached (0 peer confirmations)","ts":"2022-03-29T13:24:32.832911834Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.0.2.15","node event":"NodeLeave","node name":"master","ts":"2022-03-29T13:24:32.833037803Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:24:32.839758576Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T13:24:32.839835981Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect master has failed, no acks received","ts":"2022-03-29T13:24:33.600374831Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T13:24:34.605123235Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 10.0.2.15: dial tcp 10.0.2.15:7946: connect: connection refused\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T13:24:34.618008831Z"}
  • Speaker container logs on worker2 node (init-worker1 stop-worker1 restart):
{"branch":"HEAD","caller":"level.go:63","commit":"v0.12.1","goversion":"gc / go1.16.14 / amd64","level":"info","msg":"MetalLB speaker starting version 0.12.1 (commit v0.12.1, branch HEAD)","ts":"2022-03-29T14:37:02.044244041Z","version":"0.12.1"}
{"caller":"level.go:63","event":"createARPResponder","interface":"eth0","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:37:02.044903118Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"eth0","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T14:37:02.047000684Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"eth1","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:37:02.047618893Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"eth1","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T14:37:02.047692978Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"docker0","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:37:02.04779035Z"}
{"caller":"level.go:63","event":"createARPResponder","interface":"flannel.1","level":"info","msg":"created ARP responder for interface","ts":"2022-03-29T14:37:02.048047578Z"}
{"caller":"level.go:63","event":"createNDPResponder","interface":"flannel.1","level":"info","msg":"created NDP responder for interface","ts":"2022-03-29T14:37:02.048104355Z"}
{"caller":"level.go:63","level":"info","msg":"using endpoint slices","op":"New","ts":"2022-03-29T14:37:02.066539811Z"}
W0329 14:37:02.066068     102 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.103","node event":"NodeJoin","node name":"worker2","ts":"2022-03-29T14:37:02.073821375Z"}
W0329 14:37:02.075384     102 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
W0329 14:37:02.077491     102 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
{"caller":"level.go:63","configmap":"metallb-system/config","event":"peerAdded","level":"info","msg":"peer configured, starting BGP session","peer":"10.0.0.1","ts":"2022-03-29T14:37:02.175896131Z"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-03-29T14:37:02.176105939Z"}
{"caller":"level.go:63","event":"nodeLabelsChanged","level":"info","msg":"Node labels changed, resyncing BGP peers","ts":"2022-03-29T14:37:02.176224696Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T14:37:02.17626329Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.102","node event":"NodeJoin","node name":"worker1","ts":"2022-03-29T14:37:02.18486104Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.101","node event":"NodeJoin","node name":"master","ts":"2022-03-29T14:37:02.185410703Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:37:02.185542432Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:37:02.186116794Z"}
{"caller":"level.go:63","level":"info","msg":"memberlist join succesfully","number of other nodes":2,"op":"Member detection","ts":"2022-03-29T14:37:02.192170283Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:37:02.195986505Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:37:02.196703177Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T14:37:31.234814731Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:37:32.045073767Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect worker1 has failed, no acks received","ts":"2022-03-29T14:40:23.047887043Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect worker1 has failed, no acks received","ts":"2022-03-29T14:40:25.051992337Z"}
{"caller":"state.go:1157","component":"Memberlist","level":"info","msg":"memberlist: Marking worker1 as failed, suspect timeout reached (0 peer confirmations)","ts":"2022-03-29T14:40:27.048374201Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.102","node event":"NodeLeave","node name":"worker1","ts":"2022-03-29T14:40:27.048697111Z"}
{"caller":"state.go:459","component":"Memberlist","level":"info","msg":"memberlist: Suspect worker1 has failed, no acks received","ts":"2022-03-29T14:40:27.048879516Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:40:27.054425204Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:40:27.054550778Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:40:32.045019585Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:41:02.292242879Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:41:02.292346013Z"}
{"caller":"level.go:63","event":"serviceWithdrawn","ip":null,"ips":["128.28.28.1"],"level":"info","msg":"withdrawing service announcement","pool":"default","protocol":"bgp","reason":"noEndpoints","service":"default/my-nginx","ts":"2022-03-29T14:41:02.331142165Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: i/o timeout\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:41:12.075017958Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:41:32.044753429Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:42:05.140521946Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T14:42:38.900022216Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:42:41.969281029Z"}
{"caller":"level.go:63","error":"1 error occurred:\n\t* Failed to join 172.42.42.102: dial tcp 172.42.42.102:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-03-29T14:43:05.136842044Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"172.42.42.102","node event":"NodeJoin","node name":"worker1","ts":"2022-03-29T14:43:46.819066189Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:43:49.798107151Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:43:49.798331253Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["128.28.28.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:43:51.873261069Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["128.28.28.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/my-nginx","ts":"2022-03-29T14:43:51.873464734Z"}
{"caller":"level.go:63","level":"info","op":"reload-validate","success":"reloaded config","ts":"2022-03-29T14:44:02.045142609Z"}
W0329 14:46:55.088407     102 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-03-29T14:47:46.757943443Z"}

-Environment replication:
The issue can be replicated with this repository (branch FRR_BFD) on vagrant. To check if the cluster is running simply enter 172.42.42.100 on web browser.

@xhejtman
Copy link

xhejtman commented May 8, 2022

Is there any progress?

@fedepaol
Copy link
Member

fedepaol commented May 9, 2022

Hi @xhejtman , we are giving this priority, however we are struggling in reproducing it.
If this is happening often on your cluster, would you mind setting the log level to debug and attaching the logs of the frr container and of the speaker here?

@xhejtman
Copy link

xhejtman commented May 9, 2022

metallb.zip
debug logs.

I rebooted node 4. nginx has been restarted on node 1. (loadbalancer ip 10.16.48.119). I forced nginx to be moved back to node 4.

nodes 1, 3, 4 stopped sending bgp updates to peer, only node 2 continued to send updates.

I double checked this with the following:

  1. node reboot itself is not a problem.
  2. if node is rebooted and back up, I can see all nodes are sending bgp updates
  3. problem is with moving loadbalancer service from a node to another after the node is back online
  4. once I force service move, some nodes stop sending bgp updates.
  5. this could be seen in logs. ip 10.16.48.119 has been moved from node 4 to node 1 due to node 4 reboot. I moved it back to node 4 and updates stopped.

@fedepaol
Copy link
Member

fedepaol commented May 9, 2022

Thanks! Just one question: what you mean by "force service move" ? Normally the lb service is advertised from all the nodes.

@xhejtman
Copy link

xhejtman commented May 9, 2022

kubectl cordon node 1
kubectl delete pod -n kube-system rke2-ingress-nginx-controller-6b45cb6ddc-cdw26. this forces start on another node than 1.

@xhejtman
Copy link

xhejtman commented May 9, 2022

oh, and maybe related thing. ingress loadbalancer service uses externaltrafficpolicy: Local

@fedepaol
Copy link
Member

fedepaol commented May 9, 2022

I was about to ask that :-)

@xhejtman
Copy link

xhejtman commented May 9, 2022

yes, that is the way I noticed that it stopped working, as the service is actually not announced on all nodes ;)

@fedepaol
Copy link
Member

fedepaol commented May 9, 2022

would you mind jumping on the metallb slack on the kubernetes workspace so we don't pollute the issue?

@michalg91
Copy link

Hello!
Recently i faced the same issue with FRR mode. We've got configuration with Canal CNI, Service with ExternalTrafficPolicy set to Local. When i delete pod or restart whole deployment routes are disapearing from neighbor and are not redistributed.

Did you work out any solution?

@fedepaol
Copy link
Member

Hi @michalg91 , would you mind clarifying? Are you restarting the deployment OR the speakers?

@fedepaol
Copy link
Member

And, would you mind sharing your configuration and the details of the service?

@michalg91
Copy link

michalg91 commented Oct 28, 2022

Hi, @fedepaol
I am restarting the deployment under the service (in this case nginx-ingress).
And until i restart speakers the paths are not redistributed.

I also tried to change frr version but no mather which one i chose the problem still occurs.

kind: BFDProfile
metadata:
  name: bfd-stage
  namespace: metallb-system
spec:
  detectMultiplier: 10
  echoMode: true
  minimumTtl: 250
  passiveMode: false 
  receiveInterval: 700
  transmitInterval: 700
---
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
  name: bgp-bird-am4
  namespace: metallb-system
spec:
  myASN: 65531
  peerASN: 65200
  peerAddress: 172.26.6.105
  bfdProfile: bfd-stage
  routerID: 10.50.64.0
  holdTime: "10s"
  ebgpMultiHop: true
  keepaliveTime: "3s"
  nodeSelectors:
  - matchLabels:
      topology.kubernetes.io/zone: am4
---
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
  name: bgp-bird-am6
  namespace: metallb-system
spec:
  myASN: 65531
  peerASN: 65201
  peerAddress: 172.26.134.105
  bfdProfile: bfd-stage
  routerID: 10.50.64.0
  holdTime: "10s"
  keepaliveTime: "3s"
  ebgpMultiHop: true
  nodeSelectors:
  - matchLabels:
      topology.kubernetes.io/zone: am6
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: bgp-one-services
  namespace: metallb-system
spec:
  addresses:
  - 10.50.64.0/28
  autoAssign: true
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
  name: bgp-bird-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
  - bgp-one-services

FRR Config in speaker:

!
Building configuration...

Current configuration:
!
frr version 8.3.1_git
frr defaults traditional
hostname am6-one-worklaod-dev-15-worker4
log file /etc/frr/frr.log informational
log timestamp precision 3
no ipv6 forwarding
service integrated-vtysh-config
!
router bgp 65531
 bgp router-id 10.50.64.0
 no bgp ebgp-requires-policy
 no bgp suppress-duplicates
 no bgp hard-administrative-reset
 no bgp default ipv4-unicast
 no bgp graceful-restart notification
 no bgp network import-check
 neighbor 172.26.134.105 remote-as 65201
 neighbor 172.26.134.105 bfd
 neighbor 172.26.134.105 bfd profile bfd-stage
 neighbor 172.26.134.105 ebgp-multihop 255
 neighbor 172.26.134.105 timers 3 10
 !
 address-family ipv4 unicast
  network 10.50.64.0/32
  network 10.50.64.1/32
  neighbor 172.26.134.105 activate
  neighbor 172.26.134.105 route-map 172.26.134.105-in in
  neighbor 172.26.134.105 route-map 172.26.134.105-out out
 exit-address-family
exit
!
ip prefix-list 172.26.134.105-pl-ipv4 seq 5 permit 10.50.64.1/32
ip prefix-list 172.26.134.105-pl-ipv4 seq 10 permit 10.50.64.0/32
ip prefix-list 172.26.134.105-pl-ipv4 seq 15 deny any
!
ipv6 prefix-list 172.26.134.105-pl-ipv4 seq 5 deny any
!
route-map 172.26.134.105-in deny 20
exit
!
route-map 172.26.134.105-out permit 1
 match ip address prefix-list 172.26.134.105-pl-ipv4
exit
!
route-map 172.26.134.105-out permit 2
 match ipv6 address prefix-list 172.26.134.105-pl-ipv4
exit
!
ip nht resolve-via-default
!
ipv6 nht resolve-via-default
!
bfd
 profile bfd-stage
  detect-multiplier 10
  transmit-interval 700
  receive-interval 700
  echo-mode
  minimum-ttl 250
 exit
 !
exit
!
end

service:

apiVersion: v1
kind: Service
metadata:
  annotations:
    field.cattle.io/publicEndpoints: '[{"addresses":["10.50.64.0"],"port":80,"protocol":"TCP","serviceName":"nginx-ingress:ingress-internal-ingress-nginx-controller-loadbalancer","allNodes":false},{"addresses":["10.50.64.0"],"port":443,"protocol":"TCP","serviceName":"nginx-ingress:ingress-internal-ingress-nginx-controller-loadbalancer","allNodes":false}]'
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":null,"name":"ingress-internal-ingress-nginx-controller-loadbalancer","namespace":"nginx-ingress"},"spec":{"ports":[{"appProtocol":"http","name":"http","port":80,"protocol":"TCP","targetPort":"http"},{"appProtocol":"https","name":"https","port":443,"protocol":"TCP","targetPort":"https"}],"selector":{"app.kubernetes.io/component":"controller","app.kubernetes.io/instance":"ingress-internal","app.kubernetes.io/name":"ingress-nginx"},"type":"LoadBalancer"}}
  creationTimestamp: "2022-10-18T08:07:51Z"
  name: ingress-internal-ingress-nginx-controller-loadbalancer
  namespace: nginx-ingress
  resourceVersion: "18640886"
  uid: c4d61fae-7d83-4ba3-9a41-c060e6e138ab
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.43.225.32
  clusterIPs:
  - 10.43.225.32
  externalTrafficPolicy: Local
  healthCheckNodePort: 31875
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - appProtocol: http
    name: http
    nodePort: 32245
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    nodePort: 32436
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-internal
    app.kubernetes.io/name: ingress-nginx
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 10.50.64.0

@fedepaol
Copy link
Member

Couple of more questions:

  • can you try with the native mode?
  • can you try bumping the version of the frr image to latest?

@michalg91
Copy link

I can confirm that in native mode it is working as expected, but usage of bfd is important to us.
Yes, i tried changed frr image to latest.

@fedepaol
Copy link
Member

I was trying to narrow down, I am not suggesting that reverting is a valid workaround :-)
So, with latest FRR you still have the issue?

@fedepaol
Copy link
Member

And, does this happen always?

@michalg91
Copy link

Yes, this happens always.
With latest FRR and with 7.5.1

@xhejtman
Copy link

Try to build frr image yourself from git. At least this head 403f312d5657d2d62775dc57c0a73bfb85118ea6 worker for me but I need to use custom frr image.

@xhejtman
Copy link

or if you want, you can use our image:
cerit.io/frr/frr:ubi-8-403f312d56

@michalg91
Copy link

michalg91 commented Nov 2, 2022

@xhejtman @fedepaol thanks for your help, i think i got it running (only done few tests).
I have build frr from same head as @xhejtman, also i have set avoidBuggyIPs in addresspool configuration and i think it did the trick.

Edit:
I forgot to change ExternalTrafficPolicy to Local, it still doesn't work.

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 30 days
with no activity. This issue will be closed in 10 days unless you do one of the following:

  • respond to this issue
  • have one of these labels applied: bug,good first issue,help wanted,hold,enhancement,documentation,question

@fedepaol
Copy link
Member

I think this is workarounded by #1732 , where we verify frr is healthy via a probe.
Closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants