Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service become unreachable after loadBalancerIP change with MetalLB in L2 mode #471

Closed
zigmund opened this issue Aug 27, 2019 · 9 comments · Fixed by #520
Closed

Service become unreachable after loadBalancerIP change with MetalLB in L2 mode #471

zigmund opened this issue Aug 27, 2019 · 9 comments · Fixed by #520
Assignees

Comments

@zigmund
Copy link

@zigmund zigmund commented Aug 27, 2019

Hi there,

Checked few times and here the steps to reproduce:

  1. Launch metallb with L2 mode configured.
  2. Create LoadBalancer service with or without loadBalancerIP.
  3. Service is reachable via IP metallb assinged to service.
  4. Manually set another different loadBalancerIP from same pool.
  5. Metallb writes to logs and service events that IP chanched and he is announcing new IP, but is not reachable anymore via new or old IP.

Here is example, service worked well on IP 10.9.244.6 and then I changed loadBalancerIP to 10.9.244.7.
Service events:

  Normal  nodeAssigned    4s (x4 over 4m)       metallb-speaker     announcing from node "hw-kube-n5.***"
  Normal  IPAllocated     4s                    metallb-controller  Assigned IP "10.9.244.7"
  Normal  LoadbalancerIP  4s                    service-controller  10.9.244.6 -> 10.9.244.7

Controller log:

{"caller":"main.go:49","event":"startUpdate","msg":"start of service update","service":"common-testing/os-redis-frontend-dc1-common","ts":"2019-08-27T04:41:28.880651029Z"}
{"caller":"service.go:77","event":"clearAssignment","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"common-testing/os-redis-frontend-dc1-common","ts":"2019-08-27T04:41:28.880722368Z"}
{"caller":"service.go:98","event":"ipAllocated","ip":"10.9.244.7","msg":"IP address assigned by controller","service":"common-testing/os-redis-frontend-dc1-common","ts":"2019-08-27T04:41:28.880757651Z"}
{"caller":"main.go:96","event":"serviceUpdated","msg":"updated service object","service":"common-testing/os-redis-frontend-dc1-common","ts":"2019-08-27T04:41:28.897212655Z"}
{"caller":"main.go:98","event":"endUpdate","msg":"end of service update","service":"common-testing/os-redis-frontend-dc1-common","ts":"2019-08-27T04:41:28.897234751Z"}

Speaker log:

{"caller":"main.go:176","event":"startUpdate","msg":"start of service update","service":"common-testing/os-redis-frontend-dc1-common","ts":"2019-08-27T04:41:28.896635956Z"}
{"caller":"main.go:246","event":"serviceAnnounced","ip":"10.9.244.7","msg":"service has IP, announcing","pool":"alahd-10-9-244","protocol":"layer2","service":"common-testing/os-redis-frontend-dc1-common","ts":"2019-08-27T04:41:28.896745401Z"}
{"caller":"main.go:249","event":"endUpdate","msg":"end of service update","service":"common-testing/os-redis-frontend-dc1-common","ts":"2019-08-27T04:41:28.896834263Z"}

There are few ways to recover service access:

  • Recreate service (delete/create);
  • Change service type, for example, to NodePort and back;
  • Restart speaker POD (delete);

MetalLb version:

{"branch":"HEAD","caller":"main.go:72","commit":"v0.8.1","msg":"MetalLB speaker starting version 0.8.1 (commit v0.8.1, branch HEAD)","ts":"2019-08-27T04:37:31.972293077Z","version":"0.8.1"}

Kubernetes version:

Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate:"2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

kube-proxy in iptables mode.

@danderson
Copy link
Contributor

@danderson danderson commented Aug 27, 2019

Thanks for the report!

Okay, I see a logic bug in speaker, when an IP is assigned and then later changed. Basically the speaker internal state doesn't get cleaned up, so exported metrics are wrong and the speaker ends up doing a bit too much work for IPv6.

But... Even with that (which is definitely a bug!), the service should still be reachable on the new IP, based on the code I'm staring at...

Can you help me verify a few more things with your setup? I don't have time right now to set up a sandbox and repro (but should tomorrow).

  1. Delete the service.
  2. Restart all speaker pods (to clear any bad internal state).
  3. Create the service.
  4. Run arping <assigned ip> and verify that you're seeing ARP responses from MetalLB.
  5. Change the loadBalancerIP like you did before, to make MetalLB change the IP allocation.
  6. Run arping <new assigned ip>. Do you get any ARP responses? What about arping <old assigned ip> ?

Basically, looking at the code, I don't see how the new IP would not be working. There's definitely some stuff to fix here anyway, and with more steps (multiple services, moving IPs from one to another), there's maybe a way to trigger what you're seeing with just this logic bug... but if you don't see any ARP traffic with the clean steps above (that don't have borked internal speaker state), then there's something else going on.

@danderson danderson added this to To Do in Layer 2 mode via automation Aug 27, 2019
@danderson danderson self-assigned this Aug 27, 2019
@zigmund
Copy link
Author

@zigmund zigmund commented Aug 27, 2019

@danderson thanks for reply.

Restarted all metallb pods, including controller.
Created service, got 10.9.244.1 assigned.
Arping from independent host in k8s nodes' L2 domain:

# arping -i ens18 -c 3 10.9.244.1
ARPING 10.9.244.1
60 bytes from 0c:c4:7a:9b:fc:b4 (10.9.244.1): index=0 time=9.281 msec
60 bytes from 0c:c4:7a:9b:fc:b4 (10.9.244.1): index=1 time=4.254 msec
60 bytes from 0c:c4:7a:9b:fc:b4 (10.9.244.1): index=2 time=15.969 msec

--- 10.9.244.1 statistics ---
3 packets transmitted, 3 packets received,   0% unanswered (0 extra)
rtt min/avg/max/std-dev = 4.254/9.835/15.969/4.799 ms

Edited service and set LoadbalancerIP to 10.9.244.15. New IP assigned, according to logs and events.
No response with arping from new IP, but still got response from old IP.

# arping -i ens18 -c 3 10.9.244.15
ARPING 10.9.244.15
Timeout
Timeout
Timeout

--- 10.9.244.15 statistics ---
3 packets transmitted, 0 packets received, 100% unanswered (0 extra)

# arping -i ens18 -c 3 10.9.244.1
ARPING 10.9.244.1
60 bytes from 0c:c4:7a:9b:fc:b4 (10.9.244.1): index=0 time=13.750 msec
60 bytes from 0c:c4:7a:9b:fc:b4 (10.9.244.1): index=1 time=5.406 msec
60 bytes from 0c:c4:7a:9b:fc:b4 (10.9.244.1): index=2 time=12.952 msec

--- 10.9.244.1 statistics ---
3 packets transmitted, 3 packets received,   0% unanswered (0 extra)
rtt min/avg/max/std-dev = 5.406/10.703/13.750/3.759 ms

@fungnaz
Copy link

@fungnaz fungnaz commented Nov 1, 2019

same problem
Kubernetes v1.16.2 + calico
metallb: v0.8.3.

change loadBalacner IP, service IP was change correctly but cannot connect.

after restart speaker its back to normal.
kubectl rollout restart daemonset speaker -n metallb-system

@davemuench
Copy link

@davemuench davemuench commented Nov 25, 2019

I am seeing the same behavior, a quick redeploy of the speakers moves the IP - up till that point it remains responding on the old IP and not the new.

Rancher 2.2.3 / kubernetes 1.15.5 / metallb 0.8.3

@syska
Copy link

@syska syska commented Dec 30, 2019

Kind a the same issue here ...

LoadBalancerIP: 192.168.1.40

change it to

LoadBalancerIP: 192.168.1.42

Can't access it anymore ... deleting the service and creating it again resolved the issue.

Running ubuntu-19.10 with microk8s and MetalLB 0.8.3

@DonAndrey
Copy link

@DonAndrey DonAndrey commented Jan 15, 2020

Any solution on this? I've faced the same problem.

@KnicKnic
Copy link
Contributor

@KnicKnic KnicKnic commented Jan 27, 2020

For those that are hitting this issue, I did build a new speaker based off master with change #520 (fixed it for me, I don't know if there is any incompatibility between sepeaker built off master and controller base off latest release).

docker image name knicknic/metallb-speaker:ip_delete_1

@admun
Copy link

@admun admun commented Feb 2, 2020

I deployed knicknic/metallb-speaker:ip_delete_1 on my 3 nodes clusters, but still seeing the issue..... endpoints just getting timeout/unreachable.

How do I debug this further?

Sorry, I am not familiar how to debug such problem.

env: rancher 3.3.5, 1.17.2 (RKE) + , canel on bare metal 3x cs-24sc

@carroarmato0
Copy link

@carroarmato0 carroarmato0 commented Mar 11, 2020

Same issue when trying out knicknic/metallb-speaker:ip_delete_1 (I only update the speaking daemon-set)
It actually seems to work. For some reason my arpings don't work how I expected them to work, but when connecting to the new IP address it actually responds.

Layer 2 mode automation moved this from To Do to Done Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Layer 2 mode
  
Done
Development

Successfully merging a pull request may close this issue.

9 participants