New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP route not withdrawn, if shared IP, externalTrafficPolicy=Local and no endpoints #295

Closed
swoga opened this Issue Aug 5, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@swoga
Copy link

swoga commented Aug 5, 2018

Is this a bug report or a feature request?:
bug

What happened:
Created a deployment with a scale of 1 and two services with a shared IP.
Initially the BGP announcements are correct, but if I set the scale to 2 and then back to 1, the routes from the second node are not withdrawn.

What you expected to happen:
The same behaviour like non-shared IPs: The routes should be withdrawn if there are no local endpoints.

How to reproduce it (as minimally and precisely as possible):
(needs min. 2 nodes)
Create deployment and two services which share the same IP.
Set the scale to 2.
Set the scale to 1.
Check the BGP peer.

Anything else we need to know?:
When you kill the speaker container on the node who falsely announces the prefixes, the problem is temporary fixed.

Log of the affected speaker container:
announce:

{"caller":"main.go:161","event":"startUpdate","msg":"start of service update","service":"default/helloworld","ts":"2018-08-05T12:49:19.457974662Z"}
{"caller":"main.go:165","event":"endUpdate","msg":"end of service update","service":"default/helloworld","ts":"2018-08-05T12:49:19.458063763Z"}
{"caller":"main.go:161","event":"startUpdate","msg":"start of service update","service":"default/test-lb-2","ts":"2018-08-05T12:49:19.472152169Z"}
{"caller":"bgp_controller.go:201","event":"updatedAdvertisements","ip":"100.64.0.3","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/test-lb-2","ts":"2018-08-05T12:49:19.47224787Z"}
{"caller":"main.go:232","event":"serviceAnnounced","ip":"100.64.0.3","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/test-lb-2","ts":"2018-08-05T12:49:19.472319871Z"}
{"caller":"main.go:234","event":"endUpdate","msg":"end of service update","service":"default/test-lb-2","ts":"2018-08-05T12:49:19.472362372Z"}
{"caller":"main.go:161","event":"startUpdate","msg":"start of service update","service":"default/test-lb","ts":"2018-08-05T12:49:19.47360399Z"}
{"caller":"bgp_controller.go:201","event":"updatedAdvertisements","ip":"100.64.0.3","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"default/test-lb","ts":"2018-08-05T12:49:19.473723792Z"}
{"caller":"main.go:232","event":"serviceAnnounced","ip":"100.64.0.3","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"default/test-lb","ts":"2018-08-05T12:49:19.473774793Z"}
{"caller":"main.go:234","event":"endUpdate","msg":"end of service update","service":"default/test-lb","ts":"2018-08-05T12:49:19.473834694Z"}

withdraw:

{"caller":"main.go:161","event":"startUpdate","msg":"start of service update","service":"default/test-lb-2","ts":"2018-08-05T12:49:33.375369048Z"}
{"caller":"main.go:261","event":"serviceWithdrawn","ip":"","msg":"withdrawing service announcement","pool":"default","protocol":"bgp","reason":"noLocalEndpoints","service":"default/test-lb-2","ts":"2018-08-05T12:49:33.375477149Z"}
{"caller":"main.go:212","event":"endUpdate","msg":"end of service update","service":"default/test-lb-2","ts":"2018-08-05T12:49:33.37550845Z"}
{"caller":"main.go:161","event":"startUpdate","msg":"start of service update","service":"default/test-lb","ts":"2018-08-05T12:49:33.385365393Z"}
{"caller":"main.go:261","event":"serviceWithdrawn","ip":"","msg":"withdrawing service announcement","pool":"default","protocol":"bgp","reason":"noLocalEndpoints","service":"default/test-lb","ts":"2018-08-05T12:49:33.385526195Z"}
{"caller":"main.go:212","event":"endUpdate","msg":"end of service update","service":"default/test-lb","ts":"2018-08-05T12:49:33.385568896Z"}
{"caller":"main.go:161","event":"startUpdate","msg":"start of service update","service":"default/helloworld","ts":"2018-08-05T12:49:33.386172805Z"}
{"caller":"main.go:165","event":"endUpdate","msg":"end of service update","service":"default/helloworld","ts":"2018-08-05T12:49:33.386228006Z"}

Deployment:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "19"
  labels:
    workload.user.cattle.io/workloadselector: deployment-default-helloworld
  name: helloworld
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      workload.user.cattle.io/workloadselector: deployment-default-helloworld
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        workload.cattle.io/state: '{"cmFuY2hlcg==":"c-dl98c:m-0ca6b50b7a6c"}'
      creationTimestamp: null
      labels:
        workload.user.cattle.io/workloadselector: deployment-default-helloworld
    spec:
      affinity: {}
      containers:
      - image: eexit/mirror-http-server
        imagePullPolicy: Always
        name: helloworld
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
          readOnlyRootFilesystem: false
          runAsNonRoot: false
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Service 1:

apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/allow-shared-ip: x
  name: test-lb
spec:
  clusterIP: 10.43.66.49
  externalTrafficPolicy: Local
  healthCheckNodePort: 31251
  ports:
  - nodePort: 32256
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    workload.user.cattle.io/workloadselector: deployment-default-helloworld
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 100.64.0.3

Service 2:

apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/allow-shared-ip: x
  name: test-lb-2
spec:
  clusterIP: 10.43.102.162
  externalTrafficPolicy: Local
  healthCheckNodePort: 30716
  loadBalancerIP: 100.64.0.3
  ports:
  - nodePort: 32661
    port: 8080
    protocol: TCP
    targetPort: 80
  selector:
    workload.user.cattle.io/workloadselector: deployment-default-helloworld
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 100.64.0.3

Environment:

  • MetalLB version: 0.7.2
  • Kubernetes version: v1.10.5-rancher1-1
  • BGP router type/version: MikroTik
  • OS (e.g. from /etc/os-release): RancherOS 1.4
  • Kernel (e.g. uname -a):
@swoga

This comment has been minimized.

Copy link

swoga commented Aug 5, 2018

probably related: #263

@danderson

This comment has been minimized.

Copy link
Member

danderson commented Aug 8, 2018

Thanks for the bug report! This sounds pretty terrible... And even more confusing because according to MetalLB's logs, it's doing exactly the right thing! I see a log entry where it triggers a withdraw of the BGP announcement for reason noLocalEndpoints... So MetalLB knows that it should not be announcing... Why is it?!

Digging into this, thanks.

@danderson danderson self-assigned this Aug 8, 2018

@swoga

This comment has been minimized.

Copy link

swoga commented Aug 9, 2018

I think I have found the issue!

  1. State: Deployment with scale of 2
svcAds: {  
   "default/test-lb":[  
      {  
         "Prefix":{  
            "IP":"100.64.0.3",
            "Mask":"/////w=="
         },
         "NextHop":"",
         "LocalPref":0,
         "Communities":null
      }
   ],
   "default/test-lb-2":[  
      {  
         "Prefix":{  
            "IP":"100.64.0.3",
            "Mask":"/////w=="
         },
         "NextHop":"",
         "LocalPref":0,
         "Communities":null
      }
   ]
}

ipRefcnt: {"100.64.0.3":2}

  1. Scale set to 1

  2. deleteBalancer()@speaker/main.go gets called for service default/test-lb, but bgp_controller.DeleteBalancer() isn't called because ref != 0, but the ipRefcnt gets reduced -> ipRefcnt: {"100.64.0.3":1}

  3. deleteBalancer()@speaker/main.go gets called for service default/test-lb-2, now ref = 0 and bgp_controller.DeleteBalancer() gets called.

  4. Erroneous state

svcAds: {  
   "default/test-lb":[  
      {  
         "Prefix":{  
            "IP":"100.64.0.3",
            "Mask":"/////w=="
         },
         "NextHop":"",
         "LocalPref":0,
         "Communities":null
      }
   ]
}

If I remove the check for ref = 0 in speaker/main.go the route is withdrawn correctly, but I think there is some intent behind the ipRefcnt.

@danderson

This comment has been minimized.

Copy link
Member

danderson commented Aug 9, 2018

Thanks for digging. Yeah, the refcount is important for shared IPs, because we only want to stop announcing if all services have gone away, otherwise we can end up with weird race conditions because of the lack of guarantees k8s gives us about event ordering.

However, the key problem you've identified here is: when ref=0, DeleteBalancer() is called... But the BGP announcer library keeps on announcing. So, there's something broken in the deletion path somewhere.

@swoga

This comment has been minimized.

Copy link

swoga commented Aug 9, 2018

But there is a seperate entry for every service in svcAds and if DeleteBalancer is called only this service is removed from svcAds, as long as another service uses this IP it will still be advertised, so no refcount would be needed.
Or I am overlooking something?

The bgp_controller.go contains the following comment, which I based my assumption on:
"This list might contain duplicates, but that's fine, they'll get compacted by the session code when it's calculating advertisements"

@danderson

This comment has been minimized.

Copy link
Member

danderson commented Aug 10, 2018

Doh! Okay, I see the problem now.

When I refactored everything for IP sharing, I used the internal model of layer2 mode as a reference, where I must refcount externally and only delete when the last service goes away. On the other hand, in BGP mode, because of the comment you noted, you have to delete an equal number of times as you create, otherwise announcements stay behind.

So... Some tweaks needed, obviously :). I can't just delete the refcounting or it'll break layer2 mode. Thinking...

@danderson

This comment has been minimized.

Copy link
Member

danderson commented Aug 10, 2018

Okay, the fix here is simply to move the refcounting into internal/layer2, and unconditionally SetBalancer in the main speaker code.

It's late here, so I'll write that patch tomorrow and release 0.7.3 with the fix. Thanks for getting all the way to the root cause!

@danderson danderson removed the need-repro label Aug 10, 2018

danderson added a commit to danderson/metallb that referenced this issue Aug 17, 2018

Move IP refcounting into the layer2 package.
BGP mode does its own "refcounting" through announcement deduplication.

Fixes google#295

danderson added a commit that referenced this issue Aug 17, 2018

Move IP refcounting into the layer2 package.
BGP mode does its own "refcounting" through announcement deduplication.

Fixes #295

danderson added a commit that referenced this issue Aug 17, 2018

Move IP refcounting into the layer2 package.
BGP mode does its own "refcounting" through announcement deduplication.

Fixes #295

(cherry picked from commit 5cd15b1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment