Works on one worker, but not on the other #345

jeroenjacobs79 · 2018-11-25T21:04:29Z

Is this a bug report or a feature request?:

bug report?

What happened:

I have two kubernetes worker nodes (192.168.90.5 en 192.168.90.10). When a pod is running on 192.168.90.5, and exposed as type LoadBalancer, everything works fine. However, pods running on 192.168.90.10 are unreachable when exposed as type LoadBalancer.

The only difference between the two workers, is that the working one is "bare-metal" and the other (non-working) one is a VM running under VMWare ESXi.

What you expected to happen:

I expect that pods can be exposed correctly.

How to reproduce it (as minimally and precisely as possible):

I wish I knew....

Anything else we need to know?:

I'll try to be as describtive as I can possible be.

These are my two nodes:

kubectl get nodes -o wide                                                                                                                                                                                                                                         
NAME        STATUS   ROLES    AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
nuc-1       Ready    <none>   105d   v1.11.3   192.168.90.5    <none>        CentOS Linux 7 (Core)   4.19.4-1.el7.elrepo.x86_64   docker://18.9.0
worker-01   Ready    <none>   1h     v1.11.3   192.168.90.10   <none>        CentOS Linux 7 (Core)   4.19.4-1.el7.elrepo.x86_64   docker://18.9.0

Speakers are running on both nodes:

kubectl get pods -n metallb-system -o wide                                                                                                                                                                                                                        
NAME                        READY   STATUS    RESTARTS   AGE   IP              NODE        NOMINATED NODE
controller-9c57dbd4-4qtjq   1/1     Running   0          8h    10.32.0.56      nuc-1       <none>
speaker-b7d86               1/1     Running   0          8h    192.168.90.5    nuc-1       <none>
speaker-cb7bd               1/1     Running   0          1h    192.168.90.10   worker-01   <none>

On my router, both are listed as neighbours:

protocols {
    bgp 65000 {
        neighbor 192.168.90.5 {
            remote-as 65001
        }
        neighbor 192.168.90.10 {
            remote-as 65001
        }
        parameters {
            router-id 192.168.90.1
        }
    }
    static {
    }
}

This is my metallb config map:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    peers:
    - peer-address: 192.168.90.1
      peer-asn: 65000
      my-asn: 65001
    address-pools:
    - name: default
      protocol: bgp
      addresses:
      - 192.168.90.128/25

This is my test application I'm deploying:

apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      nodeSelector:
        servicetest: "1"
      tolerations:
        - key: "servicetest"
          operator: "Equal"
          effect: "NoSchedule"
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx-ext
spec:
  type: LoadBalancer
  loadBalancerIP: 192.168.90.222
  externalTrafficPolicy: Local
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: nginx

On my router, the route table lists this when the pod is scheduled on 192.168.90.5 (nuc-1), which is the correctly working one:

route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         0.0.0.0         255.255.255.0   U     0      0        0 vtun0
0.0.0.0         81.82.192.1     0.0.0.0         UG    0      0        0 eth2
81.82.192.0     0.0.0.0         255.255.192.0   U     0      0        0 eth2
172.19.0.0      0.0.0.0         255.255.255.0   U     0      0        0 vtun0
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0.99
192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.30.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.101
192.168.40.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.102
192.168.90.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.110
192.168.90.222  192.168.90.5   255.255.255.255 UGH   0      0        0 eth0.110
192.168.99.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1

Like I said, at this point everything is working. But once the pod is scheduled to other node 192.168.90.10 (worker-01), the route table lists this:

route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         0.0.0.0         255.255.255.0   U     0      0        0 vtun0
0.0.0.0         81.82.192.1     0.0.0.0         UG    0      0        0 eth2
81.82.192.0     0.0.0.0         255.255.192.0   U     0      0        0 eth2
172.19.0.0      0.0.0.0         255.255.255.0   U     0      0        0 vtun0
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0.99
192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.30.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.101
192.168.40.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.102
192.168.90.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.110
192.168.90.222  192.168.90.10   255.255.255.255 UGH   0      0        0 eth0.110
192.168.99.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1

This appears to be correct, but 192.168.90.222 is unreachable.

I looked, looked, at looked again. Those workers are set up in the same way. The only difference is, that the mal-functioning one is a VM, but I don't think this can explain this behaviour.

Environment:

MetalLB version: v0.7.3
Kubernetes version: 1.11.3
CNI: 0.6.0
BGP router type/version: Ubiquiti EdgeRouter
OS (e.g. from /etc/os-release): Centos7
Kernel (e.g. uname -a): 4.19.4-1.el7.elrepo.x86_64 Implement BGP add-path #1 SMP Fri Nov 23 08:15:01 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
Weave (pod networking overlay): v2.4.1

The text was updated successfully, but these errors were encountered:

jeroenjacobs79 · 2018-11-25T21:13:23Z

Some additional information:

Only services exposed as ExternalTrafficPolicy: Local seem to be affected.

When I ssh into the malfunctioning host 192.168.90.10 (on which the pod is running), and I do curl -v http://192.168.90.222 I get a time-out. However when I use kubectl to start a shell in that container, and do the same curl command, The lookup succeeds!

clarification:curl -v http://192.168.90.222 on 192.168.90.10 fails, but curl -v http://192.168.90.222 from within the container (which runs on 192.168.90.10) succeeds.

I'm baffled....

jeroenjacobs79 · 2018-12-15T09:35:52Z

Closing this issue. Root-cause was identified as an incorrect --hostname-override passed to kube-proxy.

jeroenjacobs79 closed this as completed Dec 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Works on one worker, but not on the other #345

Works on one worker, but not on the other #345

jeroenjacobs79 commented Nov 25, 2018 •

edited

jeroenjacobs79 commented Nov 25, 2018 •

edited

jeroenjacobs79 commented Dec 15, 2018

Works on one worker, but not on the other #345

Works on one worker, but not on the other #345

Comments

jeroenjacobs79 commented Nov 25, 2018 • edited

jeroenjacobs79 commented Nov 25, 2018 • edited

jeroenjacobs79 commented Dec 15, 2018

jeroenjacobs79 commented Nov 25, 2018 •

edited

jeroenjacobs79 commented Nov 25, 2018 •

edited