Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Works on one worker, but not on the other #345

Closed
jeroenjacobs79 opened this issue Nov 25, 2018 · 2 comments
Closed

Works on one worker, but not on the other #345

jeroenjacobs79 opened this issue Nov 25, 2018 · 2 comments

Comments

@jeroenjacobs79
Copy link

jeroenjacobs79 commented Nov 25, 2018

Is this a bug report or a feature request?:

bug report?

What happened:

I have two kubernetes worker nodes (192.168.90.5 en 192.168.90.10). When a pod is running on 192.168.90.5, and exposed as type LoadBalancer, everything works fine. However, pods running on 192.168.90.10 are unreachable when exposed as type LoadBalancer.

The only difference between the two workers, is that the working one is "bare-metal" and the other (non-working) one is a VM running under VMWare ESXi.

What you expected to happen:

I expect that pods can be exposed correctly.

How to reproduce it (as minimally and precisely as possible):

I wish I knew....

Anything else we need to know?:

I'll try to be as describtive as I can possible be.

These are my two nodes:

kubectl get nodes -o wide                                                                                                                                                                                                                                         
NAME        STATUS   ROLES    AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
nuc-1       Ready    <none>   105d   v1.11.3   192.168.90.5    <none>        CentOS Linux 7 (Core)   4.19.4-1.el7.elrepo.x86_64   docker://18.9.0
worker-01   Ready    <none>   1h     v1.11.3   192.168.90.10   <none>        CentOS Linux 7 (Core)   4.19.4-1.el7.elrepo.x86_64   docker://18.9.0

Speakers are running on both nodes:

kubectl get pods -n metallb-system -o wide                                                                                                                                                                                                                        
NAME                        READY   STATUS    RESTARTS   AGE   IP              NODE        NOMINATED NODE
controller-9c57dbd4-4qtjq   1/1     Running   0          8h    10.32.0.56      nuc-1       <none>
speaker-b7d86               1/1     Running   0          8h    192.168.90.5    nuc-1       <none>
speaker-cb7bd               1/1     Running   0          1h    192.168.90.10   worker-01   <none>

On my router, both are listed as neighbours:

protocols {
    bgp 65000 {
        neighbor 192.168.90.5 {
            remote-as 65001
        }
        neighbor 192.168.90.10 {
            remote-as 65001
        }
        parameters {
            router-id 192.168.90.1
        }
    }
    static {
    }
}

This is my metallb config map:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    peers:
    - peer-address: 192.168.90.1
      peer-asn: 65000
      my-asn: 65001
    address-pools:
    - name: default
      protocol: bgp
      addresses:
      - 192.168.90.128/25

This is my test application I'm deploying:

apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      nodeSelector:
        servicetest: "1"
      tolerations:
        - key: "servicetest"
          operator: "Equal"
          effect: "NoSchedule"
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx
  name: nginx-ext
spec:
  type: LoadBalancer
  loadBalancerIP: 192.168.90.222
  externalTrafficPolicy: Local
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: nginx

On my router, the route table lists this when the pod is scheduled on 192.168.90.5 (nuc-1), which is the correctly working one:

route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         0.0.0.0         255.255.255.0   U     0      0        0 vtun0
0.0.0.0         81.82.192.1     0.0.0.0         UG    0      0        0 eth2
81.82.192.0     0.0.0.0         255.255.192.0   U     0      0        0 eth2
172.19.0.0      0.0.0.0         255.255.255.0   U     0      0        0 vtun0
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0.99
192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.30.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.101
192.168.40.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.102
192.168.90.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.110
192.168.90.222  192.168.90.5   255.255.255.255 UGH   0      0        0 eth0.110
192.168.99.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1

Like I said, at this point everything is working. But once the pod is scheduled to other node 192.168.90.10 (worker-01), the route table lists this:

route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         0.0.0.0         255.255.255.0   U     0      0        0 vtun0
0.0.0.0         81.82.192.1     0.0.0.0         UG    0      0        0 eth2
81.82.192.0     0.0.0.0         255.255.192.0   U     0      0        0 eth2
172.19.0.0      0.0.0.0         255.255.255.0   U     0      0        0 vtun0
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0.99
192.168.10.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.30.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.101
192.168.40.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.102
192.168.90.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0.110
192.168.90.222  192.168.90.10   255.255.255.255 UGH   0      0        0 eth0.110
192.168.99.0    0.0.0.0         255.255.255.0   U     0      0        0 eth1

This appears to be correct, but 192.168.90.222 is unreachable.

I looked, looked, at looked again. Those workers are set up in the same way. The only difference is, that the mal-functioning one is a VM, but I don't think this can explain this behaviour.

Environment:

  • MetalLB version: v0.7.3
  • Kubernetes version: 1.11.3
  • CNI: 0.6.0
  • BGP router type/version: Ubiquiti EdgeRouter
  • OS (e.g. from /etc/os-release): Centos7
  • Kernel (e.g. uname -a): 4.19.4-1.el7.elrepo.x86_64 Implement BGP add-path #1 SMP Fri Nov 23 08:15:01 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Weave (pod networking overlay): v2.4.1
@jeroenjacobs79
Copy link
Author

jeroenjacobs79 commented Nov 25, 2018

Some additional information:

Only services exposed as ExternalTrafficPolicy: Local seem to be affected.

When I ssh into the malfunctioning host 192.168.90.10 (on which the pod is running), and I do curl -v http://192.168.90.222 I get a time-out. However when I use kubectl to start a shell in that container, and do the same curl command, The lookup succeeds!

clarification:curl -v http://192.168.90.222 on 192.168.90.10 fails, but curl -v http://192.168.90.222 from within the container (which runs on 192.168.90.10) succeeds.

I'm baffled....

@jeroenjacobs79
Copy link
Author

Closing this issue. Root-cause was identified as an incorrect --hostname-override passed to kube-proxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant