Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Udp service not working on 1 node #93791

Closed
velkhatib opened this issue Aug 7, 2020 · 11 comments
Closed

Udp service not working on 1 node #93791

velkhatib opened this issue Aug 7, 2020 · 11 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@velkhatib
Copy link

velkhatib commented Aug 7, 2020

What happened:
Udp service is not working with 1 node. (3 others are ok)

What you expected to happen:
Udp K8s service is working in my 4th node

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
I have build a 3 nodes cluster about 6 months and I have just add a node

  • Cluster of 4 nodes. 3 are running and working well.
  • Core DNS is in the 2nd node. I have no problem my first 3 nodes.

When I'm trying to reach the coreDNS with the k8s service from my last node it's doesn't work:

nslookup consul.infra.svc.cluster.local 10.96.0.10
;; connection timed out; no servers could be reached

But when I'm using the coredns pod ip from my last node, it's working well:

nslookup consul.infra.svc.cluster.local 10.244.1.83
Server:		10.244.1.83
Address:	10.244.1.83#53

Name:	consul.infra.svc.cluster.local
Address: 10.109.247.189

And when I'm using nslookup with tcp from my last node, it's working :

nslookup 
> set vc
> server 10.96.0.10
Default server: 10.96.0.10
Address: 10.96.0.10#53
> consul.infra.svc.cluster.local
Server:		10.96.0.10
Address:	10.96.0.10#53

When I'm tracking network with tcpdump while nslookup I can see the udp request on my 4th node:

tcpdump -ni any port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
16:59:05.433050 IP 10.244.4.0.50391 > 10.244.1.83.53: 55191+ A? consul.infra.svc.cluster.local. (48)
16:59:10.433433 IP 10.244.4.0.50391 > 10.244.1.83.53: 55191+ A? consul.infra.svc.cluster.local. (48)
16:59:15.433631 IP 10.244.4.0.50391 > 10.244.1.83.53: 55191+ A? consul.infra.svc.cluster.local. (48)

But I have nothing I my 2nd node who have the coredns pod.

My iptables rules related to the DNS service on my 4th node:

iptables-save |grep dns
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.0.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SVC-JD5MR3NA4I4DYORP -m comment --comment "kube-system/kube-dns:metrics" -j KUBE-SEP-GPHYIAY7CKFIF2AF
-A KUBE-SEP-GPHYIAY7CKFIF2AF -s 10.244.1.83/32 -m comment --comment "kube-system/kube-dns:metrics" -j KUBE-MARK-MASQ
-A KUBE-SEP-GPHYIAY7CKFIF2AF -p tcp -m comment --comment "kube-system/kube-dns:metrics" -m tcp -j DNAT --to-destination 10.244.1.83:9153
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns" -j KUBE-SEP-57SGA34GFMXC42YK
-A KUBE-SEP-57SGA34GFMXC42YK -s 10.244.1.83/32 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ
-A KUBE-SEP-57SGA34GFMXC42YK -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 10.244.1.83:53
-A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-SEP-3VSLRZMLJF4URI34
-A KUBE-SEP-3VSLRZMLJF4URI34 -s 10.244.1.83/32 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ
-A KUBE-SEP-3VSLRZMLJF4URI34 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination 10.244.1.83:53
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd6
  • Cloud provider or hardware configuration:
    Ovh baremetal, created with kubeadm
  • OS (e.g: cat /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
  • Kernel (e.g. uname -a):
    Linux 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux

  • Install tools:
    Kubeadm

  • Network plugin and version (if this is a network-related bug):
    quay.io/coreos/flannel:v0.11.0-amd64

  • Others:

flannel configuration:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: kube-flannel-ds-amd64
  namespace: kube-system
  selfLink: /apis/apps/v1/namespaces/kube-system/daemonsets/kube-flannel-ds-amd64
  uid: 29c39945-b93d-4572-938b-df990474d6ee
  resourceVersion: '69462883'
  generation: 4
  creationTimestamp: '2020-01-20T17:19:31Z'
  labels:
    app: flannel
    tier: node
  annotations:
    deprecated.daemonset.template.generation: '4'
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app":"flannel","tier":"node"},"name":"kube-flannel-ds-amd64","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"app":"flannel"}},"template":{"metadata":{"labels":{"app":"flannel","tier":"node"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"beta.kubernetes.io/os","operator":"In","values":["linux"]},{"key":"beta.kubernetes.io/arch","operator":"In","values":["amd64"]}]}]}}},"containers":[{"args":["--ip-masq","--kube-subnet-mgr"],"command":["/opt/bin/flanneld"],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.name"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}}],"image":"quay.io/coreos/flannel:v0.11.0-amd64","name":"kube-flannel","resources":{"limits":{"cpu":"100m","memory":"50Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"securityContext":{"capabilities":{"add":["NET_ADMIN"]},"privileged":false},"volumeMounts":[{"mountPath":"/run/flannel","name":"run"},{"mountPath":"/etc/kube-flannel/","name":"flannel-cfg"}]}],"hostNetwork":true,"initContainers":[{"args":["-f","/etc/kube-flannel/cni-conf.json","/etc/cni/net.d/10-flannel.conflist"],"command":["cp"],"image":"quay.io/coreos/flannel:v0.11.0-amd64","name":"install-cni","volumeMounts":[{"mountPath":"/etc/cni/net.d","name":"cni"},{"mountPath":"/etc/kube-flannel/","name":"flannel-cfg"}]}],"serviceAccountName":"flannel","tolerations":[{"effect":"NoSchedule","operator":"Exists"}],"volumes":[{"hostPath":{"path":"/run/flannel"},"name":"run"},{"hostPath":{"path":"/etc/cni/net.d"},"name":"cni"},{"configMap":{"name":"kube-flannel-cfg"},"name":"flannel-cfg"}]}}}}
spec:
  selector:
    matchLabels:
      app: flannel
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: flannel
        tier: node
    spec:
      volumes:
        - name: run
          hostPath:
            path: /run/flannel
            type: ''
        - name: cni
          hostPath:
            path: /etc/cni/net.d
            type: ''
        - name: flannel-cfg
          configMap:
            name: kube-flannel-cfg
            defaultMode: 420
      initContainers:
        - name: install-cni
          image: 'quay.io/coreos/flannel:v0.11.0-amd64'
          command:
            - cp
          args:
            - '-f'
            - /etc/kube-flannel/cni-conf.json
            - /etc/cni/net.d/10-flannel.conflist
          resources: {}
          volumeMounts:
            - name: cni
              mountPath: /etc/cni/net.d
            - name: flannel-cfg
              mountPath: /etc/kube-flannel/
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      containers:
        - name: kube-flannel
          image: 'quay.io/coreos/flannel:v0.11.0-amd64'
          command:
            - /opt/bin/flanneld
          args:
            - '--ip-masq'
            - '--kube-subnet-mgr'
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
          resources:
            limits:
              cpu: 100m
              memory: 50Mi
            requests:
              cpu: 100m
              memory: 50Mi
          volumeMounts:
            - name: run
              mountPath: /run/flannel
            - name: flannel-cfg
              mountPath: /etc/kube-flannel/
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add:
                - NET_ADMIN
            privileged: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: flannel
      serviceAccount: flannel
      hostNetwork: true
      securityContext: {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: beta.kubernetes.io/os
                    operator: In
                    values:
                      - linux
                  - key: beta.kubernetes.io/arch
                    operator: In
                    values:
                      - amd64
                  - key: flannel
                    operator: In
                    values:
                      - baremetal
      schedulerName: default-scheduler
      tolerations:
        - operator: Exists
          effect: NoSchedule
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  revisionHistoryLimit: 10
status:
  currentNumberScheduled: 3
  numberMisscheduled: 0
  desiredNumberScheduled: 3
  numberReady: 3
  observedGeneration: 4
  updatedNumberScheduled: 3
  numberAvailable: 3

kubeProxy configuration

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: kube-proxy
  namespace: kube-system
  selfLink: /apis/apps/v1/namespaces/kube-system/daemonsets/kube-proxy
  uid: 6b98e7dd-afa9-4e06-9c57-bb3cc0711cca
  resourceVersion: '69462235'
  generation: 3
  creationTimestamp: '2020-01-20T17:14:17Z'
  labels:
    k8s-app: kube-proxy
  annotations:
    deprecated.daemonset.template.generation: '3'
spec:
  selector:
    matchLabels:
      k8s-app: kube-proxy
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kube-proxy
    spec:
      volumes:
        - name: kube-proxy
          configMap:
            name: kube-proxy
            defaultMode: 420
        - name: xtables-lock
          hostPath:
            path: /run/xtables.lock
            type: FileOrCreate
        - name: lib-modules
          hostPath:
            path: /lib/modules
            type: ''
      containers:
        - name: kube-proxy
          image: 'k8s.gcr.io/kube-proxy:v1.18.3'
          command:
            - /usr/local/bin/kube-proxy
            - '--config=/var/lib/kube-proxy/config.conf'
            - '--hostname-override=$(NODE_NAME)'
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
          resources: {}
          volumeMounts:
            - name: kube-proxy
              mountPath: /var/lib/kube-proxy
            - name: xtables-lock
              mountPath: /run/xtables.lock
            - name: lib-modules
              readOnly: true
              mountPath: /lib/modules
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubeproxy: iptables
      serviceAccountName: kube-proxy
      serviceAccount: kube-proxy
      hostNetwork: true
      securityContext: {}
      schedulerName: default-scheduler
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - operator: Exists
      priorityClassName: system-node-critical
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  revisionHistoryLimit: 10
status:
  currentNumberScheduled: 4
  numberMisscheduled: 0
  desiredNumberScheduled: 4
  numberReady: 4
  observedGeneration: 3
  updatedNumberScheduled: 4
  numberAvailable: 4

kube flannel config

kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  namespace: kube-system
  selfLink: /api/v1/namespaces/kube-system/configmaps/kube-flannel-cfg
  uid: 55928a38-48ff-4009-807c-bd69c841d2b3
  resourceVersion: '1035'
  creationTimestamp: '2020-01-20T17:19:31Z'
  labels:
    app: flannel
    tier: node
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"v1","data":{"cni-conf.json":"{\n  \"cniVersion\":
      \"0.2.0\",\n  \"name\": \"cbr0\",\n  \"plugins\": [\n    {\n     
      \"type\": \"flannel\",\n      \"delegate\": {\n        \"hairpinMode\":
      true,\n        \"isDefaultGateway\": true\n      }\n    },\n    {\n     
      \"type\": \"portmap\",\n      \"capabilities\": {\n       
      \"portMappings\": true\n      }\n    }\n  ]\n}\n","net-conf.json":"{\n 
      \"Network\": \"10.244.0.0/16\",\n  \"Backend\": {\n    \"Type\":
      \"vxlan\"\n 
      }\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"app":"flannel","tier":"node"},"name":"kube-flannel-cfg","namespace":"kube-system"}}
data:
  cni-conf.json: |
    {
      "cniVersion": "0.2.0",
      "name": "cbr0",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }

@velkhatib velkhatib added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 7, 2020
@velkhatib
Copy link
Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 7, 2020
@athenabot
Copy link

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Aug 7, 2020
@jayunit100
Copy link
Member

/assign

@jayunit100
Copy link
Member

To clarify, just a few starter questions ...

  • DNS works on all nodes but one?
  • Can you show us the routing rules to core dns ?
  • What happens if you run kubectl scale deployment coredns -n coredns
  • Out of curiosity what is your node ip range ?
  • Can you run the Kubernetes e2e test suite and tell us what tests fail/pass ? A good start is sonobuoy run --e2e-focus='intra-pod'

@jayunit100
Copy link
Member

/remove-triage unresolved

@k8s-ci-robot k8s-ci-robot removed the triage/unresolved Indicates an issue that can not or will not be resolved. label Aug 9, 2020
@velkhatib
Copy link
Author

Hi,

Thanks for your response jayunit.

  • Yes DNS is working on all nodes and pods.
  • I don't think the problem come from the coredns, it's working well on all others node and pod. (My config is bellow)
  • I can scale up and down the coredns deployment and all is working fine.
  • My node ip range is 10.0.0.0/16

The response of sonobuoy:

Plugin: e2e
Status: failed
Total: 4992
Passed: 269
Failed: 6
Skipped: 4717

Failed tests:
[sig-scheduling] SchedulerPredicates [Serial] validates that NodeSelector is respected if not matching  [Conformance]
[sig-network] Networking Granular Checks: Pods should function for intra-pod communication: udp [NodeConformance] [Conformance]
[sig-network] Networking Granular Checks: Pods should function for node-pod communication: http [LinuxOnly] [NodeConformance] [Conformance]
[sig-network] Networking Granular Checks: Pods should function for intra-pod communication: http [NodeConformance] [Conformance]
[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run  [Conformance]
[sig-network] Networking Granular Checks: Pods should function for node-pod communication: udp [LinuxOnly] [NodeConformance] [Conformance]](url)

coredns config:

data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

@jayunit100
Copy link
Member

jayunit100 commented Aug 16, 2020

  1. Can you paste the logs from the e2es somewhere ? I want to specifically see the results of the Networking Granular Checks: Pods should function for intra-pod communication: http tests, and see how many times it had to retry all the nodes. I have a feeling it may be more then just UDP that is down but not sure.

  2. Can you paste the output of ip a from the node that is broken, compared with ip a from a healthy node? want to make sure CNI is working properly also .

  3. is 10.244.1.83 (or any pod that runs on the broken node) pingable/reachable from pods on the healthy nodes?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2020
@jayunit100
Copy link
Member

i think we can close this. im suspecting this issue went away, it was maybe just an infra issue Also now we have a way to do breadth first data from the intra-pod tests, so it should be obvious which nodes are down if someone needs to test it again later

@jayunit100
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@jayunit100: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

5 participants