docs needed: kube-proxy doesn't support live cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) #102314

harshanarayana · 2021-05-26T07:11:19Z

What happened:

As part of the Kubernetes upgrade workflow we have in place, we are upgrading kubernetes from 1.15.3 -> 1.18.15 (both versions and all versions in the middle are custom patched for different behavior of the kubelet and kubeadm. None of them having anything to do with kube-proxy's IPVS), and as part of this upgrade we are also migrating the kube-proxy from iptables to ipvs mode.

We have calico 3.12 running on the cluster as part of the CNI stack.

Once the kube-proxy pods are migrated from iptables to ipvs mode by applying the new kube-proxy DaemonSet manifest, the cluster has a series of stale rules for KUBE-SERVICES chain left around that breaks the runtime state of the pods.

Additional Info

We have cluster running in HA mode (3 Node cluster) where all nodes of the cluster are running as leaders and take the workload as well. There is no non-leader Pods.
We have customized the kubernetes and calico infra to enable the usage of Link local IP address in 169.254.0.0/16 range for internal usecases

Logs

Most of the IPs and the names have been mangled.

Stale IP Table Rules

root@maglev-master-10:/home/maglev# iptables -L KUBE-SERVICES | grep my-service.my-namespace.svc.cluster.local
REJECT     tcp  --  anywhere             my-service.my-namespace.svc.cluster.local  /* my-namespace/my-service:api has no endpoints */ tcp dpt:8000 reject-with icmp-port-unreachable

SVC Definition

root@maglev-master-10:/home/maglev# nslookup my-service.my-namespace.svc.cluster.local
Server:		127.0.0.53
Address:	127.0.0.53#53

Name:	my-service.my-namespace.svc.cluster.local
Address: 169.254.63.88

root@maglev-master-10:/home/maglev# kubectl get svc -n my-namespace my-service -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2021-05-24T09:07:02Z"
  labels:
    service-type: collector
    serviceName: my-service
    tier: application
  name: my-service
  namespace: my-namespace
  resourceVersion: "8853"
  selfLink: /api/v1/namespaces/my-namespace/services/my-service
  uid: a8492618-c0d9-4d5a-b21b-2efe8db37e7c
spec:
  clusterIP: 169.254.63.88
  ports:
  - name: api
    port: 8000
    protocol: TCP
    targetPort: 8076
  selector:
    serviceName: my-service
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

SVC Endpoint

root@maglev-master-10:/home/maglev# kubectl get ep -n my-namespace my-service -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2021-05-26T06:19:09Z"
  creationTimestamp: "2021-05-24T09:07:02Z"
  labels:
    service-type: collector
    serviceName: my-service
    tier: application
  name: my-service
  namespace: my-namespace
  resourceVersion: "470853"
  selfLink: /api/v1/namespaces/my-namespace/endpoints/my-service
  uid: 416fbf2b-e65b-443b-9c8f-625312f4499a
subsets:
- notReadyAddresses:
  - ip: 169.254.32.24
    nodeName: 10.30.199.249
    targetRef:
      kind: Pod
      name: my-service-7cd5bdd55-4cfln
      namespace: my-namespace
      resourceVersion: "470852"
      uid: d2c099b6-6965-4664-bb62-4b94a3ad033c
  ports:
  - name: api
    port: 8076
    protocol: TCP

Pod State

root@maglev-master-10:/home/maglev# kubectl get pods -n my-namespace my-service-7cd5bdd55-4cfln -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP              NODE            NOMINATED NODE   READINESS GATES
my-service-7cd5bdd55-4cfln       1/1     Running   0          3m29s   169.254.32.24   10.30.199.249   <none>           <none>

Kube-Proxy Manifest

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "4"
  creationTimestamp: "2021-05-24T11:53:33Z"
  generation: 4
  labels:
    k8s-app: kube-proxy
  name: kube-proxy
  namespace: kube-system
  resourceVersion: "441510"
  uid: 7681838f-fc46-49de-b7fc-b8ac073934ad
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-proxy
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        k8s-app: kube-proxy
    spec:
      containers:
      - command:
        - /usr/local/bin/kube-proxy
        - --kubeconfig=/var/lib/kube-proxy/kubeconfig.conf
        - --hostname-override=$(NODE_NAME)
        - --cluster-cidr=169.254.32.0/20
        - --bind-address=0.0.0.0
        - --proxy-mode=ipvs
        - --ipvs-scheduler=lc
        - --ipvs-min-sync-period=1s
        - --ipvs-sync-period=3s
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: my-registry.my-system.svc.cluster.local:5000/kube-proxy:v1.18.15-cisco
        imagePullPolicy: IfNotPresent
        name: kube-proxy
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 200m
            memory: 200Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kube-proxy
          name: kube-proxy
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kube-proxy
      serviceAccountName: kube-proxy
      shareProcessNamespace: true
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - configMap:
          defaultMode: 420
          name: kube-proxy
        name: kube-proxy
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  observedGeneration: 4
  updatedNumberScheduled: 3

What you expected to happen:

Stale IP table rules to be cleaned up once the pods and endpoints are stable with all endpoints in Ready address mode.

How to reproduce it (as minimally and precisely as possible):

Scale down the pods on a given namespace to 0 (leave the service and Endpoints around)
Upgrade the control plane to 1.18.15 (take the intermediate hop path. 1.15.3 -> 1.16.7 -> 1.17.5 -> 1.18.15)
Kubectl apply the kube-proxy manifest provided above.
Scale up the pods scaled down in Step 1.

Anything else we need to know?:

Current workaround we have been putting it use the kube-proxy as an init container and run the --cleanup and then bring up the kube-proxy in IPVS mode and that seem to take care of the stale entries (but this seem to be too much of a big hammer approach of cleaning up)

WE have seen this issue for only those endpoints that correspond to the pods that were scaled down during the kubernetes control plane upgrade and scaled back up later

Environment:

Kubernetes version (use kubectl version):

root@maglev-master-10:/home/maglev# kubectl version
Client Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.15-cisco", GitCommit:"9bd6278a0e70c390455f515d696d1b41cbef8e10", GitTreeState:"clean", BuildDate:"2021-03-16T10:17:33Z", GoVersion:"go1.14.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.15-cisco", GitCommit:"9bd6278a0e70c390455f515d696d1b41cbef8e10", GitTreeState:"clean", BuildDate:"2021-03-16T10:17:33Z", GoVersion:"go1.14.13", Compiler:"gc", Platform:"linux/amd64"}

root@maglev-master-10:/home/maglev# kubectl get nodes
NAME            STATUS   ROLES    AGE   VERSION
10.30.199.248   Ready    master   43h   v1.18.15-cisco
10.30.199.249   Ready    master   44h   v1.18.15-cisco
10.30.199.250   Ready    master   47h   v1.18.15-cisco

Cloud provider or hardware configuration:

    description: Computer
    product: VMware Virtual Platform
    vendor: VMware, Inc.
    version: None
    serial: VMware-42 24 b1 a9 a5 89 81 ea-48 7e 2b bc 34 57 18 f5
    width: 64 bits
    capabilities: smbios-2.7 dmi-2.7 smp vsyscall32
    configuration: administrator_password=enabled boot=normal frontpanel_password=unknown keyboard_password=unknown power-on_password=disabled uuid=4224B1A9-A589-81EA-487E-2BBC345718F5
  *-pnp00:00
       product: PnP device PNP0c02
       physical id: 3
       capabilities: pnp
       configuration: driver=system
  *-pnp00:01
       product: PnP device PNP0b00
       physical id: 4
       capabilities: pnp
       configuration: driver=rtc_cmos
  *-pnp00:04
       product: PnP device PNP0103
       physical id: 45
       capabilities: pnp
       configuration: driver=system
  *-pnp00:05
       product: PnP device PNP0c02
       physical id: 46
       capabilities: pnp
       configuration: driver=system
  *-remoteaccess UNCLAIMED
       vendor: Intel
       physical id: 1
       capabilities: inbound

# cat /proc/cpuinfo  | grep "processor" | wc -l
64

# free -h
              total        used        free      shared  buff/cache   available
Mem:           251G        124G         76G        312M         50G        124G
Swap:            0B          0B          0B

OS (e.g: cat /etc/os-release):

# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a):

# uname -a
Linux maglev-master-10.30.199.249 5.4.0-73-generic #82~18.04.1 SMP Tue Apr 20 06:24:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools:

kubeadm

Network plugin and version (if this is a network-related bug):

# calicoctl version
Client Version:    v3.12.0.cisco
Git commit:        dca8136d
Cluster Version:   v3.12.0.cisco
Cluster Type:      k8s,bgp

Others:

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-05-26T07:11:26Z

@harshanarayana: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-05-26T07:12:56Z

@harshanarayana: The label(s) /label area/kube-proxy cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda

In response to this:

/label area/kube-proxy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-05-26T07:13:56Z

@harshanarayana: The label(s) sig/sig/network cannot be applied, because the repository doesn't have them.

In response to this:

/sig sig/network

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

harshanarayana · 2021-05-26T07:14:09Z

/sig network

aojea · 2021-05-26T16:36:39Z

Current workaround we have been putting it use the kube-proxy as an init container and run the --cleanup and then bring up the kube-proxy in IPVS mode and that seem to take care of the stale entries (but this seem to be too much of a big hammer approach of cleaning up)

I don't think that the live migration between kube-proxy implementations iptables<->ipvs is supported, at least I never heard about that and seems tricky.
@thockin you use to have all the historical context, what do you think?

thockin · 2021-05-27T20:23:43Z

We tried implementing a cleanup mode but it was problematic and removed. (see #76109)

You can run kube-proxy --cleanup but that's pretty coarse. Maybe we should add a value to that flag to say which mode to try to cleanup? PRs welcome.

The simplest answer is to reboot.

thockin · 2021-05-27T21:08:34Z

Let's make this a docs issues - we should write this down

jayunit100 · 2021-08-19T21:30:43Z

can we merge the above so that we can close this issue?

khenidak · 2021-08-20T22:45:42Z

@jayunit100 not just yet :-)

wouldn't signal registration and calling clean-up would at least clean things up on graceful shutdown? Granted rules will not be there until kube-proxy is restarted.

harshanarayana added the kind/bug Categorizes issue or PR as related to a bug. label May 26, 2021

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 26, 2021

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 26, 2021

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 26, 2021

thockin self-assigned this May 27, 2021

thockin closed this as completed May 27, 2021

thockin reopened this May 27, 2021

thockin changed the title ~~kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable~~ docs needed: kube-proxy doesn't support cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) May 27, 2021

thockin assigned jayunit100 and bridgetkromhout May 27, 2021

thockin added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 27, 2021

jayunit100 mentioned this issue May 27, 2021

kube-proxy disclaimer about cleanup kubernetes/website#28147

Merged

k8s-ci-robot closed this as completed in kubernetes/website#28147 Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs needed: kube-proxy doesn't support live cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) #102314

docs needed: kube-proxy doesn't support live cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) #102314

harshanarayana commented May 26, 2021 •

edited

k8s-ci-robot commented May 26, 2021

k8s-ci-robot commented May 26, 2021

k8s-ci-robot commented May 26, 2021

harshanarayana commented May 26, 2021

aojea commented May 26, 2021

thockin commented May 27, 2021

thockin commented May 27, 2021

jayunit100 commented Aug 19, 2021

khenidak commented Aug 20, 2021

docs needed: kube-proxy doesn't support live cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) #102314

docs needed: kube-proxy doesn't support live cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) #102314

Comments

harshanarayana commented May 26, 2021 • edited

What happened:

Additional Info

Logs

Stale IP Table Rules

SVC Definition

SVC Endpoint

Pod State

Kube-Proxy Manifest

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

k8s-ci-robot commented May 26, 2021

k8s-ci-robot commented May 26, 2021

k8s-ci-robot commented May 26, 2021

harshanarayana commented May 26, 2021

aojea commented May 26, 2021

thockin commented May 27, 2021

thockin commented May 27, 2021

jayunit100 commented Aug 19, 2021

khenidak commented Aug 20, 2021

harshanarayana commented May 26, 2021 •

edited