Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs needed: kube-proxy doesn't support live cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) #102314

Closed
harshanarayana opened this issue May 26, 2021 · 9 comments · Fixed by kubernetes/website#28147
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@harshanarayana
Copy link
Contributor

harshanarayana commented May 26, 2021

What happened:

As part of the Kubernetes upgrade workflow we have in place, we are upgrading kubernetes from 1.15.3 -> 1.18.15 (both versions and all versions in the middle are custom patched for different behavior of the kubelet and kubeadm. None of them having anything to do with kube-proxy's IPVS), and as part of this upgrade we are also migrating the kube-proxy from iptables to ipvs mode.

We have calico 3.12 running on the cluster as part of the CNI stack.

Once the kube-proxy pods are migrated from iptables to ipvs mode by applying the new kube-proxy DaemonSet manifest, the cluster has a series of stale rules for KUBE-SERVICES chain left around that breaks the runtime state of the pods.

Additional Info
  1. We have cluster running in HA mode (3 Node cluster) where all nodes of the cluster are running as leaders and take the workload as well. There is no non-leader Pods.
  2. We have customized the kubernetes and calico infra to enable the usage of Link local IP address in 169.254.0.0/16 range for internal usecases
Logs

Most of the IPs and the names have been mangled.

Stale IP Table Rules
root@maglev-master-10:/home/maglev# iptables -L KUBE-SERVICES | grep my-service.my-namespace.svc.cluster.local
REJECT     tcp  --  anywhere             my-service.my-namespace.svc.cluster.local  /* my-namespace/my-service:api has no endpoints */ tcp dpt:8000 reject-with icmp-port-unreachable
SVC Definition
root@maglev-master-10:/home/maglev# nslookup my-service.my-namespace.svc.cluster.local
Server:		127.0.0.53
Address:	127.0.0.53#53

Name:	my-service.my-namespace.svc.cluster.local
Address: 169.254.63.88

root@maglev-master-10:/home/maglev# kubectl get svc -n my-namespace my-service -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2021-05-24T09:07:02Z"
  labels:
    service-type: collector
    serviceName: my-service
    tier: application
  name: my-service
  namespace: my-namespace
  resourceVersion: "8853"
  selfLink: /api/v1/namespaces/my-namespace/services/my-service
  uid: a8492618-c0d9-4d5a-b21b-2efe8db37e7c
spec:
  clusterIP: 169.254.63.88
  ports:
  - name: api
    port: 8000
    protocol: TCP
    targetPort: 8076
  selector:
    serviceName: my-service
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
SVC Endpoint
root@maglev-master-10:/home/maglev# kubectl get ep -n my-namespace my-service -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2021-05-26T06:19:09Z"
  creationTimestamp: "2021-05-24T09:07:02Z"
  labels:
    service-type: collector
    serviceName: my-service
    tier: application
  name: my-service
  namespace: my-namespace
  resourceVersion: "470853"
  selfLink: /api/v1/namespaces/my-namespace/endpoints/my-service
  uid: 416fbf2b-e65b-443b-9c8f-625312f4499a
subsets:
- notReadyAddresses:
  - ip: 169.254.32.24
    nodeName: 10.30.199.249
    targetRef:
      kind: Pod
      name: my-service-7cd5bdd55-4cfln
      namespace: my-namespace
      resourceVersion: "470852"
      uid: d2c099b6-6965-4664-bb62-4b94a3ad033c
  ports:
  - name: api
    port: 8076
    protocol: TCP
Pod State
root@maglev-master-10:/home/maglev# kubectl get pods -n my-namespace my-service-7cd5bdd55-4cfln -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP              NODE            NOMINATED NODE   READINESS GATES
my-service-7cd5bdd55-4cfln       1/1     Running   0          3m29s   169.254.32.24   10.30.199.249   <none>           <none>
Kube-Proxy Manifest
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "4"
  creationTimestamp: "2021-05-24T11:53:33Z"
  generation: 4
  labels:
    k8s-app: kube-proxy
  name: kube-proxy
  namespace: kube-system
  resourceVersion: "441510"
  uid: 7681838f-fc46-49de-b7fc-b8ac073934ad
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-proxy
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        k8s-app: kube-proxy
    spec:
      containers:
      - command:
        - /usr/local/bin/kube-proxy
        - --kubeconfig=/var/lib/kube-proxy/kubeconfig.conf
        - --hostname-override=$(NODE_NAME)
        - --cluster-cidr=169.254.32.0/20
        - --bind-address=0.0.0.0
        - --proxy-mode=ipvs
        - --ipvs-scheduler=lc
        - --ipvs-min-sync-period=1s
        - --ipvs-sync-period=3s
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: my-registry.my-system.svc.cluster.local:5000/kube-proxy:v1.18.15-cisco
        imagePullPolicy: IfNotPresent
        name: kube-proxy
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 200m
            memory: 200Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kube-proxy
          name: kube-proxy
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kube-proxy
      serviceAccountName: kube-proxy
      shareProcessNamespace: true
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - configMap:
          defaultMode: 420
          name: kube-proxy
        name: kube-proxy
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  observedGeneration: 4
  updatedNumberScheduled: 3

What you expected to happen:

Stale IP table rules to be cleaned up once the pods and endpoints are stable with all endpoints in Ready address mode.

How to reproduce it (as minimally and precisely as possible):

  1. Scale down the pods on a given namespace to 0 (leave the service and Endpoints around)
  2. Upgrade the control plane to 1.18.15 (take the intermediate hop path. 1.15.3 -> 1.16.7 -> 1.17.5 -> 1.18.15)
  3. Kubectl apply the kube-proxy manifest provided above.
  4. Scale up the pods scaled down in Step 1.

Anything else we need to know?:

Current workaround we have been putting it use the kube-proxy as an init container and run the --cleanup and then bring up the kube-proxy in IPVS mode and that seem to take care of the stale entries (but this seem to be too much of a big hammer approach of cleaning up)

WE have seen this issue for only those endpoints that correspond to the pods that were scaled down during the kubernetes control plane upgrade and scaled back up later

Environment:

  • Kubernetes version (use kubectl version):
root@maglev-master-10:/home/maglev# kubectl version
Client Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.15-cisco", GitCommit:"9bd6278a0e70c390455f515d696d1b41cbef8e10", GitTreeState:"clean", BuildDate:"2021-03-16T10:17:33Z", GoVersion:"go1.14.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.15-cisco", GitCommit:"9bd6278a0e70c390455f515d696d1b41cbef8e10", GitTreeState:"clean", BuildDate:"2021-03-16T10:17:33Z", GoVersion:"go1.14.13", Compiler:"gc", Platform:"linux/amd64"}

root@maglev-master-10:/home/maglev# kubectl get nodes
NAME            STATUS   ROLES    AGE   VERSION
10.30.199.248   Ready    master   43h   v1.18.15-cisco
10.30.199.249   Ready    master   44h   v1.18.15-cisco
10.30.199.250   Ready    master   47h   v1.18.15-cisco
  • Cloud provider or hardware configuration:
    description: Computer
    product: VMware Virtual Platform
    vendor: VMware, Inc.
    version: None
    serial: VMware-42 24 b1 a9 a5 89 81 ea-48 7e 2b bc 34 57 18 f5
    width: 64 bits
    capabilities: smbios-2.7 dmi-2.7 smp vsyscall32
    configuration: administrator_password=enabled boot=normal frontpanel_password=unknown keyboard_password=unknown power-on_password=disabled uuid=4224B1A9-A589-81EA-487E-2BBC345718F5
  *-pnp00:00
       product: PnP device PNP0c02
       physical id: 3
       capabilities: pnp
       configuration: driver=system
  *-pnp00:01
       product: PnP device PNP0b00
       physical id: 4
       capabilities: pnp
       configuration: driver=rtc_cmos
  *-pnp00:04
       product: PnP device PNP0103
       physical id: 45
       capabilities: pnp
       configuration: driver=system
  *-pnp00:05
       product: PnP device PNP0c02
       physical id: 46
       capabilities: pnp
       configuration: driver=system
  *-remoteaccess UNCLAIMED
       vendor: Intel
       physical id: 1
       capabilities: inbound
# cat /proc/cpuinfo  | grep "processor" | wc -l
64

# free -h
              total        used        free      shared  buff/cache   available
Mem:           251G        124G         76G        312M         50G        124G
Swap:            0B          0B          0B
  • OS (e.g: cat /etc/os-release):
# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
  • Kernel (e.g. uname -a):
# uname -a
Linux maglev-master-10.30.199.249 5.4.0-73-generic #82~18.04.1 SMP Tue Apr 20 06:24:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • kubeadm
  • Network plugin and version (if this is a network-related bug):
# calicoctl version
Client Version:    v3.12.0.cisco
Git commit:        dca8136d
Cluster Version:   v3.12.0.cisco
Cluster Type:      k8s,bgp
  • Others:
@harshanarayana harshanarayana added the kind/bug Categorizes issue or PR as related to a bug. label May 26, 2021
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 26, 2021
@k8s-ci-robot
Copy link
Contributor

@harshanarayana: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 26, 2021
@k8s-ci-robot
Copy link
Contributor

@harshanarayana: The label(s) /label area/kube-proxy cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda

In response to this:

/label area/kube-proxy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@harshanarayana: The label(s) sig/sig/network cannot be applied, because the repository doesn't have them.

In response to this:

/sig sig/network

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@harshanarayana
Copy link
Contributor Author

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 26, 2021
@aojea
Copy link
Member

aojea commented May 26, 2021

Current workaround we have been putting it use the kube-proxy as an init container and run the --cleanup and then bring up the kube-proxy in IPVS mode and that seem to take care of the stale entries (but this seem to be too much of a big hammer approach of cleaning up)

I don't think that the live migration between kube-proxy implementations iptables<->ipvs is supported, at least I never heard about that and seems tricky.
@thockin you use to have all the historical context, what do you think?

@thockin
Copy link
Member

thockin commented May 27, 2021

We tried implementing a cleanup mode but it was problematic and removed. (see #76109)

You can run kube-proxy --cleanup but that's pretty coarse. Maybe we should add a value to that flag to say which mode to try to cleanup? PRs welcome.

The simplest answer is to reboot.

@thockin thockin self-assigned this May 27, 2021
@thockin thockin closed this as completed May 27, 2021
@thockin thockin reopened this May 27, 2021
@thockin
Copy link
Member

thockin commented May 27, 2021

Let's make this a docs issues - we should write this down

@thockin thockin changed the title kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable docs needed: kube-proxy doesn't support cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) May 27, 2021
@thockin thockin added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 27, 2021
@thockin thockin changed the title docs needed: kube-proxy doesn't support cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) docs needed: kube-proxy doesn't support live cleanup (kube-proxy leaves behind stale IP Table rules in KUBE-SERVICES for icmp-port-unreachable) May 27, 2021
@jayunit100
Copy link
Member

can we merge the above so that we can close this issue?

@khenidak
Copy link
Contributor

@jayunit100 not just yet :-)

wouldn't signal registration and calling clean-up would at least clean things up on graceful shutdown? Granted rules will not be there until kube-proxy is restarted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants