/opt/kubernetes/helpers/docker-healthcheck low timeout can cause DEADLOCK of the node #5434

tatobi · 2018-07-12T13:16:30Z

High I/O or SWAP file usage of memory intesive applications deployed to an AWS instance can cause it's dead and can cause deadlock of the node if all of them restarted by helathcheck - dockerd restart - at the same time. Typical usage is multiple JVM / Java based applications running on the node.

We faced this issue several times.

Use of swap file can help in case of spikes of JVM memory handling and helps to calibare the use of many low CPU but memory intensive appls like SpringBoot.

I make it reproducible and symulate it below using stress command.

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

The dockerd restart causing many problems:

restarts the kube-proxy, BUT it cannot attach immediately to services because of the kerne TCP bind fails (sockets already in use):

E0712 13:02:55.499053       1 proxier.go:1379] can't open "nodePort for default/bm-user-management:" (:31223/tcp), skipping this nodePort: listen tcp :31223: bind: address already in use
E0712 13:02:55.499079       1 proxier.go:1379] can't open "nodePort for default/bm-retail-account:" (:31755/tcp), skipping this nodePort: listen tcp :31755: bind: address already in use
E0712 13:02:55.499187       1 proxier.go:1379] can't open "nodePort for default/bm-deposit:" (:30870/tcp), skipping this nodePort: listen tcp :30870: bind: address already in use
E0712 13:02:55.499241       1 proxier.go:1379] can't open "nodePort for default/bm-emarsys:" (:31155/tcp), skipping this nodePort: listen tcp :31155: bind: address already in use
E0712 13:02:55.499325       1 proxier.go:1379] can't open "nodePort for default/euribor-integration:" (:30336/tcp), skipping this nodePort: listen tcp :30336: bind: address already in use
E0712 13:02:55.499368       1 proxier.go:1379] can't open "nodePort for default/bm-corporate-loan:" (:31779/tcp), skipping this nodePort: listen tcp :31779: bind: address already in use
E0712 13:02:55.499396       1 proxier.go:1379] can't open "nodePort for default/bm-user-profiling:" (:31163/tcp), skipping this nodePort: listen tcp :31163: bind: address already in use
...

Results:

So the K8s services remain uaccessibale from/ on the node.

Many pods have no external access to network because of calico pod restart.

The pod services restart cause high I/O load, cause healthckeck failing, cause deadlock loop above.

Only the node restart helps.

------------- BUG REPORT TEMPLATE --------------------

What kops version are you running? The command kops version, will display
this information.

ubuntu@kubernetes-test-bastion:~$ kops version Version 1.9.1 (git-ba77c9ca2)

What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:14:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

What cloud provider are you using?

AWS

What commands did you run? What is the simplest way to reproduce this issue?

Prerequisities:
Runnig clsuer, you can access the node.

Enabled swap in kops deployment according https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/:
Instancegroup configuration in kops YAML:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-07-08T14:49:18Z
  labels:
    kops.k8s.io/cluster: ***********************************
  name: nodes
spec:
  additionalSecurityGroups:
  - sg-09d85674
  additionalUserData:
  - content: |
      #!/bin/bash -xe
      fallocate -l 10G /swapfile
      chmod 600 /swapfile
      mkswap /swapfile
      swapon /swapfile
      sysctl vm.swappiness=10
      sysctl vm.vfs_cache_pressure=50
    name: swap.sh
    type: text/x-shellscript
  - content: |
      #!/bin/bash -xe
      apt-get update
      apt-get -y install nfs-common
    name: nfs.sh
    type: text/x-shellscript
  associatePublicIp: false
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  kubelet:
    failSwapOn: false
  machineType: t2.small
  maxSize: 20
  minSize: 10
  nodeLabels:
    beta.kubernetes.io/fluentd-ds-ready: "true"
    kops.k8s.io/instancegroup: nodes
  role: Node
  rootVolumeSize: 50
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c

I have a running cluster, deploy a simple pod e.g. a simple webserver.

ubuntu@kubernetes-test-bastion:~$ kops validate cluster
Using cluster from kubectl context: ****************************************-

Validating cluster ************************************-

INSTANCE GROUPS
NAME                    ROLE    MACHINETYPE     MIN     MAX     SUBNETS
master-eu-west-1b       Master  t2.small        1       1       eu-west-1b
nodes                   Node    t2.small        10      20      eu-west-1a,eu-west-1b,eu-west-1c

NODE STATUS
NAME                                            ROLE    READY
ip-10-201-32-209.eu-west-1.compute.internal     node    True
ip-10-201-35-203.eu-west-1.compute.internal     node    True
ip-10-201-37-245.eu-west-1.compute.internal     master  True
ip-10-201-44-110.eu-west-1.compute.internal     node    True
ip-10-201-48-43.eu-west-1.compute.internal      node    True
ip-10-201-49-106.eu-west-1.compute.internal     node    True
ip-10-201-62-72.eu-west-1.compute.internal      node    True
ip-10-201-67-230.eu-west-1.compute.internal     node    True
ip-10-201-69-39.eu-west-1.compute.internal      node    True
ip-10-201-74-8.eu-west-1.compute.internal       node    True
ip-10-201-79-13.eu-west-1.compute.internal      node    True

Your cluster ************************************ is ready

Attach to a running pod running on the node.
Install stress command into pod:

apt-get update && apt-get install stress

Force it to run out to swap with command stress:

root@bed0b1a5f0ec:/# stress --vm-bytes $(awk '/MemTotal/{printf "%d\n", $2 * 1.0;}' < /proc/meminfo)k --vm-keep -m 1
stress: info: [271] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

What happened after the commands executed?

On the node, the /opt/kubernetes/helpers/docker-healthcheck failse because of timeout 10 but the node and the dockerd itself is healthy. The kubelet default timeout is 2mins, the 10 sec healthcheck is a bit overkill.

What did you expect to happen?

Increase healthcheck time to 1m from 10 sec

Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-07-08T14:49:17Z
  name: ****************************************
spec:
  api:
    loadBalancer:
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Name: ***************************************
  cloudProvider: aws
  configBase: s3://********************************
  dnsZone: ***********************************-
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-west-1b
      name: b
    name: main
  - etcdMembers:
    - instanceGroup: master-eu-west-1b
      name: b
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    runtimeConfig:
      autoscaling/v2beta1: "true"
  kubelet:
    failSwapOn: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.9.9
  masterPublicName: ************************************
  networkCIDR: 10.201.0.0/16
  networkID: vpc-ca4d27ac
  networking:
    calico:
      crossSubnet: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.201.64.0/20
    id: subnet-86c135dc
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.201.32.0/20
    id: subnet-bfb80dd9
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.201.48.0/20
    id: subnet-40992308
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  - cidr: 10.201.16.0/20
    id: subnet-12b80d74
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.201.80.0/20
    id: subnet-799e2431
    name: utility-eu-west-1c
    type: Utility
    zone: eu-west-1c
  - cidr: 10.201.0.0/20
    id: subnet-e1d92dbb
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  topology:
    dns:
      type: Private
    masters: private
    nodes: private
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-07-08T14:49:18Z
  labels:
    kops.k8s.io/cluster: *********************************************
  name: master-eu-west-1b
spec:
  additionalSecurityGroups:
  - sg-09d85674
  additionalUserData:
  - content: |
      #!/bin/bash -xe
      fallocate -l 10G /swapfile
      chmod 600 /swapfile
      mkswap /swapfile
      swapon /swapfile
      sysctl vm.swappiness=10
      sysctl vm.vfs_cache_pressure=50
    name: swap.sh
    type: text/x-shellscript
  - content: |
      #!/bin/bash -xe
      apt-get update
      apt-get -y install nfs-common
    name: nfs.sh
    type: text/x-shellscript
  associatePublicIp: false
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  kubelet:
    failSwapOn: false
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1b
  role: Master
  rootVolumeSize: 50
  subnets:
  - eu-west-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-07-08T14:49:18Z
  labels:
    kops.k8s.io/cluster: ************************************************
  name: nodes
spec:
  additionalSecurityGroups:
  - sg-09d85674
  additionalUserData:
  - content: |
      #!/bin/bash -xe
      fallocate -l 10G /swapfile
      chmod 600 /swapfile
      mkswap /swapfile
      swapon /swapfile
      sysctl vm.swappiness=10
      sysctl vm.vfs_cache_pressure=50
    name: swap.sh
    type: text/x-shellscript
  - content: |
      #!/bin/bash -xe
      apt-get update
      apt-get -y install nfs-common
    name: nfs.sh
    type: text/x-shellscript
  associatePublicIp: false
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  kubelet:
    failSwapOn: false
  machineType: t2.small
  maxSize: 20
  minSize: 10
  nodeLabels:
    beta.kubernetes.io/fluentd-ds-ready: "true"
    kops.k8s.io/instancegroup: nodes
  role: Node
  rootVolumeSize: 50
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c

Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Anything else do we need to know?

------------- FEATURE REQUEST TEMPLATE --------------------

Describe IN DETAIL the feature/behavior/change you would like to see.

Increase healthcheck timeout to 1m

Feel free to provide a design supporting your feature request.

The text was updated successfully, but these errors were encountered:

tatobi · 2018-07-16T21:08:42Z

Workaround (switch off dockerd healthcheck) apply for kops instancegroups to avoid "timeout node deadlocks" , it works on small AWS instances pretty well:
...

  additionalUserData:
  - content: |
      #!/bin/bash -xe
      echo "* * * * * root echo '#!/bin/bash\nexit 0' > /opt/kubernetes/helpers/docker-healthcheck" >> /etc/crontab
    name: docker-healthcheck.sh
    type: text/x-shellscript

...

In case of increased I/O load, the 10sec timeout is not enough on small / heavily loaded systems thus I propose the 60sec. The kubelet timeout is 2m (120sec) by default to detect health problems. Secondly, the docker restart can load heavily the host OS even huge systems because of many pods initialization at the same time. Continuous dockerd restart loop - a deadlock of node - is observed. Thirdly, because of the forcibly closed sockets and the kernel TCP TIME_WAIT value, the TCP sockets are not usable immediately with a "restart", wait for FIN_TIMEOUT is necessary before start services. Workaround kubernetes#1 for: kubernetes#5434

fejta-bot · 2018-10-14T21:22:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-11-13T22:10:35Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

rifelpet · 2018-12-12T01:13:46Z

This has been fixed and will be released in Kops 1.11.0

/close

k8s-ci-robot · 2018-12-12T01:13:47Z

@rifelpet: Closing this issue.

In response to this:

This has been fixed and will be released in Kops 1.11.0

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tatobi mentioned this issue Aug 16, 2018

increase docker-healthcheck respose timeout #5644

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 13, 2018

k8s-ci-robot closed this as completed Dec 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/opt/kubernetes/helpers/docker-healthcheck low timeout can cause DEADLOCK of the node #5434

/opt/kubernetes/helpers/docker-healthcheck low timeout can cause DEADLOCK of the node #5434

tatobi commented Jul 12, 2018

tatobi commented Jul 16, 2018

fejta-bot commented Oct 14, 2018

fejta-bot commented Nov 13, 2018

rifelpet commented Dec 12, 2018

k8s-ci-robot commented Dec 12, 2018

/opt/kubernetes/helpers/docker-healthcheck low timeout can cause DEADLOCK of the node #5434

/opt/kubernetes/helpers/docker-healthcheck low timeout can cause DEADLOCK of the node #5434

Comments

tatobi commented Jul 12, 2018

tatobi commented Jul 16, 2018

fejta-bot commented Oct 14, 2018

fejta-bot commented Nov 13, 2018

rifelpet commented Dec 12, 2018

k8s-ci-robot commented Dec 12, 2018