Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/opt/kubernetes/helpers/docker-healthcheck low timeout can cause DEADLOCK of the node #5434

Closed
tatobi opened this issue Jul 12, 2018 · 5 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tatobi
Copy link
Contributor

tatobi commented Jul 12, 2018

High I/O or SWAP file usage of memory intesive applications deployed to an AWS instance can cause it's dead and can cause deadlock of the node if all of them restarted by helathcheck - dockerd restart - at the same time. Typical usage is multiple JVM / Java based applications running on the node.

We faced this issue several times.

Use of swap file can help in case of spikes of JVM memory handling and helps to calibare the use of many low CPU but memory intensive appls like SpringBoot.

I make it reproducible and symulate it below using stress command.

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

The dockerd restart causing many problems:

  • restarts the kube-proxy, BUT it cannot attach immediately to services because of the kerne TCP bind fails (sockets already in use):
E0712 13:02:55.499053       1 proxier.go:1379] can't open "nodePort for default/bm-user-management:" (:31223/tcp), skipping this nodePort: listen tcp :31223: bind: address already in use
E0712 13:02:55.499079       1 proxier.go:1379] can't open "nodePort for default/bm-retail-account:" (:31755/tcp), skipping this nodePort: listen tcp :31755: bind: address already in use
E0712 13:02:55.499187       1 proxier.go:1379] can't open "nodePort for default/bm-deposit:" (:30870/tcp), skipping this nodePort: listen tcp :30870: bind: address already in use
E0712 13:02:55.499241       1 proxier.go:1379] can't open "nodePort for default/bm-emarsys:" (:31155/tcp), skipping this nodePort: listen tcp :31155: bind: address already in use
E0712 13:02:55.499325       1 proxier.go:1379] can't open "nodePort for default/euribor-integration:" (:30336/tcp), skipping this nodePort: listen tcp :30336: bind: address already in use
E0712 13:02:55.499368       1 proxier.go:1379] can't open "nodePort for default/bm-corporate-loan:" (:31779/tcp), skipping this nodePort: listen tcp :31779: bind: address already in use
E0712 13:02:55.499396       1 proxier.go:1379] can't open "nodePort for default/bm-user-profiling:" (:31163/tcp), skipping this nodePort: listen tcp :31163: bind: address already in use
...

Results:

So the K8s services remain uaccessibale from/ on the node.

Many pods have no external access to network because of calico pod restart.

The pod services restart cause high I/O load, cause healthckeck failing, cause deadlock loop above.

Only the node restart helps.

------------- BUG REPORT TEMPLATE --------------------

  1. What kops version are you running? The command kops version, will display
    this information.

ubuntu@kubernetes-test-bastion:~$ kops version Version 1.9.1 (git-ba77c9ca2)

  1. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:14:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.9", GitCommit:"57729ea3d9a1b75f3fc7bbbadc597ba707d47c8a", GitTreeState:"clean", BuildDate:"2018-06-29T01:07:01Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using?

AWS

  1. What commands did you run? What is the simplest way to reproduce this issue?

Prerequisities:
Runnig clsuer, you can access the node.

Enabled swap in kops deployment according https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/:
Instancegroup configuration in kops YAML:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-07-08T14:49:18Z
  labels:
    kops.k8s.io/cluster: ***********************************
  name: nodes
spec:
  additionalSecurityGroups:
  - sg-09d85674
  additionalUserData:
  - content: |
      #!/bin/bash -xe
      fallocate -l 10G /swapfile
      chmod 600 /swapfile
      mkswap /swapfile
      swapon /swapfile
      sysctl vm.swappiness=10
      sysctl vm.vfs_cache_pressure=50
    name: swap.sh
    type: text/x-shellscript
  - content: |
      #!/bin/bash -xe
      apt-get update
      apt-get -y install nfs-common
    name: nfs.sh
    type: text/x-shellscript
  associatePublicIp: false
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  kubelet:
    failSwapOn: false
  machineType: t2.small
  maxSize: 20
  minSize: 10
  nodeLabels:
    beta.kubernetes.io/fluentd-ds-ready: "true"
    kops.k8s.io/instancegroup: nodes
  role: Node
  rootVolumeSize: 50
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c

I have a running cluster, deploy a simple pod e.g. a simple webserver.

ubuntu@kubernetes-test-bastion:~$ kops validate cluster
Using cluster from kubectl context: ****************************************-

Validating cluster ************************************-

INSTANCE GROUPS
NAME                    ROLE    MACHINETYPE     MIN     MAX     SUBNETS
master-eu-west-1b       Master  t2.small        1       1       eu-west-1b
nodes                   Node    t2.small        10      20      eu-west-1a,eu-west-1b,eu-west-1c

NODE STATUS
NAME                                            ROLE    READY
ip-10-201-32-209.eu-west-1.compute.internal     node    True
ip-10-201-35-203.eu-west-1.compute.internal     node    True
ip-10-201-37-245.eu-west-1.compute.internal     master  True
ip-10-201-44-110.eu-west-1.compute.internal     node    True
ip-10-201-48-43.eu-west-1.compute.internal      node    True
ip-10-201-49-106.eu-west-1.compute.internal     node    True
ip-10-201-62-72.eu-west-1.compute.internal      node    True
ip-10-201-67-230.eu-west-1.compute.internal     node    True
ip-10-201-69-39.eu-west-1.compute.internal      node    True
ip-10-201-74-8.eu-west-1.compute.internal       node    True
ip-10-201-79-13.eu-west-1.compute.internal      node    True

Your cluster ************************************ is ready

Attach to a running pod running on the node.
Install stress command into pod:

apt-get update && apt-get install stress

Force it to run out to swap with command stress:

root@bed0b1a5f0ec:/# stress --vm-bytes $(awk '/MemTotal/{printf "%d\n", $2 * 1.0;}' < /proc/meminfo)k --vm-keep -m 1
stress: info: [271] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
  1. What happened after the commands executed?

On the node, the /opt/kubernetes/helpers/docker-healthcheck failse because of timeout 10 but the node and the dockerd itself is healthy. The kubelet default timeout is 2mins, the 10 sec healthcheck is a bit overkill.

  1. What did you expect to happen?

Increase healthcheck time to 1m from 10 sec

  1. Please provide your cluster manifest. Execute
    kops get --name my.example.com -o yaml to display your cluster manifest.
    You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-07-08T14:49:17Z
  name: ****************************************
spec:
  api:
    loadBalancer:
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    Name: ***************************************
  cloudProvider: aws
  configBase: s3://********************************
  dnsZone: ***********************************-
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-west-1b
      name: b
    name: main
  - etcdMembers:
    - instanceGroup: master-eu-west-1b
      name: b
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    runtimeConfig:
      autoscaling/v2beta1: "true"
  kubelet:
    failSwapOn: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.9.9
  masterPublicName: ************************************
  networkCIDR: 10.201.0.0/16
  networkID: vpc-ca4d27ac
  networking:
    calico:
      crossSubnet: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.201.64.0/20
    id: subnet-86c135dc
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.201.32.0/20
    id: subnet-bfb80dd9
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.201.48.0/20
    id: subnet-40992308
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  - cidr: 10.201.16.0/20
    id: subnet-12b80d74
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.201.80.0/20
    id: subnet-799e2431
    name: utility-eu-west-1c
    type: Utility
    zone: eu-west-1c
  - cidr: 10.201.0.0/20
    id: subnet-e1d92dbb
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  topology:
    dns:
      type: Private
    masters: private
    nodes: private
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-07-08T14:49:18Z
  labels:
    kops.k8s.io/cluster: *********************************************
  name: master-eu-west-1b
spec:
  additionalSecurityGroups:
  - sg-09d85674
  additionalUserData:
  - content: |
      #!/bin/bash -xe
      fallocate -l 10G /swapfile
      chmod 600 /swapfile
      mkswap /swapfile
      swapon /swapfile
      sysctl vm.swappiness=10
      sysctl vm.vfs_cache_pressure=50
    name: swap.sh
    type: text/x-shellscript
  - content: |
      #!/bin/bash -xe
      apt-get update
      apt-get -y install nfs-common
    name: nfs.sh
    type: text/x-shellscript
  associatePublicIp: false
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  kubelet:
    failSwapOn: false
  machineType: t2.small
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1b
  role: Master
  rootVolumeSize: 50
  subnets:
  - eu-west-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-07-08T14:49:18Z
  labels:
    kops.k8s.io/cluster: ************************************************
  name: nodes
spec:
  additionalSecurityGroups:
  - sg-09d85674
  additionalUserData:
  - content: |
      #!/bin/bash -xe
      fallocate -l 10G /swapfile
      chmod 600 /swapfile
      mkswap /swapfile
      swapon /swapfile
      sysctl vm.swappiness=10
      sysctl vm.vfs_cache_pressure=50
    name: swap.sh
    type: text/x-shellscript
  - content: |
      #!/bin/bash -xe
      apt-get update
      apt-get -y install nfs-common
    name: nfs.sh
    type: text/x-shellscript
  associatePublicIp: false
  image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
  kubelet:
    failSwapOn: false
  machineType: t2.small
  maxSize: 20
  minSize: 10
  nodeLabels:
    beta.kubernetes.io/fluentd-ds-ready: "true"
    kops.k8s.io/instancegroup: nodes
  role: Node
  rootVolumeSize: 50
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
  1. Please run the commands with most verbose logging by adding the -v 10 flag.
    Paste the logs into this report, or in a gist and provide the gist link here.

  2. Anything else do we need to know?

------------- FEATURE REQUEST TEMPLATE --------------------

  1. Describe IN DETAIL the feature/behavior/change you would like to see.

Increase healthcheck timeout to 1m

  1. Feel free to provide a design supporting your feature request.
@tatobi
Copy link
Contributor Author

tatobi commented Jul 16, 2018

Workaround (switch off dockerd healthcheck) apply for kops instancegroups to avoid "timeout node deadlocks" , it works on small AWS instances pretty well:
...

  additionalUserData:
  - content: |
      #!/bin/bash -xe
      echo "* * * * * root echo '#!/bin/bash\nexit 0' > /opt/kubernetes/helpers/docker-healthcheck" >> /etc/crontab
    name: docker-healthcheck.sh
    type: text/x-shellscript

...

tatobi added a commit to tatobi/kops that referenced this issue Aug 16, 2018
In case of increased I/O load, the 10sec timeout is not enough on small / heavily loaded systems thus I propose the 60sec. The kubelet timeout is 2m (120sec) by default to detect health problems. Secondly, the docker restart can load heavily the host OS even huge systems because of many pods initialization at the same time. Continuous dockerd restart loop - a deadlock of node - is observed. Thirdly, because of the forcibly closed sockets and the kernel TCP TIME_WAIT value, the TCP sockets are not usable immediately with a "restart", wait for FIN_TIMEOUT is necessary before start services.
Workaround kubernetes#1 for: kubernetes#5434
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 13, 2018
@rifelpet
Copy link
Member

This has been fixed and will be released in Kops 1.11.0

/close

@k8s-ci-robot
Copy link
Contributor

@rifelpet: Closing this issue.

In response to this:

This has been fixed and will be released in Kops 1.11.0

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants