K8s system pods fails due to liveness check not working #6506

ivanvtimofeev · 2020-08-06T11:40:25Z

What happened:
kube-scheduler and kube-controller pods fail due to liveness checking not works. Liveness checking does not work because healtz check entrypoints for these pod was removed in 1.16.13 k8s (for kube-scheduler pod http://127.0.0.1:10251/healthz and for kube-controller pod http://127.0.0.1:10252/healthz )

What you expected to happen:
I expect k8s pods manifests will not contain liveness check is containers don’t have entry points for them.

How to reproduce it (as minimally and precisely as possible):
Deploy k8s using kubespray release-2.12 (https://github.com/kubernetes-sigs/kubespray/tree/release-2.12) with default k8s version.

Anything else we need to know?:
—

Environment:

Cloud provider or hardware configuration:
AWS
OS cat /etc/os-release:
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Version of Ansible (ansible --version):
ansible 2.7.16
config file = None
configured module search path = ['/home/centos/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.8 (default, Apr 2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Version of Python (python --version):
[centos@ip-172-31-15-227 ~]$ python --version
Python 2.7.5

Kubespray version (commit) (git rev-parse --short HEAD):
2acc5a7

Network plugin used:
Tungsten Fabric, Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
all:
hosts:
node1:
ansible_host: 172.31.15.227
ip: 172.31.15.227
access_ip: 172.31.15.227
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node1:
etcd:
hosts:
node1:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}

Command used to invoke ansible:
ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root cluster.yml -e kube_pods_subnet=10.32.0.0/12 -e kube_service_addresses=10.96.0.0/12

The text was updated successfully, but these errors were encountered:

ivanvtimofeev · 2020-08-06T12:51:34Z

Additionally. The report from k8s repo about this bug. The ask me to report here with this issue: kubernetes/kubernetes#93746

eifelmicha · 2020-08-08T11:23:50Z

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}

I have the same issue. "workaround" is to delete that port flag from the kubernetes manifests, but would be happy to have a better fix. Happened after i upgraded to Kubernetes 1.17.9 and release 2.13 a few days back.

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

paulcosma · 2020-08-12T06:17:07Z

same issue here after upgrade from v1.18.5 to v1.18.6

Edit: Reproduced also on a clean install (v2.14.0)
Server Version: v1.18.8 on Debian10
Output

$ kubectl get componentstatus
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
etcd-0               Healthy     {"health":"true"}                                                 
etcd-1               Healthy     {"health":"true"}                                                 
etcd-2               Healthy     {"health":"true"}

Cluster seems to work fine, though.

pedrohmuniz · 2020-08-24T14:40:50Z

Hi, i´m having the same issue in the master, this worked for me
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

But when running again cluster.yml this confs are not persisted

linkvt · 2020-08-24T22:09:43Z

Seems to be fixed in Kubernetes 1.16.14: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.16.md#changelog-since-v11613

Fixed a regression in kubeadm manifests for kube-scheduler and kube-controller-manager which caused continuous restarts because of failing health checks (#93208, @SataQiu) [SIG Cluster Lifecycle]

linkvt · 2020-08-25T08:51:25Z

I will create a PR for using the fixed 1.16.14 version very soon.
Until then everybody should also be able to just fix the livenessprobe instead of reenabling the insecure liveness check ports e.g. with something like a basic playbook like

- hosts: kube-master
  gather_facts: false
  tasks:
  - name: kube-controller-manager - Use secure port for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-controller-manager.yaml
      regexp: '10252'
      replace: '10257'
  - name: kube-controller-manager - Use HTTPS for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-controller-manager.yaml
      regexp: 'scheme: HTTP$'
      replace: 'scheme: HTTPS'
  - name: Wait a few seconds as too fast updates don't tear down the previous version correctly
    pause:
      seconds: 10
  - name: kube-scheduler - Use secure port for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-scheduler.yaml
      regexp: '10251'
      replace: '10259'
  - name: kube-scheduler - Use HTTPS for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-scheduler.yaml
      regexp: 'scheme: HTTP$'
      replace: 'scheme: HTTPS'

diogomurta · 2020-10-02T19:59:21Z

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}

I have the same issue. "workaround" is to delete that port flag from the kubernetes manifests, but would be happy to have a better fix. Happened after i upgraded to Kubernetes 1.17.9 and release 2.13 a few days back.

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

Thanks!
Workaround works for me.

fejta-bot · 2020-12-31T20:00:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

eifelmicha · 2020-12-31T20:03:57Z

/remove-lifecycle stale

eifelmicha · 2020-12-31T20:05:26Z

Seems that component status will be replaced anyways: kubernetes/kubernetes#93570

fejta-bot · 2021-03-31T20:13:25Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-04-30T20:13:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

oomichi · 2021-05-20T21:24:35Z

This issue seems fixed with #6583

/close

k8s-ci-robot · 2021-05-20T21:24:40Z

@oomichi: Closing this issue.

In response to this:

This issue seems fixed with #6583

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ivanvtimofeev added the kind/bug Categorizes issue or PR as related to a bug. label Aug 6, 2020

linkvt mentioned this issue Aug 25, 2020

[2.12] Add 1.16.14 and 1.16.15 support #6583

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 30, 2021

k8s-ci-robot closed this as completed May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s system pods fails due to liveness check not working #6506

K8s system pods fails due to liveness check not working #6506

ivanvtimofeev commented Aug 6, 2020

ivanvtimofeev commented Aug 6, 2020 •

edited

Loading

eifelmicha commented Aug 8, 2020 •

edited

Loading

paulcosma commented Aug 12, 2020 •

edited

Loading

pedrohmuniz commented Aug 24, 2020

linkvt commented Aug 24, 2020

linkvt commented Aug 25, 2020

diogomurta commented Oct 2, 2020

fejta-bot commented Dec 31, 2020

eifelmicha commented Dec 31, 2020

eifelmicha commented Dec 31, 2020

fejta-bot commented Mar 31, 2021

fejta-bot commented Apr 30, 2021

oomichi commented May 20, 2021

k8s-ci-robot commented May 20, 2021

K8s system pods fails due to liveness check not working #6506

K8s system pods fails due to liveness check not working #6506

Comments

ivanvtimofeev commented Aug 6, 2020

ivanvtimofeev commented Aug 6, 2020 • edited Loading

eifelmicha commented Aug 8, 2020 • edited Loading

paulcosma commented Aug 12, 2020 • edited Loading

pedrohmuniz commented Aug 24, 2020

linkvt commented Aug 24, 2020

linkvt commented Aug 25, 2020

diogomurta commented Oct 2, 2020

fejta-bot commented Dec 31, 2020

eifelmicha commented Dec 31, 2020

eifelmicha commented Dec 31, 2020

fejta-bot commented Mar 31, 2021

fejta-bot commented Apr 30, 2021

oomichi commented May 20, 2021

k8s-ci-robot commented May 20, 2021

ivanvtimofeev commented Aug 6, 2020 •

edited

Loading

eifelmicha commented Aug 8, 2020 •

edited

Loading

paulcosma commented Aug 12, 2020 •

edited

Loading