Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s system pods fails due to liveness check not working #6506

Closed
ivanvtimofeev opened this issue Aug 6, 2020 · 14 comments
Closed

K8s system pods fails due to liveness check not working #6506

ivanvtimofeev opened this issue Aug 6, 2020 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ivanvtimofeev
Copy link

What happened:
kube-scheduler and kube-controller pods fail due to liveness checking not works. Liveness checking does not work because healtz check entrypoints for these pod was removed in 1.16.13 k8s (for kube-scheduler pod http://127.0.0.1:10251/healthz and for kube-controller pod http://127.0.0.1:10252/healthz )

What you expected to happen:
I expect k8s pods manifests will not contain liveness check is containers don’t have entry points for them.

How to reproduce it (as minimally and precisely as possible):
Deploy k8s using kubespray release-2.12 (https://github.com/kubernetes-sigs/kubespray/tree/release-2.12) with default k8s version.

Anything else we need to know?:

Environment:

  • Cloud provider or hardware configuration:
    AWS
  • OS cat /etc/os-release:
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Version of Ansible (ansible --version):
    ansible 2.7.16
    config file = None
    configured module search path = ['/home/centos/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
    ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
    executable location = /usr/local/bin/ansible
    python version = 3.6.8 (default, Apr 2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]

  • Version of Python (python --version):
    [centos@ip-172-31-15-227 ~]$ python --version
    Python 2.7.5

Kubespray version (commit) (git rev-parse --short HEAD):
2acc5a7

Network plugin used:
Tungsten Fabric, Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
all:
hosts:
node1:
ansible_host: 172.31.15.227
ip: 172.31.15.227
access_ip: 172.31.15.227
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node1:
etcd:
hosts:
node1:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}

Command used to invoke ansible:
ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root cluster.yml -e kube_pods_subnet=10.32.0.0/12 -e kube_service_addresses=10.96.0.0/12

@ivanvtimofeev ivanvtimofeev added the kind/bug Categorizes issue or PR as related to a bug. label Aug 6, 2020
@ivanvtimofeev
Copy link
Author

ivanvtimofeev commented Aug 6, 2020

Additionally. The report from k8s repo about this bug. The ask me to report here with this issue: kubernetes/kubernetes#93746

@eifelmicha
Copy link
Contributor

eifelmicha commented Aug 8, 2020

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}         

I have the same issue. "workaround" is to delete that port flag from the kubernetes manifests, but would be happy to have a better fix. Happened after i upgraded to Kubernetes 1.17.9 and release 2.13 a few days back.

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

@paulcosma
Copy link

paulcosma commented Aug 12, 2020

same issue here after upgrade from v1.18.5 to v1.18.6

Edit: Reproduced also on a clean install (v2.14.0)
Server Version: v1.18.8 on Debian10
Output

$ kubectl get componentstatus
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
etcd-0               Healthy     {"health":"true"}                                                 
etcd-1               Healthy     {"health":"true"}                                                 
etcd-2               Healthy     {"health":"true"}  

Cluster seems to work fine, though.

@pedrohmuniz
Copy link

Hi, i´m having the same issue in the master, this worked for me
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

But when running again cluster.yml this confs are not persisted

@linkvt
Copy link
Contributor

linkvt commented Aug 24, 2020

Seems to be fixed in Kubernetes 1.16.14: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.16.md#changelog-since-v11613

Fixed a regression in kubeadm manifests for kube-scheduler and kube-controller-manager which caused continuous restarts because of failing health checks (#93208, @SataQiu) [SIG Cluster Lifecycle]

@linkvt
Copy link
Contributor

linkvt commented Aug 25, 2020

I will create a PR for using the fixed 1.16.14 version very soon.
Until then everybody should also be able to just fix the livenessprobe instead of reenabling the insecure liveness check ports e.g. with something like a basic playbook like

- hosts: kube-master
  gather_facts: false
  tasks:
  - name: kube-controller-manager - Use secure port for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-controller-manager.yaml
      regexp: '10252'
      replace: '10257'
  - name: kube-controller-manager - Use HTTPS for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-controller-manager.yaml
      regexp: 'scheme: HTTP$'
      replace: 'scheme: HTTPS'
  - name: Wait a few seconds as too fast updates don't tear down the previous version correctly
    pause:
      seconds: 10
  - name: kube-scheduler - Use secure port for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-scheduler.yaml
      regexp: '10251'
      replace: '10259'
  - name: kube-scheduler - Use HTTPS for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-scheduler.yaml
      regexp: 'scheme: HTTP$'
      replace: 'scheme: HTTPS'

@diogomurta
Copy link

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}         

I have the same issue. "workaround" is to delete that port flag from the kubernetes manifests, but would be happy to have a better fix. Happened after i upgraded to Kubernetes 1.17.9 and release 2.13 a few days back.

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

Thanks!
Workaround works for me.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2020
@eifelmicha
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2020
@eifelmicha
Copy link
Contributor

Seems that component status will be replaced anyways: kubernetes/kubernetes#93570

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 30, 2021
@oomichi
Copy link
Contributor

oomichi commented May 20, 2021

This issue seems fixed with #6583

/close

@k8s-ci-robot
Copy link
Contributor

@oomichi: Closing this issue.

In response to this:

This issue seems fixed with #6583

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants