Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager in CrashLoopBackoff and Error #76069

Closed
vishnukraj1111 opened this issue Apr 3, 2019 · 3 comments

Comments

@vishnukraj1111
Copy link

commented Apr 3, 2019

@kubernetes/sig-api-machiner-bugs

What happened:
/kind bug
I had a cluster with 3 master server and suddenly I could see kube-controller-managers are in Crashloopbackup and error state in all the 3 master nodes.
What you expected to happen:
Run fine or restart successfully

How to reproduce it (as minimally and precisely as possible):
I happened suddenly in two of my clusters. Kubeadm reset solves for sometime but it reoccurs again after some days.
Anything else we need to know?:
I can even see restart in kube-schedulers and kubectl commands take more time than usual to give an output.

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    KVM 2 core 4GB Ram
  • OS (e.g: cat /etc/os-release):
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Kernel (e.g. uname -a):
    4.4.176-1.el7.elrepo.x86_64 #1 SMP Sat Feb 23 09:05:01 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Others:

Below are the error logs from one of the kube-controller-manager pods but I saw this log just ones other logs are given below

Flag --address has been deprecated, see --bind-address instead.
I0403 05:23:25.287429 1 serving.go:318] Generated self-signed cert in-memory
I0403 05:23:25.842858 1 controllermanager.go:151] Version: v1.13.4
I0403 05:23:25.843339 1 secure_serving.go:116] Serving securely on [::]:10257
I0403 05:23:25.843757 1 deprecated_insecure_serving.go:51] Serving insecurely on 127.0.0.1:10252
I0403 05:23:25.844529 1 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-controller-manager...
I0403 05:23:43.136941 1 leaderelection.go:214] successfully acquired lease kube-system/kube-controller-manager
I0403 05:23:43.138223 1 event.go:221] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"eb40eac2-464b-11e9-8868-525400ec7062", APIVersion:"v1", ResourceVersion:"3612528", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' stagingmaster1.company.com_9b4ae0ba-55d0-11e9-8454-525400ec7062 became leader
I0403 05:23:44.904263 1 plugins.go:103] No cloud provider specified.
I0403 05:23:44.906064 1 controller_utils.go:1027] Waiting for caches to sync for tokens controller
E0403 05:23:56.904607 1 leaderelection.go:270] error retrieving resource lock kube-system/kube-controller-manager: Get https://172.17.33.40:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded
E0403 05:23:56.904763 1 event.go:259] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'stagingmaster1.company.com_9b4ae0ba-55d0-11e9-8454-525400ec7062 stopped leading'
I0403 05:23:56.904908 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0403 05:23:56.904937 1 controllermanager.go:254] leaderelection lost

Below is how the pods looks now

kube-controller-manager-stagingmaster1.company.com 0/1 CrashLoopBackOff 106 18h
kube-controller-manager-stagingmaster2.company.com 0/1 CrashLoopBackOff 100 18h
kube-controller-manager-stagingmaster3.company.com 0/1 CrashLoopBackOff 113 18h
kube-scheduler-stagingmaster1.company.com 1/1 Running 98 18h
kube-scheduler-stagingmaster2.company.com 1/1 Running 19 18h
kube-scheduler-stagingmaster3.company.com 1/1 Running 90 19d

These are the logs from all the three pods

[root@stagingmaster1 ~]# kubectl logs -n kube-system kube-controller-manager-stagingmaster1.company.com
Flag --address has been deprecated, see --bind-address instead.
I0403 06:25:55.717052 1 serving.go:318] Generated self-signed cert in-memory
I0403 06:25:56.895769 1 controllermanager.go:151] Version: v1.13.4
I0403 06:25:56.898638 1 secure_serving.go:116] Serving securely on [::]:10257
I0403 06:25:56.899251 1 deprecated_insecure_serving.go:51] Serving insecurely on 127.0.0.1:10252
I0403 06:25:56.899467 1 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-controller-manager...
I0403 06:26:14.145302 1 leaderelection.go:214] successfully acquired lease kube-system/kube-controller-manager
I0403 06:26:14.147100 1 event.go:221] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"eb40eac2-464b-11e9-8868-525400ec7062", APIVersion:"v1", ResourceVersion:"3617358", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' stagingmaster1.company.com_57186fbd-55d9-11e9-8026-525400ec7062 became leader
I0403 06:26:15.234291 1 plugins.go:103] No cloud provider specified.
I0403 06:26:15.235961 1 controller_utils.go:1027] Waiting for caches to sync for tokens controller
I0403 06:26:26.358371 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0403 06:26:26.358427 1 controllermanager.go:254] leaderelection lost
[root@stagingmaster1 ~]# kubectl logs -n kube-system kube-controller-manager-stagingmaster2.company.com
Flag --address has been deprecated, see --bind-address instead.
I0403 06:26:25.738821 1 serving.go:318] Generated self-signed cert in-memory
I0403 06:26:26.222086 1 controllermanager.go:151] Version: v1.13.5
I0403 06:26:26.223659 1 secure_serving.go:116] Serving securely on [::]:10257
I0403 06:26:26.224681 1 deprecated_insecure_serving.go:51] Serving insecurely on 127.0.0.1:10252
I0403 06:26:26.224887 1 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-controller-manager...
I0403 06:26:41.959151 1 leaderelection.go:214] successfully acquired lease kube-system/kube-controller-manager
I0403 06:26:41.959707 1 event.go:221] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"eb40eac2-464b-11e9-8868-525400ec7062", APIVersion:"v1", ResourceVersion:"3617395", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' stagingmaster2.company.com_689324cf-55d9-11e9-a090-5254002bc8cc became leader
I0403 06:26:51.959652 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0403 06:26:51.959714 1 controllermanager.go:254] leaderelection lost
[root@stagingmaster1 ~]# kubectl logs -n kube-system kube-controller-manager-stagingmaster3.company.com
Flag --address has been deprecated, see --bind-address instead.
I0403 06:27:02.373917 1 serving.go:318] Generated self-signed cert in-memory
I0403 06:27:02.793242 1 controllermanager.go:151] Version: v1.13.4
I0403 06:27:02.794348 1 secure_serving.go:116] Serving securely on [::]:10257
I0403 06:27:02.795223 1 deprecated_insecure_serving.go:51] Serving insecurely on 127.0.0.1:10252
I0403 06:27:02.795376 1 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-controller-manager...
I0403 06:27:18.372944 1 leaderelection.go:214] successfully acquired lease kube-system/kube-controller-manager
I0403 06:27:18.373736 1 event.go:221] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"eb40eac2-464b-11e9-8868-525400ec7062", APIVersion:"v1", ResourceVersion:"3617444", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' stagingmaster3.company.com_7e5f5bd7-55d9-11e9-a20d-525400652e16 became leader
I0403 06:27:18.451891 1 plugins.go:103] No cloud provider specified.
I0403 06:27:18.455452 1 controller_utils.go:1027] Waiting for caches to sync for tokens controller
I0403 06:27:30.455940 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0403 06:27:30.456007 1 controllermanager.go:254] leaderelection lost

/var/log/messages
Apr 3 12:02:19 stagingmaster1 kubelet: E0403 12:02:19.799477 38844 pod_workers.go:190] Error syncing pod e8895af6723911002b5cc8f57e191fb8 ("kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-controller-manager pod=kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"
Apr 3 12:02:30 stagingmaster1 kubelet: E0403 12:02:30.799260 38844 pod_workers.go:190] Error syncing pod e8895af6723911002b5cc8f57e191fb8 ("kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-controller-manager pod=kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"
Apr 3 12:02:45 stagingmaster1 kubelet: E0403 12:02:45.799568 38844 pod_workers.go:190] Error syncing pod e8895af6723911002b5cc8f57e191fb8 ("kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-controller-manager pod=kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"

This has very little resolution available online in all my searches and I could see many people facing it, most of them fixed it by rebuilding the clusters from scratch or doing kubeadm reset when it occurs.

If you need more information please ask

@vishnukraj1111

This comment has been minimized.

Copy link
Author

commented Apr 3, 2019

/sig api-machinery

@yue9944882

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

this is designed. controllers crashes when it lost connectivity to apiserver.

@vishnukraj1111

This comment has been minimized.

Copy link
Author

commented Apr 4, 2019

This was a loadbalancer issue while controller was connection to the api server.

@yue9944882 Thanks for the hint orelse I won't have checked the api server logs.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.