Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager in CrashLoopBackoff and Error #76069

Closed
vishnukraj1111 opened this issue Apr 3, 2019 · 6 comments
Closed

kube-controller-manager in CrashLoopBackoff and Error #76069

vishnukraj1111 opened this issue Apr 3, 2019 · 6 comments

Comments

@vishnukraj1111
Copy link

@vishnukraj1111 vishnukraj1111 commented Apr 3, 2019

@kubernetes/sig-api-machiner-bugs

What happened:
/kind bug
I had a cluster with 3 master server and suddenly I could see kube-controller-managers are in Crashloopbackup and error state in all the 3 master nodes.
What you expected to happen:
Run fine or restart successfully

How to reproduce it (as minimally and precisely as possible):
I happened suddenly in two of my clusters. Kubeadm reset solves for sometime but it reoccurs again after some days.
Anything else we need to know?:
I can even see restart in kube-schedulers and kubectl commands take more time than usual to give an output.

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    KVM 2 core 4GB Ram
  • OS (e.g: cat /etc/os-release):
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Kernel (e.g. uname -a):
    4.4.176-1.el7.elrepo.x86_64 #1 SMP Sat Feb 23 09:05:01 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Others:

Below are the error logs from one of the kube-controller-manager pods but I saw this log just ones other logs are given below

Flag --address has been deprecated, see --bind-address instead.
I0403 05:23:25.287429 1 serving.go:318] Generated self-signed cert in-memory
I0403 05:23:25.842858 1 controllermanager.go:151] Version: v1.13.4
I0403 05:23:25.843339 1 secure_serving.go:116] Serving securely on [::]:10257
I0403 05:23:25.843757 1 deprecated_insecure_serving.go:51] Serving insecurely on 127.0.0.1:10252
I0403 05:23:25.844529 1 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-controller-manager...
I0403 05:23:43.136941 1 leaderelection.go:214] successfully acquired lease kube-system/kube-controller-manager
I0403 05:23:43.138223 1 event.go:221] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"eb40eac2-464b-11e9-8868-525400ec7062", APIVersion:"v1", ResourceVersion:"3612528", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' stagingmaster1.company.com_9b4ae0ba-55d0-11e9-8454-525400ec7062 became leader
I0403 05:23:44.904263 1 plugins.go:103] No cloud provider specified.
I0403 05:23:44.906064 1 controller_utils.go:1027] Waiting for caches to sync for tokens controller
E0403 05:23:56.904607 1 leaderelection.go:270] error retrieving resource lock kube-system/kube-controller-manager: Get https://172.17.33.40:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded
E0403 05:23:56.904763 1 event.go:259] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'stagingmaster1.company.com_9b4ae0ba-55d0-11e9-8454-525400ec7062 stopped leading'
I0403 05:23:56.904908 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0403 05:23:56.904937 1 controllermanager.go:254] leaderelection lost

Below is how the pods looks now

kube-controller-manager-stagingmaster1.company.com 0/1 CrashLoopBackOff 106 18h
kube-controller-manager-stagingmaster2.company.com 0/1 CrashLoopBackOff 100 18h
kube-controller-manager-stagingmaster3.company.com 0/1 CrashLoopBackOff 113 18h
kube-scheduler-stagingmaster1.company.com 1/1 Running 98 18h
kube-scheduler-stagingmaster2.company.com 1/1 Running 19 18h
kube-scheduler-stagingmaster3.company.com 1/1 Running 90 19d

These are the logs from all the three pods

[root@stagingmaster1 ~]# kubectl logs -n kube-system kube-controller-manager-stagingmaster1.company.com
Flag --address has been deprecated, see --bind-address instead.
I0403 06:25:55.717052 1 serving.go:318] Generated self-signed cert in-memory
I0403 06:25:56.895769 1 controllermanager.go:151] Version: v1.13.4
I0403 06:25:56.898638 1 secure_serving.go:116] Serving securely on [::]:10257
I0403 06:25:56.899251 1 deprecated_insecure_serving.go:51] Serving insecurely on 127.0.0.1:10252
I0403 06:25:56.899467 1 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-controller-manager...
I0403 06:26:14.145302 1 leaderelection.go:214] successfully acquired lease kube-system/kube-controller-manager
I0403 06:26:14.147100 1 event.go:221] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"eb40eac2-464b-11e9-8868-525400ec7062", APIVersion:"v1", ResourceVersion:"3617358", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' stagingmaster1.company.com_57186fbd-55d9-11e9-8026-525400ec7062 became leader
I0403 06:26:15.234291 1 plugins.go:103] No cloud provider specified.
I0403 06:26:15.235961 1 controller_utils.go:1027] Waiting for caches to sync for tokens controller
I0403 06:26:26.358371 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0403 06:26:26.358427 1 controllermanager.go:254] leaderelection lost
[root@stagingmaster1 ~]# kubectl logs -n kube-system kube-controller-manager-stagingmaster2.company.com
Flag --address has been deprecated, see --bind-address instead.
I0403 06:26:25.738821 1 serving.go:318] Generated self-signed cert in-memory
I0403 06:26:26.222086 1 controllermanager.go:151] Version: v1.13.5
I0403 06:26:26.223659 1 secure_serving.go:116] Serving securely on [::]:10257
I0403 06:26:26.224681 1 deprecated_insecure_serving.go:51] Serving insecurely on 127.0.0.1:10252
I0403 06:26:26.224887 1 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-controller-manager...
I0403 06:26:41.959151 1 leaderelection.go:214] successfully acquired lease kube-system/kube-controller-manager
I0403 06:26:41.959707 1 event.go:221] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"eb40eac2-464b-11e9-8868-525400ec7062", APIVersion:"v1", ResourceVersion:"3617395", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' stagingmaster2.company.com_689324cf-55d9-11e9-a090-5254002bc8cc became leader
I0403 06:26:51.959652 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0403 06:26:51.959714 1 controllermanager.go:254] leaderelection lost
[root@stagingmaster1 ~]# kubectl logs -n kube-system kube-controller-manager-stagingmaster3.company.com
Flag --address has been deprecated, see --bind-address instead.
I0403 06:27:02.373917 1 serving.go:318] Generated self-signed cert in-memory
I0403 06:27:02.793242 1 controllermanager.go:151] Version: v1.13.4
I0403 06:27:02.794348 1 secure_serving.go:116] Serving securely on [::]:10257
I0403 06:27:02.795223 1 deprecated_insecure_serving.go:51] Serving insecurely on 127.0.0.1:10252
I0403 06:27:02.795376 1 leaderelection.go:205] attempting to acquire leader lease kube-system/kube-controller-manager...
I0403 06:27:18.372944 1 leaderelection.go:214] successfully acquired lease kube-system/kube-controller-manager
I0403 06:27:18.373736 1 event.go:221] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"eb40eac2-464b-11e9-8868-525400ec7062", APIVersion:"v1", ResourceVersion:"3617444", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' stagingmaster3.company.com_7e5f5bd7-55d9-11e9-a20d-525400652e16 became leader
I0403 06:27:18.451891 1 plugins.go:103] No cloud provider specified.
I0403 06:27:18.455452 1 controller_utils.go:1027] Waiting for caches to sync for tokens controller
I0403 06:27:30.455940 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0403 06:27:30.456007 1 controllermanager.go:254] leaderelection lost

/var/log/messages
Apr 3 12:02:19 stagingmaster1 kubelet: E0403 12:02:19.799477 38844 pod_workers.go:190] Error syncing pod e8895af6723911002b5cc8f57e191fb8 ("kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-controller-manager pod=kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"
Apr 3 12:02:30 stagingmaster1 kubelet: E0403 12:02:30.799260 38844 pod_workers.go:190] Error syncing pod e8895af6723911002b5cc8f57e191fb8 ("kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-controller-manager pod=kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"
Apr 3 12:02:45 stagingmaster1 kubelet: E0403 12:02:45.799568 38844 pod_workers.go:190] Error syncing pod e8895af6723911002b5cc8f57e191fb8 ("kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-controller-manager pod=kube-controller-manager-stagingmaster1.company.com_kube-system(e8895af6723911002b5cc8f57e191fb8)"

This has very little resolution available online in all my searches and I could see many people facing it, most of them fixed it by rebuilding the clusters from scratch or doing kubeadm reset when it occurs.

If you need more information please ask

@vishnukraj1111
Copy link
Author

@vishnukraj1111 vishnukraj1111 commented Apr 3, 2019

/sig api-machinery

Loading

@yue9944882
Copy link
Member

@yue9944882 yue9944882 commented Apr 3, 2019

this is designed. controllers crashes when it lost connectivity to apiserver.

Loading

@vishnukraj1111
Copy link
Author

@vishnukraj1111 vishnukraj1111 commented Apr 4, 2019

This was a loadbalancer issue while controller was connection to the api server.

@yue9944882 Thanks for the hint orelse I won't have checked the api server logs.

/close

Loading

@shuaipeng123
Copy link

@shuaipeng123 shuaipeng123 commented Oct 9, 2019

@vishnukraj1111 Hi, so how do you fix this problem?

Loading

@chenshuyuhhh
Copy link

@chenshuyuhhh chenshuyuhhh commented Jul 10, 2020

I also have similar problem. My kube-controller-manager pod in CrashLoopBackoff after the wifi changed.
And I use the blow command to restart the pod:
kubectl get pod kube-controller-manager-k8s-master -n kube-system -o yaml | kubectl replace --force -f -
Then the pod is deleted, but not be restarted:
pod "kube-controller-manager-k8s-master" deleted
error: timed out waiting for the condition

Loading

@thien-enouvo
Copy link

@thien-enouvo thien-enouvo commented Dec 21, 2020

I got the same issue. I setup full HA cluster on single disk.
For my investigation, my disk is too low write speed. The cluster works well when I migrate to a SSD high write speed.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants