Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Istio: Manager exited non-zero #27509

Closed
luiof opened this issue Sep 24, 2020 · 6 comments · Fixed by #28085 or #32165
Closed

Managed Istio: Manager exited non-zero #27509

luiof opened this issue Sep 24, 2020 · 6 comments · Fixed by #28085 or #32165
Assignees
Labels
area/environments/operator Issues related to Operator or installation area/environments
Milestone

Comments

@luiof
Copy link

luiof commented Sep 24, 2020

Hi team! We have the managed istio on multiple IKS clusters.
We are using the latest 1.7.2 version and sporadically on some clusters we can see logs as the following:

2020-09-24T13:23:12.697712Z	error	error retrieving resource lock ibm-operators/istio-operator-lock: Get "https://172.21.0.1:443/api/v1/namespaces/ibm-operators/configmaps/istio-operator-lock": context deadline exceeded
2020-09-24T13:23:12.697802Z	info	failed to renew lease ibm-operators/istio-operator-lock: timed out waiting for the condition
2020-09-24T13:23:12.697865Z	fatal	Manager exited non-zero: leader election lost

and as:

    Sep 24 15:23:13kube-brq9cjpd05d8tb2l7bfg-ghostfordoc-default-00000115syslogE0924 13:23:13.260809    6966 pod_workers.go:191] Error syncing pod bd4d62ea-e38d-4e67-98c0-48b339f06337 ("managed-istio-operator-88cc547b4-8p9x4_ibm-operators(bd4d62ea-e38d-4e67-98c0-48b339f06337)"), skipping: failed to "StartContainer" for "istio-operator" with CrashLoopBackOff: "back-off 10s restarting failed container=istio-operator pod=managed-istio-operator-88cc547b4-8p9x4_ibm-operators(bd4d62ea-e38d-4e67-98c0-48b339f06337)"

    Sep 24 15:23:13kube-brq9cjpd05d8tb2l7bfg-ghostfordoc-default-00000115kubelet.logE0924 13:23:13.260809    6966 pod_workers.go:191] Error syncing pod bd4d62ea-e38d-4e67-98c0-48b339f06337 ("managed-istio-operator-88cc547b4-8p9x4_ibm-operators(bd4d62ea-e38d-4e67-98c0-48b339f06337)"), skipping: failed to "StartContainer" for "istio-operator" with CrashLoopBackOff: "back-off 10s restarting failed container=istio-operator pod=managed-istio-operator-88cc547b4-8p9x4_ibm-operators(bd4d62ea-e38d-4e67-98c0-48b339f06337)"

This seems to appear only some days and everything seems to be fine (the system works as usually).

The istioctl analyze -n namespace doesn't show any problem.
The istioctl version shows that every proxy is updated to latest 1.7.2 version.

Any ideas? Are there action that we can take to avoid those crash?

If required, I can attach some extra logs.

@sdake
Copy link
Member

sdake commented Sep 27, 2020

@ostromart this looks like a crash failure. Can you take a look?

@sdake sdake added area/environments area/environments/operator Issues related to Operator or installation labels Sep 27, 2020
@sdake sdake added this to the 1.8 milestone Sep 27, 2020
@sdake
Copy link
Member

sdake commented Sep 27, 2020

@luiof just a brief look. It appears as if leader election, which is used to ensure only one operator is running at a time within a namespace (aka fencing), is lost because of a timeout reading from the K8s API server. Looking at the K8s client code, the default is 10 seconds for RenewDeadline and the operator does not override this configuration.

10 seconds may be insufficient on a heavily loaded K8s API server.

@ostromart how do you think we should proceed? Increase the timeout to something more reasonable like 30 seconds?

@morvencao if you're bored :)

Cheers,
-steve

@richardwxn
Copy link
Contributor

can add a new env to the template to customize renewDeadline and pass in the manager config

@luiof
Copy link
Author

luiof commented Oct 10, 2020

Any update or eta about the new customization to mitigate the timeout?
Our managed-istio-operator restarts multiple times in four days:

$ kubectl get pods -n ibm-operators
NAME                                     READY   STATUS    RESTARTS   AGE
managed-istio-operator-56759bf69-xjdxc   1/1     Running   10         4d

@bianpengyuan
Copy link
Contributor

@richardwxn I made this a P0 for 1.8 as this seems to be an important bug fix. Looks like you have an idea about how to fix it so temporarily assign to you. Please downgrade if you think this is not a P0 or reassign if necessary. thanks!

@William-Newshutz
Copy link

This let us set the RENEW_DEADLINE but the leaseDuration limits the renewDeadline.

We set the RENEW_DEADLINE to 30s and get this error when the operator pod tries to start.

2021-04-09T18:45:44.565506Z	info	installer	Controller added
2021-04-09T18:45:44.565541Z	info	Starting the Cmd.
2021-04-09T18:45:44.565767Z	fatal	Manager exited non-zero: leaseDuration must be greater than renewDeadline

It seems like the leaseDuration defaults to 15s

cc @richardwxn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/environments/operator Issues related to Operator or installation area/environments
Projects
Status: Done
6 participants