Managed Istio: Manager exited non-zero #27509

luiof · 2020-09-24T16:16:53Z

Hi team! We have the managed istio on multiple IKS clusters.
We are using the latest 1.7.2 version and sporadically on some clusters we can see logs as the following:

2020-09-24T13:23:12.697712Z	error	error retrieving resource lock ibm-operators/istio-operator-lock: Get "https://172.21.0.1:443/api/v1/namespaces/ibm-operators/configmaps/istio-operator-lock": context deadline exceeded
2020-09-24T13:23:12.697802Z	info	failed to renew lease ibm-operators/istio-operator-lock: timed out waiting for the condition
2020-09-24T13:23:12.697865Z	fatal	Manager exited non-zero: leader election lost

and as:

    Sep 24 15:23:13kube-brq9cjpd05d8tb2l7bfg-ghostfordoc-default-00000115syslogE0924 13:23:13.260809    6966 pod_workers.go:191] Error syncing pod bd4d62ea-e38d-4e67-98c0-48b339f06337 ("managed-istio-operator-88cc547b4-8p9x4_ibm-operators(bd4d62ea-e38d-4e67-98c0-48b339f06337)"), skipping: failed to "StartContainer" for "istio-operator" with CrashLoopBackOff: "back-off 10s restarting failed container=istio-operator pod=managed-istio-operator-88cc547b4-8p9x4_ibm-operators(bd4d62ea-e38d-4e67-98c0-48b339f06337)"

    Sep 24 15:23:13kube-brq9cjpd05d8tb2l7bfg-ghostfordoc-default-00000115kubelet.logE0924 13:23:13.260809    6966 pod_workers.go:191] Error syncing pod bd4d62ea-e38d-4e67-98c0-48b339f06337 ("managed-istio-operator-88cc547b4-8p9x4_ibm-operators(bd4d62ea-e38d-4e67-98c0-48b339f06337)"), skipping: failed to "StartContainer" for "istio-operator" with CrashLoopBackOff: "back-off 10s restarting failed container=istio-operator pod=managed-istio-operator-88cc547b4-8p9x4_ibm-operators(bd4d62ea-e38d-4e67-98c0-48b339f06337)"

This seems to appear only some days and everything seems to be fine (the system works as usually).

The istioctl analyze -n namespace doesn't show any problem.
The istioctl version shows that every proxy is updated to latest 1.7.2 version.

Any ideas? Are there action that we can take to avoid those crash?

If required, I can attach some extra logs.

The text was updated successfully, but these errors were encountered:

sdake · 2020-09-27T10:44:29Z

@ostromart this looks like a crash failure. Can you take a look?

sdake · 2020-09-27T10:59:21Z

@luiof just a brief look. It appears as if leader election, which is used to ensure only one operator is running at a time within a namespace (aka fencing), is lost because of a timeout reading from the K8s API server. Looking at the K8s client code, the default is 10 seconds for RenewDeadline and the operator does not override this configuration.

10 seconds may be insufficient on a heavily loaded K8s API server.

@ostromart how do you think we should proceed? Increase the timeout to something more reasonable like 30 seconds?

@morvencao if you're bored :)

Cheers,
-steve

richardwxn · 2020-09-28T08:24:43Z

can add a new env to the template to customize renewDeadline and pass in the manager config

luiof · 2020-10-10T18:15:12Z

Any update or eta about the new customization to mitigate the timeout?
Our managed-istio-operator restarts multiple times in four days:

$ kubectl get pods -n ibm-operators
NAME                                     READY   STATUS    RESTARTS   AGE
managed-istio-operator-56759bf69-xjdxc   1/1     Running   10         4d

bianpengyuan · 2020-10-19T16:14:57Z

@richardwxn I made this a P0 for 1.8 as this seems to be an important bug fix. Looks like you have an idea about how to fix it so temporarily assign to you. Please downgrade if you think this is not a P0 or reassign if necessary. thanks!

William-Newshutz · 2021-04-09T20:44:44Z

This let us set the RENEW_DEADLINE but the leaseDuration limits the renewDeadline.

We set the RENEW_DEADLINE to 30s and get this error when the operator pod tries to start.

2021-04-09T18:45:44.565506Z	info	installer	Controller added
2021-04-09T18:45:44.565541Z	info	Starting the Cmd.
2021-04-09T18:45:44.565767Z	fatal	Manager exited non-zero: leaseDuration must be greater than renewDeadline

It seems like the leaseDuration defaults to 15s

cc @richardwxn

Based on istio#27509

sdake added area/environments area/environments/operator Issues related to Operator or installation labels Sep 27, 2020

sdake added this to the 1.8 milestone Sep 27, 2020

bianpengyuan assigned richardwxn Oct 19, 2020

richardwxn mentioned this issue Oct 19, 2020

allow configure RENEW_DEADLINE for operator manager #28085

Merged

istio-testing closed this as completed in #28085 Oct 20, 2020

GregHanson reopened this Apr 9, 2021

GregHanson mentioned this issue Apr 14, 2021

dynamically set Lease Duration for operator #32165

Merged

istio-testing closed this as completed in #32165 Apr 20, 2021

nielsole pushed a commit to wayfair-contribs/istio that referenced this issue Apr 4, 2022

Fixing restarts on busy clusters.

9028169

Based on istio#27509

nielsole mentioned this issue Apr 4, 2022

Expose RENEW_DEADLINE env variable in operator #38234

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed Istio: Manager exited non-zero #27509

Managed Istio: Manager exited non-zero #27509

luiof commented Sep 24, 2020 •

edited

Loading

sdake commented Sep 27, 2020

sdake commented Sep 27, 2020

richardwxn commented Sep 28, 2020

luiof commented Oct 10, 2020

bianpengyuan commented Oct 19, 2020

William-Newshutz commented Apr 9, 2021

Managed Istio: Manager exited non-zero #27509

Managed Istio: Manager exited non-zero #27509

Comments

luiof commented Sep 24, 2020 • edited Loading

sdake commented Sep 27, 2020

sdake commented Sep 27, 2020

richardwxn commented Sep 28, 2020

luiof commented Oct 10, 2020

bianpengyuan commented Oct 19, 2020

William-Newshutz commented Apr 9, 2021

luiof commented Sep 24, 2020 •

edited

Loading