Manager error results in pod crash #1774

delps1001 · 2022-01-12T18:31:15Z

Using the manager, if there is an issue with leader election, our program should exit immediately according to the docs:

controller-runtime/pkg/manager/manager.go

Lines 81 to 87 in b8db76e

    
           // Start starts all registered Controllers and blocks until the context is cancelled. 
        
           // Returns an error if there is an error starting any controller. 
        
           // 
        
           // If LeaderElection is used, the binary must be exited immediately after this returns, 
        
           // otherwise components that need leader election might continue to run after the leader 
        
           // lock was lost. 
        
           Start(ctx context.Context) error

However, when I experience a network issue in our kubernetes cluster, sometimes our application pod will continually restart because it has momentarily lost connection to the kubernetes api (again, due to some intermittent connection issues). This happens because when the pod tries to acquire/reacquire the leader lock, it times out when the pod starts back up, then exits, rinse and repeat.

I'm seeing logs like this:

E0108 19:06:53.325550 1 leaderelection.go:361] Failed to update lock: Put "https://1.1.1.1:443/apis/coordination.k8s.io/v1/namespaces/namespace/leases/my-lock": context deadline exceeded
[application will fatally exit]
...
E0108 19:15:09.629250 1 leaderelection.go:325] error retrieving resource lock my-namespace/my-lock: Get "https://1.1.1.1:443/api/v1/namespaces/namespace/configmaps/my-lock": context deadline exceeded
[application will fatally exit]

Is this expected behavior? Is there anything I can do to prevent the pods from restarting continually when there are connectivity issues to the kubernetes apis?

The text was updated successfully, but these errors were encountered:

FillZpp · 2022-01-13T03:45:41Z

Is this expected behavior?

Yes, it is the expected behavior. The client doesn't know whether the network issue is in the whole cluster or itself. If it continues to run without restart and another manager pod has been new leader, they might work at the same time and lead to unexpected result.

Is there anything I can do to prevent the pods from restarting continually when there are connectivity issues to the kubernetes apis?

You can deploy your controller with one replica, so that you can disable the leader election. But this is not the recommended way.
You can change the lease and renew duration of leader election, which are options provided in manager.Options. For example, you can set LeaseDuration=60s and RenewDeadline=55s, so that it can tolerate network issue for 60s, in which time it will keep retry calling every 2s (RetryPeriod). By default it can only wait for 15s and then stop itself.

controller-runtime/pkg/manager/manager.go

Lines 178 to 187 in 19f9afe

    
           // LeaseDuration is the duration that non-leader candidates will 
        
           // wait to force acquire leadership. This is measured against time of 
        
           // last observed ack. Default is 15 seconds. 
        
           LeaseDuration *time.Duration 
        
           // RenewDeadline is the duration that the acting controlplane will retry 
        
           // refreshing leadership before giving up. Default is 10 seconds. 
        
           RenewDeadline *time.Duration 
        
           // RetryPeriod is the duration the LeaderElector clients should wait 
        
           // between tries of actions. Default is 2 seconds. 
        
           RetryPeriod *time.Duration

delps1001 · 2022-01-13T18:37:19Z

@FillZpp thank you for the information!

shaharr-ma · 2024-03-03T09:04:59Z

Is this expected behavior?

Yes, it is the expected behavior. The client doesn't know whether the network issue is in the whole cluster or itself. If it continues to run without restart and another manager pod has been new leader, they might work at the same time and lead to unexpected result.

Is there anything I can do to prevent the pods from restarting continually when there are connectivity issues to the kubernetes apis?

You can deploy your controller with one replica, so that you can disable the leader election. But this is not the recommended way.

You can change the lease and renew duration of leader election, which are options provided in manager.Options. For example, you can set LeaseDuration=60s and RenewDeadline=55s, so that it can tolerate network issue for 60s, in which time it will keep retry calling every 2s (RetryPeriod). By default it can only wait for 15s and then stop itself.

controller-runtime/pkg/manager/manager.go

Lines 178 to 187 in 19f9afe

// LeaseDuration is the duration that non-leader candidates will

// wait to force acquire leadership. This is measured against time of

// last observed ack. Default is 15 seconds.

LeaseDuration *time.Duration

// RenewDeadline is the duration that the acting controlplane will retry

// refreshing leadership before giving up. Default is 10 seconds.

RenewDeadline *time.Duration

// RetryPeriod is the duration the LeaderElector clients should wait

// between tries of actions. Default is 2 seconds.

RetryPeriod *time.Duration

Hi, does it possible to expose those values as HELM variables ?

delps1001 closed this as completed Jan 13, 2022

crisp2u mentioned this issue Mar 30, 2022

keda operator restarted at the time of start.(error retrieving resource lock keda/operator.keda.sh) kedacore/keda#2836

Closed

JorTurFer mentioned this issue Feb 8, 2023

Keda operator restarting after failing to renew leader election kedacore/keda#4212

Closed

mengqiy mentioned this issue Feb 16, 2024

Pod restarts without clear reason (timeout when doing GET to configmap) kubernetes-sigs/aws-load-balancer-controller#3126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manager error results in pod crash #1774

Manager error results in pod crash #1774

delps1001 commented Jan 12, 2022 •

edited

Loading

FillZpp commented Jan 13, 2022

delps1001 commented Jan 13, 2022

shaharr-ma commented Mar 3, 2024

Manager error results in pod crash #1774

Manager error results in pod crash #1774

Comments

delps1001 commented Jan 12, 2022 • edited Loading

FillZpp commented Jan 13, 2022

delps1001 commented Jan 13, 2022

shaharr-ma commented Mar 3, 2024

delps1001 commented Jan 12, 2022 •

edited

Loading