Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manager error results in pod crash #1774

Closed
delps1001 opened this issue Jan 12, 2022 · 3 comments
Closed

Manager error results in pod crash #1774

delps1001 opened this issue Jan 12, 2022 · 3 comments

Comments

@delps1001
Copy link

delps1001 commented Jan 12, 2022

Using the manager, if there is an issue with leader election, our program should exit immediately according to the docs:

// Start starts all registered Controllers and blocks until the context is cancelled.
// Returns an error if there is an error starting any controller.
//
// If LeaderElection is used, the binary must be exited immediately after this returns,
// otherwise components that need leader election might continue to run after the leader
// lock was lost.
Start(ctx context.Context) error

However, when I experience a network issue in our kubernetes cluster, sometimes our application pod will continually restart because it has momentarily lost connection to the kubernetes api (again, due to some intermittent connection issues). This happens because when the pod tries to acquire/reacquire the leader lock, it times out when the pod starts back up, then exits, rinse and repeat.

I'm seeing logs like this:

E0108 19:06:53.325550 1 leaderelection.go:361] Failed to update lock: Put "https://1.1.1.1:443/apis/coordination.k8s.io/v1/namespaces/namespace/leases/my-lock": context deadline exceeded
[application will fatally exit]
...
E0108 19:15:09.629250 1 leaderelection.go:325] error retrieving resource lock my-namespace/my-lock: Get "https://1.1.1.1:443/api/v1/namespaces/namespace/configmaps/my-lock": context deadline exceeded
[application will fatally exit]

Is this expected behavior? Is there anything I can do to prevent the pods from restarting continually when there are connectivity issues to the kubernetes apis?

@FillZpp
Copy link
Contributor

FillZpp commented Jan 13, 2022

Is this expected behavior?

Yes, it is the expected behavior. The client doesn't know whether the network issue is in the whole cluster or itself. If it continues to run without restart and another manager pod has been new leader, they might work at the same time and lead to unexpected result.

Is there anything I can do to prevent the pods from restarting continually when there are connectivity issues to the kubernetes apis?

  1. You can deploy your controller with one replica, so that you can disable the leader election. But this is not the recommended way.
  2. You can change the lease and renew duration of leader election, which are options provided in manager.Options. For example, you can set LeaseDuration=60s and RenewDeadline=55s, so that it can tolerate network issue for 60s, in which time it will keep retry calling every 2s (RetryPeriod). By default it can only wait for 15s and then stop itself.

// LeaseDuration is the duration that non-leader candidates will
// wait to force acquire leadership. This is measured against time of
// last observed ack. Default is 15 seconds.
LeaseDuration *time.Duration
// RenewDeadline is the duration that the acting controlplane will retry
// refreshing leadership before giving up. Default is 10 seconds.
RenewDeadline *time.Duration
// RetryPeriod is the duration the LeaderElector clients should wait
// between tries of actions. Default is 2 seconds.
RetryPeriod *time.Duration

@delps1001
Copy link
Author

@FillZpp thank you for the information!

@shaharr-ma
Copy link

Is this expected behavior?

Yes, it is the expected behavior. The client doesn't know whether the network issue is in the whole cluster or itself. If it continues to run without restart and another manager pod has been new leader, they might work at the same time and lead to unexpected result.

Is there anything I can do to prevent the pods from restarting continually when there are connectivity issues to the kubernetes apis?

  1. You can deploy your controller with one replica, so that you can disable the leader election. But this is not the recommended way.
  2. You can change the lease and renew duration of leader election, which are options provided in manager.Options. For example, you can set LeaseDuration=60s and RenewDeadline=55s, so that it can tolerate network issue for 60s, in which time it will keep retry calling every 2s (RetryPeriod). By default it can only wait for 15s and then stop itself.

// LeaseDuration is the duration that non-leader candidates will
// wait to force acquire leadership. This is measured against time of
// last observed ack. Default is 15 seconds.
LeaseDuration *time.Duration
// RenewDeadline is the duration that the acting controlplane will retry
// refreshing leadership before giving up. Default is 10 seconds.
RenewDeadline *time.Duration
// RetryPeriod is the duration the LeaderElector clients should wait
// between tries of actions. Default is 2 seconds.
RetryPeriod *time.Duration

Hi, does it possible to expose those values as HELM variables ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants