Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd lease auto-renewal can extend event TTL indefinitely #65497

Closed
jpbetz opened this issue Jun 26, 2018 · 15 comments

Comments

@jpbetz
Copy link
Contributor

commented Jun 26, 2018

Per the etcd ops guide: "Once a majority of members works, the etcd cluster elects a new leader automatically and returns to a healthy state. The new leader extends timeouts automatically for all leases. This mechanism ensures no lease expires due to server side unavailability."

For events, if leader elections occur more often than the event-ttl (which defaults to 1hr), event leases will be renewed indefinitely and the number of events stored in etcd (and the number of open leases) will grow until either the etcd storage space limit is exceeded or some other limit is hit (e.g. lease count results in excessively expensive revoke operations).

  1. Add an option to etcd lease creation to disable auto-renewal, and use this option to disable auto-renewal of event leases created by k8s
  2. Add a remediation routine (in kube-apiserver?) to remove old events (feels like an ugly hack)
  3. Transition to a different approach of expiring events in k8s (GC?)

We're currently considering (1) as a short term fix for this issue.

@gyuho, @xiang90 Do you have data or intuition about how often leader elections typically occur? Or is this too dependent on the environment to say?

cc @wenjiaswe @wojtek-t @mborsz

@jpbetz jpbetz added the area/etcd label Jun 26, 2018

@jpbetz jpbetz self-assigned this Jun 26, 2018

@k8s-ci-robot k8s-ci-robot removed the needs-sig label Jun 26, 2018

@xiang90

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2018

Do you have data or intuition about how often leader elections typically occur? Or is this too dependent on the environment to say?

It should happen less than once per week in a stable environment. Should not happen multiple times a day at least.

Add an option to etcd lease creation to disable auto-renewal, and use this option to disable auto-renewal of event leases created by k8s

We do not persist the remaining lease duration for performance consideration. We actually can persist the lease remaining duration lazily for long term leases. For example, we can persist an one hour lease remaining duration every 10 minutes. The overhead should be minimal.

We probably should move this discussion to etcd repo.

@gyuho

This comment has been minimized.

Copy link
Member

commented Jun 26, 2018

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2018

Created etcd-io/etcd#9888 to track this on the etcd repo.

@fedebongio

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2018

/cc @jingyih

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented Aug 3, 2018

Fixed by #64539

@mborsz

This comment has been minimized.

Copy link
Member

commented Aug 6, 2018

@jpbetz Could you elaborate how #64539 fixes this issue?

@bgrant0607

This comment has been minimized.

Copy link
Member

commented Aug 7, 2018

cc @janetkuo @kow3ns since we had discussed a controller-based TTL GC

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented Aug 7, 2018

@mborsz The root cause fix is etcd-io/etcd#9924, but it seems likely that #64539 helps to mitigate this issue for k8s for cases where event load high since without #64539 the lease revoke operations become expensive in etcd and can cause leader election churn.

@fejta-bot

This comment has been minimized.

Copy link

commented Nov 5, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented Nov 5, 2018

/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

commented Feb 3, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Feb 4, 2019

/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

commented May 5, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@wojtek-t

This comment has been minimized.

Copy link
Member

commented May 6, 2019

@jpbetz - can we close it now?

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

Yes, this is resolved.

@jpbetz jpbetz closed this May 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.