-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reevaluate events deletion policy #38949
Comments
Ideally, this would also be separate X for different kinds of events – for example, "container failed liveness probe and was restarted" is not really relevant after 24h, but "this node stopped reporting" would be to me. The latter are also much more rare. |
We don't cache watch the events today, so this would be adding a new events
watch which is likely to be high bandwidth. It also increases the amount
of deletes sent to etcd, although I suspect that this would result in less
load overall than on etcd before. So I'm cautiously in favor of this.
…On Mon, Dec 19, 2016 at 6:07 AM, Matthias Rampke ***@***.***> wrote:
Ideally, this would also be separate X for different kinds of events – for
example, "container failed liveness probe and was restarted" is not really
relevant after 24h, but "this node stopped reporting" would be to me. The
latter are also much more rare.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_YnmtA3MRPCCLiq7x8wCxDAX40Nks5rJmV5gaJpZM4LQfL0>
.
|
As far as I can tell, events don't need the same guarantees in terms of
consistency and persistence as more critical objects. How could they be
stored such that they don't need to go through the same consensus?
|
At one point we allowed events to be stored in their own etcd cluster. Not
sure anyone runs that way today after the fixes to reduce the write volume
of events.
…On Mon, Dec 19, 2016 at 3:49 PM, Matthias Rampke ***@***.***> wrote:
As far as I can tell, events don't need the same guarantees in terms of
consistency and persistence as more critical objects. How could they be
stored such that they don't need to go through the same consensus?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p7r8QZ6jigSFdx6Dmo8sc2rSTdWeks5rJu3wgaJpZM4LQfL0>
.
|
As far as I'm aware that is still possible, but I'm wondering if it's the
right kind of storage at all. On the other hand there isn't really any
other kind available.
|
We have noted event saturation in a few cases, where an overzealous
component went into a hot loop and was insufficiently ratelimited and took
all the master's write capacity. We should probably be limiting event
creation independently at the server level, but hasn't bumped in priority
yet.
Elasticsearch has been proposed as a backend for events but no one has
stepped up to implement the basic CRUD (and watch is problematic).
On Dec 19, 2016, at 5:30 PM, Matthias Rampke <notifications@github.com> wrote:
As far as I'm aware that is still possible, but I'm wondering if it's the
right kind of storage at all. On the other hand there isn't really any
other kind available.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p0bZXaDqLH2XSZsHqfCPAYiymOeiks5rJwWXgaJpZM4LQfL0>
.
|
#19637 as Clayton alluded to but nobody ever did anything about it. |
Yes - we definitely want to move events to a different kind of storage, but I'm not aware of any real effort towards this goal. Regarding storing events in a separate etcd, we are doing this, and in fact this significantly helps for performance. Though we should reevaluate it after moving to etcd v3 (but I guess there will still be a gain in having it).
Yes - we don't have cacher enabled for events today. However, to reduce the visible increase of load on apiserver, I wouldn't enable it even if we decide to go with the proposal. That said, I think watch for events should be served directly from etcd because:
That said, I agree it would increase the load on etcd (the number of deleted will be significant). On the other hand, etcd itself should have less work, as there won't be any objects with TTLs in such case. With this approach, implementing the logic that @matthiasr suggested (i.e. different policies for different kinds of events) would also be relatively straightforward. One very minor issue is that since we are "merging" similar events, I assume that we will be gc-ing them based on last occurrence. That said, we will end up in the situation where the very old event will still be "kind of present" in the system (as it also happened very recently), whereas some events that happened after it will already be removed (as they didn't have recent occurrences). I guess it's not a very big deal, but we need to be very clear about how exactly it works and document it somewhere. |
From the user perspective I think we should just say that we care only about "last seen" timestamp. Then it's pretty straightforwards which event objects are deleted and which are not. WDYT? |
I think that's fine. The only complication are occurence counts – they'd keep counting up as long as an event happens frequently enough. But I think that's easy enough to reason about. |
(also, for such a frequent event, all I would look for in an occurence count is "a lot" vs "not a lot", so that'd be fine). |
So I'm going to ignore all the other comments and "just say no!" Events were always intended to be ephemeral, if folks want a history service, then create a history service and shunt events out of etcd entirely. At the scales we reach today, and want to reach tomorrow this type of option is untenable. |
Yeah, events were intended to be archived and read from the archival store.
I'm not in favor of doing stop-gap work to make the etcd events stick
around longer.
…On Thu, Jan 5, 2017 at 2:15 PM, Timothy St. Clair ***@***.***> wrote:
So I'm going to ignore all the other comments and "just say no!"
Events by were always intended to be ephemeral, if folks want a history
service, then create a history service and shunt events out of etcd
entirely. At the scales we reach today, and want to reach tomorrow this
type of decision option is untenable.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAnglo1omQy5BpWlAmG24J4e37SIJoFNks5rPWuQgaJpZM4LQfL0>
.
|
Fair enough. How can this, and the options to do so, be made more clear to
people building clusters?
…On Fri, Jan 6, 2017, 23:10 Daniel Smith ***@***.***> wrote:
Yeah, events were intended to be archived and read from the archival store.
I'm not in favor of doing stop-gap work to make the etcd events stick
around longer.
On Thu, Jan 5, 2017 at 2:15 PM, Timothy St. Clair <
***@***.***>
wrote:
> So I'm going to ignore all the other comments and "just say no!"
>
> Events by were always intended to be ephemeral, if folks want a history
> service, then create a history service and shunt events out of etcd
> entirely. At the scales we reach today, and want to reach tomorrow this
> type of decision option is untenable.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#38949 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AAnglo1omQy5BpWlAmG24J4e37SIJoFNks5rPWuQgaJpZM4LQfL0
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBtwM75ePkIeZEa5Mm1q72jU-q6NEks5rPrvbgaJpZM4LQfL0>
.
|
#36304 is relevant (I think) |
I don't agree with this reasoning. Events being ephemeral is somewhat orthogonal to how long we try to keep them. Keeping them for 1h makes it impossible to debug problems that occurred during a longer lunch, not to mention during the night. @timothysc what scale issues are you worried about? what size of a cluster? If the number of events is high we would just delete quicker. I imagine this might even be more managable, as you would be able to say where is the threshold, while with 1h you can't guarantee anything. @lavalamp I don't see this as a stop-gap solution. I think there are actually advantages such as:
|
All the more reason to address properly with a history service, b/c even if you lengthen the time there will always be a question "How long is long enough vs. what is arbitrary". If you were to record overnight on a 1k node/ 250k-pod high churn cluster, that is an obscene amount of data to be holding onto, which can OOM etcd. |
Of course in large clusters this would not solve the problem, as you pointed out. But it would not be worse - we would only change the method of deletion. However this could be a major improvement in smaller clusters. Most likely you don't want to run yet another pod/service just for this as it would be a large tax. Making this part of the "core" would simplify things greatly! I think for people with O(10) nodes this might be a huge improvement. |
I'm ok with an input knob for adjusting the TTL with default unchanged, but not a fan of changing the defaults. |
I though we already allowed event TTL to be tuned? We use it in openshift
since we generate more events per unit time.
On Jan 10, 2017, at 1:58 PM, Timothy St. Clair <notifications@github.com> wrote:
I'm ok with an input knob for adjusting the TTL with default unchanged, but
not a fan of changing the defaults.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p9DpGpRj24wHQ2A_4eO8U9koaj18ks5rQ9TmgaJpZM4LQfL0>
.
|
Yup.
|
Would you be OK with changing the mechanism from TTL in etcd to something
like garbage collection? My reasoning is that if you care about total
number of events stored in a system, then having a knob for TTL does match
your expectations, while a more direct solution where you just keep X last
events would easier to manage WDYT?
…--
Filip
On Tue, Jan 10, 2017 at 10:05 PM, Timothy St. Clair < ***@***.***> wrote:
Yup.
--event-ttl duration Amount of time to retain events. Default is 1h.
(default 1h0m0s)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKUcdnEqBUaOuHVb0qODfCgPHUixO7E8ks5rQ_KLgaJpZM4LQfL0>
.
|
In order to check the event count there would either need to be a watch or a periodic re-list of events. Neither of which I'm crazy about... I'd be ok with it if it was an opt-in behavior, and defaulted off for smaller clusters. /cc @hongchaodeng |
I'd expect garbage collection or active management to be far more expensive than simple TTL |
I agree w/ @liggitt. Moving away from the TTL would make for a much more complicated and resource-intensive system. |
Actually, since we are already watching most things on the cluster, and we
have the local watch cache, events might actually be less impactful for the
server if we did GC.
Also, in v2 TTL is more expensive than not having TTL, and in v3 I believe
it's still marginally more expensive (correct me if I'm wrong on this).
Having a local events cache would potentially make event compaction and
event updates cheaper.
…On Tue, Jan 17, 2017 at 5:13 PM, Daniel Smith ***@***.***> wrote:
I agree w/ @liggitt <https://github.com/liggitt>. Moving away from the
TTL would make for a much more complicated and resource-intensive system.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p6JomMEDWm1MinpcryoNtpxf0-Rsks5rTTz6gaJpZM4LQfL0>
.
|
not for events |
Heapster and OpenShift both end up calling List and Watch, so are we
gaining anything by not having it? A flyweight cache might be another
option for events - story only last touch time.
…On Tue, Jan 17, 2017 at 9:35 PM, Jordan Liggitt ***@***.***> wrote:
we have the local watch cache
not for events
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38949 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p6wfMPK0u1Lv2YDtDGk6FN0sdoSmks5rTXpagaJpZM4LQfL0>
.
|
If there is exactly one watcher of all events, we don't gain pretty much anything from having events from cache (this starts to be useful if there are multiple watchers and/or we are e.g. serving lists from memory).
I think this may be close to true if we do garbage collection in apiserver itself. But I'm pretty sure we don't want it. In such case, if we create another controller (or modify the existing gc-controller), then I'm pretty sure this would be significantly more expensive as apiserver would have to process a ton of new "remove event" api requests. |
From a feature perspective, I consider everything discussed here to be near 0 sum gain, or a band-aide type of solution that breaks down under multiple scenarios. If the overall user-story is to provide a history service to glean meaningful insights into the behavior of pods, then I would contend that is a separate service, which is out of core. There is nothing today that prevents a operator from using the facilities that exist today from providing, or creating that service. /cc @kubernetes/sig-scalability-misc |
I agree, given the complexity the gains don't seem to be worth it. |
Closing this issue there are a number of overlapping ones. |
Currently we delete all the events after 1h. This has been introduced as a way to keep the load on apiserver small. However, it also has a serious downside for the user, as they can't debug what has happened unless they persist events somewhere.
I'd like to propose a slightly change approach, where instead of delete all events after 1h, we use the same policy as for terminated pods - delete them only if we have more events than X which depends on the cluster size. I imagine we should do it by reusing podgc controller logic and stop setting TTL on etcd.
@lavalamp @wojtek-t @smarterclayton @piosz
@kubernetes/sig-instrumentation-misc (events are kind of instrumentation of the system)
@kubernetes/sig-api-machinery-misc (it's related to apiserver & co)
@kubernetes/sig-scalability-misc (proposed change might have scalability implications)
The text was updated successfully, but these errors were encountered: