Reevaluate events deletion policy #38949

fgrzadkowski · 2016-12-19T08:42:35Z

Currently we delete all the events after 1h. This has been introduced as a way to keep the load on apiserver small. However, it also has a serious downside for the user, as they can't debug what has happened unless they persist events somewhere.

I'd like to propose a slightly change approach, where instead of delete all events after 1h, we use the same policy as for terminated pods - delete them only if we have more events than X which depends on the cluster size. I imagine we should do it by reusing podgc controller logic and stop setting TTL on etcd.

@lavalamp @wojtek-t @smarterclayton @piosz

@kubernetes/sig-instrumentation-misc (events are kind of instrumentation of the system)
@kubernetes/sig-api-machinery-misc (it's related to apiserver & co)
@kubernetes/sig-scalability-misc (proposed change might have scalability implications)

matthiasr · 2016-12-19T11:07:29Z

Ideally, this would also be separate X for different kinds of events – for example, "container failed liveness probe and was restarted" is not really relevant after 24h, but "this node stopped reporting" would be to me. The latter are also much more rare.

smarterclayton · 2016-12-19T20:35:59Z

We don't cache watch the events today, so this would be adding a new events watch which is likely to be high bandwidth. It also increases the amount of deletes sent to etcd, although I suspect that this would result in less load overall than on etcd before. So I'm cautiously in favor of this.

…

On Mon, Dec 19, 2016 at 6:07 AM, Matthias Rampke ***@***.***> wrote: Ideally, this would also be separate X for different kinds of events – for example, "container failed liveness probe and was restarted" is not really relevant after 24h, but "this node stopped reporting" would be to me. The latter are also much more rare. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p_YnmtA3MRPCCLiq7x8wCxDAX40Nks5rJmV5gaJpZM4LQfL0> .

matthiasr · 2016-12-19T20:49:42Z

As far as I can tell, events don't need the same guarantees in terms of consistency and persistence as more critical objects. How could they be stored such that they don't need to go through the same consensus?

smarterclayton · 2016-12-19T20:55:13Z

At one point we allowed events to be stored in their own etcd cluster. Not sure anyone runs that way today after the fixes to reduce the write volume of events.

…

On Mon, Dec 19, 2016 at 3:49 PM, Matthias Rampke ***@***.***> wrote: As far as I can tell, events don't need the same guarantees in terms of consistency and persistence as more critical objects. How could they be stored such that they don't need to go through the same consensus? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p7r8QZ6jigSFdx6Dmo8sc2rSTdWeks5rJu3wgaJpZM4LQfL0> .

matthiasr · 2016-12-19T22:30:40Z

As far as I'm aware that is still possible, but I'm wondering if it's the right kind of storage at all. On the other hand there isn't really any other kind available.

smarterclayton · 2016-12-20T01:43:24Z

We have noted event saturation in a few cases, where an overzealous component went into a hot loop and was insufficiently ratelimited and took all the master's write capacity. We should probably be limiting event creation independently at the server level, but hasn't bumped in priority yet. Elasticsearch has been proposed as a backend for events but no one has stepped up to implement the basic CRUD (and watch is problematic). On Dec 19, 2016, at 5:30 PM, Matthias Rampke <notifications@github.com> wrote: As far as I'm aware that is still possible, but I'm wondering if it's the right kind of storage at all. On the other hand there isn't really any other kind available. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p0bZXaDqLH2XSZsHqfCPAYiymOeiks5rJwWXgaJpZM4LQfL0> .

davidopp · 2016-12-20T18:40:12Z

#19637 as Clayton alluded to but nobody ever did anything about it.

wojtek-t · 2016-12-21T14:25:26Z

Yes - we definitely want to move events to a different kind of storage, but I'm not aware of any real effort towards this goal.

Regarding storing events in a separate etcd, we are doing this, and in fact this significantly helps for performance. Though we should reevaluate it after moving to etcd v3 (but I guess there will still be a gain in having it).

We don't cache watch the events today, so this would be adding a new events
watch which is likely to be high bandwidth. It also increases the amount
of deletes sent to etcd, although I suspect that this would result in less
load overall than on etcd before. So I'm cautiously in favor of this.

Yes - we don't have cacher enabled for events today. However, to reduce the visible increase of load on apiserver, I wouldn't enable it even if we decide to go with the proposal. That said, I think watch for events should be served directly from etcd because:

the watcher we are talking about is pretty much interested in all events
there is only one watcher
So the whole gains that we get from cacher (serving multiple watcher from a single data stream and better filtering) doesn't applicate in this case.

That said, I agree it would increase the load on etcd (the number of deleted will be significant). On the other hand, etcd itself should have less work, as there won't be any objects with TTLs in such case.

With this approach, implementing the logic that @matthiasr suggested (i.e. different policies for different kinds of events) would also be relatively straightforward.

One very minor issue is that since we are "merging" similar events, I assume that we will be gc-ing them based on last occurrence. That said, we will end up in the situation where the very old event will still be "kind of present" in the system (as it also happened very recently), whereas some events that happened after it will already be removed (as they didn't have recent occurrences). I guess it's not a very big deal, but we need to be very clear about how exactly it works and document it somewhere.

fgrzadkowski · 2016-12-22T09:41:55Z

From the user perspective I think we should just say that we care only about "last seen" timestamp. Then it's pretty straightforwards which event objects are deleted and which are not. WDYT?

matthiasr · 2016-12-22T09:43:57Z

I think that's fine. The only complication are occurence counts – they'd keep counting up as long as an event happens frequently enough. But I think that's easy enough to reason about.

matthiasr · 2016-12-22T09:44:29Z

(also, for such a frequent event, all I would look for in an occurence count is "a lot" vs "not a lot", so that'd be fine).

timothysc · 2017-01-05T22:15:38Z

So I'm going to ignore all the other comments and "just say no!"

Events were always intended to be ephemeral, if folks want a history service, then create a history service and shunt events out of etcd entirely. At the scales we reach today, and want to reach tomorrow this type of option is untenable.

lavalamp · 2017-01-06T22:10:12Z

Yeah, events were intended to be archived and read from the archival store. I'm not in favor of doing stop-gap work to make the etcd events stick around longer.

…

On Thu, Jan 5, 2017 at 2:15 PM, Timothy St. Clair ***@***.***> wrote: So I'm going to ignore all the other comments and "just say no!" Events by were always intended to be ephemeral, if folks want a history service, then create a history service and shunt events out of etcd entirely. At the scales we reach today, and want to reach tomorrow this type of decision option is untenable. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnglo1omQy5BpWlAmG24J4e37SIJoFNks5rPWuQgaJpZM4LQfL0> .

matthiasr · 2017-01-06T22:14:47Z

Fair enough. How can this, and the options to do so, be made more clear to people building clusters?

…

On Fri, Jan 6, 2017, 23:10 Daniel Smith ***@***.***> wrote: Yeah, events were intended to be archived and read from the archival store. I'm not in favor of doing stop-gap work to make the etcd events stick around longer. On Thu, Jan 5, 2017 at 2:15 PM, Timothy St. Clair < ***@***.***> wrote: > So I'm going to ignore all the other comments and "just say no!" > > Events by were always intended to be ephemeral, if folks want a history > service, then create a history service and shunt events out of etcd > entirely. At the scales we reach today, and want to reach tomorrow this > type of decision option is untenable. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #38949 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AAnglo1omQy5BpWlAmG24J4e37SIJoFNks5rPWuQgaJpZM4LQfL0 > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAICBtwM75ePkIeZEa5Mm1q72jU-q6NEks5rPrvbgaJpZM4LQfL0> .

davidopp · 2017-01-06T22:17:28Z

#36304 is relevant (I think)

fgrzadkowski · 2017-01-09T16:12:33Z

I don't agree with this reasoning. Events being ephemeral is somewhat orthogonal to how long we try to keep them. Keeping them for 1h makes it impossible to debug problems that occurred during a longer lunch, not to mention during the night.

@timothysc what scale issues are you worried about? what size of a cluster? If the number of events is high we would just delete quicker. I imagine this might even be more managable, as you would be able to say where is the threshold, while with 1h you can't guarantee anything.

@lavalamp I don't see this as a stop-gap solution. I think there are actually advantages such as:

it's more managable (admin just define threshold to number of events, which corresponds directly to cpu/mem usage)
it will make it much easier to debug small clusters (as we would keep events for longer)

timothysc · 2017-01-09T16:57:08Z

Keeping them for 1h makes it impossible to debug problems that occurred during a longer lunch, not to mention during the night.

All the more reason to address properly with a history service, b/c even if you lengthen the time there will always be a question "How long is long enough vs. what is arbitrary". If you were to record overnight on a 1k node/ 250k-pod high churn cluster, that is an obscene amount of data to be holding onto, which can OOM etcd.

fgrzadkowski · 2017-01-10T11:14:57Z

Of course in large clusters this would not solve the problem, as you pointed out. But it would not be worse - we would only change the method of deletion.

However this could be a major improvement in smaller clusters. Most likely you don't want to run yet another pod/service just for this as it would be a large tax. Making this part of the "core" would simplify things greatly! I think for people with O(10) nodes this might be a huge improvement.

timothysc · 2017-01-10T18:58:29Z

I'm ok with an input knob for adjusting the TTL with default unchanged, but not a fan of changing the defaults.

smarterclayton · 2017-01-10T19:17:41Z

I though we already allowed event TTL to be tuned? We use it in openshift since we generate more events per unit time. On Jan 10, 2017, at 1:58 PM, Timothy St. Clair <notifications@github.com> wrote: I'm ok with an input knob for adjusting the TTL with default unchanged, but not a fan of changing the defaults. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p9DpGpRj24wHQ2A_4eO8U9koaj18ks5rQ9TmgaJpZM4LQfL0> .

timothysc · 2017-01-10T21:04:37Z

Yup.

--event-ttl duration Amount of time to retain events. Default is 1h. (default 1h0m0s)

fgrzadkowski · 2017-01-10T21:08:31Z

Would you be OK with changing the mechanism from TTL in etcd to something like garbage collection? My reasoning is that if you care about total number of events stored in a system, then having a knob for TTL does match your expectations, while a more direct solution where you just keep X last events would easier to manage WDYT?

…

-- Filip

On Tue, Jan 10, 2017 at 10:05 PM, Timothy St. Clair < ***@***.***> wrote: Yup. --event-ttl duration Amount of time to retain events. Default is 1h. (default 1h0m0s) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKUcdnEqBUaOuHVb0qODfCgPHUixO7E8ks5rQ_KLgaJpZM4LQfL0> .

timothysc · 2017-01-10T22:54:33Z

In order to check the event count there would either need to be a watch or a periodic re-list of events. Neither of which I'm crazy about... I'd be ok with it if it was an opt-in behavior, and defaulted off for smaller clusters.

/cc @hongchaodeng

liggitt · 2017-01-10T22:59:51Z

I'd expect garbage collection or active management to be far more expensive than simple TTL

lavalamp · 2017-01-17T22:13:07Z

I agree w/ @liggitt. Moving away from the TTL would make for a much more complicated and resource-intensive system.

smarterclayton · 2017-01-18T02:33:41Z

Actually, since we are already watching most things on the cluster, and we have the local watch cache, events might actually be less impactful for the server if we did GC. Also, in v2 TTL is more expensive than not having TTL, and in v3 I believe it's still marginally more expensive (correct me if I'm wrong on this). Having a local events cache would potentially make event compaction and event updates cheaper.

…

On Tue, Jan 17, 2017 at 5:13 PM, Daniel Smith ***@***.***> wrote: I agree w/ @liggitt <https://github.com/liggitt>. Moving away from the TTL would make for a much more complicated and resource-intensive system. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p6JomMEDWm1MinpcryoNtpxf0-Rsks5rTTz6gaJpZM4LQfL0> .

liggitt · 2017-01-18T02:34:59Z

we have the local watch cache

not for events

smarterclayton · 2017-01-18T04:19:28Z

Heapster and OpenShift both end up calling List and Watch, so are we gaining anything by not having it? A flyweight cache might be another option for events - story only last touch time.

…

On Tue, Jan 17, 2017 at 9:35 PM, Jordan Liggitt ***@***.***> wrote: we have the local watch cache not for events — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38949 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p6wfMPK0u1Lv2YDtDGk6FN0sdoSmks5rTXpagaJpZM4LQfL0> .

wojtek-t · 2017-01-18T07:31:34Z

Heapster and OpenShift both end up calling List and Watch, so are we
gaining anything by not having it?

If there is exactly one watcher of all events, we don't gain pretty much anything from having events from cache (this starts to be useful if there are multiple watchers and/or we are e.g. serving lists from memory).

Actually, since we are already watching most things on the cluster, and we
have the local watch cache, events might actually be less impactful for the
server if we did GC.

I think this may be close to true if we do garbage collection in apiserver itself. But I'm pretty sure we don't want it. In such case, if we create another controller (or modify the existing gc-controller), then I'm pretty sure this would be significantly more expensive as apiserver would have to process a ton of new "remove event" api requests.

timothysc · 2017-01-18T18:29:32Z

From a feature perspective, I consider everything discussed here to be near 0 sum gain, or a band-aide type of solution that breaks down under multiple scenarios.

If the overall user-story is to provide a history service to glean meaningful insights into the behavior of pods, then I would contend that is a separate service, which is out of core. There is nothing today that prevents a operator from using the facilities that exist today from providing, or creating that service.

/cc @kubernetes/sig-scalability-misc

matthiasr · 2017-01-18T20:54:53Z

I agree, given the complexity the gains don't seem to be worth it.

timothysc · 2017-03-08T23:57:48Z

Closing this issue there are a number of overlapping ones.

fgrzadkowski added area/apiserver area/controller-manager sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Dec 19, 2016

timothysc added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jan 5, 2017

timothysc closed this as completed Mar 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reevaluate events deletion policy #38949

Reevaluate events deletion policy #38949

fgrzadkowski commented Dec 19, 2016

matthiasr commented Dec 19, 2016

smarterclayton commented Dec 19, 2016 via email

matthiasr commented Dec 19, 2016 via email

smarterclayton commented Dec 19, 2016 via email

matthiasr commented Dec 19, 2016 via email

smarterclayton commented Dec 20, 2016 via email

davidopp commented Dec 20, 2016

wojtek-t commented Dec 21, 2016

fgrzadkowski commented Dec 22, 2016

matthiasr commented Dec 22, 2016

matthiasr commented Dec 22, 2016

timothysc commented Jan 5, 2017 •

edited

Loading

lavalamp commented Jan 6, 2017 via email

matthiasr commented Jan 6, 2017 via email

davidopp commented Jan 6, 2017

fgrzadkowski commented Jan 9, 2017

timothysc commented Jan 9, 2017

fgrzadkowski commented Jan 10, 2017

timothysc commented Jan 10, 2017

smarterclayton commented Jan 10, 2017 via email

timothysc commented Jan 10, 2017

fgrzadkowski commented Jan 10, 2017 via email

timothysc commented Jan 10, 2017

liggitt commented Jan 10, 2017

lavalamp commented Jan 17, 2017

smarterclayton commented Jan 18, 2017 via email

liggitt commented Jan 18, 2017

smarterclayton commented Jan 18, 2017 via email

wojtek-t commented Jan 18, 2017

timothysc commented Jan 18, 2017

matthiasr commented Jan 18, 2017

timothysc commented Mar 8, 2017

Reevaluate events deletion policy #38949

Reevaluate events deletion policy #38949

Comments

fgrzadkowski commented Dec 19, 2016

matthiasr commented Dec 19, 2016

smarterclayton commented Dec 19, 2016 via email

matthiasr commented Dec 19, 2016 via email

smarterclayton commented Dec 19, 2016 via email

matthiasr commented Dec 19, 2016 via email

smarterclayton commented Dec 20, 2016 via email

davidopp commented Dec 20, 2016

wojtek-t commented Dec 21, 2016

fgrzadkowski commented Dec 22, 2016

matthiasr commented Dec 22, 2016

matthiasr commented Dec 22, 2016

timothysc commented Jan 5, 2017 • edited Loading

lavalamp commented Jan 6, 2017 via email

matthiasr commented Jan 6, 2017 via email

davidopp commented Jan 6, 2017

fgrzadkowski commented Jan 9, 2017

timothysc commented Jan 9, 2017

fgrzadkowski commented Jan 10, 2017

timothysc commented Jan 10, 2017

smarterclayton commented Jan 10, 2017 via email

timothysc commented Jan 10, 2017

fgrzadkowski commented Jan 10, 2017 via email

timothysc commented Jan 10, 2017

liggitt commented Jan 10, 2017

lavalamp commented Jan 17, 2017

smarterclayton commented Jan 18, 2017 via email

liggitt commented Jan 18, 2017

smarterclayton commented Jan 18, 2017 via email

wojtek-t commented Jan 18, 2017

timothysc commented Jan 18, 2017

matthiasr commented Jan 18, 2017

timothysc commented Mar 8, 2017

timothysc commented Jan 5, 2017 •

edited

Loading