Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reevaluate events deletion policy #38949

Closed
fgrzadkowski opened this issue Dec 19, 2016 · 32 comments
Closed

Reevaluate events deletion policy #38949

fgrzadkowski opened this issue Dec 19, 2016 · 32 comments
Labels
area/apiserver area/controller-manager sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@fgrzadkowski
Copy link
Contributor

Currently we delete all the events after 1h. This has been introduced as a way to keep the load on apiserver small. However, it also has a serious downside for the user, as they can't debug what has happened unless they persist events somewhere.

I'd like to propose a slightly change approach, where instead of delete all events after 1h, we use the same policy as for terminated pods - delete them only if we have more events than X which depends on the cluster size. I imagine we should do it by reusing podgc controller logic and stop setting TTL on etcd.

@lavalamp @wojtek-t @smarterclayton @piosz

@kubernetes/sig-instrumentation-misc (events are kind of instrumentation of the system)
@kubernetes/sig-api-machinery-misc (it's related to apiserver & co)
@kubernetes/sig-scalability-misc (proposed change might have scalability implications)

@fgrzadkowski fgrzadkowski added area/apiserver area/controller-manager sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Dec 19, 2016
@matthiasr
Copy link

Ideally, this would also be separate X for different kinds of events – for example, "container failed liveness probe and was restarted" is not really relevant after 24h, but "this node stopped reporting" would be to me. The latter are also much more rare.

@smarterclayton
Copy link
Contributor

smarterclayton commented Dec 19, 2016 via email

@matthiasr
Copy link

matthiasr commented Dec 19, 2016 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Dec 19, 2016 via email

@matthiasr
Copy link

matthiasr commented Dec 19, 2016 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Dec 20, 2016 via email

@davidopp
Copy link
Member

#19637 as Clayton alluded to but nobody ever did anything about it.

@wojtek-t
Copy link
Member

Yes - we definitely want to move events to a different kind of storage, but I'm not aware of any real effort towards this goal.

Regarding storing events in a separate etcd, we are doing this, and in fact this significantly helps for performance. Though we should reevaluate it after moving to etcd v3 (but I guess there will still be a gain in having it).

We don't cache watch the events today, so this would be adding a new events
watch which is likely to be high bandwidth. It also increases the amount
of deletes sent to etcd, although I suspect that this would result in less
load overall than on etcd before. So I'm cautiously in favor of this.

Yes - we don't have cacher enabled for events today. However, to reduce the visible increase of load on apiserver, I wouldn't enable it even if we decide to go with the proposal. That said, I think watch for events should be served directly from etcd because:

  • the watcher we are talking about is pretty much interested in all events
  • there is only one watcher
    So the whole gains that we get from cacher (serving multiple watcher from a single data stream and better filtering) doesn't applicate in this case.

That said, I agree it would increase the load on etcd (the number of deleted will be significant). On the other hand, etcd itself should have less work, as there won't be any objects with TTLs in such case.

With this approach, implementing the logic that @matthiasr suggested (i.e. different policies for different kinds of events) would also be relatively straightforward.

One very minor issue is that since we are "merging" similar events, I assume that we will be gc-ing them based on last occurrence. That said, we will end up in the situation where the very old event will still be "kind of present" in the system (as it also happened very recently), whereas some events that happened after it will already be removed (as they didn't have recent occurrences). I guess it's not a very big deal, but we need to be very clear about how exactly it works and document it somewhere.

@fgrzadkowski
Copy link
Contributor Author

From the user perspective I think we should just say that we care only about "last seen" timestamp. Then it's pretty straightforwards which event objects are deleted and which are not. WDYT?

@matthiasr
Copy link

I think that's fine. The only complication are occurence counts – they'd keep counting up as long as an event happens frequently enough. But I think that's easy enough to reason about.

@matthiasr
Copy link

(also, for such a frequent event, all I would look for in an occurence count is "a lot" vs "not a lot", so that'd be fine).

@timothysc
Copy link
Member

timothysc commented Jan 5, 2017

So I'm going to ignore all the other comments and "just say no!"

Events were always intended to be ephemeral, if folks want a history service, then create a history service and shunt events out of etcd entirely. At the scales we reach today, and want to reach tomorrow this type of option is untenable.

@timothysc timothysc added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jan 5, 2017
@lavalamp
Copy link
Member

lavalamp commented Jan 6, 2017 via email

@matthiasr
Copy link

matthiasr commented Jan 6, 2017 via email

@davidopp
Copy link
Member

davidopp commented Jan 6, 2017

#36304 is relevant (I think)

@fgrzadkowski
Copy link
Contributor Author

I don't agree with this reasoning. Events being ephemeral is somewhat orthogonal to how long we try to keep them. Keeping them for 1h makes it impossible to debug problems that occurred during a longer lunch, not to mention during the night.

@timothysc what scale issues are you worried about? what size of a cluster? If the number of events is high we would just delete quicker. I imagine this might even be more managable, as you would be able to say where is the threshold, while with 1h you can't guarantee anything.

@lavalamp I don't see this as a stop-gap solution. I think there are actually advantages such as:

  1. it's more managable (admin just define threshold to number of events, which corresponds directly to cpu/mem usage)
  2. it will make it much easier to debug small clusters (as we would keep events for longer)

@timothysc
Copy link
Member

Keeping them for 1h makes it impossible to debug problems that occurred during a longer lunch, not to mention during the night.

All the more reason to address properly with a history service, b/c even if you lengthen the time there will always be a question "How long is long enough vs. what is arbitrary". If you were to record overnight on a 1k node/ 250k-pod high churn cluster, that is an obscene amount of data to be holding onto, which can OOM etcd.

@fgrzadkowski
Copy link
Contributor Author

Of course in large clusters this would not solve the problem, as you pointed out. But it would not be worse - we would only change the method of deletion.

However this could be a major improvement in smaller clusters. Most likely you don't want to run yet another pod/service just for this as it would be a large tax. Making this part of the "core" would simplify things greatly! I think for people with O(10) nodes this might be a huge improvement.

@timothysc
Copy link
Member

I'm ok with an input knob for adjusting the TTL with default unchanged, but not a fan of changing the defaults.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 10, 2017 via email

@timothysc
Copy link
Member

Yup.

--event-ttl duration Amount of time to retain events. Default is 1h. (default 1h0m0s)

@fgrzadkowski
Copy link
Contributor Author

fgrzadkowski commented Jan 10, 2017 via email

@timothysc
Copy link
Member

In order to check the event count there would either need to be a watch or a periodic re-list of events. Neither of which I'm crazy about... I'd be ok with it if it was an opt-in behavior, and defaulted off for smaller clusters.

/cc @hongchaodeng

@liggitt
Copy link
Member

liggitt commented Jan 10, 2017

I'd expect garbage collection or active management to be far more expensive than simple TTL

@lavalamp
Copy link
Member

I agree w/ @liggitt. Moving away from the TTL would make for a much more complicated and resource-intensive system.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 18, 2017 via email

@liggitt
Copy link
Member

liggitt commented Jan 18, 2017

we have the local watch cache

not for events

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 18, 2017 via email

@wojtek-t
Copy link
Member

Heapster and OpenShift both end up calling List and Watch, so are we
gaining anything by not having it?

If there is exactly one watcher of all events, we don't gain pretty much anything from having events from cache (this starts to be useful if there are multiple watchers and/or we are e.g. serving lists from memory).

Actually, since we are already watching most things on the cluster, and we
have the local watch cache, events might actually be less impactful for the
server if we did GC.

I think this may be close to true if we do garbage collection in apiserver itself. But I'm pretty sure we don't want it. In such case, if we create another controller (or modify the existing gc-controller), then I'm pretty sure this would be significantly more expensive as apiserver would have to process a ton of new "remove event" api requests.

@timothysc
Copy link
Member

From a feature perspective, I consider everything discussed here to be near 0 sum gain, or a band-aide type of solution that breaks down under multiple scenarios.

If the overall user-story is to provide a history service to glean meaningful insights into the behavior of pods, then I would contend that is a separate service, which is out of core. There is nothing today that prevents a operator from using the facilities that exist today from providing, or creating that service.

/cc @kubernetes/sig-scalability-misc

@matthiasr
Copy link

I agree, given the complexity the gains don't seem to be worth it.

@timothysc
Copy link
Member

Closing this issue there are a number of overlapping ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/controller-manager sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

8 participants