-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DELETECOLLECTION doesn't always #90743
Comments
|
cc @wojtek-t |
|
another possibility is to change deletecollection to page over the list internally and process/delete the items in each page before proceeding xref #80877 (comment) |
|
I agree with Jordan (and thanks for the link - I remembered there was some discussion about it in the past). |
|
Sorry - by "I agree with Jordan" I meant - I think the option he mentioned seems the best for me. |
|
Yeah that'd be a huge improvement but it could still take a long time. Maybe we need to also exempt it from the global timeout? |
But we can (before every call to etcd) check if context is done and if so return. |
Plumbing the request context through to etcd lets us honor the timeout exactly. The issue is not exceeding the request timeout internally, it's that "call deletecollection repeatedly, get timeout errors, and retry until success" is really unpleasant guidance for API clients. |
|
No, then the client calls it again and it starts from the beginning, double-deleting objects that have a deletion timestamp but haven't finalized. (as pointed out by @liggitt) |
I don't think it's double deleting, becuse next request lists again and the previously deleted objects won't be returned. I think Jordan`s argument about "unpleasasant guidance" is stronger. I guess we have this guidance for read requests already, but for mutating requests it's not the most intuitive thing. |
The objects haven't "really" been deleted because they have a finalizer or a grace period or something. |
|
Hmm - so I guess I don't really understand this option then. @liggitt - can you clarify? |
|
Another issue is what happens when APF (priority & fairness) is on. Self calls are unlimited, so this could generate a LOT of traffic. Especially if clients re-call while a zombie one is already in flight, or multiple clients happen to call this at a time. |
|
I'm tempted to say we should just deprecate DELETECOLLECTION and make clients implement it. It is literally faster to use curl and bash to delete everything on a large collection right now. A client can do this super fast: Pre-pages, clients could not do this super fast because the list smashed the server. So, now that we have pagination, I think it might make sense to move delete collection back into the client library. |
|
Maybe best is just to change the client libary implementation to do that. It doesn't help non-go languages, though, unfortunately. |
Longer term - maybe. But I don't think we're ready for it. |
for now, all requests raised from
currently the client honors |
|
xref #91497 |
|
We have seen a cluster hitting this issue recently. I'm wondering how we can actually fix this issue.
For me, actually the last option: recommendation to use pagination in DeleteCollection calls is the cleanest solution. It guarantees that we will delete each object exactly once and avoids triggering timeout errors. We should also make sure that DeleteCollection properly honors context cancellation so that if we hit timeout (e.g. caller not using pagination) we will not leak any goroutine. This would require making DeleteCollection's pagination a fully supported feature, which I understand is unwanted for some reason. WDYT? |
|
I opened #107950 to add support for context cancellation in DeleteCollection (this isn't a full solution, because List call itself isn't fully supporting it, but at least we won't be leaking goroutines deleting objects and colliding with the next DeleteCollection calls). Regarding the main problem - I would actually expect to use pagination internally. So option first. If an object is already scheduled for deletion (but waiting for finalizer or sth like that), we can actually skip those by looking into their DeletionTimestamp, so I don't think this is actually a problem [yes, we're not doing that now, but we should probably add it.] |
|
Thanks Wojtek! I think we can do similar to #107950 fix to the List call, right? Re external vs internal pagination: Could you tell why do you prefer the internal pagination over external one?
The external pagination has none of the mentioned issues, but requires supporting pagination feature for Delete. The way I see this: In limited time we can handle limited number of objects. We know this and for that reason we introduced a pagination which is a recommended way of fetching huge numbers of objects in Lists (ignoring for now watch cache). In DeleteCollection we have pretty similar issue: in limited time we can delete only limited number of objects. Reusing the solution from the List seems to be the easiest way to achieve this effect. For free we also get additional guarantees like predictability: the caller knows which set of objects will be deleted (the consistent view of objects with RV from the first page). |
It's a bit more complex to that.
[In otther words - for deletecollection, the third item is negligible, because the response is small, as opposed to list requests.]
Yes, I agree. This is problematic now. But I think it's mostly not the listing part, but rather the fact that (a) we send a lot of data from etcd, (b) we need to deserialize all of that, which is expensive. If we would switch to something like A-C points described in #108003 it shouldn't be that problematic. Anyway - thinking more about that, I think that unifying the API with LIST and adding support for external pagination as you're suggesting can actually be a better approach. We have problems that pagination still didn't reach GA, but I would like to push it through by adding support to pagination in watchcache (namely point D in #108003 ). And then we would have a clear API. But I would also like to hear from Jordan before we start proceeding with it. |
|
+1 to getting something done for deletecollection, in the case where the target collection is very large. Right now, the unchunked list call to etcd puts stress on etcd and the API server. And it's hard to debug (since you have to dig around and find out how delete collection works). |
|
Any update on plans and timing for this fix? |
|
BTW, another factor in favor of possibly moving this client side is that the LISTing phase of the work would then be covered by the same list width estimation that API Priority and Fairness uses for normal LIST calls. Right now, deletecollection does an unchunked list call.... and the width (size) of that list cannot be taken into account when trying to throttle the deletecollection calls. |
That actually can be fixed on APF side. But it doesn't fully solve the problem. |
|
Getting back to it in the context of #115090 With #105606 and #107950 we actually got the point, where DeleteCollection is respecting context cancellation, and thus is finishing soon after the API call times out. So by just introducing a pagination into DeleteCollection implementation (which should be relatively straighforward) we should actually improve its state to the point where:
I'm going to take a look at implementing it in 1.27 timeframe (unless someone objects in the meantime). |
|
/cc |
|
I opened #117971 to address my comment above - still requires adding appropriate tests, but it seems to work fine in a way of passing all existing tests. Will try to finish that soon. |
This is great. Thanks for all of your hard work on this. |
|
@wojtek-t And update on status of this, with respect to @jensentanlo 's question? |
|
It will be available in 1.28 - we can potentially consider backporting it , but backports weren't opened at this point. |
|
@wojtek-t did this make it into 1.28? |
|
Yup - the stop-the-bleeding part: #117971 We should think if we want to evolve that further more medium/longer term. |
If someone puts a huge number of objects into the API server, and comes to regret doing so, it's handy to be able to give them a way to remove them. Even if the removal is going to take a while, we want something we can document and put into a troubleshooting guide. I think ideally I'd like the API servers to have special-case handling for this, because although this kind of thing is rare, one day we might see a combination of managed Kubernetes and a third-party tool that breaks lots of clusters all at once. Something that's the equivalent of either: For the special case where you're authorized to bypass finalizers, my ideal is that we come up with a way to return 202 Accepted to the caller pretty soon, and for the actual removals from etcd to continue at a more leisurely pace. Those background removals could still count against the requester's APF quota. If the best advice is to do a list on the collection and then iterate, so be it. Even in that case it'd be handy if DELETECOLLECTION could return a 4xx code when there are too many things to remove, and nudge the caller to try something different. |
What happened:
Deleting a large collection times out (504) and leaks a goroutine that's busy deleting the collection. Calling multiple times in a row makes it worse. There's no indication that apiserver is currently deleting the collection.
What you expected to happen:
a) Change delete collection so that it deletes just the first page of items in the collection, rather than loading the whole collection into memory. Callers must call repeatedly to delete the whole collection.
b) Change delete collection to use a single-flight approach, so that if you call delete collection while one is in progress, you join it rather than start a new one. (This actually doesn't work because there's multiple apiservers.)
Environment:
kubectl version): 1.14.x (but I don't think it matters)/sig api-machinery
The text was updated successfully, but these errors were encountered: