Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DELETECOLLECTION doesn't always #90743

Open
lavalamp opened this issue May 4, 2020 · 34 comments
Open

DELETECOLLECTION doesn't always #90743

lavalamp opened this issue May 4, 2020 · 34 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@lavalamp
Copy link
Member

lavalamp commented May 4, 2020

What happened:

Deleting a large collection times out (504) and leaks a goroutine that's busy deleting the collection. Calling multiple times in a row makes it worse. There's no indication that apiserver is currently deleting the collection.

What you expected to happen:

a) Change delete collection so that it deletes just the first page of items in the collection, rather than loading the whole collection into memory. Callers must call repeatedly to delete the whole collection.
b) Change delete collection to use a single-flight approach, so that if you call delete collection while one is in progress, you join it rather than start a new one. (This actually doesn't work because there's multiple apiservers.)

Environment:

  • Kubernetes version (use kubectl version): 1.14.x (but I don't think it matters)

/sig api-machinery

@lavalamp lavalamp added the kind/bug Categorizes issue or PR as related to a bug. label May 4, 2020
@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 4, 2020
@lavalamp
Copy link
Member Author

lavalamp commented May 4, 2020

cc @wojtek-t

@liggitt
Copy link
Member

liggitt commented May 5, 2020

another possibility is to change deletecollection to page over the list internally and process/delete the items in each page before proceeding

xref #80877 (comment)

@wojtek-t
Copy link
Member

wojtek-t commented May 5, 2020

I agree with Jordan (and thanks for the link - I remembered there was some discussion about it in the past).

@wojtek-t
Copy link
Member

wojtek-t commented May 5, 2020

Sorry - by "I agree with Jordan" I meant - I think the option he mentioned seems the best for me.

@lavalamp
Copy link
Member Author

lavalamp commented May 5, 2020

Yeah that'd be a huge improvement but it could still take a long time. Maybe we need to also exempt it from the global timeout?

@wojtek-t
Copy link
Member

wojtek-t commented May 5, 2020

Yeah that'd be a huge improvement but it could still take a long time.

But we can (before every call to etcd) check if context is done and if so return.
So that shouldn't lead to exceeding timeout by any significant factor...

@liggitt
Copy link
Member

liggitt commented May 5, 2020

But we can (before every call to etcd) check if context is done and if so return.
So that shouldn't lead to exceeding timeout by any significant factor...

Plumbing the request context through to etcd lets us honor the timeout exactly. The issue is not exceeding the request timeout internally, it's that "call deletecollection repeatedly, get timeout errors, and retry until success" is really unpleasant guidance for API clients.

@lavalamp
Copy link
Member Author

lavalamp commented May 5, 2020

No, then the client calls it again and it starts from the beginning, double-deleting objects that have a deletion timestamp but haven't finalized. (as pointed out by @liggitt)

@wojtek-t
Copy link
Member

wojtek-t commented May 5, 2020

No, then the client calls it again and it starts from the beginning, double-deleting objects that have a deletion timestamp but haven't finalized. (as pointed out by @liggitt)

I don't think it's double deleting, becuse next request lists again and the previously deleted objects won't be returned.

I think Jordan`s argument about "unpleasasant guidance" is stronger. I guess we have this guidance for read requests already, but for mutating requests it's not the most intuitive thing.

@lavalamp
Copy link
Member Author

lavalamp commented May 5, 2020

next request lists again and the previously deleted objects won't be returned.

The objects haven't "really" been deleted because they have a finalizer or a grace period or something.

@wojtek-t
Copy link
Member

wojtek-t commented May 5, 2020

Hmm - so I guess I don't really understand this option then. @liggitt - can you clarify?
[So I agree that we should add pagintion, but I can't see how that addresses Daniel`s concerns from the issue - namely the second one].

@lavalamp
Copy link
Member Author

lavalamp commented May 5, 2020

Another issue is what happens when APF (priority & fairness) is on. Self calls are unlimited, so this could generate a LOT of traffic. Especially if clients re-call while a zombie one is already in flight, or multiple clients happen to call this at a time.

@lavalamp
Copy link
Member Author

lavalamp commented May 5, 2020

I'm tempted to say we should just deprecate DELETECOLLECTION and make clients implement it. It is literally faster to use curl and bash to delete everything on a large collection right now.

A client can do this super fast:

for collectionNonEmpty() {
  for p := nextListPage() {
    for item := range p.Items {
      if item.Metadata.DeletionTimestamp == "" {
        go deleteItem(item.Metadata.Namespace, item.Metadata.Name)
      }
    }
  }
}

Pre-pages, clients could not do this super fast because the list smashed the server.

So, now that we have pagination, I think it might make sense to move delete collection back into the client library.

@lavalamp
Copy link
Member Author

lavalamp commented May 5, 2020

Maybe best is just to change the client libary implementation to do that. It doesn't help non-go languages, though, unfortunately.

@wojtek-t
Copy link
Member

wojtek-t commented May 5, 2020

Maybe best is just to change the client libary implementation to do that. It doesn't help non-go languages, though, unfortunately.

Longer term - maybe. But I don't think we're ready for it.
In particular, as long as we have client QPSes, deleting large namespace can take tens of minutes (e.g. deleting events there) [where with DeleteCollection it's orders of magnitude faster]. Once we have P&F - that would make a lot of sense to me. But we're not there yet.

@yue9944882
Copy link
Member

yue9944882 commented May 6, 2020

Another issue is what happens when APF (priority & fairness) is on. Self calls are unlimited, so this could generate a LOT of traffic.

for now, all requests raised from system:masters groups are unlimitted. there's a chance that a cluster admin w/ that role calling DELETECOLLECTION which leads to worse situations.. but one thing we can do is to do exclude DELETECOLLECTION from the matching verbs list from the rule so that we force these requests policed by the P&F system.

The issue is not exceeding the request timeout internally, it's that "call deletecollection repeatedly, get timeout errors, and retry until success" is really unpleasant guidance for API clients.

currently the client honors Retry-After header but it's always 1s. i presume another approach is to put a minute-ish (or greater) retry-after punishment if it times out.

@liggitt liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 6, 2020
@wojtek-t
Copy link
Member

xref #91497

@mborsz
Copy link
Member

mborsz commented Feb 4, 2022

We have seen a cluster hitting this issue recently.

I'm wondering how we can actually fix this issue.
IIUC we discussed the following options so far:

  • Implementing pagination inside of DeleteCollection call so that even if it times out we should have made some progress
    • This approach has an issue that if we time out, then the next call will try to delete the same set of items, if they use grace period/finalizer/something
    • Even for simple case, when items are simply deleted, the recommendation to use DELETECOLLECTION, expect timeout and retry seems to be far from ideal solution
  • Deprecation of this API call in favor of calling DELETE multiple times
    • As mentioned by Wojtek, I think this would lead to order of magnitude worse performance
  • Recommending a client to use pagination for DeleteCollection API calls (which is implemented, but per Generated DeleteCollection methods make it impossible to use pagination correctly #80877 (comment) it seems to be a bug).

For me, actually the last option: recommendation to use pagination in DeleteCollection calls is the cleanest solution. It guarantees that we will delete each object exactly once and avoids triggering timeout errors.

We should also make sure that DeleteCollection properly honors context cancellation so that if we hit timeout (e.g. caller not using pagination) we will not leak any goroutine.

This would require making DeleteCollection's pagination a fully supported feature, which I understand is unwanted for some reason.

WDYT?

@wojtek-t
Copy link
Member

wojtek-t commented Feb 4, 2022

I opened #107950 to add support for context cancellation in DeleteCollection (this isn't a full solution, because List call itself isn't fully supporting it, but at least we won't be leaking goroutines deleting objects and colliding with the next DeleteCollection calls).

Regarding the main problem - I would actually expect to use pagination internally. So option first. If an object is already scheduled for deletion (but waiting for finalizer or sth like that), we can actually skip those by looking into their DeletionTimestamp, so I don't think this is actually a problem [yes, we're not doing that now, but we should probably add it.]

@mborsz
Copy link
Member

mborsz commented Feb 4, 2022

Thanks Wojtek!

List call itself isn't fully supporting it

I think we can do similar to #107950 fix to the List call, right?
In the case I have seen, the etcd-level list was really fast (etcd logged latency like 500ms), but the overall call took ~45s (spent mostly in the deserialization loop IIUC). We can add a similar check in the deserialization loop.

Re external vs internal pagination:

Could you tell why do you prefer the internal pagination over external one?
The internal has at least two problems I think:

  • If the bottleneck is actually in List calls taking too much (in the cluster I have debugged it was a case), then even skipping already deleted objects will take some nontrivial time (let's say that we already deleted N pages in some of the previous calls, then the current call must deserialize at least N+1 pages to make any progress, for large enough N this will hit timeout).
  • This request will be still failing if we try to delete too many objects and the caller is not able to do anything about that (other than setting limit which isn't recommended in this solution). I don't think we should design APIs in a way that assumes that in some cases callers should expect "timeout" error.
  • If we still want to list deleted objects (as promised by DeleteCollection), is it possible to do so with timeout error returned? I'm asking for both theoretical aspects (if the API allows to returns such error and List object at the same time) and practical (how would we know when to finish deleting objects so that we still have time to serialize List and pass it to the API caller)?

The external pagination has none of the mentioned issues, but requires supporting pagination feature for Delete.
Could you clarify why this isn't desired solution?
It is seems to be much cleaner solution, as we attempt to delete a single object only once, without teaching DeleteCollection what the deletionTimestamp and finalizers are.

The way I see this: In limited time we can handle limited number of objects. We know this and for that reason we introduced a pagination which is a recommended way of fetching huge numbers of objects in Lists (ignoring for now watch cache).

In DeleteCollection we have pretty similar issue: in limited time we can delete only limited number of objects. Reusing the solution from the List seems to be the easiest way to achieve this effect. For free we also get additional guarantees like predictability: the caller knows which set of objects will be deleted (the consistent view of objects with RV from the first page).

@wojtek-t
Copy link
Member

wojtek-t commented Mar 7, 2022

I think we can do similar to #107950 fix to the List call, right?
In the case I have seen, the etcd-level list was really fast (etcd logged latency like 500ms), but the overall call took ~45s (spent mostly in the deserialization loop IIUC). We can add a similar check in the deserialization loop.

It's a bit more complex to that.
For List calls there are few things:

  • listing from etcd [I agree with you that we can probably ignore that in the first attempt]
  • deserializing/filtering logic [this should be relatively simple to do]
  • writing the result [this one we know that can take tens of second and fixing that would require much bigger changes]

[In otther words - for deletecollection, the third item is negligible, because the response is small, as opposed to list requests.]

If the bottleneck is actually in List calls taking too much (in the cluster I have debugged it was a case), then even skipping already deleted objects will take some nontrivial time (let's say that we already deleted N pages in some of the previous calls, then the current call must deserialize at least N+1 pages to make any progress, for large enough N this will hit timeout).

Yes, I agree. This is problematic now. But I think it's mostly not the listing part, but rather the fact that (a) we send a lot of data from etcd, (b) we need to deserialize all of that, which is expensive. If we would switch to something like A-C points described in #108003 it shouldn't be that problematic.

Anyway - thinking more about that, I think that unifying the API with LIST and adding support for external pagination as you're suggesting can actually be a better approach.
Given that the biggest user of DeleteColelction is effectively namespace controller, adjusting it to use pagination should be very simple and it would make the API much more intuitive.

We have problems that pagination still didn't reach GA, but I would like to push it through by adding support to pagination in watchcache (namely point D in #108003 ). And then we would have a clear API.

But I would also like to hear from Jordan before we start proceeding with it.

@JohnRusk
Copy link

JohnRusk commented Sep 1, 2022

+1 to getting something done for deletecollection, in the case where the target collection is very large. Right now, the unchunked list call to etcd puts stress on etcd and the API server. And it's hard to debug (since you have to dig around and find out how delete collection works).

@JohnRusk
Copy link

Any update on plans and timing for this fix?

@JohnRusk
Copy link

BTW, another factor in favor of possibly moving this client side is that the LISTing phase of the work would then be covered by the same list width estimation that API Priority and Fairness uses for normal LIST calls. Right now, deletecollection does an unchunked list call.... and the width (size) of that list cannot be taken into account when trying to throttle the deletecollection calls.

@wojtek-t
Copy link
Member

Right now, deletecollection does an unchunked list call.... and the width (size) of that list cannot be taken into account when trying to throttle the deletecollection calls.

That actually can be fixed on APF side. But it doesn't fully solve the problem.

@wojtek-t
Copy link
Member

Getting back to it in the context of #115090

With #105606 and #107950 we actually got the point, where DeleteCollection is respecting context cancellation, and thus is finishing soon after the API call times out.

So by just introducing a pagination into DeleteCollection implementation (which should be relatively straighforward) we should actually improve its state to the point where:

  • it can make some progress even if the collection of objects is huge (and currently listing all those objects times out)
  • it will not hit into the problem of deleting same objects multiple times in case when we retry the previously failed DeleteCollection request

I'm going to take a look at implementing it in 1.27 timeframe (unless someone objects in the meantime).

@tkashem
Copy link
Contributor

tkashem commented Jan 26, 2023

/cc

@wojtek-t
Copy link
Member

I opened #117971 to address my comment above - still requires adding appropriate tests, but it seems to work fine in a way of passing all existing tests. Will try to finish that soon.

@jensentanlo
Copy link

I opened #117971 to address my comment above - still requires adding appropriate tests, but it seems to work fine in a way of passing all existing tests. Will try to finish that soon.

This is great. Thanks for all of your hard work on this.
I'm not familiar with the release cycles of bug fixes on the client side; will I need to wait for 1.28 or will it be available as soon as it's merged into the main branch?

@JohnRusk
Copy link

@wojtek-t And update on status of this, with respect to @jensentanlo 's question?

@wojtek-t
Copy link
Member

It will be available in 1.28 - we can potentially consider backporting it , but backports weren't opened at this point.

@JohnRusk
Copy link

@wojtek-t did this make it into 1.28?

@wojtek-t
Copy link
Member

Yup - the stop-the-bleeding part: #117971

We should think if we want to evolve that further more medium/longer term.

@sftim
Copy link
Contributor

sftim commented Jun 2, 2024

I'm tempted to say we should just deprecate DELETECOLLECTION and make clients implement it. It is literally faster to use curl and bash to delete everything on a large collection right now.

If someone puts a huge number of objects into the API server, and comes to regret doing so, it's handy to be able to give them a way to remove them. Even if the removal is going to take a while, we want something we can document and put into a troubleshooting guide.

I think ideally I'd like the API servers to have special-case handling for this, because although this kind of thing is rare, one day we might see a combination of managed Kubernetes and a third-party tool that breaks lots of clusters all at once.

Something that's the equivalent of either: TRUNCATE TABLE nodes; or DELETE FROM nodes; if our API looked like SQL.
In that world, people might turn off triggers or otherwise do something to make the removal effective; if you have a million things each with 3 finalizers, that's a lot of cleaning up.

For the special case where you're authorized to bypass finalizers, my ideal is that we come up with a way to return 202 Accepted to the caller pretty soon, and for the actual removals from etcd to continue at a more leisurely pace. Those background removals could still count against the requester's APF quota.

If the best advice is to do a list on the collection and then iterate, so be it. Even in that case it'd be handy if DELETECOLLECTION could return a 4xx code when there are too many things to remove, and nudge the caller to try something different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

10 participants