-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete documents by filter #3550
Conversation
Uffizzi Ephemeral Environment
|
I think that we must follow the behavior of the current update system. The way the deletion work is that it evaluates the ids to delete when it is executed, no error is raised until the update is processed. Doing that is much easier to work with when using the task queue and one day replicating the queue too. |
thanks for the answer @Kerollmops, but unfortunately I'm don't understand your suggestion. Are you suggesting that the filter be applied at request time, or only when the deletion task is processed? If applied at request time, that means that we evaluate which documents match the filter synchronously before answering the HTTP request. It means we have to go back to the user (external) document ids because we have no guarantee that a previous document update will not modify the internal document ids. If applying at request time we can reuse the existing tasks without a change because the filter is converted to the ids of the documents synchronously. If applied at processing time, that means that we evaluate which documents match the filter asynchronously, after answering the HTTP request. It means we have to somehow pass the filter to the task so the filter evaluation can take place in the index scheduler. It also means we have to handle any filter-related error that can happen in the index scheduler (I don't think we're currently filtering inside of the index scheduler). We can directly use the internal ids here because no other task can change them during the processing of our task. |
Only when the deletion task is processed. I want to evaluate the request in the queue as we already do for the other way to batch delete. When we delete the documents by using the array of ids, we don't evaluate the ids at request time but rather at processing time.
Indeed, it's unfortunately breaking and for the dumps too!
Indeed, the deletion-by-filter feature will only work if the fields are filterable. However, we could check, at least, the syntax at request time when we receive the filter. We could also accept to delete-by-filter even if some fields are non-filterable, but that's a future improvement.
If you mean that the index-scheduler will execute the filter to fetch the internal document ids and that the scheduler always process one task at a time, then yes, that's what I was thinking about 😃 |
Sorry this is still unclear to me 😓 When we delete the documents by using an array of external document ids, it is true that the external document ids are not evaluated at request time. Only at processing time are they used by the batcher to determine the transform output, and after the transform is applied they are turned to internal ids for the actual deletion. For the delete by filter, we could do either:
|
So, discussing this further with the team, @irevoire suggested we don't auto-batch delete-by-filter tasks, in which case we would be able to delete the documents directly from their internal ids. Summarizing the 3 solutions I see at the moment:
Note that whether we want filters at processing or query time is a product question. At the moment it is unclear to me what the best way to fetch the external ids from the internal ids might be. We could:
|
It's wrong to apply the filter at query time. The filter should always be used at the last update before deleting the documents, or else we may;
Since everything is asynchronous in meilisearch, the fact that the deletion may target a document that has yet to exist is a feature of meilisearch, not a problem. We also talked about being able to delete non-filterable documents. Has that been canceled/delayed? And finally:
There is definitely an issue we need to consider with the tenant tokens that can contain a filter!
For the context: I said that's not really important because most people who asked about the «auto batch addition&deletion » were synchronizing meilisearch with another database and mimicking the updates. And from what I understand, they get a list of IDs, not a filter. Thus, it's not essential to auto-batch this update for now. If I'm wrong, it won't be too late to do it, though; it's non-breaking and won't add work to this PR 👍 |
Thanks for the input @irevoire !
I don't see how this could happen if we're resolving to external ids at query time. Can you describe the sequence of events that could lead to this happening? I agree that it could definitely happen if we resolve the filter to internal ids at query time and then use these outdated internal ids to delete at processing time, but this is not what I'm proposing.
Cannot find a reference to it, but yes, we agreed to have a first implementation that doesn't do that. Discussion with users struggled to find a use case for this.
I was under the impression that tenant tokens would only apply to the search; is that incorrect? Considering that users can delete documents by their external ids today with just the |
Updated with added commits that implement solution (3) from #3550 (comment) The description of the PR is up to date |
706ac63
to
d312bc6
Compare
Rebased on |
3d53e41
to
e5e2fdf
Compare
/delete-batch
route79de627
to
80aae5d
Compare
4c45461
to
84e7bd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking OK, needing some minor changes (typo analytics)
did not check the dumps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would approve if it was possible.
Tested exporting and importing a dump, worked OK; did not test importing a dump from a previous version.
Tested deleting documents in the movie dataset by genre. Worked like a charm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love my code
bors merge
3550: Delete documents by filter r=irevoire a=dureuill # Prototype `prototype-delete-by-filter-0` Usage: A new route is available under `POST /indexes/{index_uid}/documents/delete` that allows you to delete your documents by filter. The expected payload looks like that: ```json { "filter": "doggo = bernese", } ``` It'll then enqueue a task in your task queue that'll delete all the documents matching this filter once it's processed. Here is an example of the associated details; ```json "details": { "deletedDocuments": 53, "originalFilter": "\"doggo = bernese\"" } ``` ---------- # Pull Request ## Related issue Related to #3477 ## What does this PR do? ### User standpoint - Modifies the `/indexes/{:indexUid}/documents/delete-batch` route to accept either the existing array of documents ids, or a JSON object with a `filter` field representing a filter to apply. If that latter variant is used, any document matching the filter will be deleted. ### Implementation standpoint - (processing time version) Adds a new BatchKind that is not autobatchable and that performs the delete by filter - Reuse the `documentDeletion` task with a new `originalFilter` detail that replaces the `providedIds` detail. ## Example <details> <summary>Sample request, response and task result</summary> Request: ``` curl \ -X POST 'http://localhost:7700/indexes/index-10/documents/delete-batch' \ -H 'Content-Type: application/json' \ --data-binary '{ "filter" : "mass = 600"}' ``` Response: ``` { "taskUid": 3902, "indexUid": "index-10", "status": "enqueued", "type": "documentDeletion", "enqueuedAt": "2023-02-28T20:50:31.667502Z" } ``` Task log: ```json { "uid": 3906, "indexUid": "index-12", "status": "succeeded", "type": "documentDeletion", "canceledBy": null, "details": { "deletedDocuments": 3, "originalFilter": "\"mass = 600\"" }, "error": null, "duration": "PT0.001819S", "enqueuedAt": "2023-03-07T08:57:20.11387Z", "startedAt": "2023-03-07T08:57:20.115895Z", "finishedAt": "2023-03-07T08:57:20.117714Z" } ``` </details> ## Draft status - [ ] Error handling - [ ] Analytics - [ ] Do we want to reuse the `delete-batch` route in this way, or create a new route instead? - [ ] Should the filter be applied at request time or when the deletion task is processed? - The first commit in this PR applies the filter at request time, meaning that even if a document is modified in a way that no longer matches the filter in a later update, it will be deleted as long as the deletion task is processed after that update. - The other commits in this PR apply the filter only when the asynchronous deletion task is processed, meaning that documents that match the filter at processing time are deleted even if they didn't match the filter at request time. - [ ] If keeping the filter at request time, find a more elegant way to recover the user document ids from the internal document ids. The current way implemented in the first commit of this PR involves getting all the documents matching the filter, looking for the value of their primary key, and turning it into a string by copy-pasting routines found in milli... - [ ] Security consideration, if any - [ ] Fix the tests (but waiting until product questions are resolved) - [ ] Add delete by filter specific tests 3720: Change links of docs everywhere r=curquiza a=curquiza Completely fixes #3668 Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: curquiza <clementine@meilisearch.com>
Build failed (retrying...): |
trying again after spurious failure bors merge |
Already running a review |
Prototype
prototype-delete-by-filter-0
Usage:
A new route is available under
POST /indexes/{index_uid}/documents/delete
that allows you to delete your documents by filter.The expected payload looks like that:
It'll then enqueue a task in your task queue that'll delete all the documents matching this filter once it's processed.
Here is an example of the associated details;
Pull Request
Related issue
Related to #3477
What does this PR do?
User standpoint
/indexes/{:indexUid}/documents/delete-batch
route to accept either the existing array of documents ids, or a JSON object with afilter
field representing a filter to apply. If that latter variant is used, any document matching the filter will be deleted.Implementation standpoint
documentDeletion
task with a neworiginalFilter
detail that replaces theprovidedIds
detail.Example
Sample request, response and task result
Request:
Response:
Task log:
Draft status
delete-batch
route in this way, or create a new route instead?