Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete documents by filter #3550

Merged
merged 17 commits into from
May 4, 2023
Merged

Delete documents by filter #3550

merged 17 commits into from
May 4, 2023

Conversation

dureuill
Copy link
Contributor

@dureuill dureuill commented Mar 1, 2023

Prototype prototype-delete-by-filter-0

Usage:
A new route is available under POST /indexes/{index_uid}/documents/delete that allows you to delete your documents by filter.
The expected payload looks like that:

{
  "filter": "doggo = bernese",
}

It'll then enqueue a task in your task queue that'll delete all the documents matching this filter once it's processed.
Here is an example of the associated details;

  "details": {
    "deletedDocuments": 53,
    "originalFilter": "\"doggo = bernese\""
  }

Pull Request

Related issue

Related to #3477

What does this PR do?

User standpoint

  • Modifies the /indexes/{:indexUid}/documents/delete-batch route to accept either the existing array of documents ids, or a JSON object with a filter field representing a filter to apply. If that latter variant is used, any document matching the filter will be deleted.

Implementation standpoint

  • (processing time version) Adds a new BatchKind that is not autobatchable and that performs the delete by filter
  • Reuse the documentDeletion task with a new originalFilter detail that replaces the providedIds detail.

Example

Sample request, response and task result

Request:

curl \
  -X POST 'http://localhost:7700/indexes/index-10/documents/delete-batch' \
  -H 'Content-Type: application/json' \
  --data-binary '{ "filter" : "mass = 600"}'

Response:

{
  "taskUid": 3902,
  "indexUid": "index-10",
  "status": "enqueued",
  "type": "documentDeletion",
  "enqueuedAt": "2023-02-28T20:50:31.667502Z"
}

Task log:

    {
      "uid": 3906,
      "indexUid": "index-12",
      "status": "succeeded",
      "type": "documentDeletion",
      "canceledBy": null,
      "details": {
        "deletedDocuments": 3,
        "originalFilter": "\"mass = 600\""
      },
      "error": null,
      "duration": "PT0.001819S",
      "enqueuedAt": "2023-03-07T08:57:20.11387Z",
      "startedAt": "2023-03-07T08:57:20.115895Z",
      "finishedAt": "2023-03-07T08:57:20.117714Z"
    }

Draft status

  • Error handling
  • Analytics
  • Do we want to reuse the delete-batch route in this way, or create a new route instead?
  • Should the filter be applied at request time or when the deletion task is processed?
    • The first commit in this PR applies the filter at request time, meaning that even if a document is modified in a way that no longer matches the filter in a later update, it will be deleted as long as the deletion task is processed after that update.
    • The other commits in this PR apply the filter only when the asynchronous deletion task is processed, meaning that documents that match the filter at processing time are deleted even if they didn't match the filter at request time.
  • If keeping the filter at request time, find a more elegant way to recover the user document ids from the internal document ids. The current way implemented in the first commit of this PR involves getting all the documents matching the filter, looking for the value of their primary key, and turning it into a string by copy-pasting routines found in milli...
  • Security consideration, if any
  • Fix the tests (but waiting until product questions are resolved)
  • Add delete by filter specific tests

@dureuill dureuill marked this pull request as draft March 1, 2023 14:32
@github-actions
Copy link

github-actions bot commented Mar 1, 2023

Uffizzi Ephemeral Environment deployment-17867

☁️ https://app.uffizzi.com/github.com/meilisearch/meilisearch/pull/3550

📄 View Application Logs etc.

The meilisearch preview environment contains a web terminal from where you can run the
meilisearch command. You should be able to access this instance of meilisearch running in
the preview from the link Meilisearch Endpoint link given below.

Web Terminal Endpoint :
Meilisearch Endpoint : /meilisearch

@Kerollmops
Copy link
Member

Should the filter be applied at request time or when the deletion task is processed?

I think that we must follow the behavior of the current update system. The way the deletion work is that it evaluates the ids to delete when it is executed, no error is raised until the update is processed.

Doing that is much easier to work with when using the task queue and one day replicating the queue too.

@dureuill
Copy link
Contributor Author

dureuill commented Mar 2, 2023

thanks for the answer @Kerollmops, but unfortunately I'm don't understand your suggestion.

Are you suggesting that the filter be applied at request time, or only when the deletion task is processed?

If applied at request time, that means that we evaluate which documents match the filter synchronously before answering the HTTP request. It means we have to go back to the user (external) document ids because we have no guarantee that a previous document update will not modify the internal document ids. If applying at request time we can reuse the existing tasks without a change because the filter is converted to the ids of the documents synchronously.

If applied at processing time, that means that we evaluate which documents match the filter asynchronously, after answering the HTTP request. It means we have to somehow pass the filter to the task so the filter evaluation can take place in the index scheduler. It also means we have to handle any filter-related error that can happen in the index scheduler (I don't think we're currently filtering inside of the index scheduler). We can directly use the internal ids here because no other task can change them during the processing of our task.

@Kerollmops
Copy link
Member

Are you suggesting that the filter be applied at request time, or only when the deletion task is processed?

Only when the deletion task is processed. I want to evaluate the request in the queue as we already do for the other way to batch delete. When we delete the documents by using the array of ids, we don't evaluate the ids at request time but rather at processing time.

It means we have to somehow pass the filter to the task so the filter evaluation can take place in the index scheduler.

Indeed, it's unfortunately breaking and for the dumps too!

It also means we have to handle any filter-related error that can happen in the index scheduler (I don't think we're currently filtering inside of the index scheduler).

Indeed, the deletion-by-filter feature will only work if the fields are filterable. However, we could check, at least, the syntax at request time when we receive the filter. We could also accept to delete-by-filter even if some fields are non-filterable, but that's a future improvement.

We can directly use the internal ids here because no other task can change them during the processing of our task.

If you mean that the index-scheduler will execute the filter to fetch the internal document ids and that the scheduler always process one task at a time, then yes, that's what I was thinking about 😃

@dureuill
Copy link
Contributor Author

dureuill commented Mar 2, 2023

Only when the deletion task is processed. I want to evaluate the request in the queue as we already do for the other way to batch delete. When we delete the documents by using the array of ids, we don't evaluate the ids at request time but rather at processing time.

Sorry this is still unclear to me 😓

When we delete the documents by using an array of external document ids, it is true that the external document ids are not evaluated at request time. Only at processing time are they used by the batcher to determine the transform output, and after the transform is applied they are turned to internal ids for the actual deletion.

For the delete by filter, we could do either:

  1. keep the filter, evaluate it to internal id at processing time
  2. evaluate the filter to internal ids at processing time, then go back the corresponding external ids still at processing time, so that we can express the delete by filter like a delete by array, basically.

@dureuill
Copy link
Contributor Author

dureuill commented Mar 6, 2023

So, discussing this further with the team, @irevoire suggested we don't auto-batch delete-by-filter tasks, in which case we would be able to delete the documents directly from their internal ids.

Summarizing the 3 solutions I see at the moment:

  1. Apply filter at query time (this PR as of its first commit).
  • Needs to round trip from internal ids to external ids
  • No need to change the documentDeletion task, we can reuse it directly
  • Benefits from autobatching
  1. Apply filter at processing time, keeping autobatching.
  • Needs a new kind of task, or to modify documentDeletion task.
  • Needs to round trip from internal ids to external ids anyway.
  • Benefits from autobatching
  1. Apply filter at processing time, disabling autobatching for this deleteByFilter tasks (this PR as of its second commit).
  • Needs a new kind of task (or to modify documentDeletion task and have it autobatch conditionally)
  • No round trip from internal ids to external ids.
  • No autobatching

Note that whether we want filters at processing or query time is a product question.

At the moment it is unclear to me what the best way to fetch the external ids from the internal ids might be.

We could:

  1. Load the full document from the internal id, then read the primary key value (this PR in its current form)
  2. Iterate on the external ids -> internal ids FST until we find the correct internal id

@irevoire
Copy link
Member

irevoire commented Mar 6, 2023

It's wrong to apply the filter at query time. The filter should always be used at the last update before deleting the documents, or else we may;

  • Delete a document that the user never wanted to delete in an obscure manner
  • Fail to delete a document that the user clearly specified

Since everything is asynchronous in meilisearch, the fact that the deletion may target a document that has yet to exist is a feature of meilisearch, not a problem.


We also talked about being able to delete non-filterable documents. Has that been canceled/delayed?
That part will also be a little tricky because we'll need to implement a way to process the filter by iterating on all the documents.


And finally:

Security consideration, if any

There is definitely an issue we need to consider with the tenant tokens that can contain a filter!
I think « correcting » the user filter by something like {tenant_token filter} AND ( {user_filter} ) should do it (after ensuring the user filter is effectively well-formed / parsed).


@irevoire suggested we don't auto-batch delete-by-filter tasks

For the context: I said that's not really important because most people who asked about the «auto batch addition&deletion » were synchronizing meilisearch with another database and mimicking the updates. And from what I understand, they get a list of IDs, not a filter. Thus, it's not essential to auto-batch this update for now.

If I'm wrong, it won't be too late to do it, though; it's non-breaking and won't add work to this PR 👍

@dureuill
Copy link
Contributor Author

dureuill commented Mar 6, 2023

Thanks for the input @irevoire !

Delete a document that the user never wanted to delete in an obscure manner
Fail to delete a document that the user clearly specified

I don't see how this could happen if we're resolving to external ids at query time. Can you describe the sequence of events that could lead to this happening?

I agree that it could definitely happen if we resolve the filter to internal ids at query time and then use these outdated internal ids to delete at processing time, but this is not what I'm proposing.

We also talked about being able to delete non-filterable documents. Has that been canceled/delayed?

Cannot find a reference to it, but yes, we agreed to have a first implementation that doesn't do that. Discussion with users struggled to find a use case for this.

There is definitely an issue we need to consider with the tenant tokens that can contain a filter!

I was under the impression that tenant tokens would only apply to the search; is that incorrect? Considering that users can delete documents by their external ids today with just the documents.delete action + index authorization, does it make sense to limit the filter by tenant token? If yes, then you're correct that we need to take it into account.

@dureuill dureuill added this to the v1.2.0 milestone Mar 7, 2023
@dureuill
Copy link
Contributor Author

dureuill commented Mar 7, 2023

Updated with added commits that implement solution (3) from #3550 (comment)

The description of the PR is up to date

@dureuill
Copy link
Contributor Author

dureuill commented Mar 7, 2023

Rebased on main

@irevoire irevoire changed the title Draft implementation of filter support for /delete-batch route Delete documents by filter Apr 27, 2023
@dureuill dureuill force-pushed the delete_documents_by_query branch from 4c45461 to 84e7bd9 Compare May 3, 2023 15:51
@dureuill dureuill marked this pull request as ready for review May 3, 2023 15:53
Copy link
Contributor Author

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking OK, needing some minor changes (typo analytics)

did not check the dumps

meilisearch/src/analytics/segment_analytics.rs Outdated Show resolved Hide resolved
meilisearch/src/routes/indexes/documents.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would approve if it was possible.

Tested exporting and importing a dump, worked OK; did not test importing a dump from a previous version.

Tested deleting documents in the movie dataset by genre. Worked like a charm.

Copy link
Member

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love my code

bors merge

meili-bors bot added a commit that referenced this pull request May 4, 2023
3550: Delete documents by filter r=irevoire a=dureuill

# Prototype `prototype-delete-by-filter-0`

Usage:
A new route is available under `POST /indexes/{index_uid}/documents/delete` that allows you to delete your documents by filter.
The expected payload looks like that:
```json
{
  "filter": "doggo = bernese",
}
```

It'll then enqueue a task in your task queue that'll delete all the documents matching this filter once it's processed.
Here is an example of the associated details;
```json
  "details": {
    "deletedDocuments": 53,
    "originalFilter": "\"doggo = bernese\""
  }
```

----------


# Pull Request

## Related issue
Related to #3477

## What does this PR do?

### User standpoint

- Modifies the `/indexes/{:indexUid}/documents/delete-batch` route to accept either the existing array of documents ids, or a JSON object with a `filter` field representing a filter to apply. If that latter variant is used, any document matching the filter will be deleted.

### Implementation standpoint

- (processing time version) Adds a new BatchKind that is not autobatchable and that performs the delete by filter
- Reuse the `documentDeletion` task with a new `originalFilter` detail that replaces the `providedIds` detail.

## Example

<details>
<summary>Sample request, response and task result</summary>

Request:

```
curl \
  -X POST 'http://localhost:7700/indexes/index-10/documents/delete-batch' \
  -H 'Content-Type: application/json' \
  --data-binary '{ "filter" : "mass = 600"}'
```

Response:

```
{
  "taskUid": 3902,
  "indexUid": "index-10",
  "status": "enqueued",
  "type": "documentDeletion",
  "enqueuedAt": "2023-02-28T20:50:31.667502Z"
}
```

Task log:

```json
    {
      "uid": 3906,
      "indexUid": "index-12",
      "status": "succeeded",
      "type": "documentDeletion",
      "canceledBy": null,
      "details": {
        "deletedDocuments": 3,
        "originalFilter": "\"mass = 600\""
      },
      "error": null,
      "duration": "PT0.001819S",
      "enqueuedAt": "2023-03-07T08:57:20.11387Z",
      "startedAt": "2023-03-07T08:57:20.115895Z",
      "finishedAt": "2023-03-07T08:57:20.117714Z"
    }
```

</details>

## Draft status

- [ ] Error handling
- [ ] Analytics
- [ ] Do we want to reuse the `delete-batch` route in this way, or create a new route instead?
- [ ] Should the filter be applied at request time or when the deletion task is processed? 
  - The first commit in this PR applies the filter at request time, meaning that even if a document is modified in a way that no longer matches the filter in a later update, it will be deleted as long as the deletion task is processed after that update. 
  - The other commits in this PR apply the filter only when the asynchronous deletion task is processed, meaning that documents that match the filter at processing time are deleted even if they didn't match the filter at request time.
- [ ] If keeping the filter at request time, find a more elegant way to recover the user document ids from the internal document ids. The current way implemented in the first commit of this PR involves getting all the documents matching the filter, looking for the value of their primary key, and turning it into a string by copy-pasting routines found in milli...
- [ ] Security consideration, if any
- [ ] Fix the tests (but waiting until product questions are resolved)
- [ ] Add delete by filter specific tests



3720: Change links of docs everywhere r=curquiza a=curquiza

Completely fixes #3668 

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: curquiza <clementine@meilisearch.com>
@meili-bors
Copy link
Contributor

meili-bors bot commented May 4, 2023

Build failed (retrying...):

@dureuill
Copy link
Contributor Author

dureuill commented May 4, 2023

trying again after spurious failure

bors merge

@meili-bors
Copy link
Contributor

meili-bors bot commented May 4, 2023

Already running a review

@meili-bors
Copy link
Contributor

meili-bors bot commented May 4, 2023

@meili-bors meili-bors bot merged commit a95128d into main May 4, 2023
7 checks passed
@meili-bors meili-bors bot deleted the delete_documents_by_query branch May 4, 2023 11:19
@meili-bot meili-bot added the v1.2.0 PRs/issues solved in v1.2.0 released on 2023-06-05 label Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.2.0 PRs/issues solved in v1.2.0 released on 2023-06-05
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants