Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Better threshold #607

Merged
merged 2 commits into from
Aug 17, 2022
Merged

Better threshold #607

merged 2 commits into from
Aug 17, 2022

Conversation

irevoire
Copy link
Member

Pull Request

What does this PR do?

Fixes #570

This PR tries to improve the threshold used to trigger the real deletion of documents.
The deletion is now triggered in two cases;

  • 10% of the total available space is used by soft deleted documents
  • 90% of the total available space is used.

In this context, « total available space » means the map_size of lmdb.
And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document.


image

Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents.
Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know.

If you have an idea on how to improve this estimation I would love to hear it.
It look like the difference is linear so maybe we could simply multiply the current estimation by two?

Kerollmops
Kerollmops previously approved these changes Aug 17, 2022
@irevoire irevoire added no breaking The related changes are not breaking (DB nor API) API breaking The related changes break the milli API and removed no breaking The related changes are not breaking (DB nor API) labels Aug 17, 2022
@irevoire irevoire marked this pull request as ready for review August 17, 2022 15:44
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! Looks good and well commented!
bors merge

bors bot added a commit that referenced this pull request Aug 17, 2022
607: Better threshold r=Kerollmops a=irevoire

# Pull Request

## What does this PR do?
Fixes #570 

This PR tries to improve the threshold used to trigger the real deletion of documents.
The deletion is now triggered in two cases;
- 10% of the total available space is used by soft deleted documents
- 90% of the total available space is used.

In this context, « total available space » means the `map_size` of lmdb.
And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document.

--------

<img width="808" alt="image" src="https://user-images.githubusercontent.com/7032172/185083075-92cf379e-8ae1-4bfc-9ca6-93b54e6ab4e9.png">

Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents.
Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know.

If you have an idea on how to improve this estimation I would love to hear it.
It look like the difference is linear so maybe we could simply multiply the current estimation by two?

Co-authored-by: Irevoire <tamo@meilisearch.com>
@bors
Copy link
Contributor

bors bot commented Aug 17, 2022

Build failed:

@Kerollmops
Copy link
Member

bors merge

bors bot added a commit that referenced this pull request Aug 17, 2022
607: Better threshold r=Kerollmops a=irevoire

# Pull Request

## What does this PR do?
Fixes #570 

This PR tries to improve the threshold used to trigger the real deletion of documents.
The deletion is now triggered in two cases;
- 10% of the total available space is used by soft deleted documents
- 90% of the total available space is used.

In this context, « total available space » means the `map_size` of lmdb.
And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document.

--------

<img width="808" alt="image" src="https://user-images.githubusercontent.com/7032172/185083075-92cf379e-8ae1-4bfc-9ca6-93b54e6ab4e9.png">

Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents.
Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know.

If you have an idea on how to improve this estimation I would love to hear it.
It look like the difference is linear so maybe we could simply multiply the current estimation by two?

Co-authored-by: Irevoire <tamo@meilisearch.com>
@bors
Copy link
Contributor

bors bot commented Aug 17, 2022

Build failed:

@Kerollmops
Copy link
Member

bors merge

@bors
Copy link
Contributor

bors bot commented Aug 17, 2022

Build succeeded:

@bors bors bot merged commit 79094bc into main Aug 17, 2022
@bors bors bot deleted the better-threshold branch August 17, 2022 16:48
bors bot added a commit that referenced this pull request Aug 18, 2022
609: Retry downloading the benchmarks datasets r=Kerollmops a=irevoire

Downloading the benchmarks datasets is failing [more and more](#607 (review)) often; thus, instead of fixing the issue, I thought we could retry multiple times.


Co-authored-by: Irevoire <tamo@meilisearch.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API breaking The related changes break the milli API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Choose a better threshold for the soft deletion
2 participants