Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Transform's disk usage when changing the settings #4485

Closed
3 tasks
ManyTheFish opened this issue Mar 13, 2024 · 0 comments · Fixed by #4646
Closed
3 tasks

Reduce Transform's disk usage when changing the settings #4485

ManyTheFish opened this issue Mar 13, 2024 · 0 comments · Fixed by #4646
Assignees
Labels
performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented Mar 13, 2024

Related product team resources: PRD (internal only)

⚠️ this issue depends on #4480 to be implemented

Summary

This issue is a subset of the work implementing the settings diff-indexing enhancement.

The method prepare_documents_for_reindexing exports all the documents of the databases in two different formats on the disk using Grenad sorters:

The original OBKV format

It is used to write documents in the database and recompute the semantic search vectors.

Writing the documents in the database is useless when the settings are changed.
So the original OBKV format should only be computed if a setting related to the vector pipeline is changed,
otherwise, the grenad sorter shouldn't be created and sent to the indexing.

The flattened OBKV format

It is used to compute the searchable pipeline and the facet pipeline.

The flattened OBKV format should only be created if the searchable pipeline, the facet pipeline, or the word-pair-proximity database is impacted by the settings change.
Moreover, the written field for each document could be filtered depending on which setting is changed.
For instance:

  • If only the searchableAttributes have been changed, keep only the searchable fields and the primary key in the documents.
  • If only a faceted field has been added, keep only this field and the primary key in the documents.

Related Benchmarks:

  • settings-add-embeddings.json
  • settings-add-remove-filters.json
  • settings-proximity-precision.json
  • settings-remove-add-swap-searchable.json
  • settings-typo.json

TODO

  • Optionally create the original OBKV format when it's needed
  • Optionally create the flattened OBKV format when it's needed
  • Filter out the unnecessary fields from the flattened OBKV format
@ManyTheFish ManyTheFish added performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing labels Mar 13, 2024
@Kerollmops Kerollmops self-assigned this May 16, 2024
@Kerollmops Kerollmops linked a pull request May 21, 2024 that will close this issue
meili-bors bot added a commit that referenced this issue May 23, 2024
4646: Reduce `Transform`'s disk usage r=ManyTheFish a=Kerollmops

This PR implements what is described in #4485. It reduces the number of disk writes and disk usage.

Co-authored-by: Clément Renault <clement@meilisearch.com>
meili-bors bot added a commit that referenced this issue May 23, 2024
4646: Reduce `Transform`'s disk usage r=Kerollmops a=Kerollmops

This PR implements what is described in #4485. It reduces the number of disk writes and disk usage.

Co-authored-by: Clément Renault <clement@meilisearch.com>
meili-bors bot added a commit that referenced this issue May 23, 2024
4646: Reduce `Transform`'s disk usage r=Kerollmops a=Kerollmops

This PR implements what is described in #4485. It reduces the number of disk writes and disk usage.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants