Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't store the vectors in the documents database #4649

Merged
merged 44 commits into from
Jun 17, 2024

Conversation

irevoire
Copy link
Member

@irevoire irevoire commented May 21, 2024

Pull Request

Related issue

Fixes #4607

What does this PR do?

  • Ensure that anything falling under _vectors is NOT searchable, filterable or sortable
  • per embedder, add a roaring bitmap of documents that provide "userProvided" embeddings
  • in the indexing process in extract_vector_points, set the bit corresponding to the document depending on the "userProvided" subfield in the _vectors field.
  • in the document DB in typed chunks, when writing the _vectors field, remove all keys corresponding to an embedder

@irevoire irevoire self-assigned this May 21, 2024
@irevoire irevoire changed the base branch from main to dont-regenerate-vecs-in-dump May 21, 2024 15:20
@irevoire irevoire force-pushed the dont-store-vectors-in-documents branch 3 times, most recently from 4caa914 to 0370a09 Compare May 22, 2024 13:29
Base automatically changed from dont-regenerate-vecs-in-dump to main May 23, 2024 08:57
@Kerollmops
Copy link
Member

/bench *embeddings.json

@Kerollmops
Copy link
Member

/bench workloads/*embeddings.json

@irevoire irevoire force-pushed the dont-store-vectors-in-documents branch 5 times, most recently from 5941391 to c03609d Compare May 28, 2024 14:53
@irevoire irevoire force-pushed the dont-store-vectors-in-documents branch from 04683ef to d9cc108 Compare June 4, 2024 15:52
@irevoire irevoire added performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption breaking change The related changes are breaking for the users CPU/RAM usage experimental feature Related to an experimental feature disk space usage labels Jun 4, 2024
@irevoire irevoire requested a review from dureuill June 4, 2024 16:24
@irevoire irevoire added this to the v1.9.0 milestone Jun 4, 2024
@irevoire irevoire changed the base branch from main to release-v1.9.0 June 4, 2024 16:25
@meilisearch meilisearch deleted a comment from meili-bot Jun 4, 2024
@irevoire irevoire force-pushed the dont-store-vectors-in-documents branch from 0fb0fc9 to 96d288d Compare June 5, 2024 09:05
@dureuill
Copy link
Contributor

dureuill commented Jun 5, 2024

/bench workloads/*embeddings.json

shouldn't change drastically

@meili-bot
Copy link
Contributor

meilisearch/src/search.rs Outdated Show resolved Hide resolved
@dureuill dureuill force-pushed the dont-store-vectors-in-documents branch from 9bc901f to e35ef31 Compare June 13, 2024 12:21
- when the feature is disabled, documents are never modified
- when the feature is enabled and `retrieveVectors` is disabled, `_vectors` is removed from documents
- when the feature is enabled and `retrieveVectors` is enabled, vectors from the vectors DB are merged with `_vectors` in documents

Additionally `_vectors` is never displayed when the `displayedAttributes` list does not contain either `*` or `_vectors`

- fixed an issue where `_vectors` was not injected when all vectors in the dataset where always generated
- update tests following changes in behavior from previous commit
@dureuill
Copy link
Contributor

Update:

Discussed with @irevoire

  • A _vectors.noise field that is not linked to an embedder configuration should always display regardless of retrieveVectors
    • KO: retrieveVector overrides the _vectors.noise
      -> ✅ fixed by @dureuill
  • Clearing all documents should reset the roaring of userProvided
    • KO: clearing all documents allows to get one document with 0 userProvided vectors without specifying the _vectors field
      -> ✅ fixed by @dureuill
  • changing the settings of an embedder without removing the configuration should retain the userProvided vectors
    • KO: updated an embedder => lost userProvided vector
      -> ✅ fixed by @dureuill
  • delete one document with userProvided vectors should not have effects on documents created afterwards (it should reinitialize the user_provided bit for that document)
    • KO: after delete one document, can then add a document without a vector, will be handled as if it had a vector
      -> ✅ fixed by @dureuill
  • retrieveVectors=true returns an error if the vectorStore feature is not enabled
    -> ✅ fixed by @irevoire

@dureuill also changed the behavior and name of the userProvided parameter: it is now a regenerate parameter that is true when an update to the document should regenerate vectors

Copy link
Member Author

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve my PR @dureuill

meilisearch/src/search.rs Show resolved Hide resolved
@irevoire irevoire requested a review from dureuill June 17, 2024 12:29
Copy link
Contributor

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @irevoire

@dureuill
Copy link
Contributor

bors merge

Copy link
Contributor

meili-bors bot commented Jun 17, 2024

@meili-bors meili-bors bot merged commit e9bf4c4 into release-v1.9.0 Jun 17, 2024
10 checks passed
@meili-bors meili-bors bot deleted the dont-store-vectors-in-documents branch June 17, 2024 13:13
@meili-bot meili-bot added the v1.9.0 PRs/issues solved in v1.9.0 released on 2024-07-01 label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change The related changes are breaking for the users CPU/RAM usage disk space usage experimental feature Related to an experimental feature performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption v1.9.0 PRs/issues solved in v1.9.0 released on 2024-07-01
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only ever store vectors in the vector store + hide embeddings
4 participants