-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qdrant refactor #20
Qdrant refactor #20
Conversation
Notes on ONNX performanceIt looks like ONNX does utilize all available CPU cores when processing the text and generating the embeddings (the image below was generated from an AWS EC2 T2 ubuntu instance with a single 4-core CPU). On average, the entire wine reviews dataset of 129,971 reviews is vectorized and ingested into Qdrant in 34 minutes via the quantized ONNX model, as opposed to more than 1 hour for the regular
This amounts to a roughly 1.8x reduction in indexing time, with a ~26% smaller (quantized) model that loads and processes results faster. To verify that the embeddings from the quantized models are of similar quality, some example cosine similarities are shown below. Example results:The following results are for the Vanilla model
Quantized ONNX model
As can be seen, the similarity scores are very close to the vanilla model, but the model is ~26% smaller and we are able to process the sentences much faster on the same CPU. |
* Using a transformers pipeline with mean pooling prior to optimization allows us to generate similar quality embeddings as the original * The model is still the same size, but the similarities it predicts are now much more similar to the un-optimized model
Purpose of this PR
This PR refactors the Qdrant code base to offer better performance and a structure that allows the user to decide the course of operation, depending on the available hardware and Python version.
sbert
model (without any optimizations) -- this is the slower option on CPU