New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ScaNN support in vectorstore. #8251
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really cool! few small comments (can also take over PR if you don't have bandwidth)
Thanks for the suggestions! |
Let me know if anything else is needed. Thanks! @baskaryan |
Another gentle ping. Thanks! @baskaryan |
""" | ||
scores, indices = self.index.search( | ||
np.array([embedding], dtype=np.float32), | ||
fetch_k if filter is None else fetch_k * 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm are we sure we want to do this behind the scenes? would it be clearer if we just expected users to adjust fetch_k themselves if they're passing in a filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I yanked this logic from faiss.py and thought this is a common practice in the implementation of MMR.
fetch_k if filter is None else fetch_k * 2, |
Upon a closer look, I think there should be a better implementation when built-in filtering becomes available in our package. But this should a future change, and I have pulled out the MMR logic from this file for clarity.
" Each document should have an id." | ||
) | ||
|
||
docstore = InMemoryDocstore(dict(zip(index_to_id.values(), documents))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this check if a doctor has been passed in via kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah docstore*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi - I read from_texts
of other vectorstore implementations and I don't see them checking if a docstore is passed in. I agree that this would be a good feature to have - but maybe we should do it in a separate pull request when we refactor the interface of from_texts
instead of this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think probably overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool! i like it :)
Description: Add ScaNN vectorstore to langchain.
ScaNN is a Open Source, high performance vector similarity library optimized for AVX2-enabled CPUs.
https://github.com/google-research/google-research/tree/master/scann
Python notebook to illustrate the usage: docs/extras/integrations/vectorstores/scann.ipynb
Integration test: libs/langchain/tests/integration_tests/vectorstores/test_scann.py
@rlancemartin, @eyurtsev for review.
Thanks!