Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ScaNN support in vectorstore. #8251

Merged
merged 5 commits into from Aug 4, 2023

Conversation

arron2003
Copy link
Contributor

Description: Add ScaNN vectorstore to langchain.
ScaNN is a Open Source, high performance vector similarity library optimized for AVX2-enabled CPUs.
https://github.com/google-research/google-research/tree/master/scann

  • Dependencies: scann

Python notebook to illustrate the usage: docs/extras/integrations/vectorstores/scann.ipynb
Integration test: libs/langchain/tests/integration_tests/vectorstores/test_scann.py

@rlancemartin, @eyurtsev for review.

Thanks!

@vercel
Copy link

vercel bot commented Jul 25, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Aug 1, 2023 1:49am

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Jul 25, 2023
Copy link
Collaborator

@baskaryan baskaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really cool! few small comments (can also take over PR if you don't have bandwidth)

libs/langchain/langchain/vectorstores/scann.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/scann.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/scann.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/scann.py Outdated Show resolved Hide resolved
@arron2003
Copy link
Contributor Author

Thanks for the suggestions!
I have address the comments and ran lint / format.
Also feel free to take over this PR if needed.

@arron2003
Copy link
Contributor Author

Let me know if anything else is needed. Thanks! @baskaryan

@arron2003
Copy link
Contributor Author

Another gentle ping. Thanks! @baskaryan

"""
scores, indices = self.index.search(
np.array([embedding], dtype=np.float32),
fetch_k if filter is None else fetch_k * 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm are we sure we want to do this behind the scenes? would it be clearer if we just expected users to adjust fetch_k themselves if they're passing in a filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I yanked this logic from faiss.py and thought this is a common practice in the implementation of MMR.

fetch_k if filter is None else fetch_k * 2,

Upon a closer look, I think there should be a better implementation when built-in filtering becomes available in our package. But this should a future change, and I have pulled out the MMR logic from this file for clarity.

" Each document should have an id."
)

docstore = InMemoryDocstore(dict(zip(index_to_id.values(), documents)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this check if a doctor has been passed in via kwargs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah docstore*

Copy link
Contributor Author

@arron2003 arron2003 Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - I read from_texts of other vectorstore implementations and I don't see them checking if a docstore is passed in. I agree that this would be a good feature to have - but maybe we should do it in a separate pull request when we refactor the interface of from_texts instead of this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think probably overkill

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool! i like it :)

@hwchase17 hwchase17 merged commit 6aee589 into langchain-ai:master Aug 4, 2023
23 checks passed
@arron2003 arron2003 deleted the langchain_scann branch August 4, 2023 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants