Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

jaceksan
Copy link
Contributor

@jaceksan jaceksan commented Apr 27, 2024

3 fixes of DuckDB vector store:

Dependencies: added Pandas to speed up from_documents.
I was thinking about CSV and JSON options, but I expect trouble loading JSON values this way and also CSV and JSON options require storing data to disk.
Anyway, the poetry file for langchain-community already contains a dependency on Pandas.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 27, 2024
Copy link

vercel bot commented Apr 27, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 22, 2024 8:57am

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 27, 2024
@jaceksan jaceksan force-pushed the duckdb branch 2 times, most recently from afc56bb to b171b58 Compare April 27, 2024 19:00
@jaceksan
Copy link
Contributor Author

jaceksan commented May 2, 2024

@baskaryan @eyurtsev should I try to fix the failing test?
To be honest, I don't know why it started to fail.
The import in the notebook is identical to what I use in my code base...well, expect I import from langchain_community and the notebook imports from langchain. I tried to change it to langchain_community but it did not help.

Any suggestion would be appreciated.

@efriis efriis added the partner label May 3, 2024
@efriis efriis self-assigned this May 3, 2024
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 3, 2024
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels May 3, 2024
@jaceksan
Copy link
Contributor Author

@hwchase17 can we merge it now?

Row-by-row INSERTs are not recommended by the official DOC.
They are very slow and utilize heavily the storage.

I tested it with 100+ documents, duration went down from 27s to 7s and local SSD is far less utilized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases partner size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DuckDB: distance/similarity property not reported to documents returned by similarity_search
6 participants