community: DuckDB VS - expose similarity, improve performance of from_texts #20971

jaceksan · 2024-04-27T18:26:15Z

3 fixes of DuckDB vector store:

unify defaults in constructor and from_texts (users no longer have to specify vector_key).
include search similarity into output metadata (fixes DuckDB: distance/similarity property not reported to documents returned by similarity_search #20969)
significantly improve performance of from_documents

Dependencies: added Pandas to speed up from_documents.
I was thinking about CSV and JSON options, but I expect trouble loading JSON values this way and also CSV and JSON options require storing data to disk.
Anyway, the poetry file for langchain-community already contains a dependency on Pandas.

vercel · 2024-04-27T18:26:19Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 22, 2024 8:57am

libs/community/langchain_community/vectorstores/duckdb.py

jaceksan · 2024-05-02T08:19:54Z

@baskaryan @eyurtsev should I try to fix the failing test?
To be honest, I don't know why it started to fail.
The import in the notebook is identical to what I use in my code base...well, expect I import from langchain_community and the notebook imports from langchain. I tried to change it to langchain_community but it did not help.

Any suggestion would be appreciated.

libs/community/langchain_community/vectorstores/duckdb.py

jaceksan · 2024-05-16T11:25:39Z

@hwchase17 can we merge it now?

fixes langchain-ai#20969

Row-by-row INSERTs are not recommended by the official DOC. They are very slow and utilize heavily the storage. I tested it with 100+ documents, duration went down from 27s to 7s and local SSD is far less utilized.

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 27, 2024

dosubot bot added Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 27, 2024

hkad98 reviewed Apr 27, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/duckdb.py Outdated Show resolved Hide resolved

hkad98 reviewed Apr 27, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/duckdb.py Outdated Show resolved Hide resolved

jaceksan force-pushed the duckdb branch 2 times, most recently from afc56bb to b171b58 Compare April 27, 2024 19:00

leo-gan reviewed Apr 28, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/duckdb.py Outdated Show resolved Hide resolved

jaceksan force-pushed the duckdb branch from b171b58 to 862e130 Compare April 29, 2024 09:18

baskaryan requested a review from eyurtsev April 29, 2024 15:35

jaceksan force-pushed the duckdb branch from de7d06e to e9c1fa5 Compare April 30, 2024 13:46

baskaryan reviewed May 2, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/duckdb.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview May 2, 2024 17:28 View deployment

efriis added the partner label May 3, 2024

efriis self-assigned this May 3, 2024

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 3, 2024

jaceksan force-pushed the duckdb branch from 0573ce3 to 7d220af Compare May 3, 2024 08:40

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels May 3, 2024

vercel bot deployed to Preview May 3, 2024 08:47 View deployment

hwchase17 reviewed May 9, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/duckdb.py Show resolved Hide resolved

vercel bot deployed to Preview May 16, 2024 11:33 View deployment

jaceksan added 4 commits May 22, 2024 10:06

fix: DuckDB - unify defaults in constructor and from_texts

c4093a4

fix: DuckDB - include search similarity into output metadata

8991b3d

fixes langchain-ai#20969

fix: DuckDB - improve performance of from_documents

461588d

Row-by-row INSERTs are not recommended by the official DOC. They are very slow and utilize heavily the storage. I tested it with 100+ documents, duration went down from 27s to 7s and local SSD is far less utilized.

fixup! format and lint

660c2bd

baskaryan and others added 3 commits May 22, 2024 10:10

fmt

2faf796

fmt

4ff3079

WIP: support both row by row and from Dataframe

dce52a8

jaceksan force-pushed the duckdb branch from f5bc95b to dce52a8 Compare May 22, 2024 08:10

vercel bot deployed to Preview May 22, 2024 08:27 View deployment

jaceksan force-pushed the duckdb branch 2 times, most recently from a65c58c to 6275223 Compare May 22, 2024 08:41

fmt

f823e82

jaceksan force-pushed the duckdb branch from 6275223 to f823e82 Compare May 22, 2024 08:42

vercel bot deployed to Preview May 22, 2024 08:57 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

jaceksan commented Apr 27, 2024 •

edited

vercel bot commented Apr 27, 2024 •

edited

jaceksan commented May 2, 2024 •

edited

jaceksan commented May 16, 2024

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

Are you sure you want to change the base?

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

Conversation

jaceksan commented Apr 27, 2024 • edited

vercel bot commented Apr 27, 2024 • edited

jaceksan commented May 2, 2024 • edited

jaceksan commented May 16, 2024

jaceksan commented Apr 27, 2024 •

edited

vercel bot commented Apr 27, 2024 •

edited

jaceksan commented May 2, 2024 •

edited