community: DuckDB VS - expose similarity, improve performance of from_texts #20971

jaceksan · 2024-04-27T18:26:15Z

3 fixes of DuckDB vector store:

unify defaults in constructor and from_texts (users no longer have to specify vector_key).
include search similarity into output metadata (fixes DuckDB: distance/similarity property not reported to documents returned by similarity_search #20969)
significantly improve performance of from_documents

Dependencies: added Pandas to speed up from_documents.
I was thinking about CSV and JSON options, but I expect trouble loading JSON values this way and also CSV and JSON options require storing data to disk.
Anyway, the poetry file for langchain-community already contains a dependency on Pandas.

vercel · 2024-04-27T18:26:19Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 3, 2024 8:47am

libs/community/langchain_community/vectorstores/duckdb.py

jaceksan · 2024-05-02T08:19:54Z

@baskaryan @eyurtsev should I try to fix the failing test?
To be honest, I don't know why it started to fail.
The import in the notebook is identical to what I use in my code base...well, expect I import from langchain_community and the notebook imports from langchain. I tried to change it to langchain_community but it did not help.

Any suggestion would be appreciated.

baskaryan · 2024-05-02T17:15:34Z

libs/community/langchain_community/vectorstores/duckdb.py

+        df = pd.DataFrame.from_dict(data)  # noqa: F841
+        self._connection.execute(
+            f"INSERT INTO {self._table_name} SELECT * FROM df",


do we need to use pandas here? this is a breaking change since pandas is an optional dependency

Hey,
I am afraid we need it.
DuckDB does not provide any other way to import data efficiently.
Respectively, they provide import from file(CSV, PARQUET, JSON), but it would require writing the file to disk, which is IMO risky (which folder? privileges? ...).
If the new dependency becomes a blocker, I can implement a file approach, but then I would appreciate guidance on how to do it properly.

what if we try imprting pandas and if its available we do the updated approach, and if its not available we fallback to the current approach

I tried to do it.
But when I tested it end-to-end, it failed here:
https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/vectorstores/duckdb.py#L192
The similarity_search() method uses fetchdf() function which requires Pandas.
So the hard dependency was there before.

Do you want me to update the similarity_search() method too?

I noticed that you added a new code raising an exception if the import of Pandas fails.
Can we keep the current code in the PR and merge it?

fixes langchain-ai#20969

Row-by-row INSERTs are not recommended by the official DOC. They are very slow and utilize heavily the storage. I tested it with 100+ documents, duration went down from 27s to 7s and local SSD is far less utilized.

hwchase17 · 2024-05-09T00:09:59Z

libs/community/langchain_community/vectorstores/duckdb.py

+        vector_key: str = DEFAULT_VECTOR_KEY,
+        id_key: str = DEFAULT_ID_KEY,
+        text_key: str = DEFAULT_TEXT_KEY,
+        table_name: str = DEFAULT_TABLE_NAME,


this is a change from before - used to be vectorstore, now is embeddings, is that intentional?

Well, yes.
They were not unified and it caused issues - you created the table with one name and tried to search in the table using the second name.

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 27, 2024

dosubot bot added Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 27, 2024

hkad98 reviewed Apr 27, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/duckdb.py Outdated Show resolved Hide resolved

hkad98 reviewed Apr 27, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/duckdb.py Outdated Show resolved Hide resolved

jaceksan force-pushed the duckdb branch 2 times, most recently from afc56bb to b171b58 Compare April 27, 2024 19:00

leo-gan reviewed Apr 28, 2024

View reviewed changes

libs/community/langchain_community/vectorstores/duckdb.py Outdated Show resolved Hide resolved

jaceksan force-pushed the duckdb branch from b171b58 to 862e130 Compare April 29, 2024 09:18

baskaryan requested a review from eyurtsev April 29, 2024 15:35

jaceksan force-pushed the duckdb branch from de7d06e to e9c1fa5 Compare April 30, 2024 13:46

baskaryan reviewed May 2, 2024

View reviewed changes

vercel bot deployed to Preview May 2, 2024 17:28 View deployment

efriis added the partner label May 3, 2024

efriis self-assigned this May 3, 2024

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 3, 2024

jaceksan and others added 6 commits May 3, 2024 10:39

fix: DuckDB - unify defaults in constructor and from_texts

ad820b3

fix: DuckDB - include search similarity into output metadata

62f4f7e

fixes langchain-ai#20969

fix: DuckDB - improve performance of from_documents

a30e0b4

Row-by-row INSERTs are not recommended by the official DOC. They are very slow and utilize heavily the storage. I tested it with 100+ documents, duration went down from 27s to 7s and local SSD is far less utilized.

fixup! format and lint

b9e3384

fmt

5deebf8

fmt

7d220af

jaceksan force-pushed the duckdb branch from 0573ce3 to 7d220af Compare May 3, 2024 08:40

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels May 3, 2024

vercel bot deployed to Preview May 3, 2024 08:47 View deployment

hwchase17 reviewed May 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

jaceksan commented Apr 27, 2024 •

edited

vercel bot commented Apr 27, 2024 •

edited

jaceksan commented May 2, 2024 •

edited

baskaryan May 2, 2024

jaceksan May 2, 2024

baskaryan May 2, 2024

jaceksan May 3, 2024

hwchase17 May 9, 2024

jaceksan May 9, 2024

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

Are you sure you want to change the base?

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

Conversation

jaceksan commented Apr 27, 2024 • edited

vercel bot commented Apr 27, 2024 • edited

jaceksan commented May 2, 2024 • edited

baskaryan May 2, 2024

Choose a reason for hiding this comment

jaceksan May 2, 2024

Choose a reason for hiding this comment

baskaryan May 2, 2024

Choose a reason for hiding this comment

jaceksan May 3, 2024

Choose a reason for hiding this comment

hwchase17 May 9, 2024

Choose a reason for hiding this comment

jaceksan May 9, 2024

Choose a reason for hiding this comment

jaceksan commented Apr 27, 2024 •

edited

vercel bot commented Apr 27, 2024 •

edited

jaceksan commented May 2, 2024 •

edited