Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: DuckDB VS - expose similarity, improve performance of from_texts #20971

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jaceksan
Copy link
Contributor

@jaceksan jaceksan commented Apr 27, 2024

3 fixes of DuckDB vector store:

Dependencies: added Pandas to speed up from_documents.
I was thinking about CSV and JSON options, but I expect trouble loading JSON values this way and also CSV and JSON options require storing data to disk.
Anyway, the poetry file for langchain-community already contains a dependency on Pandas.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 27, 2024
Copy link

vercel bot commented Apr 27, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 3, 2024 8:47am

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 27, 2024
@jaceksan jaceksan force-pushed the duckdb branch 2 times, most recently from afc56bb to b171b58 Compare April 27, 2024 19:00
@jaceksan
Copy link
Contributor Author

jaceksan commented May 2, 2024

@baskaryan @eyurtsev should I try to fix the failing test?
To be honest, I don't know why it started to fail.
The import in the notebook is identical to what I use in my code base...well, expect I import from langchain_community and the notebook imports from langchain. I tried to change it to langchain_community but it did not help.

Any suggestion would be appreciated.

Comment on lines +180 to +182
df = pd.DataFrame.from_dict(data) # noqa: F841
self._connection.execute(
f"INSERT INTO {self._table_name} SELECT * FROM df",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to use pandas here? this is a breaking change since pandas is an optional dependency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,
I am afraid we need it.
DuckDB does not provide any other way to import data efficiently.
Respectively, they provide import from file(CSV, PARQUET, JSON), but it would require writing the file to disk, which is IMO risky (which folder? privileges? ...).
If the new dependency becomes a blocker, I can implement a file approach, but then I would appreciate guidance on how to do it properly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we try imprting pandas and if its available we do the updated approach, and if its not available we fallback to the current approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do it.
But when I tested it end-to-end, it failed here:
https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/vectorstores/duckdb.py#L192
The similarity_search() method uses fetchdf() function which requires Pandas.
So the hard dependency was there before.

Do you want me to update the similarity_search() method too?

I noticed that you added a new code raising an exception if the import of Pandas fails.
Can we keep the current code in the PR and merge it?

@efriis efriis added the partner label May 3, 2024
@efriis efriis self-assigned this May 3, 2024
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 3, 2024
jaceksan and others added 6 commits May 3, 2024 10:39
Row-by-row INSERTs are not recommended by the official DOC.
They are very slow and utilize heavily the storage.

I tested it with 100+ documents, duration went down from 27s to 7s and local SSD is far less utilized.
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels May 3, 2024
vector_key: str = DEFAULT_VECTOR_KEY,
id_key: str = DEFAULT_ID_KEY,
text_key: str = DEFAULT_TEXT_KEY,
table_name: str = DEFAULT_TABLE_NAME,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a change from before - used to be vectorstore, now is embeddings, is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes.
They were not unified and it caused issues - you created the table with one name and tried to search in the table using the second name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases partner size:M This PR changes 30-99 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DuckDB: distance/similarity property not reported to documents returned by similarity_search
6 participants