New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: DuckDB VS - expose similarity, improve performance of from_texts #20971
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
afc56bb
to
b171b58
Compare
@baskaryan @eyurtsev should I try to fix the failing test? Any suggestion would be appreciated. |
df = pd.DataFrame.from_dict(data) # noqa: F841 | ||
self._connection.execute( | ||
f"INSERT INTO {self._table_name} SELECT * FROM df", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to use pandas here? this is a breaking change since pandas is an optional dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey,
I am afraid we need it.
DuckDB does not provide any other way to import data efficiently.
Respectively, they provide import from file(CSV, PARQUET, JSON), but it would require writing the file to disk, which is IMO risky (which folder? privileges? ...).
If the new dependency becomes a blocker, I can implement a file approach, but then I would appreciate guidance on how to do it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we try imprting pandas and if its available we do the updated approach, and if its not available we fallback to the current approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do it.
But when I tested it end-to-end, it failed here:
https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/vectorstores/duckdb.py#L192
The similarity_search() method uses fetchdf()
function which requires Pandas.
So the hard dependency was there before.
Do you want me to update the similarity_search() method too?
I noticed that you added a new code raising an exception if the import of Pandas fails.
Can we keep the current code in the PR and merge it?
Row-by-row INSERTs are not recommended by the official DOC. They are very slow and utilize heavily the storage. I tested it with 100+ documents, duration went down from 27s to 7s and local SSD is far less utilized.
vector_key: str = DEFAULT_VECTOR_KEY, | ||
id_key: str = DEFAULT_ID_KEY, | ||
text_key: str = DEFAULT_TEXT_KEY, | ||
table_name: str = DEFAULT_TABLE_NAME, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a change from before - used to be vectorstore
, now is embeddings
, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes.
They were not unified and it caused issues - you created the table with one name and tried to search in the table using the second name.
3 fixes of DuckDB vector store:
vector_key
).from_documents
Dependencies: added Pandas to speed up
from_documents
.I was thinking about CSV and JSON options, but I expect trouble loading JSON values this way and also CSV and JSON options require storing data to disk.
Anyway, the poetry file for langchain-community already contains a dependency on Pandas.