-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: Add SQLDatabaseLoader
document loader
#16246
community: Add SQLDatabaseLoader
document loader
#16246
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
fc6d36b
to
6a9ffae
Compare
ffd88b2
to
2b3aed3
Compare
CONNECTION_STRING = os.environ.get( | ||
"TEST_POSTGRESQL_CONNECTION_STRING", | ||
"postgresql+psycopg2://postgres@localhost:5432/", | ||
) | ||
|
||
|
||
@pytest.fixture | ||
@unittest.skipIf(not psycopg2_installed, "psycopg2 not installed") | ||
def engine() -> sa.Engine: | ||
""" | ||
Return an SQLAlchemy engine object. | ||
""" | ||
return sa.create_engine(CONNECTION_STRING, echo=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PostgreSQL-based tests will be skipped if the psycopg2
driver is not installed. Because this is not happening by default, it would need to be installed manually.
a) Let me know if you want to include psycopg2-binary
into the list of dependencies, and if so, where.
b) Also let me know if there is any means to invoke a PostgreSQL instance on CI, or if there is one available already, in order to also run the integration tests on CI. Otherwise, do you think it is good to go like it is, skipping the integration tests on CI?
2b3aed3
to
912a702
Compare
This comment was marked as resolved.
This comment was marked as resolved.
libs/community/langchain_community/document_loaders/sqlalchemy.py
Outdated
Show resolved
Hide resolved
912a702
to
8364e8c
Compare
8364e8c
to
cde3aa9
Compare
libs/community/langchain_community/document_loaders/sqlalchemy.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/sqlalchemy.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/sqlalchemy.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/sqlalchemy.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/sqlalchemy.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/sqlalchemy.py
Outdated
Show resolved
Hide resolved
libs/community/tests/integration_tests/document_loaders/test_sqlalchemy_sqlite.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @baskaryan,
I've submitted 28ca847 as a separate patch, also in order to accompany it by corresponding test cases.
With kind regards,
Andreas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again. Thank you both for all the excellent reviews on this patch.
GH-16655 now also gained a dedicated documentation page about relevant features of the SQLDatabase
utility we improved along the lines, see sql_database.ipynb.
I think it makes sense to bring in that patch before coming back here. Thus, it has been toggled back into draft mode. After coming back from yak shaving, I will rebase this patch, and squash the individual commits, before presenting it for review again, if you agree with this procedure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again. I've just refreshed this patch, and CI signals success, so it might be good to go if @eyurtsev doesn't have any objections? Thanks again for the guidance to improve SQLDatabase
beforehand.
SQLDatabaseLoader
document loader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
def __init__( | ||
self, | ||
query: Union[str, sa.Select], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reviewing this PR: #16655 , which implements a cursor and I think will be used by this data loader for loading documents from SQL.
I've mostly used keyset pagination in the past because it's easier to add retry logic to it in case of network connections or some other server side errors (as well as parallelization). Is there way to do retries with the cursor if the connection times out or we get disconnected for whatever reason?
Key set pagination: https://www.cockroachlabs.com/docs/stable/pagination#keyset-pagination
Basic idea is to combine a sort on an id field, a filter on the id filed and a limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @eyurtsev. I hear your argument, but at the same time, I think the Cockroach documentation is misleading.
Allocating resources for a cursor once, and then using it to churn through the results, optionally also in async/streaming mode, is far more efficient than sending multiple independent database queries which need to allocate corresponding resources on the server side each and every time.
Looking at this from a different perspective, it probably depends very much on the use case at hand. It think using key set pagination is out of scope for this patch, as it depends on an id field, and some database tables may not have it.
If such functionality is needed, it can probably implemented on top, or orthogonal to this infrastructure code, which is merely foundational.
The patch also actually does not change how database querying works through SQLAlchemy. The fetch=cursor
literal just signals that it should return the result wrapper to be able to iterate on behalf of the calling site. Otherwise, records will already be collected in SQLDatabase
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mostly used keyset pagination in the past.
If such functionality is needed, it can probably implemented on top, or orthogonal to this infrastructure code, which is merely foundational.
Hi Eugene,
let me know if you think differently, or if you agree with this assessment. If you are using this technique, and would like to see any sort of adapter for it, I think it may be valuable to implement in LangChain. However, I don't see how it can be implemented within the scope of this patch.
With kind regards,
Andreas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to review / merge today if not then earliest on Monday -- a bit pressed on time and I want to review carefully :)
I agree with your points -- cursor based pagination is likely good fit for most users and will result in a simpler API.
If they're working at scale, they'll likely need to do additional optimizations regardless and may want to implement their own solution regardless.
fwiw keyset pagination is a general pagination technique (isn't a cockroachdb thing) -- depending on infra and use case it can be used to parallelize reads easily for larger scales of records (~say 10M records), and it's fault tolerant because subsets of pages can be retried independently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Eugene, please take your time. Also thanks for educating me about keyset pagination, I like the idea and it is definitively worth to consider when needing to serve data at scale.
Probably a bit OT, so we might want to continue discussing at another spot, but you sparked my interest ;]:
Is keyset pagination also used when needing to page through datasets which are larger-than-memory, so that processing a potential large resultset in response to a query will not trip memory exhaustion both server- and client-side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] so that processing a potential large resultset in response to a query will not trip memory exhaustion both server- and client-side.
On this very matter, I think it will be sweet to implement server-side cursor support on behalf of a later iteration, when applicable. PostgreSQL can do it, and CrateDB implemented it recently, also for that very purpose. See async_streaming.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies took a bit longer than planned. PR Looks great! I left a few nits!
Also wanted to thank @cbornet for his review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Eugene. Thanks for your review. I will be afk for hiking starting tomorrow, so I may be able to get back to this only on Monday ff. If you have some spare cycles, feel free to take over any time. I don't have any objections about your suggestions, thank you very much for your guidance! Otherwise, see you next week.
…rs, query by selectable, expose execution options, and documentation (#17191) - **Description:** Improve `SQLDatabase` adapter component to promote code re-use, see [suggestion](#16246 (review)). - **Needed by:** GH-16246 - **Addressed to:** @baskaryan, @cbornet ## Details - Add `cursor` fetch mode - Accept SQL query parameters - Accept both `str` and SQLAlchemy selectables as query expression - Expose `execution_options` - Documentation page (notebook) about `SQLDatabase` [^1] See [About SQLDatabase](https://github.com/langchain-ai/langchain/blob/c1c7b763/docs/docs/integrations/tools/sql_database.ipynb). [^1]: Apparently there hasn't been any yet? --------- Co-authored-by: Andreas Motl <andreas.motl@crate.io>
b19a392
to
7894cdf
Compare
7894cdf
to
56e6318
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! A few minor comments and we should merge
For talking to the database, the document loader uses the `SQLDatabase` | ||
utility from the LangChain integration toolkit. | ||
|
||
Each document represents one row of the result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be willing to add a ..code-block python example that shows how to use the loader?
It'll populate the API reference with a nice python formatted example: https://api.python.langchain.com/en/v0.0.349/api_reference.html
db: A LangChain `SQLDatabase`, wrapping an SQLAlchemy engine. | ||
sqlalchemy_kwargs: More keyword arguments for SQLAlchemy's `create_engine`. | ||
parameters: Optional. Parameters to pass to the query. | ||
page_content_mapper: Optional. Function to convert a row into a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a row a dict or a tuple? Any chance we could update the type signature on page_content_mapping and metadata_mapper with the argument type if it's known?
With the type information missing and no example python snippet -- usage of these parameters is non obvious from documentation
metadata_mapper: Optional. Function to convert a row into a dictionary | ||
to use as the `metadata` of the document. By default, no columns are | ||
selected into the metadata dictionary. | ||
source_columns: Optional. The names of the columns to use as the `source` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are source columns?
libs/community/tests/integration_tests/document_loaders/docker-compose/postgresql.yml
Show resolved
Hide resolved
libs/community/tests/integration_tests/document_loaders/docker-compose/postgresql.yml
Show resolved
Hide resolved
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use a fixture with module scope and clean up after the test end?
|
||
|
||
try: | ||
import sqlite3 # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about using:
pytest.skip(reason="the reason", alllow_module_level=True)
to skip the unit tests and provide a reason?
libs/community/tests/integration_tests/document_loaders/test_sql_database.py
Show resolved
Hide resolved
…rs, query by selectable, expose execution options, and documentation (langchain-ai#17191) - **Description:** Improve `SQLDatabase` adapter component to promote code re-use, see [suggestion](langchain-ai#16246 (review)). - **Needed by:** langchain-aiGH-16246 - **Addressed to:** @baskaryan, @cbornet ## Details - Add `cursor` fetch mode - Accept SQL query parameters - Accept both `str` and SQLAlchemy selectables as query expression - Expose `execution_options` - Documentation page (notebook) about `SQLDatabase` [^1] See [About SQLDatabase](https://github.com/langchain-ai/langchain/blob/c1c7b763/docs/docs/integrations/tools/sql_database.ipynb). [^1]: Apparently there hasn't been any yet? --------- Co-authored-by: Andreas Motl <andreas.motl@crate.io>
@amotl Let me know if you'd like me to comnandeer to incorporate some of the nits! |
Hi @eyurtsev. While I feel bad about it, if you have the cycles, it would be so nice, indeed! I am currently a bit swamped with documentation improvement matters on our ends, where others are dearly waiting for, so I will appreciate your efforts very much. |
Comandeering |
Closing in favor of: #18281 |
SQLDatabaseLoader
.cursor
, query parameters, query by selectable, expose execution options, and documentation #16655Hi from CrateDB again,
in the same spirit like GH-16243 and GH-16244, this patch breaks out another commit from crate-workbench#1, in order to reduce the size of this patch before submitting it, and to separate concerns.
To accompany the SQLAlchemy adapter implementation, the patch includes integration tests for both SQLite and PostgreSQL. Let me know if corresponding utility resources should be added at different spots.
With kind regards,
Andreas.
Software Tests
docker compose --file libs/community/tests/integration_tests/document_loaders/docker-compose/postgresql.yml up