community: Add `SQLDatabaseLoader` document loader #16246

amotl · 2024-01-19T01:52:35Z

Description: A generic document loader adapter for SQLAlchemy on top of LangChain's SQLDatabaseLoader.
Needed by: Add support for CrateDB to LangChain LLM framework crate-workbench/langchain#1
Depends on: community/SQLDatabase: Fetch mode cursor, query parameters, query by selectable, expose execution options, and documentation #16655
Addressed to: @baskaryan, @cbornet, @eyurtsev

Hi from CrateDB again,

in the same spirit like GH-16243 and GH-16244, this patch breaks out another commit from crate-workbench#1, in order to reduce the size of this patch before submitting it, and to separate concerns.

To accompany the SQLAlchemy adapter implementation, the patch includes integration tests for both SQLite and PostgreSQL. Let me know if corresponding utility resources should be added at different spots.

With kind regards,
Andreas.

Software Tests

docker compose --file libs/community/tests/integration_tests/document_loaders/docker-compose/postgresql.yml up

cd libs/community
pip install psycopg2-binary
pytest -vvv tests/integration_tests -k sqldatabase

14 passed

vercel · 2024-01-19T01:59:39Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 9, 2024 0:05am

amotl · 2024-01-19T02:44:26Z

libs/community/tests/integration_tests/document_loaders/test_sqlalchemy_postgresql.py

+CONNECTION_STRING = os.environ.get(
+    "TEST_POSTGRESQL_CONNECTION_STRING",
+    "postgresql+psycopg2://postgres@localhost:5432/",
+)
+
+
+@pytest.fixture
+@unittest.skipIf(not psycopg2_installed, "psycopg2 not installed")
+def engine() -> sa.Engine:
+    """
+    Return an SQLAlchemy engine object.
+    """
+    return sa.create_engine(CONNECTION_STRING, echo=False)


PostgreSQL-based tests will be skipped if the psycopg2 driver is not installed. Because this is not happening by default, it would need to be installed manually.

a) Let me know if you want to include psycopg2-binary into the list of dependencies, and if so, where.
b) Also let me know if there is any means to invoke a PostgreSQL instance on CI, or if there is one available already, in order to also run the integration tests on CI. Otherwise, do you think it is good to go like it is, skipping the integration tests on CI?

libs/community/langchain_community/document_loaders/sqlalchemy.py

libs/community/poetry.lock

libs/community/langchain_community/document_loaders/sqlalchemy.py

libs/community/tests/integration_tests/document_loaders/test_sqlalchemy_sqlite.py

amotl · 2024-01-27T00:07:15Z

libs/community/langchain_community/utilities/sql_database.py

Dear @baskaryan,

I've submitted 28ca847 as a separate patch, also in order to accompany it by corresponding test cases.

community/SQLDatabase: Fetch mode cursor, query parameters, query by selectable, expose execution options, and documentation #16655

With kind regards,
Andreas.

Hi again. Thank you both for all the excellent reviews on this patch.

GH-16655 now also gained a dedicated documentation page about relevant features of the SQLDatabase utility we improved along the lines, see sql_database.ipynb.

I think it makes sense to bring in that patch before coming back here. Thus, it has been toggled back into draft mode. After coming back from yak shaving, I will rebase this patch, and squash the individual commits, before presenting it for review again, if you agree with this procedure.

Hi. @eyurtsev integrated GH-16655 on behalf of GH-17191. Thank you very much. I will get back to refreshing this PR soon, and will let you know about it.

Hi again. I've just refreshed this patch, and CI signals success, so it might be good to go if @eyurtsev doesn't have any objections? Thanks again for the guidance to improve SQLDatabase beforehand.

cbornet

LGTM

eyurtsev · 2024-02-07T19:11:56Z

libs/community/langchain_community/document_loaders/sqldatabase.py

+
+    def __init__(
+        self,
+        query: Union[str, sa.Select],


I'm reviewing this PR: #16655 , which implements a cursor and I think will be used by this data loader for loading documents from SQL.

I've mostly used keyset pagination in the past because it's easier to add retry logic to it in case of network connections or some other server side errors (as well as parallelization). Is there way to do retries with the cursor if the connection times out or we get disconnected for whatever reason?

Key set pagination: https://www.cockroachlabs.com/docs/stable/pagination#keyset-pagination

Basic idea is to combine a sort on an id field, a filter on the id filed and a limit.

Hi @eyurtsev. I hear your argument, but at the same time, I think the Cockroach documentation is misleading.

Allocating resources for a cursor once, and then using it to churn through the results, optionally also in async/streaming mode, is far more efficient than sending multiple independent database queries which need to allocate corresponding resources on the server side each and every time.

Looking at this from a different perspective, it probably depends very much on the use case at hand. It think using key set pagination is out of scope for this patch, as it depends on an id field, and some database tables may not have it.

If such functionality is needed, it can probably implemented on top, or orthogonal to this infrastructure code, which is merely foundational.

The patch also actually does not change how database querying works through SQLAlchemy. The fetch=cursor literal just signals that it should return the result wrapper to be able to iterate on behalf of the calling site. Otherwise, records will already be collected in SQLDatabase itself.

I've mostly used keyset pagination in the past.

If such functionality is needed, it can probably implemented on top, or orthogonal to this infrastructure code, which is merely foundational.

Hi Eugene,

let me know if you think differently, or if you agree with this assessment. If you are using this technique, and would like to see any sort of adapter for it, I think it may be valuable to implement in LangChain. However, I don't see how it can be implemented within the scope of this patch.

With kind regards,
Andreas.

I'll try to review / merge today if not then earliest on Monday -- a bit pressed on time and I want to review carefully :)

I agree with your points -- cursor based pagination is likely good fit for most users and will result in a simpler API.

If they're working at scale, they'll likely need to do additional optimizations regardless and may want to implement their own solution regardless.

fwiw keyset pagination is a general pagination technique (isn't a cockroachdb thing) -- depending on infra and use case it can be used to parallelize reads easily for larger scales of records (~say 10M records), and it's fault tolerant because subsets of pages can be retried independently

Thanks Eugene, please take your time. Also thanks for educating me about keyset pagination, I like the idea and it is definitively worth to consider when needing to serve data at scale.

Probably a bit OT, so we might want to continue discussing at another spot, but you sparked my interest ;]:
Is keyset pagination also used when needing to page through datasets which are larger-than-memory, so that processing a potential large resultset in response to a query will not trip memory exhaustion both server- and client-side?

[...] so that processing a potential large resultset in response to a query will not trip memory exhaustion both server- and client-side.

On this very matter, I think it will be sweet to implement server-side cursor support on behalf of a later iteration, when applicable. PostgreSQL can do it, and CrateDB implemented it recently, also for that very purpose. See async_streaming.py.

Apologies took a bit longer than planned. PR Looks great! I left a few nits!

Also wanted to thank @cbornet for his review

Hi Eugene. Thanks for your review. I will be afk for hiking starting tomorrow, so I may be able to get back to this only on Monday ff. If you have some spare cycles, feel free to take over any time. I don't have any objections about your suggestions, thank you very much for your guidance! Otherwise, see you next week.

@baskaryan

…rs, query by selectable, expose execution options, and documentation (#17191) - **Description:** Improve `SQLDatabase` adapter component to promote code re-use, see [suggestion](#16246 (review)). - **Needed by:** GH-16246 - **Addressed to:** @baskaryan, @cbornet ## Details - Add `cursor` fetch mode - Accept SQL query parameters - Accept both `str` and SQLAlchemy selectables as query expression - Expose `execution_options` - Documentation page (notebook) about `SQLDatabase` [^1] See [About SQLDatabase](https://github.com/langchain-ai/langchain/blob/c1c7b763/docs/docs/integrations/tools/sql_database.ipynb). [^1]: Apparently there hasn't been any yet? --------- Co-authored-by: Andreas Motl <andreas.motl@crate.io>

eyurtsev

Looks great to me! A few minor comments and we should merge

libs/community/langchain_community/document_loaders/sql_database.py

eyurtsev · 2024-02-13T04:03:10Z

libs/community/langchain_community/document_loaders/sql_database.py

+    For talking to the database, the document loader uses the `SQLDatabase`
+    utility from the LangChain integration toolkit.
+
+    Each document represents one row of the result.


Would you be willing to add a ..code-block python example that shows how to use the loader?

It'll populate the API reference with a nice python formatted example: https://api.python.langchain.com/en/v0.0.349/api_reference.html

eyurtsev · 2024-02-13T04:04:34Z

libs/community/langchain_community/document_loaders/sql_database.py

+            db: A LangChain `SQLDatabase`, wrapping an SQLAlchemy engine.
+            sqlalchemy_kwargs: More keyword arguments for SQLAlchemy's `create_engine`.
+            parameters: Optional. Parameters to pass to the query.
+            page_content_mapper: Optional. Function to convert a row into a string


Is a row a dict or a tuple? Any chance we could update the type signature on page_content_mapping and metadata_mapper with the argument type if it's known?

With the type information missing and no example python snippet -- usage of these parameters is non obvious from documentation

eyurtsev · 2024-02-13T04:05:29Z

libs/community/langchain_community/document_loaders/sql_database.py

+            metadata_mapper: Optional. Function to convert a row into a dictionary
+              to use as the `metadata` of the document. By default, no columns are
+              selected into the metadata dictionary.
+            source_columns: Optional. The names of the columns to use as the `source`


What are source columns?

libs/community/langchain_community/document_loaders/sql_database.py

libs/community/tests/integration_tests/document_loaders/docker-compose/postgresql.yml

eyurtsev · 2024-02-13T04:13:13Z

libs/community/tests/integration_tests/document_loaders/test_sql_database.py

+
+

should we use a fixture with module scope and clean up after the test end?

eyurtsev · 2024-02-13T04:16:18Z

libs/community/tests/integration_tests/document_loaders/test_sql_database.py

+
+
+try:
+    import sqlite3  # noqa: F401


what do you think about using:

pytest.skip(reason="the reason", alllow_module_level=True)

to skip the unit tests and provide a reason?

libs/community/tests/integration_tests/document_loaders/test_sql_database.py

@baskaryan

…rs, query by selectable, expose execution options, and documentation (langchain-ai#17191) - **Description:** Improve `SQLDatabase` adapter component to promote code re-use, see [suggestion](langchain-ai#16246 (review)). - **Needed by:** langchain-aiGH-16246 - **Addressed to:** @baskaryan, @cbornet ## Details - Add `cursor` fetch mode - Accept SQL query parameters - Accept both `str` and SQLAlchemy selectables as query expression - Expose `execution_options` - Documentation page (notebook) about `SQLDatabase` [^1] See [About SQLDatabase](https://github.com/langchain-ai/langchain/blob/c1c7b763/docs/docs/integrations/tools/sql_database.ipynb). [^1]: Apparently there hasn't been any yet? --------- Co-authored-by: Andreas Motl <andreas.motl@crate.io>

eyurtsev · 2024-02-22T15:42:02Z

@amotl Let me know if you'd like me to comnandeer to incorporate some of the nits!

amotl · 2024-02-22T17:29:39Z

Hi @eyurtsev. While I feel bad about it, if you have the cycles, it would be so nice, indeed! I am currently a bit swamped with documentation improvement matters on our ends, where others are dearly waiting for, so I will appreciate your efforts very much.

eyurtsev · 2024-02-28T20:18:49Z

Comandeering

eyurtsev · 2024-02-28T20:41:36Z

Closing in favor of: #18281

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. Ɑ: doc loader Related to document loader module (not documentation) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Jan 19, 2024

vercel bot deployed to Preview January 19, 2024 01:59 View deployment

amotl force-pushed the document-loader-sqlalchemy branch from fc6d36b to 6a9ffae Compare January 19, 2024 02:23

vercel bot deployed to Preview January 19, 2024 02:30 View deployment

amotl force-pushed the document-loader-sqlalchemy branch 2 times, most recently from ffd88b2 to 2b3aed3 Compare January 19, 2024 02:44

amotl commented Jan 19, 2024

View reviewed changes

vercel bot deployed to Preview January 19, 2024 02:52 View deployment

amotl force-pushed the document-loader-sqlalchemy branch from 2b3aed3 to 912a702 Compare January 19, 2024 02:52

This comment was marked as resolved.

Sign in to view

vercel bot deployed to Preview January 19, 2024 02:59 View deployment

cbornet reviewed Jan 19, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/sqlalchemy.py Outdated Show resolved Hide resolved

amotl force-pushed the document-loader-sqlalchemy branch from 912a702 to 8364e8c Compare January 25, 2024 00:44

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jan 25, 2024

vercel bot deployed to Preview January 25, 2024 00:51 View deployment

amotl force-pushed the document-loader-sqlalchemy branch from 8364e8c to cde3aa9 Compare January 25, 2024 00:58

vercel bot deployed to Preview January 25, 2024 01:07 View deployment

amotl requested a review from cbornet January 25, 2024 01:11

amotl commented Jan 25, 2024

View reviewed changes

libs/community/poetry.lock Outdated Show resolved Hide resolved

cbornet reviewed Jan 25, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/sqlalchemy.py Outdated Show resolved Hide resolved

cbornet reviewed Jan 25, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/sqlalchemy.py Outdated Show resolved Hide resolved

cbornet reviewed Jan 25, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/sqlalchemy.py Outdated Show resolved Hide resolved

cbornet reviewed Jan 25, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/sqlalchemy.py Outdated Show resolved Hide resolved

cbornet reviewed Jan 25, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/sqlalchemy.py Outdated Show resolved Hide resolved

cbornet reviewed Jan 25, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/sqlalchemy.py Outdated Show resolved Hide resolved

cbornet reviewed Jan 25, 2024

View reviewed changes

libs/community/tests/integration_tests/document_loaders/test_sqlalchemy_sqlite.py Outdated Show resolved Hide resolved

amotl commented Jan 27, 2024

View reviewed changes

amotl changed the title ~~community: Add SQLAlchemy document loader~~ community: Add SQLDatabaseLoader document loader Jan 27, 2024

cbornet approved these changes Jan 27, 2024

View reviewed changes

hwchase17 closed this Jan 30, 2024

baskaryan reopened this Jan 30, 2024

baskaryan closed this Jan 30, 2024

baskaryan reopened this Jan 30, 2024

eyurtsev reviewed Feb 7, 2024

View reviewed changes

eyurtsev self-assigned this Feb 7, 2024

eyurtsev mentioned this pull request Feb 8, 2024

community[minor]: SQLDatabase Add fetch mode cursor, query parameters, query by selectable, expose execution options, and documentation #17191

Merged

amotl force-pushed the document-loader-sqlalchemy branch 2 times, most recently from b19a392 to 7894cdf Compare February 8, 2024 23:45

community/SQLAlchemyLoader: Add generic SQLAlchemy document loader

56e6318

amotl force-pushed the document-loader-sqlalchemy branch from 7894cdf to 56e6318 Compare February 8, 2024 23:52

vercel bot deployed to Preview February 9, 2024 00:05 View deployment

amotl marked this pull request as ready for review February 9, 2024 00:05

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 9, 2024

amotl requested a review from eyurtsev February 9, 2024 00:05

dosubot bot added the 🤖:improvement Medium size change to existing code to handle new use-cases label Feb 9, 2024

amotl requested a review from baskaryan February 9, 2024 00:05

eyurtsev reviewed Feb 13, 2024

View reviewed changes

eyurtsev closed this Feb 28, 2024

amotl mentioned this pull request Mar 28, 2024

community[minor]: Add SQLDatabaseLoader document loader #18281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: Add `SQLDatabaseLoader` document loader #16246

community: Add `SQLDatabaseLoader` document loader #16246

amotl commented Jan 19, 2024 •

edited

Loading

vercel bot commented Jan 19, 2024 •

edited

Loading

amotl Jan 19, 2024 •

edited

Loading

This comment was marked as resolved.

amotl Jan 27, 2024 •

edited

Loading

amotl Jan 27, 2024 •

edited

Loading

amotl Feb 8, 2024

amotl Feb 9, 2024 •

edited

Loading

cbornet left a comment

eyurtsev Feb 7, 2024

amotl Feb 8, 2024 •

edited

Loading

amotl Feb 9, 2024 •

edited

Loading

eyurtsev Feb 9, 2024

amotl Feb 9, 2024

amotl Feb 9, 2024 •

edited

Loading

eyurtsev Feb 13, 2024

amotl Feb 14, 2024 •

edited

Loading

eyurtsev left a comment

eyurtsev Feb 13, 2024

eyurtsev Feb 13, 2024

eyurtsev Feb 13, 2024

eyurtsev Feb 13, 2024

eyurtsev Feb 13, 2024

eyurtsev commented Feb 22, 2024 •

edited

Loading

amotl commented Feb 22, 2024 •

edited

Loading

eyurtsev commented Feb 28, 2024

eyurtsev commented Feb 28, 2024

community: Add SQLDatabaseLoader document loader #16246

community: Add SQLDatabaseLoader document loader #16246

Conversation

amotl commented Jan 19, 2024 • edited Loading

Software Tests

vercel bot commented Jan 19, 2024 • edited Loading

amotl Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

This comment was marked as resolved.

amotl Jan 27, 2024 • edited Loading

Choose a reason for hiding this comment

amotl Jan 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

cbornet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

amotl Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

eyurtsev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev commented Feb 22, 2024 • edited Loading

amotl commented Feb 22, 2024 • edited Loading

eyurtsev commented Feb 28, 2024

eyurtsev commented Feb 28, 2024

community: Add `SQLDatabaseLoader` document loader #16246

community: Add `SQLDatabaseLoader` document loader #16246

amotl commented Jan 19, 2024 •

edited

Loading

vercel bot commented Jan 19, 2024 •

edited

Loading

amotl Jan 19, 2024 •

edited

Loading

amotl Jan 27, 2024 •

edited

Loading

amotl Jan 27, 2024 •

edited

Loading

amotl Feb 9, 2024 •

edited

Loading

amotl Feb 8, 2024 •

edited

Loading

amotl Feb 9, 2024 •

edited

Loading

amotl Feb 9, 2024 •

edited

Loading

amotl Feb 14, 2024 •

edited

Loading

eyurtsev commented Feb 22, 2024 •

edited

Loading

amotl commented Feb 22, 2024 •

edited

Loading