Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Add SQLDatabaseLoader document loader #16246

Closed

Conversation

amotl
Copy link
Contributor

@amotl amotl commented Jan 19, 2024

Hi from CrateDB again,

in the same spirit like GH-16243 and GH-16244, this patch breaks out another commit from crate-workbench#1, in order to reduce the size of this patch before submitting it, and to separate concerns.

To accompany the SQLAlchemy adapter implementation, the patch includes integration tests for both SQLite and PostgreSQL. Let me know if corresponding utility resources should be added at different spots.

With kind regards,
Andreas.

Software Tests

docker compose --file libs/community/tests/integration_tests/document_loaders/docker-compose/postgresql.yml up
cd libs/community
pip install psycopg2-binary
pytest -vvv tests/integration_tests -k sqldatabase
14 passed

image

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. Ɑ: doc loader Related to document loader module (not documentation) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Jan 19, 2024
Copy link

vercel bot commented Jan 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 9, 2024 0:05am

Comment on lines 29 to 42
CONNECTION_STRING = os.environ.get(
"TEST_POSTGRESQL_CONNECTION_STRING",
"postgresql+psycopg2://postgres@localhost:5432/",
)


@pytest.fixture
@unittest.skipIf(not psycopg2_installed, "psycopg2 not installed")
def engine() -> sa.Engine:
"""
Return an SQLAlchemy engine object.
"""
return sa.create_engine(CONNECTION_STRING, echo=False)
Copy link
Contributor Author

@amotl amotl Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PostgreSQL-based tests will be skipped if the psycopg2 driver is not installed. Because this is not happening by default, it would need to be installed manually.

a) Let me know if you want to include psycopg2-binary into the list of dependencies, and if so, where.
b) Also let me know if there is any means to invoke a PostgreSQL instance on CI, or if there is one available already, in order to also run the integration tests on CI. Otherwise, do you think it is good to go like it is, skipping the integration tests on CI?

@amotl

This comment was marked as resolved.

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jan 25, 2024
@amotl amotl requested a review from cbornet January 25, 2024 01:11
libs/community/poetry.lock Outdated Show resolved Hide resolved
Copy link
Contributor Author

@amotl amotl Jan 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @baskaryan,

I've submitted 28ca847 as a separate patch, also in order to accompany it by corresponding test cases.

With kind regards,
Andreas.

Copy link
Contributor Author

@amotl amotl Jan 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again. Thank you both for all the excellent reviews on this patch.

GH-16655 now also gained a dedicated documentation page about relevant features of the SQLDatabase utility we improved along the lines, see sql_database.ipynb.

I think it makes sense to bring in that patch before coming back here. Thus, it has been toggled back into draft mode. After coming back from yak shaving, I will rebase this patch, and squash the individual commits, before presenting it for review again, if you agree with this procedure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. @eyurtsev integrated GH-16655 on behalf of GH-17191. Thank you very much. I will get back to refreshing this PR soon, and will let you know about it.

Copy link
Contributor Author

@amotl amotl Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again. I've just refreshed this patch, and CI signals success, so it might be good to go if @eyurtsev doesn't have any objections? Thanks again for the guidance to improve SQLDatabase beforehand.

@amotl amotl changed the title community: Add SQLAlchemy document loader community: Add SQLDatabaseLoader document loader Jan 27, 2024
Copy link
Collaborator

@cbornet cbornet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hwchase17 hwchase17 closed this Jan 30, 2024
@baskaryan baskaryan reopened this Jan 30, 2024
@baskaryan baskaryan closed this Jan 30, 2024
@baskaryan baskaryan reopened this Jan 30, 2024

def __init__(
self,
query: Union[str, sa.Select],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reviewing this PR: #16655 , which implements a cursor and I think will be used by this data loader for loading documents from SQL.

I've mostly used keyset pagination in the past because it's easier to add retry logic to it in case of network connections or some other server side errors (as well as parallelization). Is there way to do retries with the cursor if the connection times out or we get disconnected for whatever reason?

Key set pagination: https://www.cockroachlabs.com/docs/stable/pagination#keyset-pagination

Basic idea is to combine a sort on an id field, a filter on the id filed and a limit.

Copy link
Contributor Author

@amotl amotl Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eyurtsev. I hear your argument, but at the same time, I think the Cockroach documentation is misleading.

Allocating resources for a cursor once, and then using it to churn through the results, optionally also in async/streaming mode, is far more efficient than sending multiple independent database queries which need to allocate corresponding resources on the server side each and every time.

Looking at this from a different perspective, it probably depends very much on the use case at hand. It think using key set pagination is out of scope for this patch, as it depends on an id field, and some database tables may not have it.

If such functionality is needed, it can probably implemented on top, or orthogonal to this infrastructure code, which is merely foundational.

The patch also actually does not change how database querying works through SQLAlchemy. The fetch=cursor literal just signals that it should return the result wrapper to be able to iterate on behalf of the calling site. Otherwise, records will already be collected in SQLDatabase itself.

Copy link
Contributor Author

@amotl amotl Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mostly used keyset pagination in the past.

If such functionality is needed, it can probably implemented on top, or orthogonal to this infrastructure code, which is merely foundational.

Hi Eugene,

let me know if you think differently, or if you agree with this assessment. If you are using this technique, and would like to see any sort of adapter for it, I think it may be valuable to implement in LangChain. However, I don't see how it can be implemented within the scope of this patch.

With kind regards,
Andreas.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to review / merge today if not then earliest on Monday -- a bit pressed on time and I want to review carefully :)

I agree with your points -- cursor based pagination is likely good fit for most users and will result in a simpler API.

If they're working at scale, they'll likely need to do additional optimizations regardless and may want to implement their own solution regardless.


fwiw keyset pagination is a general pagination technique (isn't a cockroachdb thing) -- depending on infra and use case it can be used to parallelize reads easily for larger scales of records (~say 10M records), and it's fault tolerant because subsets of pages can be retried independently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Eugene, please take your time. Also thanks for educating me about keyset pagination, I like the idea and it is definitively worth to consider when needing to serve data at scale.

Probably a bit OT, so we might want to continue discussing at another spot, but you sparked my interest ;]:
Is keyset pagination also used when needing to page through datasets which are larger-than-memory, so that processing a potential large resultset in response to a query will not trip memory exhaustion both server- and client-side?

Copy link
Contributor Author

@amotl amotl Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] so that processing a potential large resultset in response to a query will not trip memory exhaustion both server- and client-side.

On this very matter, I think it will be sweet to implement server-side cursor support on behalf of a later iteration, when applicable. PostgreSQL can do it, and CrateDB implemented it recently, also for that very purpose. See async_streaming.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies took a bit longer than planned. PR Looks great! I left a few nits!

Also wanted to thank @cbornet for his review

Copy link
Contributor Author

@amotl amotl Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Eugene. Thanks for your review. I will be afk for hiking starting tomorrow, so I may be able to get back to this only on Monday ff. If you have some spare cycles, feel free to take over any time. I don't have any objections about your suggestions, thank you very much for your guidance! Otherwise, see you next week.

@eyurtsev eyurtsev self-assigned this Feb 7, 2024
eyurtsev added a commit that referenced this pull request Feb 8, 2024
…rs, query by selectable, expose execution options, and documentation (#17191)

- **Description:** Improve `SQLDatabase` adapter component to promote
code re-use, see
[suggestion](#16246 (review)).
  - **Needed by:** GH-16246
  - **Addressed to:** @baskaryan, @cbornet 

## Details
- Add `cursor` fetch mode
- Accept SQL query parameters
- Accept both `str` and SQLAlchemy selectables as query expression
- Expose `execution_options`
- Documentation page (notebook) about `SQLDatabase` [^1]
See [About
SQLDatabase](https://github.com/langchain-ai/langchain/blob/c1c7b763/docs/docs/integrations/tools/sql_database.ipynb).

[^1]: Apparently there hasn't been any yet?

---------

Co-authored-by: Andreas Motl <andreas.motl@crate.io>
@amotl amotl force-pushed the document-loader-sqlalchemy branch 2 times, most recently from b19a392 to 7894cdf Compare February 8, 2024 23:45
@amotl amotl marked this pull request as ready for review February 9, 2024 00:05
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 9, 2024
@amotl amotl requested a review from eyurtsev February 9, 2024 00:05
@dosubot dosubot bot added the 🤖:improvement Medium size change to existing code to handle new use-cases label Feb 9, 2024
@amotl amotl requested a review from baskaryan February 9, 2024 00:05
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! A few minor comments and we should merge

For talking to the database, the document loader uses the `SQLDatabase`
utility from the LangChain integration toolkit.

Each document represents one row of the result.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be willing to add a ..code-block python example that shows how to use the loader?

It'll populate the API reference with a nice python formatted example: https://api.python.langchain.com/en/v0.0.349/api_reference.html

db: A LangChain `SQLDatabase`, wrapping an SQLAlchemy engine.
sqlalchemy_kwargs: More keyword arguments for SQLAlchemy's `create_engine`.
parameters: Optional. Parameters to pass to the query.
page_content_mapper: Optional. Function to convert a row into a string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a row a dict or a tuple? Any chance we could update the type signature on page_content_mapping and metadata_mapper with the argument type if it's known?

With the type information missing and no example python snippet -- usage of these parameters is non obvious from documentation

metadata_mapper: Optional. Function to convert a row into a dictionary
to use as the `metadata` of the document. By default, no columns are
selected into the metadata dictionary.
source_columns: Optional. The names of the columns to use as the `source`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are source columns?

Comment on lines +70 to +71


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use a fixture with module scope and clean up after the test end?



try:
import sqlite3 # noqa: F401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about using:

pytest.skip(reason="the reason", alllow_module_level=True)

to skip the unit tests and provide a reason?

snsten pushed a commit to snsten/langchain that referenced this pull request Feb 15, 2024
…rs, query by selectable, expose execution options, and documentation (langchain-ai#17191)

- **Description:** Improve `SQLDatabase` adapter component to promote
code re-use, see
[suggestion](langchain-ai#16246 (review)).
  - **Needed by:** langchain-aiGH-16246
  - **Addressed to:** @baskaryan, @cbornet 

## Details
- Add `cursor` fetch mode
- Accept SQL query parameters
- Accept both `str` and SQLAlchemy selectables as query expression
- Expose `execution_options`
- Documentation page (notebook) about `SQLDatabase` [^1]
See [About
SQLDatabase](https://github.com/langchain-ai/langchain/blob/c1c7b763/docs/docs/integrations/tools/sql_database.ipynb).

[^1]: Apparently there hasn't been any yet?

---------

Co-authored-by: Andreas Motl <andreas.motl@crate.io>
@eyurtsev
Copy link
Collaborator

eyurtsev commented Feb 22, 2024

@amotl Let me know if you'd like me to comnandeer to incorporate some of the nits!

@amotl
Copy link
Contributor Author

amotl commented Feb 22, 2024

Hi @eyurtsev. While I feel bad about it, if you have the cycles, it would be so nice, indeed! I am currently a bit swamped with documentation improvement matters on our ends, where others are dearly waiting for, so I will appreciate your efforts very much.

@eyurtsev
Copy link
Collaborator

Comandeering

@eyurtsev
Copy link
Collaborator

Closing in favor of: #18281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features 🤖:improvement Medium size change to existing code to handle new use-cases size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants