Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor sqlalchemy code in pandas.io.sql to help prepare for sqlalchemy 2.0. #49531

Merged
merged 2 commits into from
Nov 17, 2022
Merged

Refactor sqlalchemy code in pandas.io.sql to help prepare for sqlalchemy 2.0. #49531

merged 2 commits into from
Nov 17, 2022

Conversation

cdcadman
Copy link
Contributor

@cdcadman cdcadman commented Nov 4, 2022

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I am splitting this out of #48576 , because it is a major refactor of the code, with the goal of making SQLDatabase only accept a Connection and not an Engine. sqlalchemy 2.0 restricts the methods that are available to them, which makes it harder to write code that works with both. For example, Connection.connect() creates a branched connection in sqlalchemy 1.x, but is removed in 2.0, but this is called in SQLDatabase.check_case_sensitive().

I also added some clarification on how transactions work in DataFrame.to_sql, based on this example, run against pandas 1.5.1:

import sqlite3
from pandas import DataFrame
from sqlalchemy import create_engine

with sqlite3.connect(":memory:") as con:
    con.execute("create table test (A integer, B integer)")
    row_count = con.execute("insert into test values (2, 4), (5, 10)").rowcount
    if row_count > 1:
        con.rollback()
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 0

with sqlite3.connect(":memory:") as con:
    con.execute("create table test (A integer, B integer)")
    row_count = DataFrame({'A': [2, 5], 'B': [4, 10]}).to_sql('test', con, if_exists='append', index=False)
    if row_count > 1:
        con.rollback() # does nothing, because pandas already committed the transaction.
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 2
    
with create_engine("sqlite:///:memory:").connect() as con:
    with con.begin():
        con.exec_driver_sql("create table test (A integer, B integer)")
    try:
        with con.begin():
            row_count = DataFrame({'A': [2, 5], 'B': [4, 10]}).to_sql('test', con, if_exists='append', index=False)
            assert row_count < 2
    except AssertionError:
        pass
    print(con.execute("select count(*) from test").fetchall()[0][0]) # prints 0

@mroeschke
Copy link
Member

Thanks for the followup, but could you pair down the changes even more (1 PR for 1 targeted change). It will greatly help the review process. It appears this PR is tackling 3 things which would be better if they were 3 individual PRs

  • SQLDatabase accepting connections only
  • Disposing engines
  • test refactoring

@cdcadman
Copy link
Contributor Author

cdcadman commented Nov 4, 2022

Yes, I think I can do that. For the first part, where I make SQLDatabase accept Connections only, I might have to disable the tests which pass an Engine, but that will be a much smaller change than the test refactor.

@cdcadman
Copy link
Contributor Author

cdcadman commented Nov 5, 2022

I pulled the engine disposal and test refactoring out of this PR. I think the engine disposal can wait until that last PR. But part of my motivation for going with this approach was so that I could put the engine disposal into the _sqlalchemy_con function.

pandas/io/sql.py Outdated Show resolved Hide resolved
pandas/io/sql.py Outdated Show resolved Hide resolved
pandas/io/sql.py Outdated Show resolved Hide resolved
@mroeschke mroeschke added the IO SQL to_sql, read_sql, read_sql_query label Nov 7, 2022
@cdcadman
Copy link
Contributor Author

@mroeschke , I've addressed all the comments. After this PR, the next step would be to refactor the test classes, since the ones involving the sqlalchemy engines would no longer be necessary. In a later PR, I would dispose of the sqlalchemy engine, if it needed to be created, within the new function _sqlalchemy_con.



@contextmanager
def _sqlalchemy_con(connectable, need_transaction: bool):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be reused in your followup PR? I have a slight preference of just folding this into pandasSQL_builder for now until there's a need to share this connection logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only plan for _sqlalchemy_con to be called by pandasSQL_builder and not by anything else. I split it out to reduce the amount of indentation and keep the functions small. There is an additional try/finally block and some additional if/else blocks in lines 772-800 here: https://github.com/cdcadman/pandas/blob/sql_fixes/pandas/io/sql.py . I think I can fold _sqlalchemy_con into pandasSQL_builder if you still prefer that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay gotcha. I think it'll be fine to leave as a separate function with more code incoming.

However, could you replace import sqlalchemy here with sqlalchemy = import_optional_dependnecy(..., errors="raise")? Just so that if someone else tries using this function in the future it's known that sqlalchemy is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made the change.

@mroeschke mroeschke added this to the 2.0 milestone Nov 17, 2022
@mroeschke mroeschke merged commit 08f070d into pandas-dev:main Nov 17, 2022
@mroeschke
Copy link
Member

Awesome, thanks for the progress here @cdcadman

@cdcadman cdcadman deleted the refactor_sql branch November 17, 2022 18:54
@cdcadman cdcadman mentioned this pull request Nov 17, 2022
5 tasks
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Nov 18, 2022
…emy 2.0. (pandas-dev#49531)

* DOC: Clarify behavior of DataFrame.to_sql

* CLN: Make SQLDatabase only accept a sqlalchemy Connection.

Co-authored-by: Chuck Cadman <charles.cadman@standard.com>
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
…emy 2.0. (pandas-dev#49531)

* DOC: Clarify behavior of DataFrame.to_sql

* CLN: Make SQLDatabase only accept a sqlalchemy Connection.

Co-authored-by: Chuck Cadman <charles.cadman@standard.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants