Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBAPI - running a series of statements that depend on each other (DECLARE, SET, SELECT, ...) #377

Open
mistercrunch opened this issue Nov 10, 2020 · 15 comments

Comments

@mistercrunch
Copy link
Contributor

@mistercrunch mistercrunch commented Nov 10, 2020

see original issue here: googleapis/python-bigquery-sqlalchemy#74

I'm a committer in Apache Superset and we integrate generally fairly well with BigQuery using the DBAPI driver provided here. Using the python DBAPI driver for BigQuery, we're having an issue running a chain of statements that depend on each other in a common session. I'm using a simple connection and cursor behind the scene.

It appears that the statements are not chained and the session/connection isn't persisted. Any clue on how to solve this? All other DBAPI

Other DBAPI implementations allow this type of chaining, and using the same logic with this driver does not work properly.

Here's a simple script exemplifying this:

DECLARE d DATE DEFAULT CURRENT_DATE();
CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y);
SELECT Add(3, 4);

Environment details

  • OS type and version: Linux
  • Python version: 3.7.x
  • pip version: 20.2.4
  • google-cloud-bigquery version: pip show google-cloud-bigquery 1.28.0

Steps to reproduce

code:

with closing(engine.raw_connection()) as conn:
    with closing(conn.cursor()) as cursor:
        cursor.execute("DECLARE d DATE DEFAULT CURRENT_DATE()")
        cursor.execute("CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y)")
        cursor.execute("SELECT Add(3, 4)")
@eugeniamz
Copy link

@eugeniamz eugeniamz commented Nov 10, 2020

I confirm the issue. The Driver executes the statements in separate sessions, from the project history you can see that

DECLARE d DATE DEFAULT CURRENT_DATE();
CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y);

are separated entries

2020-11-10_07-20-08

and because of executing as separated entries

CREATE TEMP FUNCTION Add(x INT64, y INT64) AS (x + y);
fails as temp function should be followed with a select as the error explains :

2020-11-10_07-20-49

If I execute this sessions directly from the console, the 3 statements run in one session,

2020-11-10_07-18-43

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Nov 10, 2020

Since the backend API doesn't really have a concept of connection or cursor, we'll have to do this bundling client-side. Does the following sound reasonable?

Statements from Cursor.execute are batched client-side until either a fetch* method or close (or __del__?) is called.

Perhaps we add a "DEFERRED" isolation mode to select that behavior and keep the current "IMMEDIATE"-style execution the default?

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Nov 10, 2020

Technically, reads in BigQuery use SNAPSHOT isolation, so perhaps if we go this route that should be the only option besides None (autocommit)

Loading

@villebro
Copy link

@villebro villebro commented Nov 11, 2020

Since the backend API doesn't really have a concept of connection or cursor, we'll have to do this bundling client-side. Does the following sound reasonable?

Statements from Cursor.execute are batched client-side until either a fetch* method or close (or __del__?) is called.

Perhaps we add a "DEFERRED" isolation mode to select that behavior and keep the current "IMMEDIATE"-style execution the default?

Having an option to enable client-side batching sounds like a good solution, and one we should be able to leverage in Superset with minimal changes.

Loading

@mistercrunch
Copy link
Contributor Author

@mistercrunch mistercrunch commented Nov 11, 2020

Live debugging the content of this PR apache/superset#11660, that attempts to only run cursor.fetchall() only on the last statement

Screen Shot 2020-11-11 at 1 09 17 PM

message as text for convenience:

400 CREATE TEMPORARY FUNCTION statements must be followed by an actual query.

Loading

@mistercrunch
Copy link
Contributor Author

@mistercrunch mistercrunch commented Nov 11, 2020

Also, the simpler

DECLARE d DATE DEFAULT CURRENT_DATE();
SELECT d as TEST;

Also does not work, I get bigquery error: 400 Unrecognized name: d at [1:8].

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Nov 11, 2020

Sorry, I was thinking allowed "Statements from Cursor.execute are batched client-side until either a fetch* method or close (or del?) is called." is a proposal, not the current state.

I worry about changing the default behavior, so I'm currently thinking of adding a SNAPSHOT isolation level to enable such behavior.

Loading

@mistercrunch
Copy link
Contributor Author

@mistercrunch mistercrunch commented Nov 11, 2020

Oh gotcha! I thought it was a statement not a proposal.

Can you think of workarounds on our side? Otherwise, we're happy to help improve the DBAPI implementation to make this possible. We'd love if we can help resolving this quickly!

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Nov 11, 2020

Workaround in BigQuery is to take all statements, join them with ; and call execute once. Potentially we could add the (non-standard) executescript method to indicate that this is supported.

Related: https://docs.python.org/3.8/library/sqlite3.html#sqlite3.Cursor.executescript

Loading

@mistercrunch
Copy link
Contributor Author

@mistercrunch mistercrunch commented Nov 12, 2020

I tried running a multi-statement script as a large string separated with ;\n and found that following the script execution, cursor.fetchall() returns nothing, and cursor.description raises this:

DEBUG:superset.sql_lab:Query 300: Fetching cursor description
ERROR:sqlalchemy.pool.impl.NullPool:Exception closing connection <google.cloud.bigquery.dbapi.connection.Connection object at 0x126493390>
Traceback (most recent call last):
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 270, in _close_connection
    self._dialect.do_close(connection)
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 549, in do_close
    dbapi_connection.close()
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 258, in with_closed_check
    return method(self, *args, **kwargs)
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/connection.py", line 79, in close
    cursor_.close()
  File "/Users/max/.pyenv/versions/3.7.7/envs/env37/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 257, in with_closed_check
    raise exc_class(exc_msg)
google.cloud.bigquery.dbapi.exceptions.ProgrammingError: Operating on a closed cursor.

We're in uncharted territory here as far as DBAPI is concerned, but I was expecting (hoping is probably the right word here) for the state from the last statement to persist in the cursor since I didn't close the cursor at that point.

Loading

@mistercrunch
Copy link
Contributor Author

@mistercrunch mistercrunch commented Nov 12, 2020

Digging deeper the line here is the issue https://github.com/googleapis/python-bigquery/blob/master/google/cloud/bigquery/dbapi/cursor.py#L227-L229

In this context, self._query_job.statement_type.upper() == "SCRIPT" , and if I set is_dml = False I can read the cursor from the last statement.

I can open a PR that would set is_dml to False if it's a script, but should probably make sure the last statement is a SELECT, which takes us to SQL-parsing land and may want to stay away from that.

Loading

@mistercrunch
Copy link
Contributor Author

@mistercrunch mistercrunch commented Nov 12, 2020

Just for giggles: #386

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Nov 12, 2020

Sent #387 with a system test that fixes the scripting issue.

Loading

gcf-merge-on-green bot pushed a commit that referenced this issue Nov 12, 2020
The `is_dml` logic is not needed now that we moved to `getQueryResults` instead of `tabledata.list` (#375).

Previously, the destination table of a DML query would return a non-null value that was unreadable or would return nonsense with DML (and some DDL) queries. 

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
- [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code!  That way we can discuss the change, evaluate designs, and agree on the general idea
- [ ] Ensure the tests and linter pass
- [ ] Code coverage does not decrease (if any source code was changed)
- [ ] Appropriate docs were updated (if necessary)

Towards #377  🦕
@mistercrunch
Copy link
Contributor Author

@mistercrunch mistercrunch commented Nov 13, 2020

Thankful for amazing maintainers(!) How often do you all push to PyPI?

I think we can work with this fix, but there are some downsides, one of which is Superset can't keep track of the statement currently executing. Another is that we need BigQuery-specific logic to handle this in our codebase. Ideally this DBAPI would conform more to the standard.

Loading

@tswast tswast mentioned this issue Nov 16, 2020
4 tasks
@tswast
Copy link
Contributor

@tswast tswast commented Nov 16, 2020

Looks like I merged the fix with a failing system test. I'll clean those up in #388

Once that merges, I think it'll be okay to cut the release in #381

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants