Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use BigQuery Storage client by default #55

Merged
merged 16 commits into from Jun 10, 2020

Conversation

@plamut
Copy link
Contributor

@plamut plamut commented Mar 9, 2020

Closes #86.
Closes #84.

No issue yet, just a POC preview.

The PR is now ready to be reviewed. It uses the BQ Storage API by default to fetch query results, and removes the "beta" label from that feature.

It's probably easier to review this PR commit by commit.

Key points:

  • If BQ Storage client cannot be used (e.g. missing dependencies) to fetch query results, the library falls back to the REST API.
  • The BQ storage API v1 stable is now used, replacing v1beta1 client (however, there is a feature request to support both BQ Storage client versions simultaneously).
  • The Python DB API now more closely follows the specs. Closing a connection instance makes it unusable from that point on, and also closes every Cursor instance created by it.
  • Closing a DB API Connection now also closes the BigQuery / BQ Storage clients the latter has created, fixing the sockets leak discovered.
@plamut plamut force-pushed the optimize-to-dataframe branch from 25572ab to 1bb7283 Mar 11, 2020
@plamut
Copy link
Contributor Author

@plamut plamut commented Apr 7, 2020

I did some timings to compare fetching data with and without the BQ Storage client, results below. Spoiler: When using the BQ Storage API, performance improvements are enormous.

Query

SELECT * FROM `project.dataset.table`
LIMIT 200000

Source tables

  • A synthetic table with two FLOAT columns containing 10M rows of random numbers
  • bigquery-public-data-cms_medicare.inpatient_charges_2011 - multiple numeric and string columns, contains ~163k rows.

Timings

REST API (tabledata.list) BQ Storage API
Synthetic table 10 - 11.5 s 4.5 - 5 s
Inpatient charges table 23 - 25s 10.5 - 12.5 s

Loading

@plamut plamut force-pushed the optimize-to-dataframe branch from fde9441 to c01648d Apr 14, 2020
@plamut plamut force-pushed the optimize-to-dataframe branch 4 times, most recently from f080813 to 1aa0c62 Apr 28, 2020
@plamut plamut requested a review from shollyman Apr 28, 2020
@plamut plamut marked this pull request as ready for review Apr 28, 2020
@plamut plamut force-pushed the optimize-to-dataframe branch from 1aa0c62 to 3f73a43 Apr 28, 2020
@plamut plamut force-pushed the optimize-to-dataframe branch from 3f73a43 to 00bf584 Apr 28, 2020
google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved
Loading
@plamut plamut force-pushed the optimize-to-dataframe branch from 12b2171 to dbef14d Apr 30, 2020
Copy link
Contributor

@shollyman shollyman left a comment

Semantic satiation seeing all the v1/v1beta references. Sorry it took me so long to get to this. Couple minor nits, with a couple more interesting bits (avro vs arrow, small result set, versioning for storage).

Loading

google/cloud/bigquery/dbapi/cursor.py Outdated Show resolved Hide resolved
Loading
google/cloud/bigquery/dbapi/cursor.py Outdated Show resolved Hide resolved
Loading
google/cloud/bigquery/job.py Outdated Show resolved Hide resolved
Loading
setup.py Outdated Show resolved Hide resolved
Loading
tests/system.py Outdated Show resolved Hide resolved
Loading
tests/system.py Outdated Show resolved Hide resolved
Loading
@plamut plamut force-pushed the optimize-to-dataframe branch 2 times, most recently from 5b82c76 to c715a74 May 14, 2020
@plamut plamut requested a review from shollyman May 14, 2020
@plamut plamut force-pushed the optimize-to-dataframe branch from 486309f to 6e35d09 May 14, 2020
Copy link
Contributor

@shollyman shollyman left a comment

Thanks for the slogging through this.

Loading

@plamut
Copy link
Contributor Author

@plamut plamut commented May 15, 2020

Just a thought - since this is a pretty significant change, how about releasing all the other smaller changes and fixes that have accumulated since the last release separately, and release this one separately for an easier rollback, should that be necessary?

Loading

@shollyman
Copy link
Contributor

@shollyman shollyman commented May 15, 2020

Isolating this change to its own release seems prudent.

Loading

@plamut
Copy link
Contributor Author

@plamut plamut commented May 19, 2020

Putting this on hold until the next release is made, we will ship it separately.

Loading

@plamut plamut requested a review from shollyman Jun 10, 2020
@plamut
Copy link
Contributor Author

@plamut plamut commented Jun 10, 2020

With the new release (1.25.0) now out, we can now merge this and release in the near future (possibly with miscellaneous related cleanup fixes).

Loading

@plamut plamut merged commit e75ff82 into googleapis:master Jun 10, 2020
3 checks passed
Loading
@plamut plamut deleted the optimize-to-dataframe branch Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants