Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use pyarrow stream compression, if available #593

Merged
merged 4 commits into from Apr 12, 2021

Conversation

@plamut
Copy link
Contributor

@plamut plamut commented Apr 7, 2021

Closes #579.

This PR uses pyarrow compression of BQ Storage streams when available, i.e. when a recent enough version of google-cloud-bigquery-storage is installed.

Currently blocked on the BQ Storage release that will add this support, but it's still possible to test this PR locally by changing the noxfile to use a development version of google-cloud-bigquery-storage.

PR checklist

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
@plamut plamut requested a review from tswast Apr 7, 2021
@google-cla google-cla bot added the cla: yes label Apr 7, 2021
import pkg_resources

# Having BQ Storage available implies that pyarrow is available, too.
_ARROW_COMPRESSION_SUPPORT = pkg_resources.get_distribution(
"pyarrow"
).parsed_version >= pkg_resources.parse_version("1.0.0")
Copy link
Contributor Author

@plamut plamut Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need pyarrow version check?

The bqstorage extra already pins the minimum pyarrow version to 1.0.0, thus I suppose if somebody somehow installs a less recent version, it can be considered an error on the user side?

Loading

Copy link
Contributor

@tswast tswast Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me (removing this check). Since all of our extras require > 1.0.0, I'm okay with failing.

Counterpoint though is: what error will they get when pyarrow is too low? I wonder if we should file an FR to check for minimum versions and throw nicer errors? It might help with some of the issues like #556. This would mean keeping minimum versions in sync with 3 locations: setup.py, constraints.txt, and version_check.py, though.

Loading

Copy link
Contributor Author

@plamut plamut Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pip dependency resolver (announcement) is significantly stricter, thus I suppose this will become a no-issue once its use is widespread enough and/or becomes a default. It shouldn't even be possible to install an incompatible pyarrow version with it.

Yes, we can open a FR to discuss there if covering this presumably corner-case scenario in the transitional period is worth doing, considering the problem with keeping the min versions in sync.

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Apr 7, 2021

Currently blocked on the BQ Storage release that will add this support

Oops! I'll see what open PRs we have and hopefully cut a release today.

Loading

plamut added 2 commits Apr 8, 2021
Arrow stream compression requires pyarrow>=1.0.0, but that's already
guaranteed by a version pin in setup.py if bqstorage extra is
installed.
@plamut
Copy link
Contributor Author

@plamut plamut commented Apr 8, 2021

I see that a new version of BQ Storage has been released, good. But I now also see that one of the Python 3.6 unit tests fails, will investigate.

Update:
When Python 3.6 is used, the google-cloud-bigquery-storage version 2.0.0 is installed, and not the latest 2.4.0, hence the AttributeError in the test itself. It's because we have a non-empty constraints file for Python 3.6 test dependencies.

Loading

@plamut plamut marked this pull request as ready for review Apr 8, 2021
@plamut plamut requested a review from as a code owner Apr 8, 2021
tswast
tswast approved these changes Apr 12, 2021
@tswast tswast merged commit dde9dc5 into googleapis:master Apr 12, 2021
10 checks passed
Loading
gcf-merge-on-green bot pushed a commit that referenced this issue Apr 26, 2021
🤖 I have created a release \*beep\* \*boop\*
---
## [2.14.0](https://www.github.com/googleapis/python-bigquery/compare/v2.13.1...v2.14.0) (2021-04-26)


### Features

* accept DatasetListItem where DatasetReference is accepted ([#597](https://www.github.com/googleapis/python-bigquery/issues/597)) ([c8b5581](https://www.github.com/googleapis/python-bigquery/commit/c8b5581ea3c94005d69755c4a3b5a0d8900f3fe2))
* accept job object as argument to `get_job` and `cancel_job` ([#617](https://www.github.com/googleapis/python-bigquery/issues/617)) ([f75dcdf](https://www.github.com/googleapis/python-bigquery/commit/f75dcdf3943b87daba60011c9a3b42e34ff81910))
* add `Client.delete_job_metadata` method to remove job metadata ([#610](https://www.github.com/googleapis/python-bigquery/issues/610)) ([0abb566](https://www.github.com/googleapis/python-bigquery/commit/0abb56669c097c59fbffce007c702e7a55f2d9c1))
* add `max_queue_size` argument to `RowIterator.to_dataframe_iterable` ([#575](https://www.github.com/googleapis/python-bigquery/issues/575)) ([f95f415](https://www.github.com/googleapis/python-bigquery/commit/f95f415d3441b3928f6cc705cb8a75603d790fd6))
* add type hints for public methods ([#613](https://www.github.com/googleapis/python-bigquery/issues/613)) ([f8d4aaa](https://www.github.com/googleapis/python-bigquery/commit/f8d4aaa335a0eef915e73596fc9b43b11d11be9f))
* DB API cursors are now iterable ([#618](https://www.github.com/googleapis/python-bigquery/issues/618)) ([e0b373d](https://www.github.com/googleapis/python-bigquery/commit/e0b373d0e721a70656ed8faceb7f5c70f642d144))
* retry google.auth TransportError by default ([#624](https://www.github.com/googleapis/python-bigquery/issues/624)) ([34ecc3f](https://www.github.com/googleapis/python-bigquery/commit/34ecc3f1ca0ff073330c0c605673d89b43af7ed9))
* use pyarrow stream compression, if available ([#593](https://www.github.com/googleapis/python-bigquery/issues/593)) ([dde9dc5](https://www.github.com/googleapis/python-bigquery/commit/dde9dc5114c2311fb76fafc5b222fff561e8abf1))


### Bug Fixes

* consistent percents handling in DB API query ([#619](https://www.github.com/googleapis/python-bigquery/issues/619)) ([6502a60](https://www.github.com/googleapis/python-bigquery/commit/6502a602337ae562652a20b20270949f2c9d5073))
* missing license headers in new test files ([#604](https://www.github.com/googleapis/python-bigquery/issues/604)) ([df48cc5](https://www.github.com/googleapis/python-bigquery/commit/df48cc5a0be99ad39d5835652d1b7422209afc5d))
* unsetting clustering fileds on Table is now possible ([#622](https://www.github.com/googleapis/python-bigquery/issues/622)) ([33a871f](https://www.github.com/googleapis/python-bigquery/commit/33a871f06329f9bf5a6a92fab9ead65bf2bee75d))


### Documentation

* add sample to run DML query ([#591](https://www.github.com/googleapis/python-bigquery/issues/591)) ([ff2ec3a](https://www.github.com/googleapis/python-bigquery/commit/ff2ec3abe418a443cd07751c08e654f94e8b3155))
* update the description of the return value of `_QueryResults.rows()` ([#594](https://www.github.com/googleapis/python-bigquery/issues/594)) ([8f4c0b8](https://www.github.com/googleapis/python-bigquery/commit/8f4c0b84dac3840532d7865247b8ad94b625b897))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants