Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add max_queue_size argument to RowIterator.to_dataframe_iterable #575

Merged
merged 5 commits into from Apr 14, 2021

Conversation

@plamut
Copy link
Contributor

@plamut plamut commented Mar 24, 2021

Closes #561.

This PR limits the size of the internal queue that stores result pages when streaming data over the BQ Storage API. It also makes the limit configurable.

Still need to add a few additional unit tests, but that should be it.

Note:
The new parameter is not exposed to the bigquery Jupyter cell magic - I presume that's fine? I don't think cell magic needs such fine-grained control, since it's not really meant to fetch huge query results into a Jupyter notebook session where any performance difference could actually matter.

PR checklist:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
@google-cla google-cla bot added the cla: yes label Mar 24, 2021
@plamut plamut force-pushed the iss-561 branch 2 times, most recently from 30ffe85 to 680f952 Mar 24, 2021
@plamut plamut requested a review from tswast Mar 24, 2021
@plamut plamut changed the title feat: add configurable max size for the queue holding the result pages streamed over the BQ Stroage API feat: add configurable max size for the queue holding the result pages streamed over the BQ Storage API Mar 29, 2021
plamut added 2 commits Mar 29, 2021
The new parameter allows configuring the maximum size of the internal
queue used to hold result pages when query data is streamed over the
BigQuery Storage API.
@plamut plamut marked this pull request as ready for review Mar 29, 2021
@plamut plamut requested a review from as a code owner Mar 29, 2021
@tswast
Copy link
Contributor

@tswast tswast commented Mar 30, 2021

Looks like some tests are timing out now. I suspect that 1 is not the right default

Loading

@tswast
Copy link
Contributor

@tswast tswast commented Mar 30, 2021

How about we only add the argument to to_dataframe_iterable, as that is where it is most relevant. I think None or maybe = to number of workers is probably the right default.

In the other methods we are expected to download the whole table/query results at once anyway, so conserving memory isn't as important.

Loading

@plamut
Copy link
Contributor Author

@plamut plamut commented Mar 30, 2021

I'm fine with that, I'll remove the parameter from other methods where it's expected that query results are downloaded in full. Will also check the timeouts and what a better default could be.

Loading

By default, the max queue size is set to the number of BQ Storage streams
created by the server. If ``max_queue_size`` is :data:`None`, the queue
size is infinite.
Copy link
Contributor Author

@plamut plamut Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case somebody really wants the old behavior, I added it as an option.

Loading

@plamut
Copy link
Contributor Author

@plamut plamut commented Apr 14, 2021

@tswast ping :)

Loading

tswast
tswast approved these changes Apr 14, 2021
Copy link
Contributor

@tswast tswast left a comment

Wonderful!

Loading

@tswast tswast merged commit f95f415 into googleapis:master Apr 14, 2021
10 checks passed
Loading
@tswast tswast changed the title feat: add configurable max size for the queue holding the result pages streamed over the BQ Storage API feat: add max_queue_size argument to RowIterator.to_dataframe_iterable Apr 14, 2021
@plamut plamut deleted the iss-561 branch Apr 14, 2021
gcf-merge-on-green bot pushed a commit that referenced this issue Apr 26, 2021
🤖 I have created a release \*beep\* \*boop\*
---
## [2.14.0](https://www.github.com/googleapis/python-bigquery/compare/v2.13.1...v2.14.0) (2021-04-26)


### Features

* accept DatasetListItem where DatasetReference is accepted ([#597](https://www.github.com/googleapis/python-bigquery/issues/597)) ([c8b5581](https://www.github.com/googleapis/python-bigquery/commit/c8b5581ea3c94005d69755c4a3b5a0d8900f3fe2))
* accept job object as argument to `get_job` and `cancel_job` ([#617](https://www.github.com/googleapis/python-bigquery/issues/617)) ([f75dcdf](https://www.github.com/googleapis/python-bigquery/commit/f75dcdf3943b87daba60011c9a3b42e34ff81910))
* add `Client.delete_job_metadata` method to remove job metadata ([#610](https://www.github.com/googleapis/python-bigquery/issues/610)) ([0abb566](https://www.github.com/googleapis/python-bigquery/commit/0abb56669c097c59fbffce007c702e7a55f2d9c1))
* add `max_queue_size` argument to `RowIterator.to_dataframe_iterable` ([#575](https://www.github.com/googleapis/python-bigquery/issues/575)) ([f95f415](https://www.github.com/googleapis/python-bigquery/commit/f95f415d3441b3928f6cc705cb8a75603d790fd6))
* add type hints for public methods ([#613](https://www.github.com/googleapis/python-bigquery/issues/613)) ([f8d4aaa](https://www.github.com/googleapis/python-bigquery/commit/f8d4aaa335a0eef915e73596fc9b43b11d11be9f))
* DB API cursors are now iterable ([#618](https://www.github.com/googleapis/python-bigquery/issues/618)) ([e0b373d](https://www.github.com/googleapis/python-bigquery/commit/e0b373d0e721a70656ed8faceb7f5c70f642d144))
* retry google.auth TransportError by default ([#624](https://www.github.com/googleapis/python-bigquery/issues/624)) ([34ecc3f](https://www.github.com/googleapis/python-bigquery/commit/34ecc3f1ca0ff073330c0c605673d89b43af7ed9))
* use pyarrow stream compression, if available ([#593](https://www.github.com/googleapis/python-bigquery/issues/593)) ([dde9dc5](https://www.github.com/googleapis/python-bigquery/commit/dde9dc5114c2311fb76fafc5b222fff561e8abf1))


### Bug Fixes

* consistent percents handling in DB API query ([#619](https://www.github.com/googleapis/python-bigquery/issues/619)) ([6502a60](https://www.github.com/googleapis/python-bigquery/commit/6502a602337ae562652a20b20270949f2c9d5073))
* missing license headers in new test files ([#604](https://www.github.com/googleapis/python-bigquery/issues/604)) ([df48cc5](https://www.github.com/googleapis/python-bigquery/commit/df48cc5a0be99ad39d5835652d1b7422209afc5d))
* unsetting clustering fileds on Table is now possible ([#622](https://www.github.com/googleapis/python-bigquery/issues/622)) ([33a871f](https://www.github.com/googleapis/python-bigquery/commit/33a871f06329f9bf5a6a92fab9ead65bf2bee75d))


### Documentation

* add sample to run DML query ([#591](https://www.github.com/googleapis/python-bigquery/issues/591)) ([ff2ec3a](https://www.github.com/googleapis/python-bigquery/commit/ff2ec3abe418a443cd07751c08e654f94e8b3155))
* update the description of the return value of `_QueryResults.rows()` ([#594](https://www.github.com/googleapis/python-bigquery/issues/594)) ([8f4c0b8](https://www.github.com/googleapis/python-bigquery/commit/8f4c0b84dac3840532d7865247b8ad94b625b897))
---


This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants